Mathematical Limits to Software Estimation

We occasionally hear of enormous software projects that run years late and are then canceled. On the other hand, there are a number of software engineering methodologies that claim or hint at objective estimation of development schedules, project complexity, and programmer productivity. With objective estimates of development schedules, projects should rarely run late. Are these failed software projects not using proper software engineering, or is there a deeper problem?

The paper Large Limits to Software Estimation provides a framework for answering this question. Specifically, it shows how software estimation can be interpreted in algorithmic (Kolmogorov) complexity terms. Algorithmic complexity results can then easily be interpreted as indicating that claims of purely objective estimation of project complexity, development time, and programmer productivity are necessarily incorrect.

More specifically, the paper shows that

Program size and complexity cannot be objectively estimated a priori.
Development time cannot be objectively predicted.
Absolute productivity cannot be objectively determined.
Even approximate estimators are suspect: there is no estimator that produces a correct fixed bound on the complexity of all programs.

The background for this discussion includes some fun topics -- incompleteness, the foundations of computation, logical paradox -- and similar ideas that have arisen in different fields (math, theoretical computer science, complexity).

The document you are reading provides pointers to some of this background material, and some additional anecdotes of failed software projects and exaggerated software estimation claims. Look through the paper, then return here for the background and extra bits.

Misunderstandings

In Nov '01 this paper was discussed on several web news/discussion sites. The majority of the posts seemed to agree with what they guessed the conclusions to be, but as one might expect many posters had not actually read beyond the first sentence or two. Others argued with statements they believed might be in the paper (but in fact are not).

The argument is in the general ballpark of "software estimation is hard or impossible", but it actually says something more specific than that.

The article does not say the following:

software estimation is impossible
objective software estimation is impossible therefore software estimation is impossible

The article does say

Various software engineering authorities claim that objective software estimation is possible [paragraph 3, quotes on first page].
objective software estimation is in fact not possible [body of article]

From this, it does not conclude either of the points 1,2 above. Instead, it concludes:

Software construction is inherently creative and subjective, having more in common with physics than manufacturing; software estimation is inherently subjective [conclusion, Bollinger quote].
Because software is used in the government, in vehicles, and other places where it can potentially have a negative on people's lives, we (software writers) have an ethical responsibility to not over-represent our ability to estimate (especially when it comes to software quality- r.e. correctness claim below).

Here are some of the posted responses, paraphrased:

"The article says that estimation must be objective rather than subjective"
No, it does not say this.
"The article says that subjective software estimation is not useful"
It also does not say this.
"The article says that we are looking for exact answers, not estimates" or "the article doesn't understand what `estimate' means" or
No, the article distinguishes subjective and objective estimates, and specifically discusses the case of an objective estimate with bounds.
"The article only is only talking about methods that are completely accurate."
No, the paper also considers approximate objective estimators.
"The article says that `if complexity can be objectively estimated then you can make software estimates; complexity cannot be objectively estimated, therefore software scope cannot be estimated', this is denying the antecedent".
The reader's claim is that the argument has the structure "If X then Y, but in fact not X, therefore not Y", which would be incorrect if there were ways of obtaining Y other than from X.
First (and as stated above), the article does NOT say "software scope cannot be estimated", it only addresses software estimates that claim objectivity and require a complexity estimate.
R.e. denying the antecedent: this claim is incorrect because the article addresses software estimation methods that require a complexity definition (by their own statement), and of course an objective estimate requires that all its inputs also be objective. As such the argument has an "..only if X.." that is missing from the characterization above. With the added "only if", the "not X" (no objective estimate of complexity) results in "not Y" (objective software estimates that require a complexity estimate cannot exist).
Of course there are some software estimates that are objective and do not require a complexity estimate, for example, LOC for an existing completed program, for whatever that is worth -- this is not in dispute.
"People/organizations can make accurate estimates, I made one last week" or "Estimation is hard, I double my estimates and still miss them".
Ok, but slightly off topic: the article is specifically talking about those who claim objective estimates.
"You can do objective estimation, and I did it last week using METHOD X"
And where did you get an objective estimate of the complexity of a new project? Read the article...
Consider the following two programs:
1. A program that computes the square of a very large, algorithmically random number.
2. A program that computes the Fourier transform of 16 small numbers.
Most all programmers would say that writing a Fourier transform program is more complex than writing a program to square a number, but Kolmogorov complexity says the opposite.
(Background -- an algorithmically random number is one that has no regularities that would allow it to be compressed, thus the program must represent it by literally storing its digits. So a program to square a million-byte algorithmically random number would need to be larger than a million bytes, if the number is included in the program).
This objection is confusing the complexity of the program with the complexity of a large random number that that the programmer does not write. The difficulty of writing a program is correlated with the complexity of the code that the programmer actually writes, not the complexity of inputs or arguments to the program. The numbers in the programs describe above are properly considered inputs or arguments.
The objection is correct insofar as Kolmogorov complexity conventionally deals with programs that produce output only, and thus would have their inputs or arguments incorporated in the program. The paper sketches how interactivity, arguments, and state might be accommodated in Kolmogorov complexity, but also notes that these are ultimately irrelevant -- complexity will remain non-computable. If you do not want to consider inputs, instead just consider "the complexity of the part that the programmer writes", i.e., measure the complexity of the given number, then that of the finished program, then subtract.
(This discussion suggests another ultimately irrelevant consideration: a programmer is assigned to add a bit of code to an existing large program. (S)he must understand the program before being able to add the code. Here the relevant complexity is the complexity of the added code plus some factor times the complexity of the existing code, where the factor indicates the relative difficulty of learning existing code versus producing new code. This does not affect our argument because there is no way of objectively estimating the complexity of either the existing or the new code...)
Complexity is not the only thing that should be considered, there are other requirements like modularity, maintainability, portability, readability which must be taken into account.
Certainly true, but this does not affect the argument one way or another. The argument says only that software estimates that depend (in part) on complexity cannot be objective. Some such estimates may also depend on other things (that may or may not be objectively estimable) but as long the estimate rests on one component that is subjective, the estimate itself is subjective.
"Software estimation needs common sense, not advanced mathematics."
Certainly. The 'manufacturing' camp of software estimators (see Humphrey quote below) say or hint that software construction can be made into a repeatable, fairly boring process where projects are always on time and programmers are like factory workers. This may or may not be true (I don't think it is), but regardless: to make this view seem more science than philosophy some of these people have fallen into the trap of cloaking their estimating process with formal notation and claiming or hinting objectivity. This part is wrong.
On the contrary, [below and conclusion to the article]
Good estimation therefore requires experience and judgment. The software industry should value human experience, intuition, and wisdom rather than claiming false objectivity and promoting entirely impersonal "processes".

Incompleteness

Incompleteness in mathematics

Godel Incompleteness says that every sufficiently powerful system of mathematics will contain true statements that cannot be proven, and no system can prove its own consistency. Godel, Escher, Bach is a rather lengthy popular discussion of Godel incompleteness. Shorter popular introductions are found in several popular books by John Casti and by Rudy Rucker (e.g. Mind Tools).

Perhaps the easiest "serious" introduction to Godel incompleteness is the last section of Rucker's Infinity and the Mind (Princeton U. Press, 1995). Mathematical incompleteness is difficult, however; it is much easier to show by establishing that theorem proving and computation are equivalent, and then showing incompleteness in computer science or complexity. This approach is outlined below.

Incompleteness in computer science

Turing's Halting theorem is the well known example of incompleteness in computer science. Rice's theorem is a more general version: it says that there is no algorithm that can determine an extensional property of all programs. An `extensional property' is one that can be seen from the input-output behaviour of the program without examining its internal state. Examples are: whether a program ever stops running (halting problem), whether a program will crash (automatic debugging problem), whether a program will ever print the secret code '42', whether a (graphics) program will ever produce a pure black image, etc.

The flavor of a proof of halting by contradiction is easy: You say you have written a function which will tell if any program halts. I then write a program which calls your function with my program as input. If your function says my program will exit, I loop, otherwise I exit.

P:: if Halts(P) loop else halt;

A more realistic version considers programs P(x) that have variable input x. Form the program

P(B):: { if Halts(B(B)) loop else halt; }

Now call P(B) with P itself passed for parameter B. The resulting program is

P(P):: { if Halts(P(P)) loop else halt; }

Turing's original proof was by diagonalization, similar to the Cantor diagonalization argument that the infinity of reals is bigger than the infinity of integers. The key to this proof is to recognize that programs can be enumerated. That is, fix a machine architecture, then every possible program is a bit pattern that can be interpreted as a large integer, and vice versa, every integer contains a bit pattern that you can try to execute as a program (though few will actually function). Turing's proof by diagonalization is:

Define F(N) as one plus the value of the Nth computable function applied to the number N, or zero if the Nth program does not halt. F cannot be computable because if it were it would be the Mth computable function, and then F(M) = F(M)+1. Because F(N) is straightforwardly defined, the ``flaw'' must be that halting cannot be determined.

These theorems do not say that it is impossible to determine the stated property for many particular problems, they just say that it is not possible to do so for every program. The next obvious question is, how many potential programs are there for which halting (or other extensional properties) cannot be determined? If the only program for which halting could not be determined was the one shown above that calls the halting test function in a paradoxical way, then incompleteness would be somewhat trivial. In fact this is not the case; there are (infinitely) many such exceptions. This is not evident in the proof of halting indicated above, but is evident in other proofs, e.g. the Chaitin Incompleteness theorem quoted in the paper.

In my opinion incompleteness is one of the consequences of allowing the concept of infinity -- infinity brings with it a number of problems and paradoxes. If an infinite number of programs are not allowed, the halting problem goes away... but not so fast: impossibility is merely replaced with intractability.

Intractable problems are those for which a solution is theoretically possible, but which in practice would require aeons of computer time, and a new processor won't begin to help. In fact in both computer science and Kolmogorov flavors of incompleteness, imposing resource limits results in intractable problems.

An example of an intractable problem is this one, which has a genetic programming feel:

Find the fastest x86 program that implements a particular video codec

This problem has practical value -- one could code a "reference" version of the codec, and then do a GP style search to find its most optimized implementation. To make the problem easier, fix the input video and state, and find the fastest x86 program that produces the same output from this input and state that an existing codec does. So the search program is now easy: in enumeration order, generate a program, run it for as long as the regular codec takes to run (no longer), check its output if any, keep track of the fastest-running program so far found that produced the desired output.

The intractable nature of this will be obvious to many, but for the rest of you, let's look at the numbers. How big is the codec? Perhaps 100k bytes, but let's say it is only eight bytes for now. There are 2^64 possible programs of 8 bytes in length. To test all these programs, buy a bunch of gigahertz machines -- four such machines can process roughly 2^32 (4 billion) instructions per second. If the test requires 10 seconds to run then four machines could crunch through 2^32 possibilities in 10 seconds. This would then take 2^32 * 10 seconds to search all the possibilities. Calculate this out, using a calculator such as Mathematica that isn't limited to 32-bit computation -- it's a long long time. Now consider that this is for the universe of toy 8-byte programs, whereas the desired codec is 100k bytes. For realistically sized problems, intractable may as well be impossible, and no conventional increase in compute power (including non-quantum parallel machines) will make a difference.

The paper uses "feasible" as a synonym for "tractable".

The book Neil Jones, Computability and Complexity from a Programming Perspective, MIT Press 1997 is a reasonable introduction to the computer science side of incompleteness.

Kolmogorov Complexity, and Incompleteness therein

Kolmogorov complexity is also called algorithmic complexity and KCS (Kolmogorov, Chaitin, Solomonoff) complexity. The paper briefly describes this concept. A "bible" of this field is Li and Vitanyi, Kolmogorov Complexity, Springer. It is dense but readable and suitable for home study.

Incompleteness in Kolmogorov complexity is simply that fact the complexity of a string is not computable; this is mentioned and used in the paper.

Incompleteness relationships

Mathematical incompleteness can be shown from computer science incompleteness, and vice versa, and likewise for complexity incompleteness. Some of the demonstrations are easy and require only accepting the notion that theorem proving can be formalized as computation (c.f. Turing machines).

Svozil (cited in the paper) lays out one correspondence between mathematics and computation,

axioms	<->	program input or initial state
rules of inference	<->	program interpreter
theorem(s)	<->	program output
derivation	<->	computation

but other correspondences are possible depending on the chosen formalization of "computer". Also see the Curry-Howard isomorphism.

Now to give examples of cross-field incompleteness arguments:

Turing implies Godel

Because there is no procedure for deciding which programs halt, there can be no proofs of which programs halt. In a mathematical system that is powerful enough to discuss programs (recursive functions), there are theorems ``program X will halt'' that are true but cannot be proven.

Complexity proof of Turing theorem

Assume the function Halts(P). Write a program Q that looks at all programs up to N bits in size and finds the biggest number produced by any of these programs (that halt). Double this number and print it. The search program Q is approximately log_2 N in size but it printed a number larger than any program of up to size N. The way out of the contradiction is to conclude that Q cannot tell which programs halt.

Complexity proof of Incompleteness

The paper contains the Chaitin complexity proof that mathematical systems cannot prove most statements of the form ``the complexity of (some string) is greater than (some value)''.

Etc.

There are more similar results, including that the maximum run time of a program is not computable, Halting proves Church's theorem (the negative answer to Hilbert's problem, theorems are not decidable by computation), etc.

Claims of the software estimation industry

Technical bookstores have a couple feet of shelf space devoted to software engineering. Some of these books have overly optimistic claims or hints of "objective" estimation based on historical data somewhere in their introduction.

Here are a few samples:

(The original CMM Manifesto) The CMM for Software p.3,4:

In an immature organization, there is no objective basis for judging product quality or for solving product or process problems" ...

[In a mature organization] There is an objective quantitative basis for judging product quality and analyzing problems with the product and process. Schedules and budgets are based on historical performance and are realistic"

``Software Quality Measurement: A Framework for Counting Problems and Defects'' CMU/SEI-TR-22, p.6:

Consistent measurements provide data for doing the following:

Quantitatively expressing requirements, goals, and acceptance criteria.
Monitoring progress and anticipating problems.
Quantifying tradeoffs used in allocating resources.
Predicting the software attributes for schedules, cost, and quality.

G. G. Schulmeyer, ``Software Quality Lessons from the Quality Experts,'' in G. G. Schulmeyer and J.I. McManus, Eds., Handbook of Software Quality Assurance} (2nd ed.):

In the Certainty state [of quality management], the objective of software development and software quality management, producing quality software on time with a set cost everytime, is possible.

Course title: "Productivity Improvement through Defect-Free Development".

Book title: Zero Defect Software, McGraw-Hill, New York, 1990.

etc...

Note that some of these statements do not quite say that they are going to do objective estimation. For example, both this quote

and this one

Consistent measurements provide data for doing the following:

Quantitatively expressing requirements, goals, and acceptance criteria.
Monitoring progress and anticipating problems.
Quantifying tradeoffs used in allocating resources.
Predicting the software attributes for schedules, cost, and quality.

stop just short of saying will do quantitative estimation based on historical data (though the reader might think otherwise if quickly reading these).

Software disasters

GAO-93-13 on major software challenges: ``We have repeatedly reported on cost rising by millions of dollars, schedule delays of not months but years, and multi-billion-dollar systems that don't perform as envisioned.''

A few software "disasters" that I came across during the writing of this paper (1997-2000):

The IRS is making another stab at modernizing its aging computer system, after spending $3.3 billion on a failed upgrade over the past decade.
Richard Stengel, ``An Overtaxed IRS,'' Time, April 7 1997 pp. 59-62.
The state of California was faced with scrapping an unfinished $300 million system designed to track child support payments.
H. Jordan, ``Computer Computer Snafu May Cost State Millions'', San Jose Mercury News, April 1997.
Windows 97 slips one year, becomes Windows 97/98 (later renamed to Windows 98). And similarly with NT version 5, Copeland, Mac OS/X, etc.
An unpublished review of 17 major DOD software contracts found that the average 28-month schedule was missed by 20 months, and no project was on time.

Evidently the prize for software disasters goes to the FAA's Advanced Automation System (AAS), an attempt at an improved air traffic control system that ran into trouble in the early 90s after an estimated 6.5 billion was spent. The AAS has been described as as "one of the largest and most spectacular computing failures in the history of the field."

R.Britcher, one of the software engineers involved in the AAS project, describes it in the book The Limits of Software (Addison Wesley). Britcher himself characterizes the AAS as

"The greatest debacle in the history of organized work... we learned nothing from it"

The book is a good read.

Importantly, the AAS project did make use of software engineering best practices, such as code inspections, schedule estimates, etc. The software was estimated to be on schedule each and every month... until one month before it was to be delivered. Only at this point did it become evident that the project was hopelessly late.

Another relevant book is The Mythical Man Month. We've all heard of it; it is worth actually reading if you haven't done so yet.

Questioning the Paper

The paper requires assuming that the time to develop a program is (in the large) determined mostly by its complexity, and that Kolmogorov complexity is an appropriate definition of complexity (in the large). It may be possible to challenge the conclusions of the paper by challenging one of these assumptions.

I find both of them fairly self evident.

If programming time does not depend somehow on complexity, what does it depend on? On the other hand, the nature of this relation is not clear. At most we can expect there to be a monotonic relation between development time and complexity. For example, I would expect there might be a complexity ceiling due to our finite brains, beyond which the development time goes to infinity. Development time is also understood as being independent of any particular programmer. A particular programmer may have difficulty with some of the background material or patterns needed to implement a particular program, but the whole notion of software estimation requires abstracting away from that.

The phrase "in the large" is important. There might be a particular program that is simple in complexity, but that will take a while to write -- perhaps it is a 100 line program is best written using a particular feature of the latest C++, but that requires installing a new compiler, which turns out to require modifications to a header file that you intended to use. These distractions take time, but that time becomes insignificant as the program grows from 100 to 10,000,000 lines.

Kolmogorov (algorithmic) complexity is a well established mathematical definition of the complexity of digital objects. It is a simple but deep concept, that can be used to motivate and define concepts such as randomness, Occam's razor and ideas in the philosophy of learning. It also turns out to have a close relationship to entropy, with many shared relationships. Kolmogorov complexity might be said to be the analog of entropy for individual objects.

To motivate algorithmic complexity, consider an ideal situation where a very large program will be written. After every milestone, the program is fully examined and common functionality is factored. We imagine that this refactoring is done very well. The size of this program then approaches its algorithmic complexity.

Note that the choice of language eventually becomes irrelevant (invariance theorem). If the language does not have features that make it elegant to code the desired program, at some point the most elegant refactoring is to literally write a new language. This new language can be implemented using the original language, and the cost of the interpreter or compiler is constant, which eventually becomes insignificant.

Challenge Scenario

You are a manager, and you believe that you can objectively estimate development schedules.

How long will it take your team to write a program to solve the Goldbach conjecture? (Idea from John Casti). By the equivalence of math and computation this is possible, and judging from other number theory proofs the program might even be quite short -- but this is for you to estimate.

The Goldbach conjecture has been unsolved for a couple of centuries, despite a million-dollar prize that was recently offered for its solution.

So here's a perfectly well formed (no changing requirements) request to write what may be a fairly short program, but I don't think anyone can say how long it would take.

Conclusions: Disclaimer, Programming vs. Manufacturing, Ethics

Disclaimer, repeated

Once again, I do not mean to say that Software Estimation, formal specifications, etc. are not useful! Regarding formal specifications, providing an alternate view of a program will help uncover errors that are not clear in it's original form (code). More generally, software estimation probably does help. The paper only says this:

Claims of purely objective estimation are wrong.

Wrong, and arguably irresponsible as well.

Ethics

Irresponsible? A problematic scenario would be this: a software estimation consultant says that if you take his/her $10,000 course, you can transform your software group to produce bullet-proof software on time every time. The consultant manages to convince the head of R&D at a car company that the software estimation claims are accurate. The car company then decides that it can save $50 per car by removing some manual override and relying entirely on computer control...

Unquestioning faith in software estimation is probably unlikely in the arena of car safety, but computers are everywhere and plenty of smaller scale incidents do happen. Peter Neumann has been tracking such incidents for years, see his risks archives.

Software development: like manufacturing, or like physics?

The CMM (Capability Maturity Model) and the similar ISO-900x got started in an understandable way: the goal was to apply the success and reliability found in manufacturing to software development. Indeed, the CMM was conceived by Watts Humphrey, who writes that industry process control concepts

...are just as applicable to software as they are to automobiles, cameras, wristwatches, and steel.

(``Characterizing the Software Process,'' IEEE Software 5(2), pp.~73-79)

Clearly this is well intended, but there is an obvious difference between manufacturing and software. In manufacturing, you make a chair, then another one, then 10,000 more. Then perhaps you switch to a different chair (but it's still a chair) and you make 10,000 of these.

By the nature of software you are never repeating something you did before. If what you need to do is the same as what you did, you can simply make a digital copy. As such, there is no opportunity for the repeatability and codification that are present in manufacturing. This point is better made by Bollinger (cited in the paper), who responds to Humphrey,

The creation of genuinely new software has far more in common with developing a new theory of physics than it does with producing cars or watches on an assembly line.

Software construction is an intrinsically creative activity and as such has inherent risks. As a broad consequence, software estimation is necessarily subjective. Good estimation therefore requires experience and judgment. The software industry should value human experience, intuition, and wisdom rather than claiming false objectivity and promoting entirely impersonal "processes". It should attend to the intrinsic uncertainties and risks of software development and where necessary promote the public discussion and honest assessment of these risks.

More feedback and quotes

I resonate with the sentiment of this paper... ...but I think they should be using resource-bounded complexity measures. Otherwise the size of a formal specification (in Z say) would be a reasonable estimate of the size of the system, and software engineering would be good science after-all. I think claim 1 should probably read "It's not possible to feasibly estimate the size of a feasible program for..."
-- Slashdot, Anonymous Coward 5nov01
But how do you estimate how long it will take to produce the formal specification? The specification for the Goldbach program is interesting to think about. This program should print either True or False, so it might seem that the spec is only 1 bit in size. But how long will it take you to decide what that bit is? (It is probably more useful to think of the spec as taking a set of axioms and theorems as input and requiring the desired truth as output.)
Now isn't it obvious! (Score:0)

More specifically, the paper shows that Program size and complexity cannot be objectively estimated a priori. Development time cannot be objectively predicted. Absolute productivity cannot be objectively determined. Even approximate estimators are suspect: there is no estimator that produces a correct fixed bound on the complexity of all programs.
enough said.
--Anonymous Coward on Monday November 05, @05:35PM (#2524966)
SDM adopting companies tend not to develop products that could not be trivially predicted.
--Constantine Plotnikov, paraphrasing DeMarco & Lister Peopleware, second edition Chapter 29. pages 186-193 in paperback edition.
"it's a bunch of handwaving with logical holes you could drive a bus through."
-- Author of a software estimation seminar that advertises "the science of estimation".

Mathematical Limits to Software Estimation: Supplementary Material

J.P. Lewis
Stanford University

About the Author and this Paper

Misunderstandings

Incompleteness

Incompleteness in mathematics

Incompleteness in computer science

Kolmogorov Complexity, and Incompleteness therein

Incompleteness relationships

Turing implies Godel

Complexity proof of Turing theorem

Complexity proof of Incompleteness

Etc.

Claims of the software estimation industry

Software disasters

Questioning the Paper

Challenge Scenario

Conclusions: Disclaimer, Programming vs. Manufacturing, Ethics

Disclaimer, repeated

Ethics

Software development: like manufacturing, or like physics?

More feedback and quotes

Mathematical Limits to Software Estimation: Supplementary Material

J.P. LewisStanford University

About the Author and this Paper

Misunderstandings

Incompleteness

Incompleteness in mathematics

Incompleteness in computer science

Kolmogorov Complexity, and Incompleteness therein

Incompleteness relationships

Turing implies Godel

Complexity proof of Turing theorem

Complexity proof of Incompleteness

Etc.

Claims of the software estimation industry

Software disasters

Questioning the Paper

Challenge Scenario

Conclusions: Disclaimer, Programming vs. Manufacturing, Ethics

Disclaimer, repeated

Ethics

Software development: like manufacturing, or like physics?

More feedback and quotes

J.P. Lewis
Stanford University