Mathematical Limits to Software Estimation
Supplementary Material

J.P.Lewis
Stanford University

We occasionally hear of enormous software projects that run years late and are then canceled. On the other hand, there are a number of software engineering methodologies that claim or hint at objective estimation of development schedules, project complexity, and programmer productivity. With objective estimates of development schedules, projects should rarely run late. Are these failed software projects not using proper software engineering, or is there a deeper problem?

The recent paper Large Limits to Software Estimation provides a framework for answering this question. Specifically, it shows how software estimation can be interpreted in algorithmic (Kolmogorov) complexity terms. Algorithmic complexity results can then easily be interpreted as indicating that claims of purely objective estimation of project complexity, development time, and programmer productivity are necessarily incorrect.

More specifically, the paper shows that

  • Program size and complexity cannot be objectively estimated a priori.
  • Development time cannot be objectively predicted.
  • Absolute productivity cannot be objectively determined.
  • Even approximate estimators are suspect: there is no estimator that produces a correct fixed bound on the complexity of all programs.

The background for this discussion includes some fun topics -- incompleteness, the foundations of computation, logical paradox -- and similar ideas that have arisen in different fields (math, theoretical computer science, complexity).

The document you are reading provides pointers to some of this background material, and some additional anecdotes of failed software projects and exaggerated software estimation claims. Look through the paper, then return here for the background and extra bits.

Frequent misunderstandings

In Nov '01 this paper was discussed on several web news/discussion sites. The majority of the posts seemed to agree with what they guessed the conclusions to be, but as one might expect many posters had not actually read beyond the first sentence or two. Others argued with statements they believed might be in the paper (but in fact are not).

The argument is in the general ballpark of "software estimation is hard or impossible", but it actually says something more specific than that.

The article does not say the following:

  1. software estimation is impossible
  2. objective software estimation is impossible therefore software estimation is impossible

The article does say

  • Various software engineering authorities claim that objective software estimation is possible [paragraph 3, quotes on first page].
  • objective software estimation is in fact not possible [body of article]

From this, it does not conclude either of the points 1,2 above. Instead, it concludes:

  • Software construction is inherently creative and subjective, having more in common with physics than manufacturing; software estimation is inherently subjective [conclusion, Bollinger quote].
  • Because software is used in the government, in vehicles, and other places where it can potentially have a negative on people's lives, we (software writers) have an ethical responsibility to not over-represent our ability to estimate (especially when it comes to software quality- r.e. correctness claim below).

Here are some of the posted responses, paraphrased:

  • "The article says that estimation must be objective rather than subjective"
    No, it does not say this.

  • "The article says that subjective software estimation is not useful"
    It also does not say this.

  • "The article says that we are looking for exact answers, not estimates" or "the article doesn't understand what `estimate' means" or
    No, the article distinguishes subjective and objective estimates, and specifically discusses the case of an objective estimate with bounds.

  • "The article only is only talking about methods that are completely accurate."
    No, the paper also considers approximate objective estimators.

  • "The article says that `if complexity can be objectively estimated then you can make software estimates; complexity cannot be objectively estimated, therefore software scope cannot be estimated', this is denying the antecedent".
    The reader's claim is that the argument has the structure "If X then Y, but in fact not X, therefore not Y", which would be incorrect if there were ways of obtaining Y other than from X.

    First (and as stated above), the article does NOT say "software scope cannot be estimated", it only addresses software estimates that claim objectivity and require a complexity estimate.

    R.e. denying the antecedent: this claim is incorrect because the article addresses software estimation methods that require a complexity definition (by their own statement), and of course an objective estimate requires that all its inputs also be objective. As such the argument has an "..only if X.." that is missing from the characterization above. With the added "only if", the "not X" (no objective estimate of complexity) results in "not Y" (objective software estimates that require a complexity estimate cannot exist).

    Of course there are some software estimates that are objective and do not require a complexity estimate, for example, LOC for an existing completed program, for whatever that is worth -- this is not in dispute.

  • "People/organizations can make accurate estimates, I made one last week" or "Estimation is hard, I double my estimates and still miss them".
    Ok, but slightly off topic: the article is specifically talking about those who claim objective estimates.

  • "You can do objective estimation, and I did it last week using METHOD X"
    And where did you get an objective estimate of the complexity of a new project? Read the article...

  • Consider the following two programs:
    1. A program that computes the square of a very large, algorithmically random number.
    2. A program that computes the Fourier transform of 16 small numbers.
    Most all programmers would say that writing a Fourier transform program is more complex than writing a program to square a number, but Kolmogorov complexity says the opposite.
    (Background -- an algorithmically random number is one that has no regularities that would allow it to be compressed, thus the program must represent it by literally storing its digits. So a program to square a million-byte algorithmically random number would need to be larger than a million bytes, if the number is included in the program).

    This objection is confusing the complexity of the program with the complexity of a large random number that that the programmer does not write. The difficulty of writing a program is correlated with the complexity of the code that the programmer actually writes, not the complexity of inputs or arguments to the program. The numbers in the programs describe above are properly considered inputs or arguments.

    The objection is correct insofar as Kolmogorov complexity conventionally deals with programs that produce output only, and thus would have their inputs or arguments incorporated in the program. The paper sketches how interactivity, arguments, and state might be accommodated in Kolmogorov complexity, but also notes that these are ultimately irrelevant -- complexity will remain non-computable. If you do not want to consider inputs, instead just consider "the complexity of the part that the programmer writes", i.e., measure the complexity of the given number, then that of the finished program, then subtract.

    (This discussion suggests another ultimately irrelevant consideration: a programmer is assigned to add a bit of code to an existing large program. (S)he must understand the program before being able to add the code. Here the relevant complexity is the complexity of the added code plus some factor times the complexity of the existing code, where the factor indicates the relative difficulty of learning existing code versus producing new code. This does not affect our argument because there is no way of objectively estimating the complexity of either the existing or the new code...)

  • Complexity is not the only thing that should be considered, there are other requirements like modularity, maintainability, portability, readability which must be taken into account.

    Certainly true, but this does not affect the argument one way or another. The argument says only that software estimates that depend (in part) on complexity cannot be objective. Some such estimates may also depend on other things (that may or may not be objectively estimable) but as long the estimate rests on one component that is subjective, the estimate itself is subjective.

  • "Software estimation needs common sense, not advanced mathematics."
    Certainly. The 'manufacturing' camp of software estimators (see Humphrey quote below) say or hint that software construction can be made into a repeatable, fairly boring process where projects are always on time and programmers are like factory workers. This may or may not be true (I don't think it is), but regardless: to make this view seem more science than philosophy some of these people have fallen into the trap of cloaking their estimating process with formal notation and claiming or hinting objectivity. This part is wrong.

    On the contrary, [below and conclusion to the article]

    Good estimation therefore requires experience and judgment. The software industry should value human experience, intuition, and wisdom rather than claiming false objectivity and promoting entirely impersonal "processes".

Incompleteness

Incompleteness in mathematics

Godel Incompleteness says that every sufficiently powerful system of mathematics will contain true statements that cannot be proven, and no system can prove its own consistency. Godel, Escher, Bach is a rather lengthy popular discussion of Godel incompleteness. Shorter popular introductions are found in several popular books by John Casti (Borders usually has one or more) and by Rudy Rucker (e.g. Mind Tools).

Perhaps the easiest "serious" introduction to Godel incompleteness is the last section of Rucker's Infinity and the Mind (Princeton U. Press, 1995). Mathematical incompleteness is difficult, however; it is much easier to show by establishing that theorem proving and computation are equivalent, and then showing incompleteness in computer science or complexity. This approach is outlined below.

Incompleteness in computer science

Turing's Halting theorem is the well known example of incompleteness in computer science. Rice's theorem is a more general version: it says that there is no algorithm that can determine an extensional property of all programs. An `extensional property' is one that can be seen from the input-output behaviour of the program without examining its internal state. Examples are: whether a program ever stops running (halting problem), whether a program will crash (automatic debugging problem), whether a program will ever print the secret code '41', whether a (graphics) program will ever produce a pure black image, etc.

The flavor of a proof of halting by contradiction is easy: You say you have written a function which will tell if any program halts. I then write a program which calls your function with my program as input. If your function says my program will exit, I loop, otherwise I exit.

P:: if Halts(P) loop else halt;

A more realistic version considers programs P(x) that have variable input x. Form the program

P(B):: { if Halts(B(B)) loop else halt; }
Now call P(B) with P itself passed for parameter B. The resulting program is
P(P):: { if Halts(P(P)) loop else halt; }

Turing's original proof was by diagonalization, similar to the Cantor diagonalization argument that the infinity of reals is bigger than the infinity of integers. The key to this proof is to recognize that programs can be enumerated. That is, fix a machine architecture, then every possible program is a bit pattern that can be interpreted as a large integer, and vice versa, every integer contains a bit pattern that you can try to execute as a program (though few will actually function). Turing's proof by diagonalization is:

Define F(N) as one plus the value of the Nth computable function applied to the number N, or zero if the Nth program does not halt. F cannot be computable because if it were it would be the Mth computable function, and then F(M) = F(M)+1. Because F(N) is straightforwardly defined, the ``flaw'' must be that halting cannot be determined.

These theorems do not say that it is impossible to determine the stated property for many particular problems, they just say that it is not possible to do so for every program. The next obvious question is, how many potential programs are there for which halting (or other extensional properties) cannot be determined? If the only program for which halting could not be determined was the one shown above that calls the halting test function in a paradoxical way, then incompleteness would be somewhat trivial. In fact this is not the case; there are (infinitely) many such exceptions.

In my opinion incompleteness is one of the consequences of allowing the concept of infinity -- infinity brings with it a number of problems and paradoxes. If an infinite number of programs are not allowed, the halting problem goes away... but not so fast: impossibility is merely replaced with intractability.

Intractable problems are those for which a solution is theoretically possible, but which in practice would require aeons of computer time, and a new processor won't begin to help. In fact in both computer science and Kolmogorov flavors of incompleteness, imposing resource limits results in intractable problems.

An example of an intractable problem is this one, which has a genetic programming feel:

Find the fastest x86 program that implements a particular video codec
This problem has practical value -- one could code a "reference" version of the codec, and then do a GP style search to find its most optimized implementation. To make the problem easier, fix the input video and state, and find the shortest x86 program that produces the same output from this input and state that an existing codec does. So the search program is now easy: in enumeration order, generate a program, run it for as long as the regular codec takes to run (no longer), check its output if any, keep track of the fastest-running program so far found that produced the desired output.

The intractable nature of this will be obvious to many, but for the rest of you, let's look at the numbers. How big is the codec? Perhaps 100k bytes, but let's say it is only EIGHT bytes for now. There are 2^64 possible programs of 8 bytes in length. To test all these programs, buy a bunch of gigahertz machines -- four such machines can process roughly 2^32 (4 billion) instructions per second. If the test requires 10 seconds to run then four machines could crunch through 2^32 possibilities in 10 seconds. This would then take 2^32 * 10 seconds to search all the possibilities. Calculate this out, using a calculator such as Mathematica that isn't limited to 32-bit computation -- it's a long long time. Now consider that this is for the universe of toy 8-byte programs, whereas the desired codec is 100k bytes. For realistically sized problems, intractable may as well be impossible, and no conventional increase in compute power (including non-quantum parallel machines) will make a difference.

The paper uses "feasible" as a synonym for "tractable".

The book Neil Jones, Computability and Complexity from a Programming Perspective, MIT Press 1997 is a reasonable introduction to the computer science side of incompleteness.

Kolmogorov Complexity, and Incompleteness therein

Kolmogorov complexity is also called algorithmic complexity and KCS (Kolmogorov, Chaitin, Solomonoff) complexity. The paper briefly describes this concept. The current "bible" of this field is Li and Vitanyi, Kolmogorov Complexity, Springer. It is dense but readable and suitable for home study.

Incompleteness in Kolmogorov complexity is simply that fact the complexity of a string is not computable; this is mentioned and used in the paper.

Incompleteness relationships

Mathematical incompleteness can be shown from computer science incompleteness, and vice versa, and likewise for complexity incompleteness. Some of the demonstrations are easy and require only accepting the notion that theorem proving can be formalized as computation (which is the essence of the Turing machine work).

Svozil (cited in the paper) lays out one correspondence between mathematics and computation,

axioms <-> program input or initial state
rules of inference <-> program interpreter
theorem(s) <-> program output
derivation <-> computation

but other correspondences are possible depending on the chosen formalization of "computer". Now to give examples of cross-field incompleteness arguments:

Turing implies Godel

Because there is no procedure for deciding which programs halt, there can be no proofs of which programs halt. In a mathematical system that is powerful enough to discuss programs (recursive functions), there are theorems ``program X will halt'' that are true but cannot be proven.

Complexity proof of Turing theorem

Assume the function Halts(P). Write a program Q that looks at all programs up to N bits in size and finds the biggest number produced by any of these programs (that halt). Double this number and print it. The search program Q is approximately log_2 N in size but it printed a number larger than any program of up to size N. The way out of the contradiction is to conclude that Q cannot tell which programs halt.

Complexity proof of Incompleteness

The paper contains the Chaitin complexity proof that mathematical systems cannot prove most statements of the form ``the complexity of (some string) is greater than (some value)''.

Etc.

There are many more similar results, including that the maximum run time of a program is not computable, Halting proves Church's theorem (the negative answer to Hilbert's problem, theorems are not decidable by computation), etc.

Claims of the software estimation industry

Technical bookstores have a couple feet of shelf space devoted to software engineering. Some of these books have overly optimistic claims or hints of "objective" estimation based on historical data somewhere in their introduction.

Here are a few samples:


(The original CMM Manifesto) The CMM for Software p.3,4:

"In an immature organization, there is no objective basis for judging product quality or for solving product or process problems" ...
"[In a mature organization] There is an objective quantitative basis for judging product quality and analyzing problems with the product and process. Schedules and budgets are based on historical performance and are realistic"

``Software Quality Measurement: A Framework for Counting Problems and Defects'' CMU/SEI-TR-22, p.6:

``Consistent measurements provide data for doing the following:
  • Quantitatively expressing requirements, goals, and acceptance criteria.
  • Monitoring progress and anticipating problems.
  • Quantifying tradeoffs used in allocating resources.
  • Predicting the software attributes for schedules, cost, and quality.

G. G. Schulmeyer, ``Software Quality Lessons from the Quality Experts,'' in G. G. Schulmeyer and J.I. McManus, Eds., Handbook of Software Quality Assurance} (2nd ed.):

In the Certainty state [of quality management], the objective of software development and software quality management, producing quality software on time with a set cost everytime, is possible.

Course title: "Productivity Improvement through Defect-Free Development".

Book title: Zero Defect Software, McGraw-Hill, New York, 1990.


etc...

Note that some of these statements do not quite say that they are going to do objective estimation. For example, both this quote

"[In a mature organization] There is an objective quantitative basis for judging product quality and analyzing problems with the product and process. Schedules and budgets are based on historical performance and are realistic"
and this one
``Consistent measurements provide data for doing the following:
  • Quantitatively expressing requirements, goals, and acceptance criteria.
  • Monitoring progress and anticipating problems.
  • Quantifying tradeoffs used in allocating resources.
  • Predicting the software attributes for schedules, cost, and quality.
stop just short of saying will do quantitative estimation based on historical data (though the reader might think otherwise if quickly reading these).

Software disasters

GAO-93-13 on major software challenges: ``We have repeatedly reported on cost rising by millions of dollars, schedule delays of not months but years, and multi-billion-dollar systems that don't perform as envisioned.''

A few software "disasters" that I came across during the writing of this paper (1997-2000):

  • The IRS is making another stab at modernizing its aging computer system, after spending $3.3 billion on a failed upgrade over the past decade.
    Richard Stengel, ``An Overtaxed IRS,'' Time, April 7 1997 pp. 59-62.
  • The state of California was faced with scrapping an unfinished $300 million system designed to track child support payments.
    H. Jordan, ``Computer Computer Snafu May Cost State Millions'', San Jose Mercury News, April 1997.
  • Windows 97 slips one year, becomes Windows 97/98 (later renamed to Windows 98). And similarly with NT version 5, Copeland, Mac OS/X, etc.
  • An unpublished review of 17 major DOD software contracts found that the average 28-month schedule was missed by 20 months, and no project was on time.

Evidently the prize for software disasters goes to the FAA's Advanced Automation System (AAS), an attempt at an improved air traffic control system that ran into trouble in the early 90s after an estimated 6.5 billion was spent. The AAS has been described as as "one of the largest and most spectacular computing failures in the history of the field."

R.Britcher, one of the software engineers involved in the AAS project, describes it in the book The Limits of Software (Addison Wesley). Britcher himself characterizes the AAS as

"The greatest debacle in the history of organized work... we learned nothing from it"
The book is thoughtful and is a good read.

Importantly, the AAS project did make use of software engineering best practices, such as code inspections, schedule estimates, etc. The software was estimated to be on schedule each and every month... until one month before it was to be delivered. Only at this point did it become evident that the project was hopelessly late.

Another relevant book is The Mythical Man Month. We've all heard of it; it is worth actually reading if you haven't done so yet.

Additional Claim

In addition to the claims in the paper, there is one additional (somewhat weaker) statement that can be made:
  • Algorithmic complexity casts doubt on "proofs" of program correctness.
This is contrary to the view that formal specifications and program proofs can guarantee program correctness.

The argument here is that the specification must be formal in order to participate in the proof. Since the specification fully specifies the behavior of the program, the complexity (C) of the specification must be at least approximately as large as the complexity of a minimal program for the task at hand. More specifically, given some input it must be possible to determine the desired behavior of the program from the specification. Using this, a small program can be written that, given some input, exhaustively queries the specification until the corresponding output bit pattern (if any) is determined; this bit pattern is then output, thereby simulating the desired program. Formally, C(specification) + C(query program) >= C(program). We are left with the obvious question: if the specification is approximately of the same order of complexity as the program, how can we know that the specification is correct?

This conclusion has been arrived at previously and is often raised when discussing program proofs. The complexity viewpoint strengthens and formalizes this conclusion, by making it clear that a previously intuitive quantity (complexity) can be formally defined and expressed in terms of a familiar measure (bits).

This argument, however, is not as strong as those in the paper. For one, the specification and the code are two different representations of the program, and it may be possible for a human to fully understand one without understanding the other; having both available may help in the understanding of both. Nevertheless, this argument suggests that if the program is far beyond our ability to comprehend it in its entirety, this will likely be true for the specification as well. In fact it appears that humans are only able to fully understand a "page or two" of code (whereas the arbitrary constants in KCS arguments mean that these arguments only apply to much larger programs).

(Note that this argument does not prohibit some particular and incomplete behaviors of a program from being verified. It also says nothing about the case where the "specifications" cannot be formally defined -- in this case one is also using a relaxed definition of "proof".)

Peter Naur's work is worth consulting if there is any doubt to the difficulty of knowing whether a specification is correct (he is the 'N' in BNF; the book Knowing and the Mystique of Logic and Rules, Kluwer, 1995 collects some of his publications). Naur has spent several decades studying how people deal with formal specifications and he has some disappointing comments, including

  • studies he conducted of problem solving using formal versus informal methods. People perform more poorly (catch fewer errors, etc.) when working with formal specifications.
  • he points to numerous instances of errors in published proofs of only a page or two in length. If peer-reviewed proofs this short still have errors, how can we know that a 100-page spec for an air traffic control system is correct?

Challenge Scenario

You are a manager, and you believe that you can objectively estimate development schedules.

How long will it take your team to write a program to solve the Goldbach conjecture? (Idea from John Casti). By the equivalence of math and computation this is possible, and judging from other number theory proofs the program might be pretty short (but this is for you to estimate).

The Goldbach conjecture has been unsolved for a couple of centuries, despite a million-dollar prize that was recently offered for its solution.

So here's a perfectly well formed (no changing requirements) request to write what is probably a fairly simple program, but I don't think anyone can say how long it would take.

If you say, "well this sort of program is not realistic", I say, "it's only unrealistic BECAUSE it cannot be predicted". If things like this were realistic, programmers would be hired to solve important mathematics problems. The market certainly exists -- there are plenty of important problems waiting to be solved.

Conclusions: Disclaimer, Programming vs. Manufacturing, Ethics

Disclaimer, repeated

Once again, I do not mean to say that Software Estimation, formal specifications, etc. are not useful! Regarding formal specifications, providing an alternate view of a program will help uncover errors that are not clear in it's original form (code). More generally, software estimation probably does help.

I am saying just this:

Claims of objective estimation are wrong.
Wrong, and arguably irresponsible as well.

Ethics

Irresponsible? A problematic scenario would be this: a software estimation consultant says that if you take his/her $10,000 course, you can transform your software group to produce bullet-proof software on time everytime. The consultant manages to convince the head of R&D at a car company that the software estimation claims are accurate. The car company then decides that it can save $50 per car by removing some manual override and relying entirely on computer control...

Unquestioning faith in software estimation is probably unlikely in the arena of car safety, but computers are everywhere and plenty of smaller scale incidents do happen. Peter Neumann has been tracking such incidents for years, see his risks archives.

Software development: like manufacturing, or like physics?

The CMM (Capability Maturity Model) and the similar ISO-900x got started in an understandable way: the goal was to apply the success and reliability found in manufacturing to software development. Indeed, the CMM was conceived by Watts Humphrey, who writes that industry process control concepts

``...are just as applicable to software as they are to automobiles, cameras, wristwatches, and steel.''
(``Characterizing the Software Process,'' IEEE Software 5(2), pp.~73-79)

Clearly this is well intended, but there is an obvious difference between manufacturing and software. In manufacturing, you make a chair, then another one, then 10,000 more. Then perhaps you switch to a different chair (but it's still a chair) and you make 10,000 of these.

By the nature of software you are never repeating something you did before. If what you need to do is the same as what you did, you can simply make a digital copy. As such, there is no opportunity for the repeatability and codification that are present in manufacturing. This point is better made by Bollinger (cited in the paper), who responds to Humphrey,

``The creation of genuinely new software has far more in common with developing a new theory of physics than it does with producing cars or watches on an assembly line.''

Software construction is an intrinsically creative activity and as such has inherent risks. As a broad consequence, software estimation is necessarily subjective. Good estimation therefore requires experience and judgment. The software industry should value human experience, intuition, and wisdom rather than claiming false objectivity and promoting entirely impersonal "processes". It should attend to the intrinsic uncertainties and risks of software development and where necessary promote the public discussion and honest assessment of these risks.

More Reader feedback

  • "Programming tools and libraries have to progress to the point where the coding we do is dull and mindless." Posted by slaytanic killer on kuro5hin.org Nov 5th, 2001.

  • "SDM adopting companies tend not to develop products that could not be trivially predicted." -Constantine Plotnikov, paraphrasing DeMarco & Lister Peopleware, second edition Chapter 29. pages 186-193 in paperback edition.

About the Author and this Paper

My career has been split between academic research, mostly in computer graphics, and commercial programming (again, mostly in the graphics industry). Software engineering is not my field -- how did I end up writing "Large Limits to Software Estimation"?

In 1996 I spent some time working on a notion of `visual complexity' at Interval Research in Palo Alto. As part of this effort, I studied Kolmogorov complexity. A year later I was employed at DreamWorks (the animation company), working on an early version of Shrek. The DreamWorks software staff exposed me to the Capability Maturity Model and similar software process management methods. Since these methods were unfamiliar, I read about them with interest... but something was bothering me. After a couple days I realized that the estimation claims could be seen from a Kolmogorov complexity viewpoint, and that there was a problem with a couple of the claims of these methods.

An early draft of the paper was released on the internet in 1997 and became a minor hit, judging from the email I receive. This version is currently cached at the citeseer repository. The version published in ACM is shortened from the internet draft but added the section on approximate estimation.