We occasionally hear of enormous software projects that run years late and are then canceled. On the other hand, there are a number of software engineering methodologies that claim or hint at objective estimation of development schedules, project complexity, and programmer productivity. With objective estimates of development schedules, projects should rarely run late. Are these failed software projects not using proper software engineering, or is there a deeper problem?

The paper Large Limits to Software Estimation provides a framework for answering this question. Specifically, it shows how software estimation can be interpreted in algorithmic (Kolmogorov) complexity terms. Algorithmic complexity results can then easily be interpreted as indicating that claims of purely objective estimation of project complexity, development time, and programmer productivity are necessarily incorrect.

More specifically, the paper shows that

The background for this discussion includes some fun topics -- incompleteness, the foundations of computation, logical paradox -- and similar ideas that have arisen in different fields (math, theoretical computer science, complexity).

The document you are reading provides pointers to some of this background material, and some additional anecdotes of failed software projects and exaggerated software estimation claims. Look through the paper, then return here for the background and extra bits.

Misunderstandings

In Nov '01 this paper was discussed on several web news/discussion sites. The majority of the posts seemed to agree with what they guessed the conclusions to be, but as one might expect many posters had not actually read beyond the first sentence or two. Others argued with statements they believed might be in the paper (but in fact are not).

The argument is in the general ballpark of "software estimation is hard or impossible", but it actually says something more specific than that.

The article does not say the following:

  1. software estimation is impossible
  2. objective software estimation is impossible therefore software estimation is impossible

The article does say

From this, it does not conclude either of the points 1,2 above. Instead, it concludes:

Here are some of the posted responses, paraphrased:

Incompleteness

Incompleteness in mathematics

Godel Incompleteness says that every sufficiently powerful system of mathematics will contain true statements that cannot be proven, and no system can prove its own consistency. Godel, Escher, Bach is a rather lengthy popular discussion of Godel incompleteness. Shorter popular introductions are found in several popular books by John Casti and by Rudy Rucker (e.g. Mind Tools).

Perhaps the easiest "serious" introduction to Godel incompleteness is the last section of Rucker's Infinity and the Mind (Princeton U. Press, 1995). Mathematical incompleteness is difficult, however; it is much easier to show by establishing that theorem proving and computation are equivalent, and then showing incompleteness in computer science or complexity. This approach is outlined below.

Incompleteness in computer science

Turing's Halting theorem is the well known example of incompleteness in computer science. Rice's theorem is a more general version: it says that there is no algorithm that can determine an extensional property of all programs. An `extensional property' is one that can be seen from the input-output behaviour of the program without examining its internal state. Examples are: whether a program ever stops running (halting problem), whether a program will crash (automatic debugging problem), whether a program will ever print the secret code '42', whether a (graphics) program will ever produce a pure black image, etc.

The flavor of a proof of halting by contradiction is easy: You say you have written a function which will tell if any program halts. I then write a program which calls your function with my program as input. If your function says my program will exit, I loop, otherwise I exit.

P:: if Halts(P) loop else halt;

A more realistic version considers programs P(x) that have variable input x. Form the program

P(B):: { if Halts(B(B)) loop else halt; }

Now call P(B) with P itself passed for parameter B. The resulting program is

P(P):: { if Halts(P(P)) loop else halt; }

Turing's original proof was by diagonalization, similar to the Cantor diagonalization argument that the infinity of reals is bigger than the infinity of integers. The key to this proof is to recognize that programs can be enumerated. That is, fix a machine architecture, then every possible program is a bit pattern that can be interpreted as a large integer, and vice versa, every integer contains a bit pattern that you can try to execute as a program (though few will actually function). Turing's proof by diagonalization is:

Define F(N) as one plus the value of the Nth computable function applied to the number N, or zero if the Nth program does not halt. F cannot be computable because if it were it would be the Mth computable function, and then F(M) = F(M)+1. Because F(N) is straightforwardly defined, the ``flaw'' must be that halting cannot be determined.

These theorems do not say that it is impossible to determine the stated property for many particular problems, they just say that it is not possible to do so for every program. The next obvious question is, how many potential programs are there for which halting (or other extensional properties) cannot be determined? If the only program for which halting could not be determined was the one shown above that calls the halting test function in a paradoxical way, then incompleteness would be somewhat trivial. In fact this is not the case; there are (infinitely) many such exceptions. This is not evident in the proof of halting indicated above, but is evident in other proofs, e.g. the Chaitin Incompleteness theorem quoted in the paper.

In my opinion incompleteness is one of the consequences of allowing the concept of infinity -- infinity brings with it a number of problems and paradoxes. If an infinite number of programs are not allowed, the halting problem goes away... but not so fast: impossibility is merely replaced with intractability.

Intractable problems are those for which a solution is theoretically possible, but which in practice would require aeons of computer time, and a new processor won't begin to help. In fact in both computer science and Kolmogorov flavors of incompleteness, imposing resource limits results in intractable problems.

An example of an intractable problem is this one, which has a genetic programming feel:

Find the fastest x86 program that implements a particular video codec

This problem has practical value -- one could code a "reference" version of the codec, and then do a GP style search to find its most optimized implementation. To make the problem easier, fix the input video and state, and find the fastest x86 program that produces the same output from this input and state that an existing codec does. So the search program is now easy: in enumeration order, generate a program, run it for as long as the regular codec takes to run (no longer), check its output if any, keep track of the fastest-running program so far found that produced the desired output.

The intractable nature of this will be obvious to many, but for the rest of you, let's look at the numbers. How big is the codec? Perhaps 100k bytes, but let's say it is only eight bytes for now. There are 2^64 possible programs of 8 bytes in length. To test all these programs, buy a bunch of gigahertz machines -- four such machines can process roughly 2^32 (4 billion) instructions per second. If the test requires 10 seconds to run then four machines could crunch through 2^32 possibilities in 10 seconds. This would then take 2^32 * 10 seconds to search all the possibilities. Calculate this out, using a calculator such as Mathematica that isn't limited to 32-bit computation -- it's a long long time. Now consider that this is for the universe of toy 8-byte programs, whereas the desired codec is 100k bytes. For realistically sized problems, intractable may as well be impossible, and no conventional increase in compute power (including non-quantum parallel machines) will make a difference.

The paper uses "feasible" as a synonym for "tractable".

The book Neil Jones, Computability and Complexity from a Programming Perspective, MIT Press 1997 is a reasonable introduction to the computer science side of incompleteness.

Kolmogorov Complexity, and Incompleteness therein

Kolmogorov complexity is also called algorithmic complexity and KCS (Kolmogorov, Chaitin, Solomonoff) complexity. The paper briefly describes this concept. A "bible" of this field is Li and Vitanyi, Kolmogorov Complexity, Springer. It is dense but readable and suitable for home study.

Incompleteness in Kolmogorov complexity is simply that fact the complexity of a string is not computable; this is mentioned and used in the paper.

Incompleteness relationships

Mathematical incompleteness can be shown from computer science incompleteness, and vice versa, and likewise for complexity incompleteness. Some of the demonstrations are easy and require only accepting the notion that theorem proving can be formalized as computation (c.f. Turing machines).

Svozil (cited in the paper) lays out one correspondence between mathematics and computation,

axioms <-> program input or initial state
rules of inference <-> program interpreter
theorem(s) <-> program output
derivation <-> computation

but other correspondences are possible depending on the chosen formalization of "computer". Also see the Curry-Howard isomorphism.

Now to give examples of cross-field incompleteness arguments:

Turing implies Godel

Because there is no procedure for deciding which programs halt, there can be no proofs of which programs halt. In a mathematical system that is powerful enough to discuss programs (recursive functions), there are theorems ``program X will halt'' that are true but cannot be proven.

Complexity proof of Turing theorem

Assume the function Halts(P). Write a program Q that looks at all programs up to N bits in size and finds the biggest number produced by any of these programs (that halt). Double this number and print it. The search program Q is approximately log_2 N in size but it printed a number larger than any program of up to size N. The way out of the contradiction is to conclude that Q cannot tell which programs halt.

Complexity proof of Incompleteness

The paper contains the Chaitin complexity proof that mathematical systems cannot prove most statements of the form ``the complexity of (some string) is greater than (some value)''.

Etc.

There are more similar results, including that the maximum run time of a program is not computable, Halting proves Church's theorem (the negative answer to Hilbert's problem, theorems are not decidable by computation), etc.

Claims of the software estimation industry

Technical bookstores have a couple feet of shelf space devoted to software engineering. Some of these books have overly optimistic claims or hints of "objective" estimation based on historical data somewhere in their introduction.

Here are a few samples:


(The original CMM Manifesto) The CMM for Software p.3,4:

In an immature organization, there is no objective basis for judging product quality or for solving product or process problems" ...

[In a mature organization] There is an objective quantitative basis for judging product quality and analyzing problems with the product and process. Schedules and budgets are based on historical performance and are realistic"


``Software Quality Measurement: A Framework for Counting Problems and Defects'' CMU/SEI-TR-22, p.6:

Consistent measurements provide data for doing the following:
  • Quantitatively expressing requirements, goals, and acceptance criteria.
  • Monitoring progress and anticipating problems.
  • Quantifying tradeoffs used in allocating resources.
  • Predicting the software attributes for schedules, cost, and quality.

G. G. Schulmeyer, ``Software Quality Lessons from the Quality Experts,'' in G. G. Schulmeyer and J.I. McManus, Eds., Handbook of Software Quality Assurance} (2nd ed.):

In the Certainty state [of quality management], the objective of software development and software quality management, producing quality software on time with a set cost everytime, is possible.

Course title: "Productivity Improvement through Defect-Free Development".

Book title: Zero Defect Software, McGraw-Hill, New York, 1990.


etc...

Note that some of these statements do not quite say that they are going to do objective estimation. For example, both this quote

[In a mature organization] There is an objective quantitative basis for judging product quality and analyzing problems with the product and process. Schedules and budgets are based on historical performance and are realistic"
and this one
Consistent measurements provide data for doing the following:
  • Quantitatively expressing requirements, goals, and acceptance criteria.
  • Monitoring progress and anticipating problems.
  • Quantifying tradeoffs used in allocating resources.
  • Predicting the software attributes for schedules, cost, and quality.
stop just short of saying will do quantitative estimation based on historical data (though the reader might think otherwise if quickly reading these).


Software disasters

GAO-93-13 on major software challenges: ``We have repeatedly reported on cost rising by millions of dollars, schedule delays of not months but years, and multi-billion-dollar systems that don't perform as envisioned.''

A few software "disasters" that I came across during the writing of this paper (1997-2000):

Evidently the prize for software disasters goes to the FAA's Advanced Automation System (AAS), an attempt at an improved air traffic control system that ran into trouble in the early 90s after an estimated 6.5 billion was spent. The AAS has been described as as "one of the largest and most spectacular computing failures in the history of the field."

R.Britcher, one of the software engineers involved in the AAS project, describes it in the book The Limits of Software (Addison Wesley). Britcher himself characterizes the AAS as

"The greatest debacle in the history of organized work... we learned nothing from it"
The book is a good read.

Importantly, the AAS project did make use of software engineering best practices, such as code inspections, schedule estimates, etc. The software was estimated to be on schedule each and every month... until one month before it was to be delivered. Only at this point did it become evident that the project was hopelessly late.

Another relevant book is The Mythical Man Month. We've all heard of it; it is worth actually reading if you haven't done so yet.

Questioning the Paper

The paper requires assuming that the time to develop a program is (in the large) determined mostly by its complexity, and that Kolmogorov complexity is an appropriate definition of complexity (in the large). It may be possible to challenge the conclusions of the paper by challenging one of these assumptions.

I find both of them fairly self evident.

If programming time does not depend somehow on complexity, what does it depend on? On the other hand, the nature of this relation is not clear. At most we can expect there to be a monotonic relation between development time and complexity. For example, I would expect there might be a complexity ceiling due to our finite brains, beyond which the development time goes to infinity. Development time is also understood as being independent of any particular programmer. A particular programmer may have difficulty with some of the background material or patterns needed to implement a particular program, but the whole notion of software estimation requires abstracting away from that.

The phrase "in the large" is important. There might be a particular program that is simple in complexity, but that will take a while to write -- perhaps it is a 100 line program is best written using a particular feature of the latest C++, but that requires installing a new compiler, which turns out to require modifications to a header file that you intended to use. These distractions take time, but that time becomes insignificant as the program grows from 100 to 10,000,000 lines.

Kolmogorov (algorithmic) complexity is a well established mathematical definition of the complexity of digital objects. It is a simple but deep concept, that can be used to motivate and define concepts such as randomness, Occam's razor and ideas in the philosophy of learning. It also turns out to have a close relationship to entropy, with many shared relationships. Kolmogorov complexity might be said to be the analog of entropy for individual objects.

To motivate algorithmic complexity, consider an ideal situation where a very large program will be written. After every milestone, the program is fully examined and common functionality is factored. We imagine that this refactoring is done very well. The size of this program then approaches its algorithmic complexity.

Note that the choice of language eventually becomes irrelevant (invariance theorem). If the language does not have features that make it elegant to code the desired program, at some point the most elegant refactoring is to literally write a new language. This new language can be implemented using the original language, and the cost of the interpreter or compiler is constant, which eventually becomes insignificant.

Challenge Scenario

You are a manager, and you believe that you can objectively estimate development schedules.

How long will it take your team to write a program to solve the Goldbach conjecture? (Idea from John Casti). By the equivalence of math and computation this is possible, and judging from other number theory proofs the program might even be quite short -- but this is for you to estimate.

The Goldbach conjecture has been unsolved for a couple of centuries, despite a million-dollar prize that was recently offered for its solution.

So here's a perfectly well formed (no changing requirements) request to write what may be a fairly short program, but I don't think anyone can say how long it would take.

Conclusions: Disclaimer, Programming vs. Manufacturing, Ethics

Disclaimer, repeated

Once again, I do not mean to say that Software Estimation, formal specifications, etc. are not useful! Regarding formal specifications, providing an alternate view of a program will help uncover errors that are not clear in it's original form (code). More generally, software estimation probably does help. The paper only says this:

Claims of purely objective estimation are wrong.

Wrong, and arguably irresponsible as well.

Ethics

Irresponsible? A problematic scenario would be this: a software estimation consultant says that if you take his/her $10,000 course, you can transform your software group to produce bullet-proof software on time every time. The consultant manages to convince the head of R&D at a car company that the software estimation claims are accurate. The car company then decides that it can save $50 per car by removing some manual override and relying entirely on computer control...

Unquestioning faith in software estimation is probably unlikely in the arena of car safety, but computers are everywhere and plenty of smaller scale incidents do happen. Peter Neumann has been tracking such incidents for years, see his risks archives.

Software development: like manufacturing, or like physics?

The CMM (Capability Maturity Model) and the similar ISO-900x got started in an understandable way: the goal was to apply the success and reliability found in manufacturing to software development. Indeed, the CMM was conceived by Watts Humphrey, who writes that industry process control concepts

...are just as applicable to software as they are to automobiles, cameras, wristwatches, and steel.
(``Characterizing the Software Process,'' IEEE Software 5(2), pp.~73-79)

Clearly this is well intended, but there is an obvious difference between manufacturing and software. In manufacturing, you make a chair, then another one, then 10,000 more. Then perhaps you switch to a different chair (but it's still a chair) and you make 10,000 of these.

By the nature of software you are never repeating something you did before. If what you need to do is the same as what you did, you can simply make a digital copy. As such, there is no opportunity for the repeatability and codification that are present in manufacturing. This point is better made by Bollinger (cited in the paper), who responds to Humphrey,

The creation of genuinely new software has far more in common with developing a new theory of physics than it does with producing cars or watches on an assembly line.

Software construction is an intrinsically creative activity and as such has inherent risks. As a broad consequence, software estimation is necessarily subjective. Good estimation therefore requires experience and judgment. The software industry should value human experience, intuition, and wisdom rather than claiming false objectivity and promoting entirely impersonal "processes". It should attend to the intrinsic uncertainties and risks of software development and where necessary promote the public discussion and honest assessment of these risks.

More feedback and quotes