Demonstrate

Stage four of The Method. Show the derivation holds.

In science, a claim counts for little until it is shown. You may have the most elegant hypothesis in the literature; without evidence it is a conversation, not a contribution. Code is no different, and the engineer who says “trust me, it works” is making a scientific claim with no experiment attached. An untested assertion about software is a rumor about software. It might be true. You have given no one, including yourself, a reason to believe it.

Demonstrate is the stage where the precise claim you formalised meets reality and survives — or doesn’t. It is the closing of the loop that Observe opened: you looked, you proposed, you stated it sharply, and now you show the derivation holds. The word show is doing real work. Not assert. Not believe. Show — with evidence another person, or a machine, or a future maintainer can inspect and re-run.

The sentence every tester should have memorized

Edsger Dijkstra wrote it in 1970, in Notes on Structured Programming, and it has lost none of its bite: “Program testing can be used to show the presence of bugs, but never to show their absence!”¹

This is not an argument against testing. Dijkstra tested. It is an argument about what testing is: evidence, not proof. A passing test shows that the system did the right thing on the path you exercised. By itself, it says little about the paths, states, inputs, timings, and environments you did not exercise. This is precisely why Demonstrate is layered rather than singular — why mature engineering does not rest on one kind of evidence but stacks several, each compensating for the others’ blind spots. Tests find the presence of bugs cheaply and continuously. Other instruments — review, types, proof — reach different blind spots. Understanding what each form of evidence can and cannot establish is the whole craft of this stage.

Testing, and the trade nobody likes to name

Begin with the workhorse. Does disciplined testing actually improve outcomes, or does it just feel responsible? One of the strongest industrial data points comes from Nagappan, Maximilien, Bhat, and Williams, who studied four teams at Microsoft and IBM adopting test-driven development. The result: pre-release defect density fell between 40% and 90% relative to comparable teams that did not practice TDD.² Treat that as evidence, not law: broader TDD research is mixed and context-sensitive, with later work explicitly warning that results vary by study design, environment, participant type, and measurement choices.³

Now the part that makes the finding trustworthy rather than promotional — the same study reports a cost: those teams saw initial development time increase by 15% to 35%.² That trade is the honest center of this stage, and pretending it away would discredit the whole method. You may pay time now to reduce the probability or cost of defects later. Whether the trade is worth it depends heavily on what a defect costs you — trivial for a prototype you’ll delete Friday, decisive for software that touches money, safety, or millions of users. The point of slow, intentional coding is not that the slow path is always right. It is that the trade should be made deliberately, with the real numbers in view, rather than defaulted away under deadline pressure and discovered later as an incident.

There is also a quieter dividend the defect numbers miss. Each test written to demonstrate a claim becomes a maintained sensor for that claim, left running for as long as the code and the test both remain meaningful. The regression test you add when you fix a bug is not really about today’s bug; it is a tripwire set for the day, two years from now, when a well-meaning refactor reintroduces the same fault. Demonstration done this way can accumulate. The suite is the precipitate of claims the system has had to defend, and when it is kept healthy, it keeps checking them long after everyone who remembers the original incident has moved on — which is also, not incidentally, how part of the theory of the system survives the people who built it.

Review demonstrates to a second mind — and that’s not mainly about bugs

The second layer is human. Code review is demonstration to another person: here is my change, here is why it is correct, show me where I’m wrong. But the most interesting empirical finding about review is that its value is not the value teams expect.

Bacchelli and Bird studied modern code review at Microsoft and found a revealing gap between motivation and reported outcome. Developers often value review for finding defects, but the study found that knowledge transfer, increased team awareness, and shared understanding of the changing system were among its most important realized benefits.⁴ Defects are found, but the underappreciated value is that review can spread the theory of the code across more than one head.

Set that beside Peter Naur’s claim from the Observe essay — that a program is a theory held in its builders’ minds, and dies when that theory is lost — and review reveals its deeper function. It is not merely a filter. It is one way the theory propagates, one way a system stops being legible to exactly one person, one way the bus factor begins to climb above one. Demonstration, done socially, is also partial insurance against the slow death Naur warned about. You are not only showing that the derivation holds; you are teaching someone else to see why you think it holds, which is one condition for the system enduring past your tenure on it.

When “show it” means prove it

At the far end of the evidence spectrum, demonstration stops being statistical and becomes deductive within stated assumptions. For most software this is overkill. For the narrow class of systems where a single bug is catastrophic, it can be the strongest honest standard — and, contrary to the folklore, it is achievable in real shipped systems.

The seL4 microkernel is the existence proof. In 2009, Klein and colleagues reported “the first formal proof of functional correctness of a complete, general-purpose operating-system kernel” — a machine-checked proof, in Isabelle/HOL, that the C implementation refines its abstract specification and rules out broad classes of implementation defects within the proof assumptions.⁵ Not merely “extensively tested.” Proven, mechanically, from spec to code, while still depending on the specification and the trusted assumptions beneath the proof.

The companion lesson comes from compilers. Yang, Chen, Eide, and Regehr built Csmith, a tool that generates random C programs to hunt for compiler bugs, and pointed it at many widely used C compilers. They found wrong-code bugs across the compilers they tested — except in one place. Their words: “the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task.”⁶ CompCert’s optimizer is formally verified, and the verified middle end withstood six CPU-years of automated assault that exposed wrong-code errors elsewhere.

State the nuance precisely, because accuracy is the currency of this stage: Csmith did find bugs in CompCert — in its unverified front end. The verified middle end, the part carried by machine-checked proof, is where Csmith could not find wrong-code errors. That is the actual finding, and it is stronger than the rounded-off version, because it isolates the variable. Where proof covered the code, this testing campaign did not find the same class of bugs. Where proof stopped, bugs appeared. Proof is not magic and it is not free, but it is real, it ships, and it does exactly what it claims on the part it covers.

The slogan, refuted by data

Now to the objection this entire series exists to answer, and it belongs here because Demonstrate is where the objection can at least be tested. Slow, intentional coding sounds nice, but the market rewards speed. Move fast. Ship. The careful teams get lapped.

One of the strongest bodies of software-delivery research reports the opposite pattern. In Accelerate, Forsgren, Humble, and Kim summarize years of State of DevOps research as evidence that speed and stability need not trade off.⁷ Elite-performing teams deploy more frequently and with shorter lead times and with lower change-failure rates and with faster recovery. In their data, the best teams are better on throughput and stability simultaneously. The two do not have to trade against each other; under the right practices, they can rise together.

And the mechanism is the subject of this stage. A major part of what lets high-performing teams move fast safely is automated demonstration — test automation and continuous integration, the practices Martin Fowler codified, where changes are integrated frequently and verified against an automated build and test process rather than batched into a perilous big-bang integration.⁸ The fast teams are not fast because they skip the evidence. They are fast because much of the evidence is automated and continuous, so it is less likely to become a bottleneck and less likely to let a regression travel far.

This dissolves the apparent paradox at the heart of “slow, intentional coding.” The slowness lives upstream — in Observe, Conjecture, Formalise, where thinking is cheap and haste is often ruinous. The speed lives here, in Demonstrate, where machines can check many formalised claims in seconds, on every commit, for as long as the pipeline and tests are maintained. Think slowly; verify quickly and automatically; and you get the elite pattern: deliberate where deliberation pays, fast where automated feedback is trustworthy. The teams that “move fast and break things” often have it inverted — hasty where it’s expensive, and then slow and frightened at exactly the moment a strong green pipeline should have let them be bold.

Demonstration must be reproducible, or it isn’t demonstration

There is a failure that hides inside even well-tested systems: evidence that only works on your machine. A demonstration no one else can reproduce is not a demonstration; it is an anecdote with a passing status icon.

The scale of this problem is documented and sobering in computer-systems research. Collberg and Proebsting examined 601 papers from top computer-systems conferences and journals and tried merely to obtain the code and build it — not to verify the results, just to get them to compile. Under realistic conditions resembling a reviewer’s — limited time, no help from the authors — only about a third, 32.3%, were “weakly repeatable.” With unlimited time the figure rose to 48.3%, and with authors’ own optimistic assurances to 54.0%.⁹ In other words: even in published systems research, more often than not, the demonstration was not straightforwardly re-runnable by independent readers under the study’s conditions.

For working software the discipline this implies is concrete. The build should be reproducible on a clean machine when the system is meant to outlive one developer’s environment. The test that passes only with your local environment variables and a database someone hand-seeded in 2023 is weak evidence; it is a story about evidence. Continuous integration is partly a reproducibility instrument: it forces the demonstration to run somewhere that is not your laptop, which is where its result starts to become meaningful to other people. And reproducibility is the seam to the final stage, because the thing that endures should be something others can rebuild, re-run, and trust without you in the room.

The honest limit: evidence is not proof, and confidence can be false

Every instrument in this stage has an edge past which it misleads, and a method worth following names them. Tests give confidence proportional to the paths they cover, which is always a sliver of the whole; high coverage measures how much code ran, not whether it was correct, and a suite can be green and wrong. TDD costs real time, and that time is wasted on throwaway work. Review quality degrades when reviewers are rushed or the change is too large to hold in one head. Proof is expensive, scales poorly, and rests on assumptions — seL4’s proof assumes the compiler, the assembler, and the hardware behave; Csmith found CompCert’s bugs exactly where the proof stopped.

The most treacherous part is that a green suite feels like proof, and the feeling scales with the number of tests rather than their quality. A thousand tests that all exercise the happy path establish, very thoroughly, that the happy path works — and tell you almost nothing about the error handling that will actually page you at 3 a.m. This is why test coverage is such a seductive and misleading number: it measures which lines ran during testing, not whether any assertion would have noticed them misbehaving. A line can be fully “covered” by a test that asserts nothing meaningful about it. Mutation testing exists to expose this gap — it deliberately injects faults and checks whether your tests fail, testing the tests — and recent industrial work argues that mutation testing can identify holes in a suite and provide concrete suggestions for additional tests, while also acknowledging scale and cost as adoption barriers.¹⁰ The discipline is to read a green suite as evidence proportional to its actual severity, never as a trophy.

The discipline, as everywhere in The Method, is proportion. Match the strength of your demonstration to the cost of being wrong. Cheap, fast, automated evidence — types, unit tests, CI — belongs wherever it creates fast feedback at acceptable maintenance cost. Expensive evidence — heavy property testing, model checking, formal proof — belongs near the load-bearing core where a silent defect is catastrophic. Demonstrating too little on the critical path is negligence. Demonstrating too much on the trivial path is its own kind of waste. The skill is knowing, honestly, which path you are on.

What Demonstrate looks like in practice

Reproduce, then fix, then regression-test. For durable code, a bug fix should ship with a test that fails before the fix and passes after. That test is maintained evidence that the specific bug is gone and less likely to return unnoticed.
Make CI the arbiter, not your laptop. If it isn’t green in a clean, shared environment, it isn’t fully demonstrated. The pipeline, not your local shell alone, decides what the team can trust.
Review for understanding, not nitpicks. One realized value of review is shared theory. Optimize for “now two people understand this,” not for style points a linter could have caught.
Use property-based tests for invariants. Don’t hand-pick only three inputs; state the property and let the generator search for counterexamples across a broader input space.
For the catastrophic core, reach further. Where a bug is measured in money, safety, or lives, escalate to model checking or proof. seL4 and CompCert show the bar is reachable when the stakes justify it and the scope is carefully chosen.
Automate the evidence so speed becomes safer. The elite pattern is slow thinking and fast, automatic checking. Every claim a machine can verify on every commit is a claim you can revisit with less fear, provided the check remains meaningful.

The handoff

Demonstrate ends with something rare and valuable: a claim that has survived an honest attempt to refute it, backed by evidence that others can re-run without you. Not “I believe it works.” Here is the proof that it does, and here is how to check it yourself. The derivation holds, and the holding is visible.

But surviving demonstration is a beginning, not an end. The question the final stage asks is harder and quieter than “does it work?” It is: does it deserve to last? Of everything that holds, what is worth carrying forward into the slow layers of the system, into the maintenance burden of years, into the minds of people who will read this long after you’ve forgotten you wrote it? Demonstrate produces candidates. Publish decides which of them endure.

Edsger W. Dijkstra, Notes on Structured Programming (EWD249), 2nd ed., 1970, §3: “Program testing can be used to show the presence of bugs, but never to show their absence!” E. W. Dijkstra Archive, UT Austin: cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD249.html. A near-identical line appears in the April 1970 NATO Software Engineering Techniques report. ↩
Nachiappan Nagappan, E. Michael Maximilien, Thirumalesh Bhat, and Laurie Williams, “Realizing quality improvement through test driven development: results and experiences of four industrial teams,” Empirical Software Engineering 13, no. 3 (2008): 289–302. DOI: 10.1007/s10664-008-9062-z. Pre-release defect density down 40–90%; initial development time up 15–35%, reported subjectively by management. ↩ ↩²
Mohammad Ghafari, Timm Gross, Davide Fucci, and Michael Felderer, “Why Research on Test-Driven Development is Inconclusive?” ESEC/FSE 2020, 2020. DOI: 10.1145/3382494.3410687. Useful caveat that TDD findings are contradictory and context-sensitive. See also Adrian Santos et al., “A Family of Experiments on Test-Driven Development,” arXiv: 2011.11942, for the claim that quality effects depend heavily on study characteristics. ↩
Alberto Bacchelli and Christian Bird, “Expectations, Outcomes, and Challenges of Modern Code Review,” Proc. 35th International Conference on Software Engineering (ICSE 2013), 712–721. DOI: 10.1109/ICSE.2013.6606617. Code review motivations include defect finding, but realized benefits include knowledge transfer, team awareness, and shared understanding. ↩
Gerwin Klein et al., “seL4: Formal Verification of an OS Kernel,” Proc. 22nd ACM Symposium on Operating Systems Principles (SOSP 2009), 207–220. DOI: 10.1145/1629575.1629596. “The first formal proof of functional correctness of a complete, general-purpose operating-system kernel.” ↩
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr, “Finding and Understanding Bugs in C Compilers,” Proc. 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2011), 283–294. DOI: 10.1145/1993498.1993532. Csmith found wrong-code bugs in every compiler except CompCert’s verified middle end (bugs were found in its unverified front end). ↩
Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps (IT Revolution Press, 2018). Central finding: speed and stability are not a trade-off; elite performers excel at both, enabled by test automation and continuous integration. ↩
Martin Fowler, “Continuous Integration,” martinfowler.com (rev. May 1, 2006). martinfowler.com/articles/continuousIntegration.html. ↩
Christian Collberg and Todd A. Proebsting, “Repeatability in Computer Systems Research,” Communications of the ACM 59, no. 3 (March 2016): 62–69. DOI: 10.1145/2812803. Weak repeatability: 32.3% under realistic conditions, rising to 48.3% / 54.0% under more generous ones. ↩
Goran Petrović, Marko Ivanković, Gordon Fraser, and René Just, “Does Mutation Testing Improve Testing Practices?” ICSE-SEIP 2021, 2021. DOI: 10.1109/ICSE-SEIP52600.2021.00034. Describes mutation testing as a way to identify holes in test suites and guide additional tests, while noting adoption barriers around scale and cost. ↩