Epistemic Testing, Chapter 2 – Is that a Test or an Experiment?

“The Experiment Prepares, the Test Decides.”

Before we jump in, fair warning, this chapter gets a bit philosophical. Yeah, less code, more thinking. I know what you’re thinking: Didn’t he promise to stay practical?🤨 Don’t worry, I did, and I will. But some ideas just take a little brewing time. So grab your coffee or tea, get comfy, and let this one steep for a bit. It’s slower, deeper, but trust me, we need it.


From Belief to Evidence

Every piece of software begins with belief. We believe a user interface component will render correctly, a complex formula will yield the right calculation, or an API call will return the precise payload we need. We hold these beliefs because we designed them; in our minds, the logic is sound. But in the world of engineering, belief is not knowledge.


Proof is What Transforms Belief into Enduring Trust.

A test is our proof of work, a concrete, repeatable demonstration that an idea, a rule, or an assumption can survive contact with reality. Just as a metallurgist must strike the metal to know its strength, we must execute the code to know its truth. 

In software, this proof of work isn’t about unusual mathematical rigor; it’s about establishing evidence that our understanding of the system and the system’s actual behavior are in constant alignment. The profound beauty of software testing is that this evidence can be automated. A computer can replay our reasoning thousands of times per minute, confirming again and again that our assumptions remain intact.

When you run your test suite, you are not simply checking for green lights. You are replaying your system’s history of agreements(Do you know what it means in the xUnit community? The answer will be revealed at the end of chapter), each passing test is a small, self-executing certificate of trust between human intention and machine execution.


The Classic Ritual of Test/Verification

…that I follow but don’t quite love 😉

Chapter 1 introduced the Tea Mug Test to show how human design in terms of a test, requires verification. That verification process is not random; it follows an almost ritualistic pattern. There are two common, known and accepted beliefs that exist in the software engineering world, one is the triple of Arrange, Act, Assert (AAA) pattern(yep, the acronym sounds nice ) and the Gerard Meszaros’ quadruple fixture setup, exercising the system under test, result verification, and fixture teardown, (not as twiterified as AAA).(hooora, thanks to Gerard, finally I found a way to use quadruple). Two schools of thoughts are almost identical. Since, AAA is nicer and accepted much in the community, in the following i use it.

Chapter 1 introduced the Tea Mug Test to show that human-designed tests need verification. That verification process isn’t random; it follows an almost ritualistic pattern. In software engineering, there are two widely accepted approaches. One is the famous triple: Arrange, Act, Assert (AAA), yep, the acronym sounds nice. The other is Gerard Meszaros’ quadruple fixture setup: preparing the fixture, exercising the system under test, verifying the result, and tearing down the fixture, not quite as Twitter-friendly as AAA. (Hooray, thanks to Gerard, finally I found a way to use quadruple!) The two schools of thought are almost identical in essence. Since AAA is simpler and more widely adopted in the community, that’s what I’ll use going forward.

I have no idea who originally came up with AAA. If you happen to know, please mention their name here!


AAA Gives you the Discipline That Turns Ideas into Measurable Proof

This is the ritual that transforms a fuzzy idea into measurable proof:

The first step in any verifiable experiment is to Arrange the environment. Think of the Tea Mug Test where you had to gather your equipment including the mug, the boiling water, and the thermal scanner. In your code, this translates to setting up the Input State. This is where you establish the necessary conditions and initial state for your test: you might create a new shopping cart instance, define a user with specific permissions, or mock an external dependency like a database or third-party service. This step is crucial because it isolates the code under test, ensuring that the results are purely due to the behavior being examined, not due to external factors.

The next stage is to Act, where you execute the core behavior you are interested in proving. In the Tea Mug Test, this was the moment you poured the boiling water and waited 20 seconds(Do you know and remember why we should wait 20 seconds?). To translate this into your software engineer mind, In your code, you execute the Code Under Test by calling the function or method that embodies the idea you are questioning, perhaps invoking cart.addItem(), running a complex internal function like calculateTax(), or triggering a specific service call. This is the stimulus applied to the isolated system, the action that will produce the measurable outcome.

Finally, we Assert the outcome. After performing the action, this step determines if reality matches the original belief. For the mug, this meant measuring the handle’s temperature and checking it against the predefined safe threshold, hopefully in the presence of the customer, if you are sure about your hands. In automated testing, this means using an expect() or assert() statement to verify the Expected Output. We compare the actual result of the executed code (e.g., the cart’s total) against the desired truth (e.g., the total should be $30). The passing assertion is the proof of work; it validates that the belief you held about this specific behavior has been objectively and repeatably proven by the machine.

Every test we write replays this careful sequence, transforming invisible thought into observable, measurable, and repeatable truth. We normally move forward from Arrange to Act end with Assert, path the steps in order, except when we struggle with setting the arrangement. In this case, we move backward from end through the start, Assert => Act => Arrange.


From AAA to Atomic Judgments(Rethinking the Ritual of Verification)

I Don’t Like the Classic Ritual

You might be thinking: Why? Huh? Hmm… Relax. This isn’t about making anyone feel guilty, it’s about uncovering why the classic ritual of testing can be overly bureaucratic and prone to pitfalls.

Let’s step back and zoom out the ritual we follow in software verification. For decades, the community has relied on well-known sequences, whether it’s AAA (Arrange, Act, Assert) or Meszaros’ fixture-based quadruple(hahaha), to structure what we loosely call a test. Let’s start the debate with a confession. I personally believe that these sequences have value: they give discipline, repeatability, and a sense of control. Yet, if you look closer, you begin to notice the cracks, the fuzziness, the hidden costs. When you actually look under the hood, cracks appear. The rituals hide ambiguity, overload responsibility, and often leave participants, developers, QAs, PMs, nodding in agreement while secretly talking past each other. Last but not least, lots of unlimited, open-ended fights on the separation between unit, integration and … tests. Let’s delve into some pains.

The first pain is semantic overload. The word test has become a Swiss army knife: it refers to setup, execution, observation, assertion, cleanup, all at once. When a developer says, I’ll write a test for the order feature, what does that mean? Are we preparing the environment? Running the code? Checking results? Cleaning up afterward? Every role like developer, QA or PM interprets it differently. This ambiguity leads to miscommunication, wasted effort, and opaque failures: when a test fails, what exactly failed?

The second pain is hidden complexity. Classic rituals hide the scaffolding behind a single label. Preparing state, mocking dependencies, executing behavior, capturing observations, cleaning up, it’s all swallowed by the word test. For small systems, this is manageable. For large systems, it becomes a fragile, expensive, and opaque process, because each test may hide dependencies, side effects, and subtle failures unrelated to the core judgment.

The third pain is loss of focus. When the test is everything, the actual judgment, the thing that answers, Did this behavior satisfy the expectation? is diluted! Developers spend more time orchestrating experiments than reasoning about behavior. A failing test no longer tells you, clearly, what went wrong.

To illustrate, let’s return to the Tea Mug Test. You wanted to know if a mug could safely hold boiling water. It is the experiment that begins with preparation: gathering a mug, boiling water, a thermal sensor, maybe even a brave colleague willing to hold the handle. Next came execution: pouring the water and waiting twenty seconds (yes, twenty seconds, do you remember why?). Then observation: the temperature readings, perhaps a cautious touch test. And finally, the judgment: Safe or Not safe.

//TODO image

Notice something? Only one small moment of that entire process was the actual test, the logical judgment that answered your question. Everything else; preparing the mug, boiling water, watching, recording, cleaning up afterward, was the orchestration, the experiment that made the judgment meaningful. Without it, the test would have no context, no repeatability, and frankly, no credibility.

Now, translate this into software. Imagine verifying a shopping cart’s total after applying a discount:

# Experiment: verifying cart total with discount

# 1- Preparation

card = ShoppingCard()

user = User(role='premium')

item1 = Item(price=10)

item2 = Item(price=20)

card.add_user(user)

card.add_item(item1)

card.add_item(item2)

# 2- Execution

card.apply_discount(0.1)

# 3- Observation

total = card.total()

# 4- Test (assertion)

# This is the only part that counts as the test

assert total == 27  # The actual judgment, precise and atomic

# 5- Teardown

cart.clear()

Here’s the key insight: the assertion is the only thing that counts as the test; it’s the judgment. Everything else; setting up the cart, adding users and items, applying the discount, observing totals, cleaning up, is the experiment. Without it, the judgment has no meaning; with it, the test is precise, repeatable, and verifiable.

Compare with the mug: checking the handle’s temperature is the test; the boiling water, waiting time, and sensor placement are the experiment. In the example above, the total calculation is the test, the cart setup, discount application, and observation are the experiment. And yes, thinking of tests this narrowly may feel weird at first, but it solves the three big pains of the classic ritual:

  1. Overloaded terminology: Test no longer means ten different things at once. It’s a single, atomic judgment.
  2. Hidden complexity: The experiment is explicit, you can see setup, execution, observation, and cleanup separately.
  3. Diluted focus: Failures now point precisely to the logical expectation that was violated. No more guessing whether the bug was in setup, execution, or the judgment itself.

To make it more fun, imagine if the Tea Mug Test used the classic software “everything is a test” approach:

  • You’d say, I tested the mug, but you actually poured, waited, recorded, measured, cleaned, and the word test would cover all of that. When someone asked if the mug was safe, you’d shrug: Well… maybe?
  • Now, with our new approach, you clearly say: “The experiment is pouring the water, waiting, and measuring; the test is the handle temperature being below 60°C.” Everyone knows exactly what was verified.

The point is clear: tests judge, experiments orchestrate. By making this distinction explicit, we reclaim clarity, precision, and repeatability. We no longer hide complexity under a single word; we expose it, structure it, and allow the machine to give us the truth in a reproducible, transparent way.

And yes, it’s fun. Because now, every time you write a test, you know exactly what it is and exactly what it isn’t. You know when you’re setting up the world, when you’re triggering behavior, when you’re watching, and when the machine finally gives you a verdict. The ritual transforms from a fuzzy habit into a structured, repeatable, observable, and persuasive practice, one that even a Tea Mug would approve of.


The Bureaucracies of Test

Let’s step away from code for a moment and return to the humble Tea Mug Test again. Imagine you want to know if a mug is safe to handle with boiling water, You touch the handle to check its temperature.
Notice what we do not do: we do not create a mug just to test it. The mug exists; we interact with it to verify specific behaviors. Every action like touching or measuring is part of an experiment, but only the actual judgment, handle is safe or tea is brewed counts as the test. The preparation, pouring, waiting, observing, and cleaning up are necessary bureaucracies to make that judgment meaningful, repeatable, and credible.

This is exactly the distinction that often gets blurred in software. The system or feature already exists or will soon exist, but the test isn’t about creating it. It’s about verifying its behavior under controlled conditions. Let me be clear: I’m a strong advocate of TDD. I know its rhythm, its essence, and its crucial role in shaping good design. But I draw a line between two things that are often tangled together: the object under test, its design and structure and the experiments we run against its behaviors, which include the tests themselves.


It Is an Experiment Not a Test

Here is my proposed idea, and this is where the perspective shifts from ritual to clarity, we separate logical judgment from the bureaucracies that produce it.

  • The Test is the atomic, precise, machine-verifiable judgment. Nothing more, nothing less. Its sole purpose is to determine whether a specific expected outcome matches what actually happened. It does not prepare the environment, it does not execute code in context, it does not clean up afterward. It is pure verification. Think of it as the point of measurement in a scientific experiment; the verdict of truth distilled into a single, unambiguous form.
  • The Experiment is everything else: preparation, execution of behavior, capturing observations, cleaning up resources. The experiment creates the conditions in which the test can meaningfully operate. It isolates the system under scrutiny, ensures repeatability, controls for external factors, and records the observable effects of the behavior. Without the experiment, the test is meaningless; without the test, the experiment is a series of motions without judgment. Together, they form a disciplined, transparent, and repeatable process.

Consider the mundane task of verifying a shopping cart calculation. The experiment begins with preparation: creating a new cart, defining a user with permissions, adding items, mocking necessary services. Next, the experiment executes the behaviors: adding items, applying discounts, triggering calculations. Then, it observes the outcome: the total price, the applied discount, the final tax. Only at the end does the test make its judgment: is the total what it should be? This judgment is simple, precise, and unambiguous,  the test. The orchestration that surrounds it is the experiment.

Something interesting happens here, the relationship between the test and the experiment. You can have a single test, yet run multiple experiments around it. Take our famous Tea Mug again: the test might be simple, is the handle safe to hold? But the experiments? Oh, they can go wild. You could pour water at different temperatures, try mugs made of glass, ceramic, or that mysterious microwave-safe metal your cousin swears by, or even compare black mugs versus bright yellow ones just to see if color makes a difference. Each experiment changes the world around the mug, but the test, the judgment about the handle’s safety, stays beautifully constant.

Why is this distinction important? Because it transforms the mental model of verification:

  • Clarity of responsibility: Developers, QAs, and PMs can immediately know what is being verified, what is supporting infrastructure, and what is output. The test is the decision, the experiment is the context.
  • Reusability: The same test (judgment) can operate across multiple experiments with different setups or inputs. Conversely, the same experiment orchestration can host multiple tests without conflating responsibilities.
  • Precision: Narrowing the meaning of the test prevents ambiguity, improves communication, and ensures that failures are informative. When a test fails, you know exactly what logical expectation was violated.
  • Scalability: Large systems benefit from separating assertion from orchestration. Experiments can scale in complexity, tests remain concise and verifiable, and frameworks can orchestrate them efficiently.
  • Alignment with modern practices: TDD implicitly supports this distinction, with tests driving design while the scaffolding (preparation, execution, observation) is secondary. BDD mindset and tools already separate scenarios and examples, which map naturally to experiments and tests, respectively. By formalizing this distinction, we make TDD and BDD reasoning explicit, reducing ambiguity and debate.

In short, the ritual remains, but we remove the fuzziness. We keep the discipline, but we sharpen the focus. We stop letting test mean everything, and instead let it mean exactly one thing: precise judgment.

The result is a process that is formal yet flexible, precise yet readable, rigorous yet intuitive. It solves the pains of overloaded terminology, hidden complexity, and diluted focus. It transforms the act of verification from a murky ritual into a transparent, repeatable, and persuasive practice, a practice in which every participant, from developer to PM, understands exactly what is being measured, why, and how.


Experiments and Tests in Software

Let’s compare all we have done yet.

ObjectExperimentTest (judgment)
MugFill with boiling water, wait 20s, observe handleHandle temperature safe?
MugTaste a sipTea brewed properly?
CartAdd items, apply discount, observe totalTotal matches expectation?

In both cases, the test is tiny, focused, and atomic, while the experiment orchestrates conditions, actions, and observations.


A Fun, Persuasive Takeaway

Think about the Tea Mug Test again. If we used the classic software approach where everything is a test, we’d pour, wait, observe, clean up, and shrug when someone asked if the mug was safe? Well… maybe?

Now, with experiments and precise tests, clarity emerges. The experiment orchestrates preparation, actions, and observations. The test is a single judgment. For example, the handle is safe, tea brewed, and the cart totally correct.

We’ve reclaimed precision, transparency, and repeatability. The ritual remains, we still follow structured steps, but tests judge and experiments orchestrate, no ambiguity, no overload, no secret complexity.

And yes, it’s fun. Because every time you write a test, you know exactly what it is and exactly what it isn’t. Every failure points to the logical truth. Every experiment produces repeatable, observable results. And your Tea Mug would approve.


Practices

You are developing a shopping cart for an online store. The cart should correctly calculate the total price after applying discounts.

Step 1: Define the Test (Judgment)
Before writing any code, write a single, atomic test statement:

The cart total should equal the sum of items minus the applied discount.

This is your test, the precise judgment. Nothing else, no setup, no execution, no teardown.

Step 2: Design the Experiment
Now, create the experiment that makes the test meaningful. Use AAA:

  • Arrange:
    • Create a cart
    • Add two items: Item1 ($10), Item2 ($20)
    • Assign a user with a “premium” role
    • Set a 10% discount
  • Act:
    • Apply the discount to the cart
    • Calculate the total
  • The test):
    • Verify that cart.total() == 27

Step 3: Experiment Variations
Run multiple experiments against the same test:

  • Different item prices
  • Different discount percentages
  • Different user roles (premium, regular, guest)

Notice how your single test (atomic judgment) stays the same, but the experiments change.

Step 4: Reflection
Answer these questions:

  1. Where exactly is your test, and where is the experiment?
  2. How does separating the judgment from orchestration make failures easier to understand?
  3. Can the same test work for multiple experiments? How does this improve reusability?

Takeaways:

  • Separate tests from experiments, tests means  judgments while experiments refer to the bureaucracies and context of that judgment.
  • A test is proof of work, a living, verifiable artifact.
  • Experiments create repeatable conditions, making tests meaningful.
  • Use AAA or other sequences for experiments, not tests.
  • Separating test and experiment improves clarity, reusability, precision, scalability.
  • Tests emerge from understanding, not after it, they are a language of truth between humans and machines.
  • Focus on atomic assertions; keep tests precise and understandable.

A test is not just code that checks correctness. It is a small, living proof that an idea survives contact with reality, a measurable act of understanding that replaces fragile trust with enduring evidence.


Answer:

The idea of “replaying your system’s history of agreements” refers to Regression Testing, running previously agreed-upon tests to ensure that what once worked still works after changes.

Leave a Reply

Your email address will not be published. Required fields are marked *