Suppose we’ve had a recent error with a Kubernetes cluster. As often happens with a problem in our systems, we noticed it first in terms of the visible error, which we could state as “Builds did not complete.” Now we want to trace backwards to figure out what happened. A common technique is the “Five Whys” popularized by Lean thinking. So we ask “Why did builds not complete” and we find “Kubernetes could not start the pod, and the operation timed out after 1 hour.”
We could certainly debate whether that’s a single “why” or two of them in one step, but that’s not the key topic right now. The main thing is that this is a straightforward statement about causality. “Pod no start” leads directly to “build no done.”
The next step in this analysis reveals that the pod would not start because a volume was full with too many files. Again, direct causality.
The tricky bit comes next. Why was the volume full with too many files? At this point, we’re likely to see a change in the nature of the explanation. Some variation of the following might be offered:
- The admin did not configure file purging.
- The cluster admins did not monitor for “volume full” conditions.
- The developers did not clean up files from old builds.
Do you notice how all these “causes” are stated in the form of something that didn’t happen? They are “counterfactuals.”
A counterfactual is a statement about how the world might be different now if something had happened differently in the past. It’s a kind of “alternate history” idea.
Here’s the rub: a counterfactual cannot be a cause. By definition the counterfactual did not happen, therefore it cannot have caused anything. Only events that actually occur can be causes of other events. Causality should be stated in a form “Because X then Y”. The statement “If not X then not Y” is not an explanation, it is a kind of wishful thinking about how the past might have unfolded differently.
When performing Five Whys it is important to avoid this counterfactual leap. Stick to the events that actually occurred.
Notice in the incident analysis I outlined earlier, there are three counterfactuals listed. Each of them independently would have been sufficient to avert the incident. But these are hardly the only three counterfactuals we could construct:
- We used Kubernetes for our CI cluster instead of static VMs.
- We use CI instead of a human working at the command line.
- We put code in a repository instead of directly editing files on production instances.
I could go on, but you probably felt like the first three were somehow more reasonable than these three. In some way, the original set are “closer” to actual reality than these three. Nonetheless, I could go on constructing counterfactuals for an unlimited period of time. “If the Earth hadn’t been habitable then we would not be here to care about our CI builds not finishing.” Once you start making counterfactuals, there’s really no end to them. Again, that’s because these are not events that happened. Only a finite number of events actually happened so the chain of causality is finite. An infinite number of things didn’t happen so we can always find more “missing things” to blame.
Speaking of Blame
This is also where people come into conflict when analyzing the chain of events. One person might posit a counterfactual about an event a different person or team didn’t do. That person or team naturally bristles–it feels like they are being blamed. (And worse, being blamed for not doing something, so they are being called negligent!) They would be impelled to put forward their own counterfactual which might haul in yet another team. If the negative outcome was significant, this cloud of hypotheticals becomes a “blamestorm” looking to rain down on somebody. Defenses go up, and learning stops.
Counterfactuals are the condensation nuclei for blamestorms.
Using Counterfactuals For Good
The counterfactual leap indicates where people stop looking for causes and jump to thinking about solutions. Try to reformulate the counterfactual as a statement about future prevention:
- If we configure file purging, then this won’t happen again
- If we monitor for “volume full” conditions, then this won’t happen again
- If we clean up files from old builds, then this won’t happen again.
These are useful statements. When formulated this way, they’re clearly talking about the future and not hypothesizing an alternate history. (You might have noticed that I also snuck in a bunch of “we” statements in place of the more specific attributes above.)
As long as we remain clear that these counterfactuals are not the cause of the problem that already happened, but are changes to our reality that can prevent future occurrences, we can use them without inducing blamestorming.
As a practical technique, during a Five Whys or post-incident review, when someone poses a counterfactual as a cause, I suggest capturing it in the forward-looking version in a parking lot of potential changes.
Stepping Farther Away From Reality
This reformulation also helps weed out the more far-fetched conterfactuals… the ones that felt kind of “out there” or even silly before. Let’s try it with the second set from above:
- If we use static VMs instead of Kubernetes for our CI cluster, then this won’t happen again. (Possibly true statement, though somewhat lacking in support.)
- If we use a human working at the command line instead of CI, then this won’t happen again. (Probably. Humans are more adaptable and can figure out when to purge files. But there are likely to be other undesirable effects.)
- If we edit files directly on production instances instead of putting code in a repository, then this won’t happen again. (Umm… definitely a case where the cure is worse than the disease!)
This last one also lets me illustrate something about the counterfactuals from before. You might have felt more resistance to the second set because you were automatically thinking about negative consequences if that statement had been true. Humans are very good at hypothesizing these counterfactuals. Faced with a bad outcome, our brains spontaneously and instantly conjecture a large branching tree of alternate histories. And just as quickly, we prune that tree of those branches which we know would produce other negative effects that are worse than the outcome we had. Just imagine, “If we provoke a nuclear war that ends civilization, then this CI build failure won’t happen again.”
So when I pose a counterfactual that says “if we edit files directly on production instances, this won’t happen again,” your instinctive response is to say, “yeah, but.” This is now thinking about two steps away from the current reality. Step 1 is to imagine the alternative history where the counterfactual had occurred. Step 2 is to extrapolate the negative outcome of the consequences of that alternative history. Sometimes we even go further steps away from reality by postulating still more counterfactuals that could compensate for the negative consequences of the first one.
Counterfactuals don’t say anything about what actually happened. They express wishful thinking about an alternate history where the bad event didn’t happen. Because they represent “events that didn’t occur” they cannot have caused anything. However, stating a counterfactual can trigger an unhelpful round of blamestorming. Try to reformulate counterfactuals offered as explanations for past events so you can state them as injections to prevent recurrence. Of course, you must also contemplate what other effects those injections would have!
Watch out for the pitfall of counterfactuals when analyzing anything. It’s a common trap for post-incident reviews, retrospectives, project post-mortems, and other cases when you need to reconstruct a chain of events.