Wide Awake Developers

Root Cause Analysis as Storytelling

| Comments

Humans are great storytellers and even better story-listeners. We love to hear stories so much that when there aren’t any available, we make them up on our own.

From an early age, children grasp the idea of narrative. Even if they don’t understand the forms of storytelling so much, you can hear a four-year-old weave a linked list of events from her day.

We look for stories behind everything. At a deep level, we want the world’s events to mean something. Effect follows cause, and causes have an actor to set them in motion.

Our sense of balance also demands that large effects should have large causes, with correspondingly large intent.

A drunk driver speeds through a red light, oblivious. A crossing car stops short. The shaken driver creeps home with a pounding pulse, full of queasy adrenaline. She unbuckles her daughter and hugs her tightly.

A drunk driver speeds through a red light, oblivious. A crossing car is in the intersection. The drunk smashes into it, right at the drivers’ side door. The woman’s bloody face is hidden behind airbags. Her daughter sits in her new wheelchair for her mother’s funeral.

The difference between those stories is a matter of a split second in timing. There is absolutely no change in the motives or desires of anyone in the two vignettes. The first drunk, if caught, would get a jail term and large fine. He would probably lose his driver’s license.

But most people would judge the motives of the second driver far more harshly. They would condemn him to a lengthy prison term and a lifetime ban on driving.

When we see a large effect, we expect a large cause, with a large intent.

The idea that some vast, horrible events strike randomly fills us with dread. People can’t bear the thought that a single unbalanced nobody can change the course of a nation’s history with one rifle shot, so they spend more than 50 years searching for “the truth.”

“Root Cause Analysis” expresses a desire for narrative. With the power of hindsight, we want to find out what went wrong, who did it, and how we can make sure it never happens again. But because we have the posterior event, we judge the prior probabilities differently. Any anomaly or blip suddenly becomes suspect.

People don’t look as hard at anomalies when nothing bad happens.

They don’t notice all the times the same weird log message pops up before … everything continues as normal.

When we look for “root cause,” what we are really trying to discern is not “what made this happen.” We are looking for something that would have stopped it from happening. We are building a counterfactual narrative—an alternate history—where that drunk driver dropped his keys in the parking lot and was thereby delayed a few crucial seconds.

Peel back the surface on a root cause analysis and you almost always see a formula that goes like this: “factor X” could have prevented this. “Factor X” was not present, therefore the bad event happened.

The catch is that there is usually an endless variety of possible counterfactuals. Often, more than one counterfactual narrative would have prevented the bad outcome equally well. Which one was the root cause? Non-existence of “factor X” or non-existence of “factor Y?”

Next time you have a bad incident, why not try to focus your efforts in a different way? Work on learning from the times that things don’t go wrong. And be explicit about looking for many possible interventions that would have prevented the problem. Then select ones with broad ability to prevent or impede many different problems.

Release It Second Edition in Beta

| Comments

I’m excited to announce the beta of Release It! Second edition.

It’s been ten years since the first edition was released. Many of the lessons in that book hold strong. Some are even more relevant today than in 2007. But a few things have changed. For one thing, capacity management is much less of an issue today. The rise of the cloud means that developers are more exposed to networks than ever. And in this era of microservices, we’ve got more and better ops tools in the open source world than ever.

All of that motivated me to update the book for this decade. I’ve removed the section on capacity and capacity optimization and replaced it with a section that builds up a picture of our systems by doing a “slow zoom” out from the hardware, to single processes, to clusters, to the controlling infrastructure, and to security issues.

The first beta does not yet include two additional new parts on deployment and solving systemic problems. Those will be coming in the next few weeks.

In the meanwhile, I look forward to hearing your comments and feedback! Join the conversation in the book’s forum.

Spectrum of Change

| Comments

I’ve come to believe that every system implicitly defines a spectrum of changes, ordered by their likelihood. As designers and developers, we make decisions about what to embody as architecture, code, and data based on known requirements and our experience and intuition.

We pick some kinds of changes and say they are so likely that we should represent the current choice as data in the system. For instance, who are the users? You can imagine a system where the user base is so fixed that there’s no data representing the user or users. Consider a single-user application like a word processor.

Another system might implicitly indicate there is just one community of users. So there’s no data that represents an organization of users… it’s just implicit. On the other hand, if you’re building a SaaS system, you expect the communities of users to come and go. (Hopefully, more come than go!) So you make whole communities into data because you expect that population to change very rapidly.

If you are building a SaaS system for a small, fixed market you might decide that the population won’t change very often. In that case, you might represent a population of users in the architecture via instancing.

So data is at the high-energy end of the spectrum, where we expect constant change. Next would be decisions that are contemplated in code but only made concrete in configuration. These aren’t quite as easy to change as data. Furthermore, we expect that only one answer to any given configuration choice is operative at a time. That’s in contrast to data where there can be multiple choices active simultaneously.

Below configuration are decisions represented explicitly in code. Constructs like policy objects, strategy patterns, and plugins all indicate our belief that the answer to a particular decision will change rapidly. We know it is likely to change, so we localize the current answer to a single class or function. This is the origin of the “Single Responsibility Principle.”

Farther down the spectrum, we have cross-cutting behavior in a single system. Logging, authentication, and persistence are the typical examples here. Would it be meaningful to say push these up into a higher level like configuration? What about data?

Then we have those things which are so implicit to the service or application that they aren’t even represented. Everybody has a story about when they had to make one of these explicit for the first time. It may be adding a native app to a Web architecture, or going from single-currency, single-language to multinational.

Next we run into things that we expect to change very rarely. These are cross-cutting behavior across multiple systems. Authentication services and schemas often land at this level.

So the spectrum goes like this, from high energy, rapidly changing, blue to cool, sedate red:

  • Data
  • Configuration
  • Encapsulated code
  • Cross-cutting code
  • Implicit in application
  • Cross-cutting architecture


The farther toward the “red” end of the spectrum we relegate a concern, the more tectonic it will be to change it.

No particular decision “naturally” falls at one level or another. We just have experience and intuition about which kinds of changes happen with greatest frequency. That intuition isn’t always right.

Efforts to make everything into data in the system lead to rules engines and logic programming. That doesn’t usually end up with the end-user control we think. It turns out you still need programmers to think through changes to rules in a rules engine. Instead of democratizing the changes, you’ve made them more esoteric.

It’s also not feasible to hoist everything up to be data. The most decisions you energy-boost to that level, the more it costs. And at some point you generalize enough that all you’ve done is create a new programming language. If everything about your application is data, you’ve written an interpreter and recursed one level higher. Now you still have to decide how to encode everything in that new language.

Queuing for QA

| Comments

Queues are the enemy of high-velocity flow. When we see them in our software, we know they will be a performance limiter. We should look at them in our processes the same way.

I’ve seen meeting rooms full of development managers with a chart of the year, trying to allocate which week each dev project will enter the QA environment. Any project that gets done too early just has to wait its turn in QA. Entry to QA becomes a queue of its own. And as with any queue, the more variability in the processing time, the more likely the queue is to back up and get delayed.

When faced with a situation like that, the team may look for the “right number” of QA environments to build. There is no right number. Any fixed number of environments just changes the queuing equation but keeps the queue. A much better answer is to change the rules of the game. Instead of having long-lived (in other words, broken and irreproducible) QA environments, focus on creating a machine for stamping out QA environments. Each project should be able to get its own disposable, destructible QA system, use it for the duration of a release, and discard it.

Availability and Stability

| Comments

Last post covered technical definitions of fault, error, and failure. In this post we will apply these definitions in a system.

Our context is a long-running service or server. It handles requests from many different consumers. Consumers may be human users, as in the case of a web site, or they may be other programs.

Engineering literature has many definitions of “availability.” For our purpose we will use observed availability. That is the probability that the system survives between the time a request is submitted and the time it is retired. Mathematically, this can be expressed as the probability that the system does not fail between time T_0 and T_1, where the difference T_1 - T_0 is the request latency.

(There is a subtle issue with this definition of observed availability, but we can skirt it for the time being. It intrinsically assumes there is some other channel by which we can detect failures in the system. In a pure message-passing network such as TCP/IP, there is no way to distinguish between “failed” and “really, really slow.” From the consumer’s perspective, “too slow” is failed.)

The previous post established that faults will occur. To maintain availability, we must prevent faults from turning into failures. At the component level, we may apply fault-tolerance or fault-intolerance. Either way, we must assume that components will lose availability.

Stability, then, is the architectural characteristic that allows a system to maintain availability in the face of faults, errors, and partial failures.

At the system level, we can create stability by applying the principles of recovery-oriented computing.

  1. Severability. When a component is malfunctioning, we must be able to cut it off from the rest of the system. This must be done dynamically at runtime. That is, it must not require changes to configuration or rebooting of the system as a whole.
  2. Tolerance. Components must be able to absorb “shocks” without transmitting or amplifying them. When a component depends on a another component which is failing or severed, it must not exhibit higher latency or generate errors.
  3. Recoverability. Failing components must be restarted without restarting the entire system.
  4. Resilience. A component may have higher latency or error rate when under stress from partial failures or internal faults. However, when the external or internal condition is resolved, the component must return to its previous latency and error rate. That is, it must display no lasting damage from the period of high stress.

Of these characteristics, recoverability may be the easiest to achieve in today’s architectures. Instance-level restarts of processes, virtual machines, or containers all achieve recoverability of a damaged components.

The remaining characteristics can be embedded in the code of a component via Circuit Breakers, Bulkheades, and Timeouts.

Fault, Error, Failure

| Comments

Our systems suffer many insults when they contact the real world. Flaky inputs, unreliable networks, and misbehaving users, to name just a few. As we design our components and systems to thrive in the only environment that matters, it pays to have mental schema and language to discuss the issues.

A fault is an incorrect internal state in your software. Faults are often introduced at component, module, or subsystem boundaries. There can be a mismatch between the contract a module is designed to implement and its actual behavior. A very simple example is accepting a negative integer or zero when a strictly positive integer was expected.

A fault may also occur when a latent bug in the software is triggered by an external or internal condition. For example, attempting to allocate an object when memory is exhausted will return a null pointer. If the software proceeds with the null pointer it can cause problems later, perhaps in a far distant part of the code.

Such an incorrect state may be recoverable. A fault-tolerant module will attempt to restore a good internal state after detecting a fault. Exception handlers and error-checking code are efforts to provide fault-tolerance.

Another school of thought says that fault tolerance is unreliable. In this approach, once a fault has occurred, the entire memory state of the program must be regarded as corrupt. Instead of attempting to restore a good state by backtracking or patching up the internal state, fault-intolerant modules will exit to avoid producing errors. A system built from these fault-intolerant modules will include supervisor capabilities to restart exited modules.

If a fault propagates in the system, it can produce visibly incorrect behavior. This is an error. Faults may occur without producing errors, as in the case of fault-tolerant modules that correct their own state before an error is observed. An error may be limited to an incorrect output displayed to a user. It can include any incorrect behavior, including data loss or corruption, network flooding, or launching attack drones.

At the component, module, or subsystem level, or mission is to prevent faults from causing errors.

A failure results when a system terminates without completing its job. For a long-running service or server, it stops responding to requests in a finite time. For a program that should run to completion and exit, it exits abnormally before completing. A failure may be preferrable to an error, depending on the harm caused by the error.

Next time, I will address system stability in the face of faults, errors, and failures.

Power Systems

| Comments

This is an excerpt from something I’m working on this Labor Day holiday:

Large scale power outages act a lot like software failures. It starts with a small event, like a power line grounding out on a tree. Ordinarily that would be no big deal but under high-stress conditions it can turn into a cascading failure that affects millions of people. We can also learn from how power gets restored after an outage. Operators must perform a tricky balancing act between generation, transmission, and demand.

There used to be a common situation where power would be restored and then cut off again in a matter of seconds. It was especially common in the American South, where millions of air conditioners and refrigerators would all start at the same time. When a motor starts up, it draws a lot of current. You can see this in the way that lights dim when you start a circular saw. As the motor starts to spin, though, it creates “back EMF”–a kind of backpressure on the electrical current. (That’s when the lights return to full brightness.) If you add up the effects of millions of electric motors starting all at once, you see a huge upward blip in current draw, followed by a quick drop due to back current. Power transmission systems would see the spike and drop and propagate that to the generation systems. First they would increase their draw then drop it dramatically. That would make the generation systems think they should shut off some of the turbines. Right about the time they started reducing supply, the initial surge of back EMF would decline and current load would come back up to baseline levels. The increased current load hit just when supply was declining, causing excess demand to trip circuit breakers. Lights out, again.

Smarter appliances and more modern control systems have mitigated that particular failure mode now, but there are still useful lessons for us.

Remember DAT?

| Comments

Do you remember Digital Audio Tape? DAT was supposed to have all the advantages of digital audio—high fidelity and perfect reproduction—plus the “advantages” of tape. (Presumably those advantages did not include melting on the dashboard of your Chevy Chevelle or spontaneously turning into The Best of Queen after a fortnight.)

In hindsight, we can see that DAT was a twilight product. As the sun set on the cassette era, DAT was an attempt to bridge the discontinuous technology change to digital music production. It was a twilight product because it didn’t sufficiently reimagine the existing technology to offer enough of a new advantage nor did it eliminate enough of the old disadvantages.

We often see such twilight products right before a major, discontinuous shift.

I think we’re in such a period when it comes to software development and deployment for cloud native systems. The tools we have attempt to take the traditional model into the new environment. But they don’t yet reimagine the world of software development enough. Ten years from now, we will see that they offered some advantages but also carried forward baggage from the client-server era. Unix-like full operating systems, coding one process at a time, treating network as a secondary concern, ignoring memory hierarchy in the languages.

Whatever the “operating system for the cloud” is, we haven’t seen it yet.

QA Instability Implies Production Instability

| Comments

Many companies that have trouble delivering software on time exhibit a common pathology. Developers working on the next release are frequently interrupted for production support issues with the current release. These interrupts never appear in project schedules but can take up half of the developers’ hours. When you include the cost of task-switching, this means less than half of their available time is spent on the new feature work.

Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system’s behavior in QA is not a good predictor of its behavior in production.

QA environments are often configured differently than production. Not just in the usual sense of consuming a QA version of external services, but also in more basic ways. Server topology may be different. Memory settings, capacity ratios, and the presence of network components can all vary. QA often has a much simpler traffic routing scheme than production, particularly when a CDN is involved.

The other major source of QA unavailability has to do with data refreshes. QA environments either run with a miniscule, curated test data set, or they use some form of backward migration from production data. Each backward migration can be very disruptive, leading to one or more days where QA is not available.

Disruption arises when testers have to do manual data setup in order to test new features. These setups get overwritten with the next refresh. Sometimes, production data must be cleansed or scrubbed of PII before use in QA. This cleansing process often introduces its own data quality problems. The backward migration process must also be kept up to date so it can propagate data back into the schema for the next release. This requires copying data and schema into QA, then forward-migrating the schema according to the new release.

When many teams contend to get into a QA environment, that contention can result in lost time as well. Time is lost in delays when one team cannot move their code into QA during another team’s test. It is also lost when one team overwrites test data that a different team had set up. And it can be lost when one team’s code has bugs that prevent other teams from proceeding with their tests. Suppose one team works on login and registration, while another team works on friend requests. Clearly, the friend requests team cannot do their testing when login is broken. This last issue also applies across service boundaries: a service consumer may not be able to test because the QA version of their service provider is broken.

Finally, problems in QA simply take a lower priority than problems in production. Thus, the operations team may be fully consumed with current production issues, leaving the QA environment broken for extended periods. In a vicious feedback loop, this makes it likely that the next release will also create production problems.

My recommendations are these:

  • Give priority to well-functioning test environments.
  • Virtualize your test environments, so you can avoid inter-team dependencies on a QA environment.
  • Automate the backward data propagation, and make it part of spinning up a QA environment. When you must scrub PII, automate that process so that every QA environment can draw from a snapshot of cleansed data without impinging on the production DBAs.
  • If your QA stays unavailable because there are too many production issues, recognize that this is a self-sustaining pattern. You can temporarily redirect a “SWAT” team to fix QA and it will pay dividends for all future releases.

Wittgenstein and Design

| Comments

What does a philosopher born in the 19th Century have to say about software design? More than you might think, particularly his ideas about family resemblance.

Wittgenstein used the subject of “games” to illustrate an idea. We’ll start with a counter-example. Suppose we operate with the then-prevailing notion that words are defined like sets in axiomatic set theory. Then there is a decision procedure that will let us decide whether something is a member of the set “games” or not. Such a decision procedure must include everything that is a game and exclude everything which is not a game. Can we define such a decision procedure?

Does a game require competition? Some do. Not all.

Does a game have a score? Or an objective? Not all.

Does a game involve more than one person? Not necessarily.

Is a game a frivolous expenditure of energy? Some are. Others have deep moral and philosophical lessons.

How is a game of football like a game of solitaire?

It’s easy to see that mancala and go have something in common… little rounded stones. But what do they have in common with Minecraft? Stones?

Wittgenstein said that this is not an issue for set theory. Instead, he talked about family resemblances. As described in Wikipedia, “things which could be thought to be connected by one essential common feature may in fact be connected by a series of overlapping similarities, where no one feature is common to all.”

For games, this means there is no single feature that makes something a game. Instead, there are a set of overlapping similarities that make things more gamelike or less gamelike. We can even think about things that share more of the features as being more like each other. So go and mancala share features like: two players, stones on a board, alternating turns, one winner, ancient, cerebral, positional. This makes them pretty similar. A professionally played team sport with a ball on a field shares few qualities with go. (Although “people excited about the outcome” and “positional” might be common.) So the feature-distance between go and football is large, yet they are both still games.

I think this relates to the tasks of software design and architecture. We have a strong tendency to go looking for nouns in our designs. Once we find a noun in a domain, we want to make a software artifact that captures all members of the set induced by that noun. But that only works if we stick with axiomatic set theory. Set theory works well for well-defined technical concepts and much less well for things in the human sphere.

One simple example, the humble “name” field. Go read Falsehoods programmers believe about names. How do you feel about that “first name”, “last name” database structure now? After reading that list, how much can you confidently say about instances of a “Name” class? Or a “Name” service?

We have all these debates about “noun-first” or “verb-first”. Back in The Perils of Semantic Coupling, I argued for a behavior-oriented approach rather than a noun-oriented approach. Stop saying “what is this thing?” but rather “what can you do with it?” That leads us toward segregated interfaces.

Now I’d augment that to emphasize those feature descriptions rather than noun-like descriptions. Instead of noun-first or verb-first I’m going to try “adjective-first”.