Wide Awake Developers

Data Is the New Oil

| Comments

The other day I tweeted that “Data is the New Oil.” A lot of people retweeted, but a quite a few asked what I meant by that. I’ll amplify a bit to explain the analogy.

This ended up being a lot to unpack from a quick tweet! For quite a few years now, I’ve used Twitter as a way to scratch the itch of personal expression. A quick sound bite there, highly compressed and idiosyncratic was just enough to relieve the mental pressure. As a consequence, I stopped blogging nearly as much. Lately, though, I feel the need for nuance and explanation, so I hope to do more in this space.

Oil was the resource that drove the industrial revolution in the 20th century. That was the age of oil and steel, according to economist and historian Carlota Perez. In Technological Revolutions and Financial Capital, Prof. Perez shows that every technology revolution goes through predictable phases, from irruption to exhaustion. Economics in the 20th C were totally defined by access to and movement of oil. Those who had it either had leverage or became victims, depending on their ability to create military and economic alliances. Oil reserves could put a nation on the world stage. A nation that bargained well with its oil would have power far beyond what its population size or technological ability would usually merit.

In fact, a large part of the U.S. economic dominance in the latter portion of the 20th C can be explained by the petrodollar. Since the Bretton Woods conference after WWII, oil transactions around the world were denominated in USD. If Saudi Arabia sold oil to China, then China had to pay SA in dollars. That meant China needed plenty of USD currency reserves and SA needed the US to hold riyal. (The biggest economic story in the world right now is not the DJIA hitting 26,000 or falling by 0.5% in a day… it’s that China, Russia, Saudi Arabia, and Iran are now trading oil denominated in rubles, yuan, and SDRs.)

But before the internal combustion engine, oil wasn’t a resource it was a nuisance. The oil-rich land in Oklahoma is where the U.S. Government settled people it wanted to get out of the way. Oil gets in the way of farming. It was development of the new technology that turned oil from a hassle into a resource.

Once oil became a resource, a feedback loop got underway. More demand for oil led to more extraction, which caused industries to find new uses for the stuff. Plastics, fertilizers, etc. Increased demand drove increased supply and more efficient extraction, which in turn led to more demand.

Prof. Perez already identified the next technological revolution as information technology. However, I think her book got the timing wrong. It was published in 2002 and dated the start of the revolution to the advent of the personal computer in 1970. With the advantage of 16 years of additional observation, I think that there were two missing pieces: networking and machine learning. The real irruption of information technology started over the last decade. And as with the previous revolution, this one creates a need for a new resource: data.

Before this, data was a nuisance. It filled up disks and needed to be purged. It was often dirty (meaning not fully correct or conforming to syntactic rules) and incomplete. But toward the end of the 00’s, some people started to see it as a resource. You might spend a lot of time cleansing and canonicalizing small data sets. But with a lot of data, it’s impossible. At the same time though, you don’t need to clean the data to glean information. Some kinds of errors average out and interesting signals emerge.

(If only I had come up with the name “Big Data” instead of “Dirty Data!”)

Of course, we’re well beyond mere Big Data now. With every eye turning toward machine learning, we’ve got a new challenge for our data. That’s training.

A machine learning model is only as good as the training data. The training data itself needs to be classified. In other words, to train a machine to detect cars, you need a lot of photos where some are tagged “this has a car” and others don’t have that tag. Yes, some CAPTCHAs just might be using you to train a machine, instead of proving you aren’t one.

(Aside: we’re going to see a lot of conflict about biases in ML models. We will expect the machines to be free of human cognitive and social biases, but we’re training them with data created and classified by humans! We will actually be asking the machines to make errors in a systematic way to offset humans’ systematic errors in the training inputs. It’s not hard to see why HAL 9000 went spare.)

Data is digital, but it’s not easy to move around in these quantities. We’re not talking about a station wagon full of tapes barreling down the highway… we’re talking about a convoy of 18-wheelers loaded with racks full of disks.

Companies that have tagged or classified data sets are the new oil producing and exporting countries. If you have large quantities of classified photos, video, voice, text, etc. you are well-positioned to train ML models. If you don’t have such a dataset, then you need to create a consumer-oriented startup to get humans to do the initial classification for you or you need to license access to data from one of the big players. (There are some open-access datasets that hobbyists can use, but those will never be as large or as current as the proprietary data sets.) Alternatively, focus on providing the engineering support and tooling for the technostates that have the data, the same way that Norway provides engineering to Saudi Arabia.

Just as oil production led to new uses of oil that reshaped everything from consumer products to food production to hygiene, I fully expect data-fueled ML models to reshape this century. Moreover, we will see demand for ever-greater data production from our homes, workplaces, and devices. This will cause tension and conflict about data use just as happened with land-use, water-use, and mineral rights. That will lead to new legal regimes and doctrines. In extreme cases, it may lead to revolutions similar to the Revolutions of 1848 in Europe.

Coherence Penalty for Humans

| Comments

This is a brief aside from my ongoing series about avoiding entity services. An interesting dinner conversation led to thoughts that I needed to write down.

Amdahl’s Law

In 1967, Gene Amdahl presented a case against multiprocessing computers. He argued that the maximum speed increase for a task would be limited because only a portion of the task could be split up and parallelized. This portion, the “parallel fraction,” might differ from one kind of job to another, but it would always be present. This argument came to be known as Amdahl’s Law.

When you graph the “speedup” for a job relative to the number of parallel processors devoted to it, you see this:

AmdahlsLaw.svg

The graph is asymptotic in the serial fraction, so there is an upper limit to the speedup.

From Amdahl to USL

The thing about Amdahl’s Law is that, when Gene made his argument, people weren’t actually building very many multiprocessing computers. His formula was based on first principles: if the serial fraction of a job is exactly zero, then it’s not a job but several.

Neil Gunther extended Amdahl’s Law based on observations of performance measurements from many machines. He arrived at the Universal Scalability Law. It uses two parameters to represent contention (which is similar to the serial fraction) and incoherence. Incoherence refers to the time spent restoring a common view of the world across different processors.

In a single CPU, incoherence penalties arise from caching. When one core changes a cache line, it tells other cores to eject that line from their caches. If they need to touch the same line, they spend time reloading it from main memory. (This is a slightly simplified description… but the more precise form still has incoherence penalty.)

Across database nodes, incoherence penalties arise from consistency and agreement algorithms. The penalty can be paid when data is changed (as in the case of transactional databases) or when the data is read in the case of eventually consistent stores.

Effect of USL

When you graph the USL as a function of number of processors, you get the green line on this graph:

gman-scale4.png

(The purple line shows what Amdahl’s Law would predict.)

Notice that the green line reaches a peak and then declines. It means that there is a number of nodes that produces maximum throughput. Add more processors and throughput goes down. Overscaling hurts throughput. I’ve seen this in real-life load testing.

We’d often like to increase the number of processors and get more throughput. There are exactly two ways to do that:

  1. Reduce the serial fraction
  2. Reduce the incoherence penalty

USL in Teams?

Let’s try an analogy. If the “job” is a project rather than a computational task, then we can look at the number of people on the project as the “processors” doing the work.

In that case, the serial fraction would be whatever portion of the work can only be done one step after another. That may be fodder for a future post, but it’s not what I’m interested in today.

There seems to be a direct analog for the incoherence penalty. Whatever time the team members spend re-establishing a common view of the universe is the incoherence penalty.

For a half-dozen people in a single room, that penalty might be really small. Just a whiteboard session once a week or so.

For a large team across multiple time zones, it could be large and formal. Documents and walkthrough. Presentations to the team, and so on.

In some architectures coherence matters less. Imagine a team with members across three continents, but each one works on a single service that consumes data in a well-specified format and produces data in a well-specified format. They don’t require coherence about changes in the processes, but would need coherence for any changes in the formats.

Sometimes tools and languages can change the incoherence penalty. One of the arguments for static typing is that it helps communicate across the team. In essence, types in code are the mechanism for broadcasting changes in the model of the world. In a dynamically typed language, we’d either need secondary artifacts (unit tests or chat messages) or we’d need to create boundaries where subteams only rarely needed to re-cohere with other subteams.

All of these are techniques aimed at the incoherence penalty. Let’s recall that overscaling causes reduced throughput. So if you have a high coherence penalty and too many people, then the team as a whole moves slower. I’ve certainly experienced teams where it felt like we could cut half the people and move twice as fast. USL and the incoherence penalty now helps me understand why that was true—it’s not just about getting rid of deadwood. It’s about reducing the overhead of sharing mental models.

In The Fear Cycle I alluded to codebases where people knew large scale changes were needed, but were afraid of inadvertant harm. This would imply a team that was overscaled and never achieved coherence. Once lost, it seems to be really hard to re-establish. That means ignoring the incoherence penalty is not an option.

USL and Microservices

By the way, I think that the USL explains some of the interest in microservices. By splitting a large system into smaller and smaller pieces, deployed independently, you reduce the serial fraction of a release. In a large system with many contributors, the serial fraction comes from integration, testing, and deployment activities. Part of the premise for microservices is that they don’t need the integration work, integration testing, or delay for synchronized deployment.

But, the incoherence penalty means that you might not get the desired speedup. I’m probably stretching the analogy a bit here, but I think you could regard interface changes between microservices as requiring re-coherence across teams. Too much of that and you won’t get the desired benefit of microservices.

What to do about it?

My suggestion: take a look at your architecture, language, tools, and team. See where you spend time re-establishing coherence when people make changes to the system’s model of the world.

Look for splits. Split the system with internal boundaries. Split the team.

Use your environment to communicate the changes so re-cohering can be a broadcast effort rather than one-to-one conversations.

Look at your team communications. How much of your time and process is devoted to coherence? Maybe you can make small changes to reduce the need for it.

Services by Lifecycle

| Comments

This post took a lot longer to pull together than I expected. Not because it was hard to write, but because it was too easy to write too much. Like a pre-bonsai tree, it would grow out of control and get pruned back over and over.

In the meantime, I delivered a workshop and spent some lovely holiday time with my family.

But it’s a new year now, and January is devoid of holidays so it’s high time I got back to business.

Avoiding the Entity Service

In my last post, I made a case against entity services. To recap, an entity service is a set of CRUD operations on a business entity such as Person, Location, Contract, Order, etc. It’s an antipattern because it creates high semantic and operational coupling. Edge services suffer from common-mode failures through their shared dependency on the entity services. Changes or outages in the entity services have large “failure domains.”

A lot of good advice springs from Eric Evan’s hugely influential book “Domain-Driven Design.” It was written before the service era, but seems to apply well now. I’m not an expert on DDD, though, so I’m going to offer some techniques that may or may not be described there. (I dig the “bounded context” idea, but need to re-read the whole book before I comment on it more.)

There are several ways to avoid entity services. This post explores just one (though it’s one I particularly like.) Future posts will look at additional techniques.

Focus on Behavior Instead of Data

When you think about what a service knows, you always end up back at CRUD. I recommend thinking in terms of the service’s responsibilities. (And don’t say it’s responsible for knowing some data!) Does it apply policy? Does it aggregate a stream of concepts into a summary? Does it facilitate some kinds of changes? Does it calculate something? And so on.

Once you know what a service does, you can figure out what it needs to know. For instance, a service that restricts content delivery based on local laws needs to know a few things:

  1. What jurisdiction applies?
  2. What classifiers are on the content?
  3. What classifiers are not allowed in that jurisdiction?

Notice that #1 and #2 are specific to an individual request, while #3 is slowly-changing data. Thus it makes sense for the service to know #3 and be told #1 and #2.

This leads us to a deeper discussion about what the service knows. How does that data get into the service? Is there a GUI for a legal team? Maybe there’s a feed we can purchase from a data vendor. Maybe we need to apply machine learning based on lawsuits and penalties incurred when we violate local laws. (I’m kidding! Don’t do the last one!) The answers to these questions will firm up your design and situate your service in its ecosystem.

Model Like It’s 1999

When modeling, I like to use a technique from object-oriented design. CRC cards let me lay out physical tokens that represent services. Then I can play a game where I simulate a request coming in at the edge. A service can add information to a request and pass it along to another service, following the “Tell, Don’t Ask” principle.

If you are in a team, you can deal out cards to players then simulate a request by passing a physical object around. That will quickly reveal gaps in a design. Some common gaps that I see:

  1. A service doesn’t know where to send the request. It lacks knowledge of other services that can continue processing. The solution is either to statically introduce it to the next party or to provide URLs in the data that lead to the handler.
  2. A service receives a request that is insufficient. The incoming request either lacks information or has an implicit context that should be turned into data on the request.

While playing the CRC game, it’s OK to assume your service already has data it naturally depends on. That is, slowly-changing data the service uses should be considered an asynchronous process relative to the current request. But do make note of that slowly-changing data so you remember to build in the flows needed to populate it.

If you follow “Tell, Don’t Ask” strictly, then the activation set will be a strict tree. Anywhere a service calls out to more than one downstream, it should be sending instructions forward in parallel rather than serially making queries followed by an instruction.

Dealing with Consistency

If it were just a matter of passing requests along with extra data, then life would be simple. As often happens, trouble comes from side effects.

Services are not pure functions. If a service call always results in the same result for the same parameters, then you don’t need a service. Make a library and avoid the operational overhead! Services only make sense when they change something in the world. That means state and state changes are unavoidable concerns in a service-based architecture.

Consistency immediately comes up as an issue. Many words have already been written about CAP. Some good, some misguided, and some pure marketing. I even wrote an earlier post about the subtle differences between the C in CAP versus the C in ACID.

Let’s look at one way to deal with consistency in the face of changing state.

Divide Services by Lifecycle in a Business Process

Many business processes have entities that go through a series of milestones. In a particular state, changes are allowed to certain attributes but not others. Once a subset of the properties are “valid” (whatever that means) the entity can transition to the next stage of the business process.

Instead of viewing this as a single entity with a bunch of booleans, or a CURRENT_STATE attribute (which implies a state machine that is unknown to consumers) we can view each state as a different thing.

For example, consider this process from a peer-to-peer lending situation:

  1. A loan requestor starts by creating a project proposal. The requestor can provide descriptive text, an amount to request, some media assets (projects with big vivid pictures get funded faster.)
  2. Once the loan request is completed, the requestor submits it for approval. At this point, the requestor is no longer allowed to change the amount requested.
  3. An analyst from the host company reviews the proposal. In parallel, a background job checks the requestor’s credit score, repayment history, and criminal record.
  4. The analyst reviews the request and either assigns a target interest rate, rejects the request outright, or indicates that more information is needed from the requestor.
  5. Once approved, the proposal is visible to funders. Funders can commit to a certain amount (contingent on their credit scores.) At this stage, none of the proposal information can be changed, although the whole proposal could be withdrawn.
  6. Once fully funded, funders must transfer money within 3 days. No additional funders are allowed to join the project at this time, but they can go on a waiting list in case some of the committed funders fail to supply the money.
  7. Once funds are in the funders’ accounts, it all gets transfered into a short-term holding account. The project information, all the individuals’ information (tax IDs, etc.) goes to an underwriter who produces a legal loan document for all parties to sign.

For the moment, we’re leaving out some of the tributaries of this flow.

Notice how moving through the business process causes previous information to become effectively read-only?

The original form of this system was a monolith that had a state field on a “Loan” model object. That was a wide object, since it had everything from the initial proposal through to the ultimate payment. If we made that into a “Loan” microservice we would exactly end up with an entity service, CRUD operations, and high coupling into that service, as shown below.

With loan entity service

Try playing CRC with this design. You’ll find that all requests reach the Loan service.

What is less evident from the diagram is about the cost of embedding a state model into the entity service directly. If we put a state field on the Loan, then every Loan must go through the same state machine. It locks us into a single kind of business process. At the time, we already knew the company was exploring direct-funded loans through a banking partner. So there would be a minimum of two flavors of process. (Or one process with proliferating branches.)

I briefly considered using my DEVS library to represent the state plus state machine as EDN data on each Loan, but ultimately decided against it.

Instead, I thought we could make each state into its own service, as shown here.

With Lifecycle

Now as the business process moves along, we’re really sending documents to each service. For example, from Proposal to Project, we send a “ProjectStarter” document that contains all the attributes needed for a Project. When the analyst approves a project, the analyst GUI (or backend for same) creates a “LoanStarter” and sends it to the UnfundedLoan service. Likewise, once all funding is received, the “Collection” service creates a “LoanPackage” document and sends it to the “Underwriting” service. (That’s “collection” as in “gatherer of documents” not “collection” as in “break your kneecaps.”) Further downstream, we set up a schedule of payments to receive from the requestor and payments to issue to the backers. We also keep a set of ledgers to track balances per account.

Each of the services has facilities to add or update information relevant to that service. It ignores anything in the incoming documents that it doesn’t need.

This gives us a lot of flexibility in how we build the overall process.

Consider our direct-funding scenario. We need a new “DirectFunding” service that finds suitable candidates. It sends a document out to the bank and receives a response. On a favorable response, DirectFunding can create its own version of the LoanPackage document for underwriting. In other words, treating these stages as services connected by well-defined document formats allows us to introduce more pathways without creating the state machine from hell.

As an additional benefit, we can easily monitor the flow of documents to see when the process is healthy. We can monitor each service’s activity to create a cumulative flow diagram. We get a lot of visibility. And since some stages are triggered by humans (e.g., the analysts) we can even figure out how our staff model must scale with business throughput.

It should also be clear that this style works well with event transfer instead of document transfer. It would be natural to put all the documents onto a message bus.

Overall, I think this style offers a nice degree of alignment between the technology and the business. The only “downside” I can see is that it requires a service developer to understand how that service contributes to value streams and business processes.

Backtracking, Errors, and Races

There is still a minute window of opportunity for perceived inconsistency to sneak in. For example, what happens if the requestor tries to change the proposal while the analyst is reviewing it? Or worse, what if they change it in those milliseconds between when the analyst clicks “Approve” in the GUI and when the document goes over to the Project service? For that matter, how do we tell the Proposal service that the proposal can no longer be edited without withdrawing the request and resubmitting as a new Project?

This post is already getting too long, so I’m going to answer those questions next time. It shouldn’t take another month since we’re past the holiday-fun-times and into the serious winter months.


If you’re interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

I’m hosting a session this March in sunny Florida. Especially to all my dear friends and colleagues back home in Minnesota… we know that March is a great time to not be in Minnesota.

Or, contact me to schedule a workshop at your company.

The Entity Service Antipattern

| Comments

In my last post I talked about the need to keep things separated once they’ve been decoupled. Let’s look at one of the ways this breaks down: entity services.

If a pattern is a solution to a problem in a context, what is an antipattern? An antipattern is a commonly-rediscovered solution to a problem in a context, that inadvertently creates a resulting context we like less than the original context. In other words, it’s a pattern that makes things worse (according to some value system.)

I contend that “entity services” are an antipattern.

To make that case, I need to establish that “entity services” are a commonly-rediscovered solution to a problem and that the resulting context is worse than the starting context (a monolith.)

Let’s start with the “commonly-rediscovered” part. Entity services are in Microsoft’s .NET microservices architecture ebook. Spring has a tutorial with them. (Spring may give us the absolute easiest way to create an entity service. The same class can be annotated with JSON mapping and persistence mapping.) RedHat has a Microservice Reference Architecture with product-service and sales-service. Some of the microservice-focused frameworks such as JHipster start with CRUD on data entities.

In order to make the case that the resulting context is worse than the starting context, I need to assume what that starting context actually is. For the sake of generality, I’ll assume a largish, legacy application that is more-or-less a monolith. It may call out to some integration points to get work done, but features are pretty much local and in-process. There are multiple instance of the process running on different hosts. Basically, like the following diagram.

All features reside in the code for the application instances.

Many other authors have enumerated the sins of the monolith, so I won’t belabor them here. (Though I feel compelled to make a brief aside to say that we did somehow deliver quite a lot of working, valuable features that ran in monoliths.)

How might we describe this initial context?

  • It is clear where the code to build a feature goes and how to test the code.
  • The release cadence is dictated by the slowest-delivering subteam.
  • There is little inherent enforcement of boundaries, thus coupling tends to increase over time.
  • Performance problems can be found by profiling a single application.
  • The cause of availability problems is typically found in one place.
  • Building features that rely on multiple entities is straightforward, though it may come at the cost of inappropriate coupling.
  • As the code grows large, the organization is at risk of entering the fear cycle.
  • Feature availability may be compromised by inappropriate coupling via common modes in the application. (E.g., thread pools, connection pools.)
  • Feature availability should be improved by redundancy of the whole application itself. This is reduced, however, if the application is vulnerable to the surrounding environment, as in the case of a Self-Denial Attack, memory leak, or race condition.

Supposing we move this to a microservice architecture, with entity services. We might end up with something like this example from the Spring tutorial:

In this version, you should assume that each of the service boxes comprises multiple instances of that service.

Obviously there are more moving parts involved. That immediately means it’s harder to maintain availability.

The challenges of performance analysis and debugging are well documented, so I won’t belabor them.

But in this resulting context, where do features get created? A few of them are direct interactions of the “Online Shopping” service and the individual entity services. For example, creating an account is definitely just between Online Shopping and Accounts.

Most features, however, require more than one of the entities. They will use aggregates or intersections of entities.

For example, suppose we need to calculate the total price of a cartful of items. That involves the cart, the products (for their individual prices) and the account to find the applicable sales tax or VAT. I predict that it will be implemented in the Online Shopping service by making a bunch of calls to the entity services to get their data.

We can depict this with an “activation set” (a term I made up) to show which services are activated during processing of a single request type. For this picture, we focus on just the services and elide the infrastructure.

So to price the cart, we have to activate four of the five services in our architecture.

That activation represents operational coupling, which affects availability, performance, and capacity.

It also represents semantic coupling. A change to any of the entity services has the potential to ripple through into the online shopping service. (In particularly bad cases, the online shopping service may find itself brokering between data formats: translating version 5 of the user data produced by Accounts into the version 3 format that Cart expects.)

A common corollary to entity services is the idea of “stateless business process services.” I think this meme dates back to last century with the original introduction of Java EE, with entity beans and session beans. It came back with SOA and again with microservices.

What happens to our picture if we introduce a process service to handle pricing the cart?

Not much improvement.

Bear in mind this is the activation set for just one request type. We have to consider all the different request types and overlay their activation sets. We’ll find the entity services are activated for the majority of requests. That makes them a problem for availability and performance. It also means they won’t be allowed to change as fast as we’d like. (Services with a high fan-in need to be more stable.)

So, let’s look at the resulting context of moving to microservices with entity services:

  • Performance analysis and debugging is more difficult. Tracing tools such as Zipkin are necessary.
  • Additional overhead of marshalling and parsing requests and replies consumes some of our precious latency budget.
  • Individual units of code are smaller.
  • Each team can deploy on its own cadence.
  • Semantic coupling requires cross-team negotiation.
  • Features mainly accrue in “nexuses” such as API, aggregator, or UI servers.
  • Entity services are invoked on nearly every request, so they will become heavily loaded.
  • Overall availability is coupled to many different services, even though we expect individual services to be deployed frequently. (A deployment look exactly like an outage to callers!)

In summary, I’d say both criteria are met to label entity services as an antipattern.

Stay tuned. In a future post, we’ll look at what to do instead of entity services.


If you’re interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

There is a session open to the public in March 2018.

Or, contact me to schedule a workshop at your company.

Keep ‘Em Separated

| Comments

Software doesn’t have any natural boundaries. There are no rivers, mountains, or deserts to separate different pieces of software. If two services interact, then they have a sort of “attractive force” that makes them grow towards each other. The interface between them becomes more specific. Semantic coupling sneaks in. At a certain point, they might as well be one module running in one process.

If you’re building microservices, you need to make sure they don’t grow together into an impenetrable bramble. The whole microservice bet is that we can trade deployment and operational complexity for team-scale autonomy and development speed. The last thing you want is to take on the operational complexity of microservices and still move slowly due to semantic coupling among them.

Maybe you’ve recently broken up a monolith into microservices, but found that things aren’t as easy and rosy as the conference talks led you to believe.

Maybe you have a microservice architecture that is starting to slow down and get harder. Like cooling honey, it seems sweet at first but gets stickier later.

I’m going to write a short series of posts with techniques to keep ‘em separated. This will go into API design for microservices, information architecture, and feature design. It’ll be all about making smaller, more general pieces, that you can rearrange in interesting ways.


If you’re interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

There is a session open to the public in March 2018.

Or, contact me to schedule a workshop at your company.

Root Cause Analysis as Storytelling

| Comments

Humans are great storytellers and even better story-listeners. We love to hear stories so much that when there aren’t any available, we make them up on our own.

From an early age, children grasp the idea of narrative. Even if they don’t understand the forms of storytelling so much, you can hear a four-year-old weave a linked list of events from her day.

We look for stories behind everything. At a deep level, we want the world’s events to mean something. Effect follows cause, and causes have an actor to set them in motion.

Our sense of balance also demands that large effects should have large causes, with correspondingly large intent.

A drunk driver speeds through a red light, oblivious. A crossing car stops short. The shaken driver creeps home with a pounding pulse, full of queasy adrenaline. She unbuckles her daughter and hugs her tightly.

A drunk driver speeds through a red light, oblivious. A crossing car is in the intersection. The drunk smashes into it, right at the drivers’ side door. The woman’s bloody face is hidden behind airbags. Her daughter sits in her new wheelchair for her mother’s funeral.

The difference between those stories is a matter of a split second in timing. There is absolutely no change in the motives or desires of anyone in the two vignettes. The first drunk, if caught, would get a jail term and large fine. He would probably lose his driver’s license.

But most people would judge the motives of the second driver far more harshly. They would condemn him to a lengthy prison term and a lifetime ban on driving.

When we see a large effect, we expect a large cause, with a large intent.

The idea that some vast, horrible events strike randomly fills us with dread. People can’t bear the thought that a single unbalanced nobody can change the course of a nation’s history with one rifle shot, so they spend more than 50 years searching for “the truth.”

“Root Cause Analysis” expresses a desire for narrative. With the power of hindsight, we want to find out what went wrong, who did it, and how we can make sure it never happens again. But because we have the posterior event, we judge the prior probabilities differently. Any anomaly or blip suddenly becomes suspect.

People don’t look as hard at anomalies when nothing bad happens.

They don’t notice all the times the same weird log message pops up before … everything continues as normal.

When we look for “root cause,” what we are really trying to discern is not “what made this happen.” We are looking for something that would have stopped it from happening. We are building a counterfactual narrative—an alternate history—where that drunk driver dropped his keys in the parking lot and was thereby delayed a few crucial seconds.

Peel back the surface on a root cause analysis and you almost always see a formula that goes like this: “factor X” could have prevented this. “Factor X” was not present, therefore the bad event happened.

The catch is that there is usually an endless variety of possible counterfactuals. Often, more than one counterfactual narrative would have prevented the bad outcome equally well. Which one was the root cause? Non-existence of “factor X” or non-existence of “factor Y?”

Next time you have a bad incident, why not try to focus your efforts in a different way? Work on learning from the times that things don’t go wrong. And be explicit about looking for many possible interventions that would have prevented the problem. Then select ones with broad ability to prevent or impede many different problems.

Release It Second Edition in Beta

| Comments

I’m excited to announce the beta of Release It! Second edition.

It’s been ten years since the first edition was released. Many of the lessons in that book hold strong. Some are even more relevant today than in 2007. But a few things have changed. For one thing, capacity management is much less of an issue today. The rise of the cloud means that developers are more exposed to networks than ever. And in this era of microservices, we’ve got more and better ops tools in the open source world than ever.

All of that motivated me to update the book for this decade. I’ve removed the section on capacity and capacity optimization and replaced it with a section that builds up a picture of our systems by doing a “slow zoom” out from the hardware, to single processes, to clusters, to the controlling infrastructure, and to security issues.

The first beta does not yet include two additional new parts on deployment and solving systemic problems. Those will be coming in the next few weeks.

In the meanwhile, I look forward to hearing your comments and feedback! Join the conversation in the book’s forum.

Spectrum of Change

| Comments

I’ve come to believe that every system implicitly defines a spectrum of changes, ordered by their likelihood. As designers and developers, we make decisions about what to embody as architecture, code, and data based on known requirements and our experience and intuition.

We pick some kinds of changes and say they are so likely that we should represent the current choice as data in the system. For instance, who are the users? You can imagine a system where the user base is so fixed that there’s no data representing the user or users. Consider a single-user application like a word processor.

Another system might implicitly indicate there is just one community of users. So there’s no data that represents an organization of users… it’s just implicit. On the other hand, if you’re building a SaaS system, you expect the communities of users to come and go. (Hopefully, more come than go!) So you make whole communities into data because you expect that population to change very rapidly.

If you are building a SaaS system for a small, fixed market you might decide that the population won’t change very often. In that case, you might represent a population of users in the architecture via instancing.

So data is at the high-energy end of the spectrum, where we expect constant change. Next would be decisions that are contemplated in code but only made concrete in configuration. These aren’t quite as easy to change as data. Furthermore, we expect that only one answer to any given configuration choice is operative at a time. That’s in contrast to data where there can be multiple choices active simultaneously.

Below configuration are decisions represented explicitly in code. Constructs like policy objects, strategy patterns, and plugins all indicate our belief that the answer to a particular decision will change rapidly. We know it is likely to change, so we localize the current answer to a single class or function. This is the origin of the “Single Responsibility Principle.”

Farther down the spectrum, we have cross-cutting behavior in a single system. Logging, authentication, and persistence are the typical examples here. Would it be meaningful to say push these up into a higher level like configuration? What about data?

Then we have those things which are so implicit to the service or application that they aren’t even represented. Everybody has a story about when they had to make one of these explicit for the first time. It may be adding a native app to a Web architecture, or going from single-currency, single-language to multinational.

Next we run into things that we expect to change very rarely. These are cross-cutting behavior across multiple systems. Authentication services and schemas often land at this level.

So the spectrum goes like this, from high energy, rapidly changing, blue to cool, sedate red:

  • Data
  • Configuration
  • Encapsulated code
  • Cross-cutting code
  • Implicit in application
  • Cross-cutting architecture

Implications

The farther toward the “red” end of the spectrum we relegate a concern, the more tectonic it will be to change it.

No particular decision “naturally” falls at one level or another. We just have experience and intuition about which kinds of changes happen with greatest frequency. That intuition isn’t always right.

Efforts to make everything into data in the system lead to rules engines and logic programming. That doesn’t usually end up with the end-user control we think. It turns out you still need programmers to think through changes to rules in a rules engine. Instead of democratizing the changes, you’ve made them more esoteric.

It’s also not feasible to hoist everything up to be data. The most decisions you energy-boost to that level, the more it costs. And at some point you generalize enough that all you’ve done is create a new programming language. If everything about your application is data, you’ve written an interpreter and recursed one level higher. Now you still have to decide how to encode everything in that new language.

Queuing for QA

| Comments

Queues are the enemy of high-velocity flow. When we see them in our software, we know they will be a performance limiter. We should look at them in our processes the same way.

I’ve seen meeting rooms full of development managers with a chart of the year, trying to allocate which week each dev project will enter the QA environment. Any project that gets done too early just has to wait its turn in QA. Entry to QA becomes a queue of its own. And as with any queue, the more variability in the processing time, the more likely the queue is to back up and get delayed.

When faced with a situation like that, the team may look for the “right number” of QA environments to build. There is no right number. Any fixed number of environments just changes the queuing equation but keeps the queue. A much better answer is to change the rules of the game. Instead of having long-lived (in other words, broken and irreproducible) QA environments, focus on creating a machine for stamping out QA environments. Each project should be able to get its own disposable, destructible QA system, use it for the duration of a release, and discard it.

Availability and Stability

| Comments

Last post covered technical definitions of fault, error, and failure. In this post we will apply these definitions in a system.

Our context is a long-running service or server. It handles requests from many different consumers. Consumers may be human users, as in the case of a web site, or they may be other programs.

Engineering literature has many definitions of “availability.” For our purpose we will use observed availability. That is the probability that the system survives between the time a request is submitted and the time it is retired. Mathematically, this can be expressed as the probability that the system does not fail between time T_0 and T_1, where the difference T_1 - T_0 is the request latency.

(There is a subtle issue with this definition of observed availability, but we can skirt it for the time being. It intrinsically assumes there is some other channel by which we can detect failures in the system. In a pure message-passing network such as TCP/IP, there is no way to distinguish between “failed” and “really, really slow.” From the consumer’s perspective, “too slow” is failed.)

The previous post established that faults will occur. To maintain availability, we must prevent faults from turning into failures. At the component level, we may apply fault-tolerance or fault-intolerance. Either way, we must assume that components will lose availability.

Stability, then, is the architectural characteristic that allows a system to maintain availability in the face of faults, errors, and partial failures.

At the system level, we can create stability by applying the principles of recovery-oriented computing.

  1. Severability. When a component is malfunctioning, we must be able to cut it off from the rest of the system. This must be done dynamically at runtime. That is, it must not require changes to configuration or rebooting of the system as a whole.
  2. Tolerance. Components must be able to absorb “shocks” without transmitting or amplifying them. When a component depends on a another component which is failing or severed, it must not exhibit higher latency or generate errors.
  3. Recoverability. Failing components must be restarted without restarting the entire system.
  4. Resilience. A component may have higher latency or error rate when under stress from partial failures or internal faults. However, when the external or internal condition is resolved, the component must return to its previous latency and error rate. That is, it must display no lasting damage from the period of high stress.

Of these characteristics, recoverability may be the easiest to achieve in today’s architectures. Instance-level restarts of processes, virtual machines, or containers all achieve recoverability of a damaged components.

The remaining characteristics can be embedded in the code of a component via Circuit Breakers, Bulkheades, and Timeouts.