Wide Awake Developers

Availability and Stability

| Comments

Last post covered technical definitions of fault, error, and failure. In this post we will apply these definitions in a system.

Our context is a long-running service or server. It handles requests from many different consumers. Consumers may be human users, as in the case of a web site, or they may be other programs.

Engineering literature has many definitions of “availability.” For our purpose we will use observed availability. That is the probability that the system survives between the time a request is submitted and the time it is retired. Mathematically, this can be expressed as the probability that the system does not fail between time T_0 and T_1, where the difference T_1 - T_0 is the request latency.

(There is a subtle issue with this definition of observed availability, but we can skirt it for the time being. It intrinsically assumes there is some other channel by which we can detect failures in the system. In a pure message-passing network such as TCP/IP, there is no way to distinguish between “failed” and “really, really slow.” From the consumer’s perspective, “too slow” is failed.)

The previous post established that faults will occur. To maintain availability, we must prevent faults from turning into failures. At the component level, we may apply fault-tolerance or fault-intolerance. Either way, we must assume that components will lose availability.

Stability, then, is the architectural characteristic that allows a system to maintain availability in the face of faults, errors, and partial failures.

At the system level, we can create stability by applying the principles of recovery-oriented computing.

  1. Severability. When a component is malfunctioning, we must be able to cut it off from the rest of the system. This must be done dynamically at runtime. That is, it must not require changes to configuration or rebooting of the system as a whole.
  2. Tolerance. Components must be able to absorb “shocks” without transmitting or amplifying them. When a component depends on a another component which is failing or severed, it must not exhibit higher latency or generate errors.
  3. Recoverability. Failing components must be restarted without restarting the entire system.
  4. Resilience. A component may have higher latency or error rate when under stress from partial failures or internal faults. However, when the external or internal condition is resolved, the component must return to its previous latency and error rate. That is, it must display no lasting damage from the period of high stress.

Of these characteristics, recoverability may be the easiest to achieve in today’s architectures. Instance-level restarts of processes, virtual machines, or containers all achieve recoverability of a damaged components.

The remaining characteristics can be embedded in the code of a component via Circuit Breakers, Bulkheades, and Timeouts.

Fault, Error, Failure

| Comments

Our systems suffer many insults when they contact the real world. Flaky inputs, unreliable networks, and misbehaving users, to name just a few. As we design our components and systems to thrive in the only environment that matters, it pays to have mental schema and language to discuss the issues.

A fault is an incorrect internal state in your software. Faults are often introduced at component, module, or subsystem boundaries. There can be a mismatch between the contract a module is designed to implement and its actual behavior. A very simple example is accepting a negative integer or zero when a strictly positive integer was expected.

A fault may also occur when a latent bug in the software is triggered by an external or internal condition. For example, attempting to allocate an object when memory is exhausted will return a null pointer. If the software proceeds with the null pointer it can cause problems later, perhaps in a far distant part of the code.

Such an incorrect state may be recoverable. A fault-tolerant module will attempt to restore a good internal state after detecting a fault. Exception handlers and error-checking code are efforts to provide fault-tolerance.

Another school of thought says that fault tolerance is unreliable. In this approach, once a fault has occurred, the entire memory state of the program must be regarded as corrupt. Instead of attempting to restore a good state by backtracking or patching up the internal state, fault-intolerant modules will exit to avoid producing errors. A system built from these fault-intolerant modules will include supervisor capabilities to restart exited modules.

If a fault propagates in the system, it can produce visibly incorrect behavior. This is an error. Faults may occur without producing errors, as in the case of fault-tolerant modules that correct their own state before an error is observed. An error may be limited to an incorrect output displayed to a user. It can include any incorrect behavior, including data loss or corruption, network flooding, or launching attack drones.

At the component, module, or subsystem level, or mission is to prevent faults from causing errors.

A failure results when a system terminates without completing its job. For a long-running service or server, it stops responding to requests in a finite time. For a program that should run to completion and exit, it exits abnormally before completing. A failure may be preferrable to an error, depending on the harm caused by the error.

Next time, I will address system stability in the face of faults, errors, and failures.

Power Systems

| Comments

This is an excerpt from something I’m working on this Labor Day holiday:

Large scale power outages act a lot like software failures. It starts with a small event, like a power line grounding out on a tree. Ordinarily that would be no big deal but under high-stress conditions it can turn into a cascading failure that affects millions of people. We can also learn from how power gets restored after an outage. Operators must perform a tricky balancing act between generation, transmission, and demand.

There used to be a common situation where power would be restored and then cut off again in a matter of seconds. It was especially common in the American South, where millions of air conditioners and refrigerators would all start at the same time. When a motor starts up, it draws a lot of current. You can see this in the way that lights dim when you start a circular saw. As the motor starts to spin, though, it creates “back EMF”–a kind of backpressure on the electrical current. (That’s when the lights return to full brightness.) If you add up the effects of millions of electric motors starting all at once, you see a huge upward blip in current draw, followed by a quick drop due to back current. Power transmission systems would see the spike and drop and propagate that to the generation systems. First they would increase their draw then drop it dramatically. That would make the generation systems think they should shut off some of the turbines. Right about the time they started reducing supply, the initial surge of back EMF would decline and current load would come back up to baseline levels. The increased current load hit just when supply was declining, causing excess demand to trip circuit breakers. Lights out, again.

Smarter appliances and more modern control systems have mitigated that particular failure mode now, but there are still useful lessons for us.

Remember DAT?

| Comments

Do you remember Digital Audio Tape? DAT was supposed to have all the advantages of digital audio—high fidelity and perfect reproduction—plus the “advantages” of tape. (Presumably those advantages did not include melting on the dashboard of your Chevy Chevelle or spontaneously turning into The Best of Queen after a fortnight.)

In hindsight, we can see that DAT was a twilight product. As the sun set on the cassette era, DAT was an attempt to bridge the discontinuous technology change to digital music production. It was a twilight product because it didn’t sufficiently reimagine the existing technology to offer enough of a new advantage nor did it eliminate enough of the old disadvantages.

We often see such twilight products right before a major, discontinuous shift.

I think we’re in such a period when it comes to software development and deployment for cloud native systems. The tools we have attempt to take the traditional model into the new environment. But they don’t yet reimagine the world of software development enough. Ten years from now, we will see that they offered some advantages but also carried forward baggage from the client-server era. Unix-like full operating systems, coding one process at a time, treating network as a secondary concern, ignoring memory hierarchy in the languages.

Whatever the “operating system for the cloud” is, we haven’t seen it yet.

QA Instability Implies Production Instability

| Comments

Many companies that have trouble delivering software on time exhibit a common pathology. Developers working on the next release are frequently interrupted for production support issues with the current release. These interrupts never appear in project schedules but can take up half of the developers’ hours. When you include the cost of task-switching, this means less than half of their available time is spent on the new feature work.

Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system’s behavior in QA is not a good predictor of its behavior in production.

QA environments are often configured differently than production. Not just in the usual sense of consuming a QA version of external services, but also in more basic ways. Server topology may be different. Memory settings, capacity ratios, and the presence of network components can all vary. QA often has a much simpler traffic routing scheme than production, particularly when a CDN is involved.

The other major source of QA unavailability has to do with data refreshes. QA environments either run with a miniscule, curated test data set, or they use some form of backward migration from production data. Each backward migration can be very disruptive, leading to one or more days where QA is not available.

Disruption arises when testers have to do manual data setup in order to test new features. These setups get overwritten with the next refresh. Sometimes, production data must be cleansed or scrubbed of PII before use in QA. This cleansing process often introduces its own data quality problems. The backward migration process must also be kept up to date so it can propagate data back into the schema for the next release. This requires copying data and schema into QA, then forward-migrating the schema according to the new release.

When many teams contend to get into a QA environment, that contention can result in lost time as well. Time is lost in delays when one team cannot move their code into QA during another team’s test. It is also lost when one team overwrites test data that a different team had set up. And it can be lost when one team’s code has bugs that prevent other teams from proceeding with their tests. Suppose one team works on login and registration, while another team works on friend requests. Clearly, the friend requests team cannot do their testing when login is broken. This last issue also applies across service boundaries: a service consumer may not be able to test because the QA version of their service provider is broken.

Finally, problems in QA simply take a lower priority than problems in production. Thus, the operations team may be fully consumed with current production issues, leaving the QA environment broken for extended periods. In a vicious feedback loop, this makes it likely that the next release will also create production problems.

My recommendations are these:

  • Give priority to well-functioning test environments.
  • Virtualize your test environments, so you can avoid inter-team dependencies on a QA environment.
  • Automate the backward data propagation, and make it part of spinning up a QA environment. When you must scrub PII, automate that process so that every QA environment can draw from a snapshot of cleansed data without impinging on the production DBAs.
  • If your QA stays unavailable because there are too many production issues, recognize that this is a self-sustaining pattern. You can temporarily redirect a “SWAT” team to fix QA and it will pay dividends for all future releases.

Wittgenstein and Design

| Comments

What does a philosopher born in the 19th Century have to say about software design? More than you might think, particularly his ideas about family resemblance.

Wittgenstein used the subject of “games” to illustrate an idea. We’ll start with a counter-example. Suppose we operate with the then-prevailing notion that words are defined like sets in axiomatic set theory. Then there is a decision procedure that will let us decide whether something is a member of the set “games” or not. Such a decision procedure must include everything that is a game and exclude everything which is not a game. Can we define such a decision procedure?

Does a game require competition? Some do. Not all.

Does a game have a score? Or an objective? Not all.

Does a game involve more than one person? Not necessarily.

Is a game a frivolous expenditure of energy? Some are. Others have deep moral and philosophical lessons.

How is a game of football like a game of solitaire?

It’s easy to see that mancala and go have something in common… little rounded stones. But what do they have in common with Minecraft? Stones?

Wittgenstein said that this is not an issue for set theory. Instead, he talked about family resemblances. As described in Wikipedia, “things which could be thought to be connected by one essential common feature may in fact be connected by a series of overlapping similarities, where no one feature is common to all.”

For games, this means there is no single feature that makes something a game. Instead, there are a set of overlapping similarities that make things more gamelike or less gamelike. We can even think about things that share more of the features as being more like each other. So go and mancala share features like: two players, stones on a board, alternating turns, one winner, ancient, cerebral, positional. This makes them pretty similar. A professionally played team sport with a ball on a field shares few qualities with go. (Although “people excited about the outcome” and “positional” might be common.) So the feature-distance between go and football is large, yet they are both still games.

I think this relates to the tasks of software design and architecture. We have a strong tendency to go looking for nouns in our designs. Once we find a noun in a domain, we want to make a software artifact that captures all members of the set induced by that noun. But that only works if we stick with axiomatic set theory. Set theory works well for well-defined technical concepts and much less well for things in the human sphere.

One simple example, the humble “name” field. Go read Falsehoods programmers believe about names. How do you feel about that “first name”, “last name” database structure now? After reading that list, how much can you confidently say about instances of a “Name” class? Or a “Name” service?

We have all these debates about “noun-first” or “verb-first”. Back in The Perils of Semantic Coupling, I argued for a behavior-oriented approach rather than a noun-oriented approach. Stop saying “what is this thing?” but rather “what can you do with it?” That leads us toward segregated interfaces.

Now I’d augment that to emphasize those feature descriptions rather than noun-like descriptions. Instead of noun-first or verb-first I’m going to try “adjective-first”.

In Love With Your Warts

| Comments

If you help other people solve problems, you may have run into this phenomenon: a person gleefully tells you how messed up their environment is. Processes make no sense, roadblocks are everywhere, and all previous attempts to make things better have failed. As you explore the details, you talk about directions to try. But every suggestion is met with an explanation of why it won’t work.

I say that these folks are “in love with their own warts.” That is, they know there’s a problem, but they’ve somehow been acclimated to it to such a degree that they can’t imagine a different world. They will consistently point to outside agents as the author of their woes, without realizing how much resistance they generate themselves.

Over time, by the way, there’s a reinforcing process. People who think and talk this way will cluster and drive out the less cynical.

These people can be intensely frustrating to work with, until you understand them. Understanding allows empathy, which is the only way to get past that self-generated resistance.

The first thing to understand is that any conversation about their problems isn’t really about their problems. An opening statement like, “We tried that but it didn’t work,” isn’t really asking for a solution. Instead, it’s an invitation to play a game. That game is called, “Stump the expert.” The player wins when you concede that nothing can ever improve. You “win” by suggesting something that the player cannot find an objection to. It’s not a real victory though, for reasons that will be clear in a moment.

Why does the player want to win this game instead of improving their world? For one thing, any solution you find is an implicit critique of the person who has been there. Suppose the solution is to shift a responsibility from one team to another. That requires management support in both teams. If that solution works, then it means the game-player could have produced the same improvement ages ago, but didn’t have enough courage to make it happen. Other changes might imply the game-player lacked sufficient authority, vision, credibility, or, rarely, technical acumen.

In every case, the game-player feels that your solution highlights a deficiency of theirs.

This is why “winning” the discussion isn’t really a win. You may get a grudging concession about the avenue to explore, but you’re still generating more resistance from that game-player.

My usual approach is to decline the invitation to the game. I don’t try to find point-by-point answers to things that have failed in the past. I usualy draw analogies to other organizations that have faced the same challenges and make parallels to their solutions. Failing that, I accept the objections (almost always phrased as roadblocks thrown up by others) and just tell them, “Let me handle that.” (Most of the time, I find that people on the opposite side of a boundary express roadblocks from the other side that all eventually cancel each other out. That is, the roadblock turns out to be illusory.)

I’d like to hear from you, dear Reader. Assume that you cannot simply fire or transfer the game-player. They have value beyond this particular habitual reflex.

How would you handle a situation like this? What have you tried, and what works?

Some Useful Techniques From Bygone Eras

| Comments


I find the old object-oriented design technique of CRC Cards to be useful when defining service designs. CRC is short for “Class, Responsibilities, Collaborators.” It’s a way to define what behavior belongs inside a service and what behavior it should delegate to other services.

Simulating a system via CRC is a good exercise for a group. Each person takes a CRC card and plays the role of that service. A request starts from outside the services and enters with one person. They can do only those things written down as their “responsibilities.” For anything else, they must send a message to someone else.

Personifying and role-playing really helps identify gaps in service design. You’ll find gaps where data or entire services are needed but don’t exist.

Tell, Don’t Ask

The more services you have, the more operational complexity you take on. Some patterns of service design seem to encourage high coupling and a high fan-in to a small number of critical services. (Usually, these are “entity” services… i.e., CRUDdy REST over a database table.)

Instead, I find it better to tell services what you want them to do. Don’t ask them for information, make a decision, then change some state.

Organizing around Tell, Don’t Ask leads you to design services around behavior instead of data. You’ll probably find you denormalize your data to make T,DA work. That’s OK. The runtime benefit of cleaner coupling will be worth it.

Data Flow Diagrams

If you ask someone who isn’t trained in UML to draw a system’s architecture, they will often draw something close to a Data Flow Diagram. This diagram shows data repositories and the transformation processes that populate them. DFDs are a very helpful tool because they force you to ask a few really key questions:

  1. Where did that information come from?
  2. How did it get there?
  3. Who updates it?
  4. Who uses the data we produce?

In particular, answering that last question forces you to think about whether you’re producing the right data for the downstream consumer.

Generalized Minimalism

| Comments

My daily language is Clojure. One of the joys of working in Clojure is its great core library. The core library has a wealth of functions that apply broadly across data structures. A typical function looks like this:

(defn nthnext
  "Returns the nth next of coll, (seq coll) when n is 0."
  {:added "1.0"
   :static true}
  [coll n]
    (loop [n n xs (seq coll)]
      (if (and xs (pos? n))
        (recur (dec n) (next xs))

I want to call your attention to two specific forms. The “seq” function works on any “Seqable” collection type. (N.B.: It has special cases for other types, including some to make Java interop more pleasant. But the core behavior is about Seqable.) The “next” function is similar: it works on anything that already is a Seq or anything that can be made into a Seq.

This provides a nice degree of abstraction and through that, generality.

Pretty much all of the core data types either implement ISeq or Seqable. That means I can call “seq”, “next”, and “nthnext” on any of them. Other data types can be brought into the fold by extending one of those interfaces to them. We extend the data to meet the core functions, instead of overloading functions for data types.

YAGNI Isn’t About Being Specific

Under this approach, writing a general function is both simpler and easier than writing a specific one.

For example, suppose I need to do that classic example of trivial functionality: summing a list of integers. The most natural way for me to write that is like this:

(reduce + 0 xs)

That is both simple and general. But it doesn’t meet the spec I said! It sums any numeric type, not just integers. If I decide that I really must restrict it to integers, I have to add code.

(assert (every? integer? xs))
(reduce + 0 xs)

This is a pattern I find pretty often when working in Clojure. When I generalize, I do it by removing special cases. This goes hand-in-hand with decomposing behavior into smaller and smaller units. As each unit gets smaller, I find it can be more general.

Here’s a less trivial example. Today, I’m working on a library we call Vase. (See Paul deGrandis’ talk on data-driven systems for more about Vase.) In particular, I’m updating it to work with a new routing syntax in Pedestal. With the new routing syntax, we can build routes from ordinary Clojure data… no more need for oddly-placed syntax-quoting.

One of the core concepts in Pedestal is the “interceptor”. They fulfill the same role as middleware in Ring. (One difference: interceptors are data structures that contain functions. Interceptors compose by making a vector of data, whereas Ring middleware composes by creating function closures. I find it easier to debug a stack of data than a stack of opaque closures.) Any particular route in Pedestal will have a list of interceptors that apply to that route.

When a service that uses Pedestal supplies interceptors, it composes a list of them. Suppose I want to make a convenience function that helps application developers build up that list. What would I need to do?

You probably already figured out that any such “convenience” functions I could create would basically duplicate core functions, but with added restrictions. Instead of “cons”, “conj”, “take”, and “drop”, I’d have to create “icons”, “iconj”, “itake”, and “idrop”. What a waste.

I have to ask myself, “Do I need some special behavior here?” And the answer is “YAGNI.”

YAGNI Is About Adding “Stuff”

YAGNI is commonly understood to mean “don’t generalize until you need to.” In some languages and libraries, I suppose that’s the right read. In my world, though, it is specializing that requires adding stuff. So I often call YAGNI if someone tries to make a thing less general than it could be.

Small functions that operate on abstractions instead of concrete types are both general and simple.

Redeeming the Original Sin

| Comments

While reading Bryan Cantrill’s slides from Papers We Love NYC, I was struck by something. One of the very first slides says:

The traditional UNIX security model is simple but inexpressive.

The papers go on to describe a progression of techniques to isolate processes from the host environment to greater and greater degrees. It began with the ancient precursor ‘chroot’, through Jails, and Zones. Each builds upon the previous work to improve the degree of isolation.

We’ve seen a parallel series of efforts in the Linux realm with virtual machines and containers.


All of these are introduced to restore the degree of isolation and resource control that was originally present in mainframe operating systems. Furthermore, it was the model that Multics was meant to supply.

Unix started with a simplified security model, meant for single user machines. It was “dumbed down” enough to be easy to implement on the limited machines of the day.

Zones, VMs, containers… they’re all ways to redeem Unix from its original sin. Maybe what we should look at is a better operating system?