Shared State

When programming distributed systems, the hardest kind of data to manage is shared mutable state. It requires some kind of synchronization between writers to avoid missed updates. And, after changes, it requires some kind of mechanism to restore coherence between readers.

I previously wrote about that idea of a coherence penalty as it applies to humans. Following those lines, we might regard the system of development teams in an organization as its own distributed system. Teams pass messages. Both sides must understand the semantics. Packets get lost. Nodes disappear.

Within that framework, we can consider the same dimensions of state as we would with a distributed computing system:

  1. Local, immutable state. Easy.
  2. Local, mutable state. Relatively easy to manage.
  3. Shared, immutable state. Essentially write-once. This is a send-only (unicast or broadcast) item that doesn’t require further synchronization. (But see my note later about the time dimension.)
  4. Shared, mutable state. Both synchronization and coherence penalties apply here.

So what would constitute shared mutable state between teams?

Mutable State for Humans

Teams and the humans on the teams carry around an understanding of how the system works. That definitely constitutes mutable state.

I think that the metadata used by the software also constitutes shared state. It may be mutable or immutable. More about that shortly.

The software these teams create has shared mutable state of its own. That would be data that the software creates and reads. The data may be at rest in a database or it may be in motion, in the form of messages being passed around.

For the teams that create the software, however, the shared state is the protocol or schema definition. When those change, synchronization and coherence mechanisms are required. To some extent, this is just a consequence of Conway’s Law, but it’s taken me ten years to understand it.

Consequences of Shared State

For teams to move quickly and independently, we want to minimize the synchronization and coherence delays between teams, in exactly the same way we would do when making the software itself more scalable. So we want to reduce the amount of shared, mutable metadata across team boundaries.

Some corollaries.

Less Shared Metadata Means Less Penalties

Every API has a schema. That means every novel API becomes a new piece of shared state. If you expect to evolve that API, you are planning to mutate the state. Find out if there will be multiple writers!

Where possible, favor a new implementation of an existing API to reduce the amount of state involved. Consider using standard media types and representations, or creating local standards. The time spent creating the standard definitely counts as a synchronization delay, but at least it is explicitly recognized rather than buried in Jira tickets. Also, this time spent creating the standard may cause you to create a better definition that won’t need to change as much. Thus you trade a larger early penalty for repeated penalties later.

Integration via database table maximizes the need for concurrent mutation of the schema. This is why we’ve come to believe that we should avoid such integration. But again, there may be a place to use it effectively, so long as we recognize the effect on our team-scalability.

Immutable Metadata

Shared, immutable data allows consuming software to scale better by avoiding propagation delays. Shared, immutable data also benefits from caching and can use a publication model.

The same goes for teams. API or schema definitions that never change only require publication. But do they allow for change? Yes, with some constraints.

If every change is strictly additive then we can consider the “publication date” of an updated protocol definition to be part of the protocol’s name. Thus, it isn’t a revision of the old protocol, but rather a new protocol entirely that derives from the old one without replacing or invalidating it.

For instance, the existance of HTTP/2 does not mean that HTTP/1.1 no longer exists.

Likewise, you may create a new API definition under a new name. As long as you continue supporting the old definition, then you have not mutated the old shared state, you’ve just created a new piece of immutable shared data.

The technology we use doesn’t make it easy to maintain multiples of some shared state. For example, RDBMSs have no way to express the idea that the new schema is a copy of the old schema with an extra table. Not only is their data model all about “update in place” but their metadata is also “update in place.” Similarly, most of our frameworks for writing APIs are too explicit about routes in URLs. They bake in URL parts like “/api/v1” in every route so it is hard to say that “v3” is “v2” with some changes, and “v2” is “v1” with some changes.

Consider Structuring Teams Around Shared State

This is the dual of Conway’s Law. One way to decide team boundaries is around interfaces. That is, set up your teams such that there is a team boundary everywhere you want an interface.

That interface definition is shared state which may be mutable. So, consider also drawing team boundaries to maximize ownership over that state. Transform it from shared state to private state and the rate of mutation matters less. Of course, as soon as you draw those new lines you may have created new interfaces, so look carefully for team designs that reduce the global amount of shared mutable state.

If you follow that approach when considering all the different interfaces you must negotiate, then everything gets sucked into a single gigantic team. I think this is why there’s a kind of “gravitational” attraction that tries to pull interacting pieces of software into one mass.

Maybe it’s like the life of a star. The life of a star is the unsteady conflict betwee gravity and pressure. Gravity tries to collapse the star, which creates fusion. Fusion makes pressure which holds the star up from collapse.

In software, shared mutable state at interface boundaries plays the role of gravity. Taken to the limit you get monoliths. Communciation overhead and coherence penalties (scaling quadratically with team size) act like pressure. Taken to their limit you get pico-services with solo owners. Rules like the two-pizza team are meant to impose a constraint via force majure.

More to Explore

Some of what we know from fallible message-passing networks can extend to the system that creates the software systems. But we must also keep in mind that people have resilience mechanisms that computers lack. “Hey, did you get my email?” actually works with humans. Humans can also switch from discussing their shared state (say, a protocol definition) to negotiating a new meta-model for that shared state (the meta-meta-model for the data the software will pass.) Software systems cannot renegotiate their protocols dynamically.

There may be more insight available from looking at team and organizational structure as a distributed system.