Wide Awake Developers

The Entity Service Antipattern

| Comments

In my last post I talked about the need to keep things separated once they’ve been decoupled. Let’s look at one of the ways this breaks down: entity services.

If a pattern is a solution to a problem in a context, what is an antipattern? An antipattern is a commonly-rediscovered solution to a problem in a context, that inadvertently creates a resulting context we like less than the original context. In other words, it’s a pattern that makes things worse (according to some value system.)

I contend that “entity services” are an antipattern.

To make that case, I need to establish that “entity services” are a commonly-rediscovered solution to a problem and that the resulting context is worse than the starting context (a monolith.)

Let’s start with the “commonly-rediscovered” part. Entity services are in Microsoft’s .NET microservices architecture ebook. Spring has a tutorial with them. (Spring may give us the absolute easiest way to create an entity service. The same class can be annotated with JSON mapping and persistence mapping.) RedHat has a Microservice Reference Architecture with product-service and sales-service. Some of the microservice-focused frameworks such as JHipster start with CRUD on data entities.

In order to make the case that the resulting context is worse than the starting context, I need to assume what that starting context actually is. For the sake of generality, I’ll assume a largish, legacy application that is more-or-less a monolith. It may call out to some integration points to get work done, but features are pretty much local and in-process. There are multiple instance of the process running on different hosts. Basically, like the following diagram.

All features reside in the code for the application instances.

Many other authors have enumerated the sins of the monolith, so I won’t belabor them here. (Though I feel compelled to make a brief aside to say that we did somehow deliver quite a lot of working, valuable features that ran in monoliths.)

How might we describe this initial context?

  • It is clear where the code to build a feature goes and how to test the code.
  • The release cadence is dictated by the slowest-delivering subteam.
  • There is little inherent enforcement of boundaries, thus coupling tends to increase over time.
  • Performance problems can be found by profiling a single application.
  • The cause of availability problems is typically found in one place.
  • Building features that rely on multiple entities is straightforward, though it may come at the cost of inappropriate coupling.
  • As the code grows large, the organization is at risk of entering the fear cycle.
  • Feature availability may be compromised by inappropriate coupling via common modes in the application. (E.g., thread pools, connection pools.)
  • Feature availability should be improved by redundancy of the whole application itself. This is reduced, however, if the application is vulnerable to the surrounding environment, as in the case of a Self-Denial Attack, memory leak, or race condition.

Supposing we move this to a microservice architecture, with entity services. We might end up with something like this example from the Spring tutorial:

In this version, you should assume that each of the service boxes comprises multiple instances of that service.

Obviously there are more moving parts involved. That immediately means it’s harder to maintain availability.

The challenges of performance analysis and debugging are well documented, so I won’t belabor them.

But in this resulting context, where do features get created? A few of them are direct interactions of the “Online Shopping” service and the individual entity services. For example, creating an account is definitely just between Online Shopping and Accounts.

Most features, however, require more than one of the entities. They will use aggregates or intersections of entities.

For example, suppose we need to calculate the total price of a cartful of items. That involves the cart, the products (for their individual prices) and the account to find the applicable sales tax or VAT. I predict that it will be implemented in the Online Shopping service by making a bunch of calls to the entity services to get their data.

We can depict this with an “activation set” (a term I made up) to show which services are activated during processing of a single request type. For this picture, we focus on just the services and elide the infrastructure.

So to price the cart, we have to activate four of the five services in our architecture.

That activation represents operational coupling, which affects availability, performance, and capacity.

It also represents semantic coupling. A change to any of the entity services has the potential to ripple through into the online shopping service. (In particularly bad cases, the online shopping service may find itself brokering between data formats: translating version 5 of the user data produced by Accounts into the version 3 format that Cart expects.)

A common corollary to entity services is the idea of “stateless business process services.” I think this meme dates back to last century with the original introduction of Java EE, with entity beans and session beans. It came back with SOA and again with microservices.

What happens to our picture if we introduce a process service to handle pricing the cart?

Not much improvement.

Bear in mind this is the activation set for just one request type. We have to consider all the different request types and overlay their activation sets. We’ll find the entity services are activated for the majority of requests. That makes them a problem for availability and performance. It also means they won’t be allowed to change as fast as we’d like. (Services with a high fan-in need to be more stable.)

So, let’s look at the resulting context of moving to microservices with entity services:

  • Performance analysis and debugging is more difficult. Tracing tools such as Zipkin are necessary.
  • Additional overhead of marshalling and parsing requests and replies consumes some of our precious latency budget.
  • Individual units of code are smaller.
  • Each team can deploy on its own cadence.
  • Semantic coupling requires cross-team negotiation.
  • Features mainly accrue in “nexuses” such as API, aggregator, or UI servers.
  • Entity services are invoked on nearly every request, so they will become heavily loaded.
  • Overall availability is coupled to many different services, even though we expect individual services to be deployed frequently. (A deployment look exactly like an outage to callers!)

In summary, I’d say both criteria are met to label entity services as an antipattern.

Stay tuned. In a future post, we’ll look at what to do instead of entity services.


If you’re interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

There is a session open to the public in March 2018.

Or, contact me to schedule a workshop at your company.

Keep ‘Em Separated

| Comments

Software doesn’t have any natural boundaries. There are no rivers, mountains, or deserts to separate different pieces of software. If two services interact, then they have a sort of “attractive force” that makes them grow towards each other. The interface between them becomes more specific. Semantic coupling sneaks in. At a certain point, they might as well be one module running in one process.

If you’re building microservices, you need to make sure they don’t grow together into an impenetrable bramble. The whole microservice bet is that we can trade deployment and operational complexity for team-scale autonomy and development speed. The last thing you want is to take on the operational complexity of microservices and still move slowly due to semantic coupling among them.

Maybe you’ve recently broken up a monolith into microservices, but found that things aren’t as easy and rosy as the conference talks led you to believe.

Maybe you have a microservice architecture that is starting to slow down and get harder. Like cooling honey, it seems sweet at first but gets stickier later.

I’m going to write a short series of posts with techniques to keep ‘em separated. This will go into API design for microservices, information architecture, and feature design. It’ll be all about making smaller, more general pieces, that you can rearrange in interesting ways.


If you’re interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

There is a session open to the public in March 2018.

Or, contact me to schedule a workshop at your company.

Root Cause Analysis as Storytelling

| Comments

Humans are great storytellers and even better story-listeners. We love to hear stories so much that when there aren’t any available, we make them up on our own.

From an early age, children grasp the idea of narrative. Even if they don’t understand the forms of storytelling so much, you can hear a four-year-old weave a linked list of events from her day.

We look for stories behind everything. At a deep level, we want the world’s events to mean something. Effect follows cause, and causes have an actor to set them in motion.

Our sense of balance also demands that large effects should have large causes, with correspondingly large intent.

A drunk driver speeds through a red light, oblivious. A crossing car stops short. The shaken driver creeps home with a pounding pulse, full of queasy adrenaline. She unbuckles her daughter and hugs her tightly.

A drunk driver speeds through a red light, oblivious. A crossing car is in the intersection. The drunk smashes into it, right at the drivers’ side door. The woman’s bloody face is hidden behind airbags. Her daughter sits in her new wheelchair for her mother’s funeral.

The difference between those stories is a matter of a split second in timing. There is absolutely no change in the motives or desires of anyone in the two vignettes. The first drunk, if caught, would get a jail term and large fine. He would probably lose his driver’s license.

But most people would judge the motives of the second driver far more harshly. They would condemn him to a lengthy prison term and a lifetime ban on driving.

When we see a large effect, we expect a large cause, with a large intent.

The idea that some vast, horrible events strike randomly fills us with dread. People can’t bear the thought that a single unbalanced nobody can change the course of a nation’s history with one rifle shot, so they spend more than 50 years searching for “the truth.”

“Root Cause Analysis” expresses a desire for narrative. With the power of hindsight, we want to find out what went wrong, who did it, and how we can make sure it never happens again. But because we have the posterior event, we judge the prior probabilities differently. Any anomaly or blip suddenly becomes suspect.

People don’t look as hard at anomalies when nothing bad happens.

They don’t notice all the times the same weird log message pops up before … everything continues as normal.

When we look for “root cause,” what we are really trying to discern is not “what made this happen.” We are looking for something that would have stopped it from happening. We are building a counterfactual narrative—an alternate history—where that drunk driver dropped his keys in the parking lot and was thereby delayed a few crucial seconds.

Peel back the surface on a root cause analysis and you almost always see a formula that goes like this: “factor X” could have prevented this. “Factor X” was not present, therefore the bad event happened.

The catch is that there is usually an endless variety of possible counterfactuals. Often, more than one counterfactual narrative would have prevented the bad outcome equally well. Which one was the root cause? Non-existence of “factor X” or non-existence of “factor Y?”

Next time you have a bad incident, why not try to focus your efforts in a different way? Work on learning from the times that things don’t go wrong. And be explicit about looking for many possible interventions that would have prevented the problem. Then select ones with broad ability to prevent or impede many different problems.

Release It Second Edition in Beta

| Comments

I’m excited to announce the beta of Release It! Second edition.

It’s been ten years since the first edition was released. Many of the lessons in that book hold strong. Some are even more relevant today than in 2007. But a few things have changed. For one thing, capacity management is much less of an issue today. The rise of the cloud means that developers are more exposed to networks than ever. And in this era of microservices, we’ve got more and better ops tools in the open source world than ever.

All of that motivated me to update the book for this decade. I’ve removed the section on capacity and capacity optimization and replaced it with a section that builds up a picture of our systems by doing a “slow zoom” out from the hardware, to single processes, to clusters, to the controlling infrastructure, and to security issues.

The first beta does not yet include two additional new parts on deployment and solving systemic problems. Those will be coming in the next few weeks.

In the meanwhile, I look forward to hearing your comments and feedback! Join the conversation in the book’s forum.

Spectrum of Change

| Comments

I’ve come to believe that every system implicitly defines a spectrum of changes, ordered by their likelihood. As designers and developers, we make decisions about what to embody as architecture, code, and data based on known requirements and our experience and intuition.

We pick some kinds of changes and say they are so likely that we should represent the current choice as data in the system. For instance, who are the users? You can imagine a system where the user base is so fixed that there’s no data representing the user or users. Consider a single-user application like a word processor.

Another system might implicitly indicate there is just one community of users. So there’s no data that represents an organization of users… it’s just implicit. On the other hand, if you’re building a SaaS system, you expect the communities of users to come and go. (Hopefully, more come than go!) So you make whole communities into data because you expect that population to change very rapidly.

If you are building a SaaS system for a small, fixed market you might decide that the population won’t change very often. In that case, you might represent a population of users in the architecture via instancing.

So data is at the high-energy end of the spectrum, where we expect constant change. Next would be decisions that are contemplated in code but only made concrete in configuration. These aren’t quite as easy to change as data. Furthermore, we expect that only one answer to any given configuration choice is operative at a time. That’s in contrast to data where there can be multiple choices active simultaneously.

Below configuration are decisions represented explicitly in code. Constructs like policy objects, strategy patterns, and plugins all indicate our belief that the answer to a particular decision will change rapidly. We know it is likely to change, so we localize the current answer to a single class or function. This is the origin of the “Single Responsibility Principle.”

Farther down the spectrum, we have cross-cutting behavior in a single system. Logging, authentication, and persistence are the typical examples here. Would it be meaningful to say push these up into a higher level like configuration? What about data?

Then we have those things which are so implicit to the service or application that they aren’t even represented. Everybody has a story about when they had to make one of these explicit for the first time. It may be adding a native app to a Web architecture, or going from single-currency, single-language to multinational.

Next we run into things that we expect to change very rarely. These are cross-cutting behavior across multiple systems. Authentication services and schemas often land at this level.

So the spectrum goes like this, from high energy, rapidly changing, blue to cool, sedate red:

  • Data
  • Configuration
  • Encapsulated code
  • Cross-cutting code
  • Implicit in application
  • Cross-cutting architecture

Implications

The farther toward the “red” end of the spectrum we relegate a concern, the more tectonic it will be to change it.

No particular decision “naturally” falls at one level or another. We just have experience and intuition about which kinds of changes happen with greatest frequency. That intuition isn’t always right.

Efforts to make everything into data in the system lead to rules engines and logic programming. That doesn’t usually end up with the end-user control we think. It turns out you still need programmers to think through changes to rules in a rules engine. Instead of democratizing the changes, you’ve made them more esoteric.

It’s also not feasible to hoist everything up to be data. The most decisions you energy-boost to that level, the more it costs. And at some point you generalize enough that all you’ve done is create a new programming language. If everything about your application is data, you’ve written an interpreter and recursed one level higher. Now you still have to decide how to encode everything in that new language.

Queuing for QA

| Comments

Queues are the enemy of high-velocity flow. When we see them in our software, we know they will be a performance limiter. We should look at them in our processes the same way.

I’ve seen meeting rooms full of development managers with a chart of the year, trying to allocate which week each dev project will enter the QA environment. Any project that gets done too early just has to wait its turn in QA. Entry to QA becomes a queue of its own. And as with any queue, the more variability in the processing time, the more likely the queue is to back up and get delayed.

When faced with a situation like that, the team may look for the “right number” of QA environments to build. There is no right number. Any fixed number of environments just changes the queuing equation but keeps the queue. A much better answer is to change the rules of the game. Instead of having long-lived (in other words, broken and irreproducible) QA environments, focus on creating a machine for stamping out QA environments. Each project should be able to get its own disposable, destructible QA system, use it for the duration of a release, and discard it.

Availability and Stability

| Comments

Last post covered technical definitions of fault, error, and failure. In this post we will apply these definitions in a system.

Our context is a long-running service or server. It handles requests from many different consumers. Consumers may be human users, as in the case of a web site, or they may be other programs.

Engineering literature has many definitions of “availability.” For our purpose we will use observed availability. That is the probability that the system survives between the time a request is submitted and the time it is retired. Mathematically, this can be expressed as the probability that the system does not fail between time T_0 and T_1, where the difference T_1 - T_0 is the request latency.

(There is a subtle issue with this definition of observed availability, but we can skirt it for the time being. It intrinsically assumes there is some other channel by which we can detect failures in the system. In a pure message-passing network such as TCP/IP, there is no way to distinguish between “failed” and “really, really slow.” From the consumer’s perspective, “too slow” is failed.)

The previous post established that faults will occur. To maintain availability, we must prevent faults from turning into failures. At the component level, we may apply fault-tolerance or fault-intolerance. Either way, we must assume that components will lose availability.

Stability, then, is the architectural characteristic that allows a system to maintain availability in the face of faults, errors, and partial failures.

At the system level, we can create stability by applying the principles of recovery-oriented computing.

  1. Severability. When a component is malfunctioning, we must be able to cut it off from the rest of the system. This must be done dynamically at runtime. That is, it must not require changes to configuration or rebooting of the system as a whole.
  2. Tolerance. Components must be able to absorb “shocks” without transmitting or amplifying them. When a component depends on a another component which is failing or severed, it must not exhibit higher latency or generate errors.
  3. Recoverability. Failing components must be restarted without restarting the entire system.
  4. Resilience. A component may have higher latency or error rate when under stress from partial failures or internal faults. However, when the external or internal condition is resolved, the component must return to its previous latency and error rate. That is, it must display no lasting damage from the period of high stress.

Of these characteristics, recoverability may be the easiest to achieve in today’s architectures. Instance-level restarts of processes, virtual machines, or containers all achieve recoverability of a damaged components.

The remaining characteristics can be embedded in the code of a component via Circuit Breakers, Bulkheades, and Timeouts.

Fault, Error, Failure

| Comments

Our systems suffer many insults when they contact the real world. Flaky inputs, unreliable networks, and misbehaving users, to name just a few. As we design our components and systems to thrive in the only environment that matters, it pays to have mental schema and language to discuss the issues.

A fault is an incorrect internal state in your software. Faults are often introduced at component, module, or subsystem boundaries. There can be a mismatch between the contract a module is designed to implement and its actual behavior. A very simple example is accepting a negative integer or zero when a strictly positive integer was expected.

A fault may also occur when a latent bug in the software is triggered by an external or internal condition. For example, attempting to allocate an object when memory is exhausted will return a null pointer. If the software proceeds with the null pointer it can cause problems later, perhaps in a far distant part of the code.

Such an incorrect state may be recoverable. A fault-tolerant module will attempt to restore a good internal state after detecting a fault. Exception handlers and error-checking code are efforts to provide fault-tolerance.

Another school of thought says that fault tolerance is unreliable. In this approach, once a fault has occurred, the entire memory state of the program must be regarded as corrupt. Instead of attempting to restore a good state by backtracking or patching up the internal state, fault-intolerant modules will exit to avoid producing errors. A system built from these fault-intolerant modules will include supervisor capabilities to restart exited modules.

If a fault propagates in the system, it can produce visibly incorrect behavior. This is an error. Faults may occur without producing errors, as in the case of fault-tolerant modules that correct their own state before an error is observed. An error may be limited to an incorrect output displayed to a user. It can include any incorrect behavior, including data loss or corruption, network flooding, or launching attack drones.

At the component, module, or subsystem level, or mission is to prevent faults from causing errors.

A failure results when a system terminates without completing its job. For a long-running service or server, it stops responding to requests in a finite time. For a program that should run to completion and exit, it exits abnormally before completing. A failure may be preferrable to an error, depending on the harm caused by the error.

Next time, I will address system stability in the face of faults, errors, and failures.

Power Systems

| Comments

This is an excerpt from something I’m working on this Labor Day holiday:

Large scale power outages act a lot like software failures. It starts with a small event, like a power line grounding out on a tree. Ordinarily that would be no big deal but under high-stress conditions it can turn into a cascading failure that affects millions of people. We can also learn from how power gets restored after an outage. Operators must perform a tricky balancing act between generation, transmission, and demand.

There used to be a common situation where power would be restored and then cut off again in a matter of seconds. It was especially common in the American South, where millions of air conditioners and refrigerators would all start at the same time. When a motor starts up, it draws a lot of current. You can see this in the way that lights dim when you start a circular saw. As the motor starts to spin, though, it creates “back EMF”–a kind of backpressure on the electrical current. (That’s when the lights return to full brightness.) If you add up the effects of millions of electric motors starting all at once, you see a huge upward blip in current draw, followed by a quick drop due to back current. Power transmission systems would see the spike and drop and propagate that to the generation systems. First they would increase their draw then drop it dramatically. That would make the generation systems think they should shut off some of the turbines. Right about the time they started reducing supply, the initial surge of back EMF would decline and current load would come back up to baseline levels. The increased current load hit just when supply was declining, causing excess demand to trip circuit breakers. Lights out, again.

Smarter appliances and more modern control systems have mitigated that particular failure mode now, but there are still useful lessons for us.

Remember DAT?

| Comments

Do you remember Digital Audio Tape? DAT was supposed to have all the advantages of digital audio—high fidelity and perfect reproduction—plus the “advantages” of tape. (Presumably those advantages did not include melting on the dashboard of your Chevy Chevelle or spontaneously turning into The Best of Queen after a fortnight.)

In hindsight, we can see that DAT was a twilight product. As the sun set on the cassette era, DAT was an attempt to bridge the discontinuous technology change to digital music production. It was a twilight product because it didn’t sufficiently reimagine the existing technology to offer enough of a new advantage nor did it eliminate enough of the old disadvantages.

We often see such twilight products right before a major, discontinuous shift.

I think we’re in such a period when it comes to software development and deployment for cloud native systems. The tools we have attempt to take the traditional model into the new environment. But they don’t yet reimagine the world of software development enough. Ten years from now, we will see that they offered some advantages but also carried forward baggage from the client-server era. Unix-like full operating systems, coding one process at a time, treating network as a secondary concern, ignoring memory hierarchy in the languages.

Whatever the “operating system for the cloud” is, we haven’t seen it yet.