Wide Awake Developers

What’s Lost With a DevOps Team

| Comments

Please understand, dear Reader, that I write this with positive intention. I’m not here to impugn any person or organization. I want to talk about some decisions and their natural consequences. These consequences seem negative to me and after reading this post you may agree.

When an established company faced a technology innovation, they often create a new team to adopt and exploit that innovation. During my career, I’ve seen this pattern play out with microcomputers, client/server architecture, open systems, web development, agile development, cloud architecture, NoSQL, and DevOps. Perhaps we can explore the pros and cons of that overall approach in some other post. For now, I want to specifically address the DevOps team.

A DevOps team gets created as an intermediary between development and operations. This is especially likely when dev and ops report through different management chains. That is to say, in a functionally-oriented structure. In a product-oriented structure, it is less likely.

This intermediary team gets tasked with automating releases and deployments. They are the ones to adopt some code-as-configuration platform. Sometimes they are also tasked with building an internal platform-as-a-service, but that more often falls to the infrastructure and operations teams.

So the devops team has development as their customer. Operations has the devops team as their customer. Work flows from development, through the tools created by the devops team, and into production. It would seem to capture the benefits of automation: it becomes predictable, repeatable, and safe.

All of that is true. However, even though this is an improvement, it misses out on even greater improvements that could be realized.

The key problem is the unclosed feedback loop. When developers are directly exposed to production operations, they learn. Sometimes they learn from negative feedback: getting woken up for support calls, debugging performance problems, or that horrible icy feeling in your stomach when you realize that you just shut down the wrong database in production.

With a DevOps team sitting between development and operations, the operations team remains in the “learning position.” But they lack the ability to directly improve the systems. Suppose a log message is ambiguous. If the operator who sees it can’t directly change the source code, then the message will never get corrected. (It’s important, but small… exactly the thing least likely to be worth filing a change request for.)

Over longer time spans, the things we learn from production should influence the entire architecture: from technology choices to code patterns and common libraries. A DevOps team sitting between development and operations impedes that learning.

DevOps is meant to be a style of interaction: direct collaboration between development and operations. A team in between that automates things is a tools team. It’s OK to call it a tools team. Tools are a good thing, despite what corporate budgeting seems to say these days.

Instead of creating a flow from development to DevOps to operations, consider putting development, tools, and operations all together and giving them the same goals. They should be collaborators working shoulder-to-shoulder rather than work stations in a software factory.

Give Them the Button!

| Comments

Here’s a syllogism for you:

  • Every technical review process is a queue
  • Queues are evil
  • Therefore, every review process is evil

Nobody likes a review process. Teams who have to go through the review look for any way to dodge it. The reviewers inevitably delegate the task downward and downward.

The only reason we ever create a review process is because we think someone else is going to feed us a bunch of garbage. They get created like this:

It starts when someone breaks a thing that they can’t or aren’t allowed to fix. The responsibility for repair goes to a different person or group. That party shoulders both responsibility for fixing the thing and also blame for allowing it to get screwed up in the first place.

(This is an unclosed feedback loop, but it is very common. Got a separate development and operations group? Got a separate DBA group from development or operations? Got a security team?)

As a followup, to ensure “THIS MUST NEVER HAPPEN AGAIN” the responsible party imposes a review process.

Most of the time, the review process succeeds at preventing the same kind of failure from recurring. The resulting dynamic looks like this:

The hidden cost is the time lost. Every time that review process has to go off, the creator must prepare secondary artifacts: some kind of submission to get on the calendar, a briefing, maybe even a presentation. All of these are non-value-adding to the end customer. Muda. Then there’s the delay on the review meeting or email itself. Consider that there is usually not just one review but several needed to get a major release out the door and you can see how release cycles start to stretch out and out.

Is there a way we can get the benefit of the review process without incurring the waste?

Would I be asking the question if I didn’t have an answer?

The key is to think about what the reviewer actually does. There are two possibilities:

  1. It’s purely a paperwork process. I’ll automate this away with a script that makes PDF and automatically emails it to whomever necessary. Done.
  2. The reviewer applied knowledge and experience to look for harmful situations.

Let’s talk mostly about the latter case. A lot of our technology has land mines. Sometimes that is because we have very general purpose tools available. Sometimes we use them in ways that would be OK in a different situation but fail in the current one. Indexing an RDBMS schema is a perfect example of this.

Sometimes, it’s also because the creators just lack some experience or education. Or the technology just has giant, truck-sized holes in it.

Whatever the reason, we expect that the reviewer is adding intelligence, like so:

This benefits the system, but it could be much better. Let’s look at some of the downsides:

  • Throughput is limited to the reviewer’s bandwidth. If they truly have a lot of knowledge and experience, then they won’t have much bandwidth. They’ll be needed elsewhere to solve problems.
  • The creator learns from the review meetings… by getting dinged for everything wrong. Not a rewarding process.
  • It is vulnerable to the reviewer’s availability and presence.

I’d much rather see the review codify that knowledge by building it into automation. Make the automation enforce the practices and standards. Make it smart enough to help the creator stay out of trouble. Better still, make it smart enough to help the creator solve problems successfully instead of just rejecting low quality inputs.

With this structure, you get much more leverage from the responsible party. Their knowledge gets applied across every invocation of the process. Because the feedback is immediate, the creator can learn much faster. This is how you build organizational knowledge.

Some technology is not amenable to this kind of automation. For example, parsing some developer’s DDL to figure out whether they’ve indexed things properly is a massive undertaking. To me, that’s a sufficient reason to either change how you use the technology or just change technology. With the DDL, you could move to a declarative framework for database changes (e.g., Liquibase). Or you could use virtualization to spin up a test database, apply the change, and see how it performs.

Or you can move to a database where the schema is itself data, available for query and inspection with ordinary program logic.

The automation may not be able to cover 100% of the cases in general-purpose programming. That’s why local context is important. As long as there is at least one way to solve the problem that works with the local infrastructure and automation, then the problem can be solved. In other words, we can constrain our languages and tools to fit the automation, too.

Finally, there may be a need for an exception process, where the automation can’t decide whether something is viable or not. That’s a great time to get the responsible party involved. That review will actually add value because every party involved will learn. Afterward, the RP may improve the automation or may even improve the target system itself.

After all, with all the time that you’re not spending in pointless reviews, you have to find something to do with yourself.

Happy queue hunting!

C9D9 on Architecture for Continuous Delivery

| Comments

Every single person I’ve heard talk about Continuous Delivery says you have to change your system’s architecture to succeed with it. Despite that, we keep seeing “lift and shift” efforts. So I was happy to be invited to join a panel to discuss architecture for Continuous Delivery. We had an online discussion last Tuesday on the C9D9 series, hosted by Electric Cloud.

They made the recording available immediately after the panel, along with a shiny new embed code.

Best of all, they supplied a transcript, so I can share some excerpts here. (Lightly edited for grammar, since I have relatives who are editors and I must face them with my head held high.)

Pipeline Orchestration

It’s easy to focus on the pipeline as the thing that delivers code into production. But I want to talk about two other central roles that it plays. One, with regards to risk management. To me the pipeline is not so much about ushering code out to production, but it’s about finding every opportunity to reject a harmful change, or a bad change prior to let it get into production. So I view the pipeline as an essential part of risk management.

I’ve also had a lot of lean training, so I’d look on the deployment pipeline as the value stream that developers use to deliver value to their customers. In that respect we need to think about the pipeline as production-grade infrastructure, and we need to treat it with production-like SLAs.

Cattle, Not Pets

I think a lot has been said about “cattle versus pets” over the last ten years or so. I just want to add one thing - the real challenge is identity. There are ton of systems and frameworks that implicitly assume stable identity on machines. Particularly a lot of distributed software toolkits. When you do have the cattle model, a machine identity may disappear and never come back again. I just really hope you’re not building up a queue of undelivered messages for that machine.

Service Orientation and Decoupling

Having teams running in parallel and being able develop more or less independently - I talk about team scale autonomy. But if there are very long builds, large artifacts and large number of artifacts, I regard that as the consequence of using languages and tools that are early bound and early linked. I don’t think it’s any accident that the people I heard of first doing continuous delivery were using PHP. You can regard each PHP file as its own deployable artifact, and so things move very quickly. If everything we wrote was extremely late bound, then our deployment would be an rsync command. So to an extent, breaking things down into services is a response to large artifacts, long build times, that’s one side of that.

The other side is team scale autonomy and the fact that you can’t beat Conway’s Law and that absolutely holds true. (Conway’s Law: an organization is constrained to produce software that recapitulates the structure of the organization itself. If you have four teams working on a compiler, you’re going to have a four pass compiler.)

Now, when we talk about decoupling, I need to talk about two different types of decoupling, both important.

The bigger your team gets, the more communication overhead goes up. We have known that since the 1960s, so breaking that down makes sense. But then we have to recompose things at runtime and that’s when coupling becomes a big issue. Operational coupling happens minute by minute by minute. If I have service A calling service B, service B goes down, I have to have some response. If I don’t do anything else, service A is also going to go down. So I need to build in some mechanisms to provide operational decoupling, maybe that’s a cache, maybe it’s timeouts, maybe it’s a circuit breaker, something along those lines, to protect one service from the failure of another service.

It’s not just the failure of the service! A deployment to the other service looks exactly like a failure from the perspective of the consumer. It’s simply not responding to request within an acceptable time.

So we have to pay attention to the operational decoupling.

Semantic coupling is even more insidious, and that’s what plays out over a span of months and years. We talk about API versioning quite a bit, but there other kinds of semantic coupling that creep in. I’ve been harping a lot lately about identifiers. If I have to pass an itemID to another system then I’m sort of implicitly saying there is one universe of itemIDs and that system has them all, and I can only talk to that system for items with those IDs.

Similarly with many services that we create, we create the service as though there is one instance of the service. We’d be better off creating the code that can instantiate that service many times for many consumers. So if you create a calendar service, don’t make one calendar that everyone has eventIDs on. Make a calendar service where we can ask for a new calendar and it gives you back a URL for a whole new calendar that is yours and only yours. This is the way you would build it if you were building a SaaS business. That’s how you would need to think about the decoupled services internally.

Messaging and Data Management

If I’m truly deploying continuously then I’ve got version N and version N+1 running against the same data source. So I need some way to accommodate that. In older less-flexible kinds of databases, that means triggers, shims, extra views, that kind of scaffolding.

I heard a great a story, I think it’s from Pinterest at Velocity a couple of years back. They had started with a monolithic user database and found they needed to split the table. After they already had 60 million users! But they were able to make many small deployments that each added kind of one step for an incremental migration. And once they got that in place, they let it sit for three months, at the end of that they found who was left and did a batch migration of those. Then they did a series of incremental deployments to remove the extra data management stuff.

So it’s one of those cases - doing continuous delivery both necessitates that you’re more sophisticated about your data changes, but it also gives you new tools to accomplish those changes.

There are a wide crop of databases that don’t require that kind of care and feeding when you make deployments. If you are truly architecting for operational ease and delivery, then that might be a sufficient reason to choose one of the newer databases over one of the less flexible relational stores.


The C9D9 discussion was quite enjoyable. The hosts ran the panel well, and even though all of us are pretty long-winded, nobody was able to filibuster. I’ll be happy to join them again for another discussion some time.

Software Eats the World

| Comments

During this morning’s drive, I crossed several small overpasses. It reminded me that the American Society of Civil Engineers rated more than 20% of our bridges as structurally deficient or functionally obsolete. That got me to thinking about how we even know how many bridges there are in a country as large as the U.S.

Some time in the past, it would require an army of people to go survey all the roads, looking for bridges and adding them to a ledger. Now, I’m sure it’s a query in a geographical database. The information had to be entered at least once, but now that it’s in the database we don’t need people to go wandering about with clicker counters.

Instead of clipboards and paper, the bridge survey needed data import from thousands of state and county GIS databases. That means coders to write the import jobs and DBAs to set up the target systems. It needed queries to count up the bridges and cross-check with inspection reports. So that requires more coders and maybe some UX designers for data visualization.

Back in 2011, Marc Andreessen said ”software is eating the world”. There’s no reason to think that’s going to slow down soon. And as software eats the world, work becomes tech work.

Microservices Versus Lean

| Comments

Back in April, I had the good fortune to speak at Craft Conf in lovely Budapest. It’s a fantastic conference that I would recommend.

During that conference, Randy Shoup talked about his experience migrating from monoliths to microservices at EBay and Google. David, one of the audience members asked an interesting question at the end of Randy’s talk. (I’m sorry that I didn’t get the full name of the questioner… if you are reading this, please leave a comment to let me know who you are.)

“Isn’t the concept of microservices contradictory with the lean/agile principles of a) collective code ownership, and b) optimizing whole processes and systems instead of small units?”

Randy already did a great job of responding to the first part of that question, so please view the video to hear his answer there. He didn’t have time to respond to the second part so I don’t know what his answer would be, but I will tell you mine.

Start From The “Why”

Let’s start by answering the question with a question. Why do we pursue Lean development in the first place? Your specific answer may vary, but I bet it relates back to “better use of capital” or “turning ideas into profit sooner.” Both of these are statements about efficiency: efficient use of capital and efficient use of time.

One of the first Lean changes is to reorganize people and processes around the value streams. That is a big upheaval! It often means moving from a functional structure to a cross-functional structure. (And I don’t mean matrixing!) Just moving to that cross-functional structure will deliver big improvements to cycle time and process efficiency. After that, teams in each value stream can further optimize to reduce their cycle time.

The next focus is on reducing “inventory.” For development, we consider any unreleased code or stories to be inventory. So, work-in-progress code, features that have been finished but not deployed, and the entirety of the backlog all count as inventory.

Reducing inventory always has the effect of making more problems visible. Maybe there are process bottlenecks to address, or maybe there are high defect rates at certain steps (like failed deployments to production, or a lot of rejected builds.)

This is the start of the real optimization loop: reduce the inventory until a new problem is revealed. Solve the problem in a way that allows you to further reduce inventory.

Which is the Value Stream?

David’s question seems to originate from the view that the value stream is the request handling process. So if a single request hits a dozen services, then one value stream cuts across multiple organizational boundaries. That would indeed be problematic.

However, I think the more useful viewpoint is that the value stream is “the software delivery process” itself. This is based on the premise that the value stream delivers “things customers would pay for.” Well, a customer wouldn’t pay for a single request to be handled. They would, however, pay for a whole new feature in your product.

Viewed that way, each service in production is the terminal point of its own value stream. So, Lean does not conflict with a microservice architecture. But could a microservice architecture conflict with Lean?

Return to “Why”

We asked, “Why Lean?” Now, let’s ask “Why microservices?” The answer is always “We want to preserve flexibility as we scale the organization.” Microservices are about embracing change at a macroscopic level. That has nothing to do with capital efficiency!

So are these ideas contradictory? To answer that, I need to dig into another aspect of Lean efforts: infrastructure.

Efficiency, Specialization, and Infrastructure

In the early days of aviation, airplanes were made of canvas and wood. They could land at pretty much any meadow that didn’t have cows or sheep in the way. Pilots navigated by sight and landmarks, including giant concrete arrows on the ground. Planes couldn’t go very fast, fly very high, carry many passengers, or haul a lot of cargo.

The maximum takeoff weight of an Airbus A380 is now 1.2 million pounds. It requires a specially reinforced runway of at least 9,020 feet and typically carries 525 passengers. It flies at an altitude of more than 8 miles. This is not an airplane that you navigate by eyeballing landmarks.

This aircraft is amazingly efficient. Achieving that efficiency requires extensive infrastructure. Radar on the plane and on the ground. Multiple comms systems. An extensive array of radio beacons and air traffic controllers on the ground and dozens of satellites in space, all sending signals to the on-board network of flight management systems. Billions of lines of code running across these devices. Airports with jetbridges that have multiple connections to the aircraft. Special vehicles to tow the plane, push the plane out, haul bags, fuel, de-ice, remove waste water… the list goes on and on.

In short, this is not just an airplane. It is part of an elaborate air transportation system.

It should be pretty obvious that the incredible efficiency of modern airliners comes at the expense of flexibility. Not just in terms of the individual aircraft, but in terms of changes to any part of the whole system.

You can see this play out in any technological arena. As we increase the systems’ efficiency, we accumulate infrastructure that both enables the efficient operation and also constrains the system to its current mode of operation.

In Lean initiatives, there is a gradual shift from draining inventory and solving existing problems into creating infrastructure to add efficiency. It’s not a bright line or a milestone to reach, but it is noticeable. As you get further into the infrastructure-efficiency realm, you must recognize two effects:

  • You will get better at certain actions.
  • Other actions become much, much harder.

As an example, suppose you are optimizing the value stream for delivering applications. (A reasonable thing to do.) You will eventually find that you need an automated way to move code into production. You may choose to build golden master images, or automate deployment via scripts, or use Docker to deploy the same configuration everywhere. You may commit to VSphere, Xen, OpenStack, or whatever. As you make these decisions, you make it easier to move code using the chosen stack and much, much harder to do it any other way.

Full Circle

So, with all that background, I’m finally ready to address the question of whether microservices and Lean are in conflict.

Given that:

  1. You want maneuverability from microservices.
  2. Your value stream is delivering features into production.
  3. You pursue Lean past the inventory-draining phase.
  4. Further efficiency improvements require you to commit to infrastructure and an extended system.
  5. That extended system will not be easy to change, no matter what you choose or how you build it.

Then the answer is “no.”

Development Is Production

| Comments

When I was at Totality, we treated an outage in our customers’ content management system as a Sev 2 issue. It ranked right behind “Revenue Stopped” in priority. Content management is critical to the merchants, copy writers, and editors. Without it, they cannot do their jobs.

For some reason, we always treated dev environment or QA environment issues as a Sev 3 or 4, with the “when I get around to it” SLA. I’ve come to believe that was incorrect.

The development environment and the QA environment are the critical tools needed for developers to do their jobs. When an environment is broken, it means those people are less effective. They might even be idle.

Why would you treat the tools developers use as any less critical? And yet, I see one company after another with unreliable, broken, half-integrated QA environments. They’ve got bad data, unreliable items, and manual test setup.

If the any stage of the development pipeline is broken, that’s exactly equivalent to the content pipeline being broken.

Development is production.

QA is production.

Your build pipeline is production.

Treat them accordingly!

The Fear Cycle

| Comments

Once you begin to fear your technology, you will shortly have cause to fear it even more.

The Fear Cycle goes like this:

  1. Small changes have unpredictable, scary, or costly results.
  2. We begin to fear making changes.
  3. We try to make every change as small and local as possible.
  4. The code base accumulates warts, knobs, and special cases.
  5. Fear intensifies.

Fear starts when an innocuous change goes badly. Maybe a production outage results, or maybe just an embarrassing bug. It may be a bug that gets upper management attention. Nothing instills fear like an executive committee meeting about your code defect!

This sphincter-shrinker originated because a developer couldn’t predict all the ramifications of a change. Maybe the test suite was inadequate. Or there are special cases that are only observed in production. (E.g., that one particular customer whose data setup is different than everyone else.) Whatever the specific cause, the general result is, “I didn’t know that would happen.”

Add a few of these events into the company lore and you’ll find that developers and project managers become loath to touch anything outside their narrow scope. They seek local safety.

The trouble with local safety is that it requires kludges. The code base will inevitably deteriorate as pressure for larger changes and broader refactoring builds without release.

The vicious cycle is completed when one of those local kludges is responsible for someone else’s “What? I didn’t know that!” moment. At this point, the fear cycle is self-sustaining. The cost of even small changes will continue to increase without limit. The time needed to get changes released will increase as well.

Breaking Point

One of several things will happen:

  1. A big bang rewrite (usually with a different team.) The focus will be “this time, we do it right!” See also: second system syndrome, Things You Should Never Do, Part I.
  2. Large scale outsourcing.
  3. Sell off the damaged assets to another company.

Avoiding the Cycle

The fear cycle starts when people treat a technical problem as a personal one. The first time a seemingly simple change causes a large and unpredictable effect, you need to convene a technical SWAT team to determine why the system allowed it to happen and what technical changes can avoid it in the future.

The worst response to a negative event is a tribunal.

Sadly, the difference between a technical SWAT team and a tribunal is mostly in how the individuals in that group approach the issue. Wise leadership is required to avoid the fear cycle. Look to people with experience in operations or technical management.

Breaking the Cycle

Like many reinforcing loops in an organization, the fear cycle is wickedly hard to break. So far, I have not observed any instance of a company successfully breaking out of it. If you have, I would be very interested to hear your experiences!

Components and Glue

| Comments

There’s a well-known architectural style in desktop applications called “Components and Glue”. The central idea is that independent components are composed together by a scripting layer. The glue is often implemented in a different or more dynamic language than the components.

The C2 wiki’s page on ComponentGlue has been stable since 2004, so obviously this is not a new idea.

Emacs is one example of this approach. The components are written in C, the glue is ELisp. (To be fair, though, the ELisp outnumbers the C by a pretty large factor.)

Perl was originally conceived as a glue language.

Visual Basic applications also followed this pattern. Components written in C or C++, glue in VB itself.

I think Components and Glue is a relevant architecture style today, especially if we want to compose and recompose our services in novel ways.

My last several posts have been about decomposing services into smaller, more independent units. Each one could be its own micro-SaaS business. Some application needs to stitch these back together. I often see this done in a separate layer that presents a simplified interface to the applications.

This glue layer may be written in a different language than the services themselves. For that matter, the individual services may be written in a variety of languages, but that’s a subject for a different time.

The glue layer changes more rapidly than the back end services, because it needs to keep serving the applications as they change. Even when the back end services are provided by an enterprise IT group, the integration layer will be more affiliated with the front end web & app teams.

We embrace plurality, so if there’s one glue layer, there may be more. We should allow multiple glue layers, where each one is adapted to the needs of its consumers. That begins to look like this:

The smaller and lighter we make the glue, the faster we can adapt it. The endpoint of that progression looks like AWS Lambda where every piece of script gets its own URL. Hit the URL to invoke the script and it can hit services, reshape the results, and reply in a client-specific format.

Once we reach that terminus, we can even think of individual functions as having URLs. Like one-off scripts in ELisp or perl, we can write glue for incidental needs: one-time marketing events, promotions, trial integrations, and so on.

“Scripts as glue” also lets us deal with a tension that often arises with valuable customers. Sometimes the biggest whales also demand a lot of customization. How should we balance the need to customize our service for large customers (the whales) and the need to generalize to serve the entire market? We can create suites of scripts that present one or more customer-specific interfaces, while the interior of our services remain generalized.

This also allows us to handle one of the hardest cases: when a customer wants us to “plug in” their own service in lieu of one of ours. As I’ve said before, all our services use full URLs for identifiers, so we should be able to point those URLs at our outbound customer glue. That glue calls the customer service according to its API and returns results according to our formats.

The components and glue pattern remains viable. As we decompose monoliths, it is a great way to achieve separation between services without undue burden on the front end applications and their developers.

Faceted Identities

| Comments

I have a rich and multidimensional relationship with Amazon. It started back in 1996 or 1997, when it became the main supplier for my book addiction. As the years went by, I became an “Amazon Affiliate” in a futile attempt to balance out my cash flow with the company. Later, I started using AWS for cloud computing. I also claimed my author page.

Let’s contemplate the data architecture needed to maintain such a set of relationships. Let’s assume for the moment that Amazon were using a SQL RDBMS to hold it all. The obvious approach is something I could call the “Big Fat User Table”. One table, keyed by my secret, internal user ID, with columns for all the different possible thing a user can be to Amazon. There would be a dozen columns for my affiliate status, a couple for my author page, a boolean to show I’ve signed up for AWS, and a bunch of booleans for each of the individual services.

Such a table would table would be an obvious bottleneck. Any DBA worth her salt would split that sucker into many tables, joined by a common key (the user ID.) New services would then just add a table in their own database with the common user ID. Let’s call this approach the “Universal Identifier” design.

That would also allow one-to-many relations for some aspects. For example, when I lived in Minnesota, the state demanded that Amazon keep track of tax for each affiliate. Amazon responded by shutting down all the affiliate accounts in Minnesota. I recently moved to Florida and was able to open a new account with my new address. So I have two affiliate accounts attached to my user account.

For what it’s worth, column family databases would kind of blur the lines between the Big Fat User Table and the Universal Identifier design.

We can get more flexible than the Universal Identifier, though.

You see, if we push the User ID into all the various services, that implies that the “things” that service manages can only be consumed by a User. Maneuverable architecture says we should be able to recompose services in novel configurations to solve business problems.

Instead of pushing the User ID into each service, we should just let each service create IDs for its “things” and return them to us.

For example, a Calendar Service should be willing to create a new calendar for anyone who asks. It doesn’t need to know the ID of the owner. Later, the owner can present the calendar ID as part of a request (usually somewhere in the URL) to add events, check dates, or delete the calendar. Likewise, a Ledger service should be willing to create a new ledger for any consumer, to be used for any purpose. It could be a user, a business, or a one-time special partnership. The calls could be coming from a long-lived application, a bit of script hooked to a URL, or curl in a bash script. Doesn’t matter.

If we’ve got all these services issuing identifiers, we need some way to stitch them back together. That’s where the faceted identities come in. If we start from a user and follow all the related “stuff” connected to that user, it looks a lot like a graph.

When a user logs in to the customer-facing application, that app is responsible for traversing the graph of identities, making requests to services, and assembling the response.

I hope you aren’t surprised when I say that different applications may hold different graphs, with different principals as their roots. That goes along with the idea that there’s no privileged vantage point. Every application gets to act like the center of its own universe.

Going Meta

If you’ve been schooled in database design, this probably looks a little weird. I’m removing the join keys from the relational databases. (Some day soon I need to write a post addressing a common misconception: that “relational” databases got their name because they let you relate tables together.)

The key issue I’m aiming at is really about logical dependencies in the data. Foreign key relationships are a policy statement, not a law of nature. Policies change on short notice, so they should be among the most malleable constructs we have. By putting that policy in the bottommost layer of every application, we make it as hard as possible to change!

We can think of a hierarchy of “looseness” in relationships:

  • Two ideas, stored in one entity: As coupled as it gets. Neither idea can be used without the other. (An “entity” here can be a table or link data resources with URLs. It’s not about the storage, but about the required relationship.)
  • Two ideas, two entities, one-to-one: Still, both ideas must be used together.
  • Two ideas, two entities, one-to-one optional: Now we can at least decide whether the second item is needed with the first.
  • Two ideas, two entities, one-to-many: This admits that the second idea may come in different quantities than the first.
  • Two ideas, two entities, many-to-many: Much more flexible! Both ideas can be combined in differing quantities as needed. However, this still requires that these ideas are only used together with each other. In other words, if ideas X and Y have a many-to-many relationship, I don’t get to reuse idea X together with idea A.
  • Two ideas, externalized relationship: This is the heart of faceted identities. Ideas X and Y can be completely independent. Each can be used together by other applications.

Interface Segregation Principle

The “I” in SOLID stands for Interface Segregation Principle. It says that a client should only depend on an interface with the minimum set of methods it needs. An object may support a wide set of behavior, but if my object only needs three of those behaviors, then I should depend on an interface with precisely those three behaviors. (One hopes those three make sense together!)

This has an application when we use faceted identies as well. Sometimes we have a very nice separation where the facets don’t need to interact with each other, only the application interacts with all of them. More often though, we do need to pass an identifier from one kind of thing into another. That’s when the contract becomes important. If service Y requires a foreign identifier “X” to perform an action, then it needs to be clear about what it will do with “X”. It’s up to the calling application to ensure that the “X” it passes can perform those actions.


Maneuverability is all about composing, recomposing, and combinging services in novel configurations. One of the biggest impediments to that is relationships among entities. We want to make those as loose as possible by externalizing the relationships to another service. This allows entities to be used in new ways without coordinated change across services. Furthermore, it allows different applications to use different relationship graphs for their own purposes.

Inverted Ownership, Part 2

| Comments

My last post on the subject of inverted ownership felt a bit abstract, so I thought I might illustrate it with a typical scenario.

In this first figure, we see a newly-extracted Catalog service, freshly factored out of the old monolithic application. It’s part of the company’s effort to become more maneuverable. We don’t know, or particularly care, what storage model it uses internally. From the outside, it presents an interface that looks like “SKUs have attributes”.

All seems well. It looks and smells like a microservice: independently deployable, released on its own schedule by a small autonomous team.

The problem is what you don’t see in the picture: context. This service has one “universe” of SKUs. It doesn’t serve catalogs. It serves one catalog. The problem becomes evident when we start asking what consumers of this service would want. If we think of the online storefront as the only consumer then it looks fine. Ask around a bit, though, and you’ll find other interested parties.

While IT toils to get down to a single source of record for product information, the wheelers and dealers in the business are out there signing up partners, inventing marketing campaigns, and looking into new lines of business. Pretty much all of those are going to screw around with the very idea of “the catalog”.

Maneuverability demands that we can combine and recombine our services in novel ways. What can we do with this catalog service that would let it be reused in ways that the dev team didn’t foresee?

Instancing might be one approach… multiple deployments from the same code base. High operational overhead, but it’s better than being stuck.

I prefer to make the context explicit instead.

Zero, One, Many

There’s an old saying that the only sensible numbers are zero, one, and infinity. One catalog isn’t enough, so the right number to support is “infinity.” (Or some resource-constrained approximation.)

What does it take? All we have to do is make catalog service create catalogs for anyone who asks. Any consumer that needs a catalog can create one. That might be a big, sophisticated online storefront. But it could be someone using cURL to manually construct a small catalog for a one-off marketing effort. The catalog service shouldn’t care who wants the catalog or what purpose they are going to put it to.

Of course, this means that subsequent requests need to identify which catalog the item comes from. Good thing we’re already using URLs as our identifiers.


There are some practical issues (and maybe objections) to address.

First, does this mean that the SKUs are duplicated across all those catalogs? Not necessarily. We’re talking about the interface the service presents to consumers. It can do all kinds of deduplication internally. See my post about the immutable shopping cart for some ideas about deduplication and “natural” identifiers.

Second, and trickier, how do the SKUs get associated to the catalog? Does each microsite and service need to populate its own catalog? Can it just cherry-pick items from a “master” catalog?

You can probably guess that I don’t much like the idea of a “master” catalog. Instead, we would populate a newly-minted catalog by feeding it either item representations (serialized data in a well-known format) or better yet, hyperlinks that resolve to item representations.

How about this: make the service support HTML, RDFa, and a standardized microformat as a representation. Then you just feed your catalog service with URLs that point to HTML. Those can come from a catalog of your own, an internal app for cleansing data feeds, or even a partner or vendor’s web site. Now you’ve unified channel feeds, data import, and catalog creation.

Third, is it really true that just anyone can create a catalog? Doesn’t this open us up to denial-of-service attacks wherein someone could create billions of catalogs and goop up our database? My response is that we don’t ignore questions of authorization and permission, but we do separate those concerns. We can use proxies at trust boundaries to enforce permission and usage limits.


When you make the context explicit, you allow a service to support an arbitrary number of consumers. That includes consumers that don’t exist today and even ones you can’t predict. Each service then becomes a part that you can recombine in novel ways to meet future needs.