AI Versus Microservices - michaelnygard.com

Microservices were always a technical solution to an organizational problem.

The Road More Traveled

Think back to the early 2010’s and imagine yourself as a startup CEO. You have a vision for a better app or website and you’ve gotten a bunch of VC funding. The trouble is that anywhere between ten and a thousand other startups have very similar ideas. As with any Metcalfe’s law company, at most two of you will survive. First place wins 80% of the market. Second place gets 15%. Third place is you’re fired.

Your singular objective is to gain customers faster than your competitors.

You’ve got a CRO with incredible hockey-stick charts. Your CFO has a burndown chart and an “end by” date. If the hockey stick bends upward before the burndown crosses zero, you win. If not, you lose.

You also have a CTO telling you that the team is working as fast as they can, but there’s only so much code anyone can sling in a day, and his ship dates push the hockey stick way too far out to the right. The only possible response you have is “then get a bigger team”. The CTO says something about mythical months and pregnant women, which sounds both totally irrelevant and possibly sexist to you. The CTO – or the next occupant of that chair – will expand the dev team.

So now the CTO has a problem. Expand the team by 10, 20, or 100x and ship faster. But experience shows that adding people to a project slows it down due to communication overhead and coordination cost. The solution is to break the product into many smaller projects, each with its own two-pizza team.

Microservices allow horizontal scaling of a dev organization with sub-exponential coordination cost. Each team only needs to know about its local neighborhood of services. (This also reduces the other negative effect of rapid hiring: the handful of people that have global knowledge about the system are diluted and outnumbered 100 to 1.)

In other words, microservices are an attempt to balance two opposing forces: exponential slowdown from communication and coordination (C&C) versus linear speedup from parallel feature development. If the team gets the balance right and carves the service boundaries well, they can continuously ship small batches of functionality and A/B test your company into the unicorn club. If they get it wrong, they’ll spend all their time at the whiteboard and all your money on Splunk while your customers complain on Reddit.

Large companies also found benefits to carving ancient monoliths into microservices. In their case it was less about out-competing to win a market – though some of them needed to compete a lot better to retain their market than they had been. Instead they needed to break out of a web of cyclic technical dependencies and architectural decay that was slowing development to a crawl. Basically, many of them found that they were already on the wrong side of the exponential C&C slowdown and needed to get shipping again.

What’s Changed - Flavor 1

The same story about scaling out via microservices and dedicated two-pizza teams can be interpreted differently. From a critical perspective, the result is also organizational boundaries / managerial lines of control around architectural boundaries in the system. (It’s legally required to mention Conway’s Law and the Inverse Conway Maneuver here.) The trouble is that it’s much, much easier to change boundaries in software than in the organization.

In a monolith, nobody gets laid off when you delete a class.

With microservices, no team ever eliminates their own raison d’être and lays themselves off. Instead you’ll get a version two or version three that expands the functionality of their services.

Your startup from the 2010’s now has six thousand services owned by hundreds of squads and nobody knows how it all works.

You mandate that every developer (and non-developer) must use AI coding agents for their work, seeking the speedup that you believe is possible. And it works! Pull request volume shoots way up. Lines of code modified is through the roof! Small features are delivered faster: copy edits, graphical tweaks, UX changes. They’re even being tested behind feature flags and experiments and agents are automatically removing changes that test badly.

(A side effect is that your customers never see the same app twice, and encounter paper-cut bugs on a daily basis. But you’ve got an AI chatbot responding to their complaints so you don’t hear their frustration yet.)

The trouble is that the big things aren’t moving any faster. New markets, new products, new cross-cutting features are still just as slow to produce because your overall architecture is still fragmented.

(This will come as no surprise to anyone who has read their Reinertsen, Kim & Spear, or Poppendieck.)

With AI agents you want to scale down your dev team but the architecture was optimized for scaling out not down. You need each developer, and their pod of AI agents, to own larger units of code but the microservice boundaries are too small and fragmented.

Meanwhile, the constant news of large-scale layoffs has every developer scared for their current job and scared they won’t find another one. So absolutely nobody is going to raise their hand and say “I think my service shouldn’t exist any more.” (It seems more likely that you’ll have vicious turf wars and middle managers running annexation campaigns, all with AI-produced docs and decks beautifully justifying their maneuvers.)

The resulting tension is the next Seldon Crisis facing these companies.

What’s Changed - Flavor 2

It is now almost twenty years since the anti-SOA rebellion. It’s fair to call the microservice architecture the dominant architectural style. Startups today begin with a monolith in their early days but once they find a degree of product/market fit, they look to rebuild for scale in microservices.

I don’t know what the next dominant style will be. Maybe we’ll call it “macroservices” or “megaservices”, though I doubt it. Neither of those words have the “cool factor” that will help consultancies sell services.

I can say that we need to find a new way to draw the boundaries. Here are some of the forces I see that will affect how we do that:

It has to be much, much safer to ship code than it is today. That’s a simply consequence of the risk equation: \[ \text{Expected Loss} = \sum_{i=1}^{n} \left( P(\text{loss}_i) \times \text{Opportunities}_i \times \mathbb{E}[\text{loss per opportunity}_i] \right) \] We rely on automation in our CI pipelines: unit tests, security scans, some performance testing. We also rely on experimentation and feature flags. The trouble is that AI agents are just as likely to rewrite or delete the tests, or to “reinterpret” experimentation results, in order to reach the goals we give them.
Governance mechanisms must be rebuilt to deal with the scale of changes. Security scanning tools already report far too many false positives for the human-driven rate of change. Domain name governance isn’t sufficiently automated at most companies. Data governance is partially automated, partially human driven for classification. Systems for data subject rights are cobbled together. As it is, few of the technical staff care about data subject rights and none of the non-technical staff pushing code are worried about it. I predict ever more frequent small scale data breaches of the stupid “DB endpoint was left public” variety. (In this post I’ve only been talking about AI driven development, not about agent-in-the-loop systems, but data governance for those is in an impossible double-bind. Everybody wants their agent to have access to all the data, but the agents have no guardrails at all. They’ll readily ship all your private customer data off to a free-tier Firebase if you let them.)
Access control based on tokens and keys seems fundamentally unfixable. Leaked keys in externally hosted repositories are exploited in minutes and can cost a fortune before they’re discovered. So far, there don’t seem to be guardrails immune to agentic bypass. There are too many external platforms that agents are trained on, but your company’s bespoke platform is in nobody’s corpus. Any non-technical user’s agent can spin up a fresh Vercel account and your pipeline’s checks cannot stop it.
Each developer must own larger units of code. I have previously valued collective code ownership. However that’s a principle for spreading knowledge among humans, with the intent of diffusing knowledge faster than the truth of the code changes. That now appears hopeless. The change rate is too fast. Not even one human fully understands the code they’re accountable for.
Another reason for larger ownership scope: Merge conflicts are a big issue again. Agents will happily rewrite code that is working fine. Two devs with their swarms create massive merge conflicts. Right now this looks like even more productivity since resolving those conflicts spends tokens and boosts LOC changed metrics. That activity doesn’t create value though.
We need validation beyond test code in the same repository. Constant rewriting by the agents causes regressions. Agents will “fake it” by changing the test instead of fixing the code. If you try to nail all existing behavior in place with vast suites of unit tests, then you’ll spend all your tokens on reading and updating test code instead of production code. More tokenmaxxing without value.
The specification of APIs a service offers must be externalized. When changing code, agents don’t value the principle of keeping your interface stable. They’ll change an API that other services depend on. You might catch this in CI, if you’ve built good compatibility checking. It’s better to put the API specification someplace separate and confine the coding agents so they aren’t allowed to mutate it during normal development. (As sub-principle here is that accretive interfaces are more important than ever so that unilateral change in the specification – by a different agent – is possible.)
We need to think about team topologies at two different scales: within the “pod” of the developer and agents, and between pods. This will determine how fast you ship large scale (cross-pod) features more than PR rates or LOC changes.

Summing Up

Every radically new technology has caused a shift in the dominant architecture, and often in the languages and platforms, that we employ. The new architecture will look obvious in hindsight, but what doesn’t? It will hit the right balance of adjacency to existing technology, re-balancing of forces in tension, and mass appeal to become the next “hot topic” of books, conference talks, consultancies, etc. That doesn’t mean it will be the ideal approach… every new solution has within it the seeds of the next problems. So whatever the successor architecture becomes, it will create new niches for tooling, supporting systems, languages, frameworks, and the like.

I’ve tried to avoid cynicism about either the current state of affairs or the future. But I will confess that “doing things right” has never seemed less valued than it does now. In the push to have everybody shipping code, all the time, I worry that we are accumulating mountains of invisible coupling, fragile dependencies, uncontrolled production environments … it adds up to a lot of technical risk. We do not (yet) have platforms or processes to manage that risk. When the bill comes due, the blame will fall on the front line staff not the ones who set up the incentives and reaped the rewards.

AI disclosure: All content human crafted, except for getting the LaTeX math syntax right.

The Road More Traveled

Sidebar - Microservices in Big Companies

What’s Changed - Flavor 1

What’s Changed - Flavor 2

Summing Up