Wide Awake Developers

Some Useful Techniques From Bygone Eras

| Comments


I find the old object-oriented design technique of CRC Cards to be useful when defining service designs. CRC is short for “Class, Responsibilities, Collaborators.” It’s a way to define what behavior belongs inside a service and what behavior it should delegate to other services.

Simulating a system via CRC is a good exercise for a group. Each person takes a CRC card and plays the role of that service. A request starts from outside the services and enters with one person. They can do only those things written down as their “responsibilities.” For anything else, they must send a message to someone else.

Personifying and role-playing really helps identify gaps in service design. You’ll find gaps where data or entire services are needed but don’t exist.

Tell, Don’t Ask

The more services you have, the more operational complexity you take on. Some patterns of service design seem to encourage high coupling and a high fan-in to a small number of critical services. (Usually, these are “entity” services… i.e., CRUDdy REST over a database table.)

Instead, I find it better to tell services what you want them to do. Don’t ask them for information, make a decision, then change some state.

Organizing around Tell, Don’t Ask leads you to design services around behavior instead of data. You’ll probably find you denormalize your data to make T,DA work. That’s OK. The runtime benefit of cleaner coupling will be worth it.

Data Flow Diagrams

If you ask someone who isn’t trained in UML to draw a system’s architecture, they will often draw something close to a Data Flow Diagram. This diagram shows data repositories and the transformation processes that populate them. DFDs are a very helpful tool because they force you to ask a few really key questions:

  1. Where did that information come from?
  2. How did it get there?
  3. Who updates it?
  4. Who uses the data we produce?

In particular, answering that last question forces you to think about whether you’re producing the right data for the downstream consumer.

Generalized Minimalism

| Comments

My daily language is Clojure. One of the joys of working in Clojure is its great core library. The core library has a wealth of functions that apply broadly across data structures. A typical function looks like this:

(defn nthnext
  "Returns the nth next of coll, (seq coll) when n is 0."
  {:added "1.0"
   :static true}
  [coll n]
    (loop [n n xs (seq coll)]
      (if (and xs (pos? n))
        (recur (dec n) (next xs))

I want to call your attention to two specific forms. The “seq” function works on any “Seqable” collection type. (N.B.: It has special cases for other types, including some to make Java interop more pleasant. But the core behavior is about Seqable.) The “next” function is similar: it works on anything that already is a Seq or anything that can be made into a Seq.

This provides a nice degree of abstraction and through that, generality.

Pretty much all of the core data types either implement ISeq or Seqable. That means I can call “seq”, “next”, and “nthnext” on any of them. Other data types can be brought into the fold by extending one of those interfaces to them. We extend the data to meet the core functions, instead of overloading functions for data types.

YAGNI Isn’t About Being Specific

Under this approach, writing a general function is both simpler and easier than writing a specific one.

For example, suppose I need to do that classic example of trivial functionality: summing a list of integers. The most natural way for me to write that is like this:

(reduce + 0 xs)

That is both simple and general. But it doesn’t meet the spec I said! It sums any numeric type, not just integers. If I decide that I really must restrict it to integers, I have to add code.

(assert (every? integer? xs))
(reduce + 0 xs)

This is a pattern I find pretty often when working in Clojure. When I generalize, I do it by removing special cases. This goes hand-in-hand with decomposing behavior into smaller and smaller units. As each unit gets smaller, I find it can be more general.

Here’s a less trivial example. Today, I’m working on a library we call Vase. (See Paul deGrandis’ talk on data-driven systems for more about Vase.) In particular, I’m updating it to work with a new routing syntax in Pedestal. With the new routing syntax, we can build routes from ordinary Clojure data… no more need for oddly-placed syntax-quoting.

One of the core concepts in Pedestal is the “interceptor”. They fulfill the same role as middleware in Ring. (One difference: interceptors are data structures that contain functions. Interceptors compose by making a vector of data, whereas Ring middleware composes by creating function closures. I find it easier to debug a stack of data than a stack of opaque closures.) Any particular route in Pedestal will have a list of interceptors that apply to that route.

When a service that uses Pedestal supplies interceptors, it composes a list of them. Suppose I want to make a convenience function that helps application developers build up that list. What would I need to do?

You probably already figured out that any such “convenience” functions I could create would basically duplicate core functions, but with added restrictions. Instead of “cons”, “conj”, “take”, and “drop”, I’d have to create “icons”, “iconj”, “itake”, and “idrop”. What a waste.

I have to ask myself, “Do I need some special behavior here?” And the answer is “YAGNI.”

YAGNI Is About Adding “Stuff”

YAGNI is commonly understood to mean “don’t generalize until you need to.” In some languages and libraries, I suppose that’s the right read. In my world, though, it is specializing that requires adding stuff. So I often call YAGNI if someone tries to make a thing less general than it could be.

Small functions that operate on abstractions instead of concrete types are both general and simple.

Redeeming the Original Sin

| Comments

While reading Bryan Cantrill’s slides from Papers We Love NYC, I was struck by something. One of the very first slides says:

The traditional UNIX security model is simple but inexpressive.

The papers go on to describe a progression of techniques to isolate processes from the host environment to greater and greater degrees. It began with the ancient precursor ‘chroot’, through Jails, and Zones. Each builds upon the previous work to improve the degree of isolation.

We’ve seen a parallel series of efforts in the Linux realm with virtual machines and containers.


All of these are introduced to restore the degree of isolation and resource control that was originally present in mainframe operating systems. Furthermore, it was the model that Multics was meant to supply.

Unix started with a simplified security model, meant for single user machines. It was “dumbed down” enough to be easy to implement on the limited machines of the day.

Zones, VMs, containers… they’re all ways to redeem Unix from its original sin. Maybe what we should look at is a better operating system?

What’s Lost With a DevOps Team

| Comments

Please understand, dear Reader, that I write this with positive intention. I’m not here to impugn any person or organization. I want to talk about some decisions and their natural consequences. These consequences seem negative to me and after reading this post you may agree.

When an established company faced a technology innovation, they often create a new team to adopt and exploit that innovation. During my career, I’ve seen this pattern play out with microcomputers, client/server architecture, open systems, web development, agile development, cloud architecture, NoSQL, and DevOps. Perhaps we can explore the pros and cons of that overall approach in some other post. For now, I want to specifically address the DevOps team.

A DevOps team gets created as an intermediary between development and operations. This is especially likely when dev and ops report through different management chains. That is to say, in a functionally-oriented structure. In a product-oriented structure, it is less likely.

This intermediary team gets tasked with automating releases and deployments. They are the ones to adopt some code-as-configuration platform. Sometimes they are also tasked with building an internal platform-as-a-service, but that more often falls to the infrastructure and operations teams.

So the devops team has development as their customer. Operations has the devops team as their customer. Work flows from development, through the tools created by the devops team, and into production. It would seem to capture the benefits of automation: it becomes predictable, repeatable, and safe.

All of that is true. However, even though this is an improvement, it misses out on even greater improvements that could be realized.

The key problem is the unclosed feedback loop. When developers are directly exposed to production operations, they learn. Sometimes they learn from negative feedback: getting woken up for support calls, debugging performance problems, or that horrible icy feeling in your stomach when you realize that you just shut down the wrong database in production.

With a DevOps team sitting between development and operations, the operations team remains in the “learning position.” But they lack the ability to directly improve the systems. Suppose a log message is ambiguous. If the operator who sees it can’t directly change the source code, then the message will never get corrected. (It’s important, but small… exactly the thing least likely to be worth filing a change request for.)

Over longer time spans, the things we learn from production should influence the entire architecture: from technology choices to code patterns and common libraries. A DevOps team sitting between development and operations impedes that learning.

DevOps is meant to be a style of interaction: direct collaboration between development and operations. A team in between that automates things is a tools team. It’s OK to call it a tools team. Tools are a good thing, despite what corporate budgeting seems to say these days.

Instead of creating a flow from development to DevOps to operations, consider putting development, tools, and operations all together and giving them the same goals. They should be collaborators working shoulder-to-shoulder rather than work stations in a software factory.

Give Them the Button!

| Comments

Here’s a syllogism for you:

  • Every technical review process is a queue
  • Queues are evil
  • Therefore, every review process is evil

Nobody likes a review process. Teams who have to go through the review look for any way to dodge it. The reviewers inevitably delegate the task downward and downward.

The only reason we ever create a review process is because we think someone else is going to feed us a bunch of garbage. They get created like this:

It starts when someone breaks a thing that they can’t or aren’t allowed to fix. The responsibility for repair goes to a different person or group. That party shoulders both responsibility for fixing the thing and also blame for allowing it to get screwed up in the first place.

(This is an unclosed feedback loop, but it is very common. Got a separate development and operations group? Got a separate DBA group from development or operations? Got a security team?)

As a followup, to ensure “THIS MUST NEVER HAPPEN AGAIN” the responsible party imposes a review process.

Most of the time, the review process succeeds at preventing the same kind of failure from recurring. The resulting dynamic looks like this:

The hidden cost is the time lost. Every time that review process has to go off, the creator must prepare secondary artifacts: some kind of submission to get on the calendar, a briefing, maybe even a presentation. All of these are non-value-adding to the end customer. Muda. Then there’s the delay on the review meeting or email itself. Consider that there is usually not just one review but several needed to get a major release out the door and you can see how release cycles start to stretch out and out.

Is there a way we can get the benefit of the review process without incurring the waste?

Would I be asking the question if I didn’t have an answer?

The key is to think about what the reviewer actually does. There are two possibilities:

  1. It’s purely a paperwork process. I’ll automate this away with a script that makes PDF and automatically emails it to whomever necessary. Done.
  2. The reviewer applied knowledge and experience to look for harmful situations.

Let’s talk mostly about the latter case. A lot of our technology has land mines. Sometimes that is because we have very general purpose tools available. Sometimes we use them in ways that would be OK in a different situation but fail in the current one. Indexing an RDBMS schema is a perfect example of this.

Sometimes, it’s also because the creators just lack some experience or education. Or the technology just has giant, truck-sized holes in it.

Whatever the reason, we expect that the reviewer is adding intelligence, like so:

This benefits the system, but it could be much better. Let’s look at some of the downsides:

  • Throughput is limited to the reviewer’s bandwidth. If they truly have a lot of knowledge and experience, then they won’t have much bandwidth. They’ll be needed elsewhere to solve problems.
  • The creator learns from the review meetings… by getting dinged for everything wrong. Not a rewarding process.
  • It is vulnerable to the reviewer’s availability and presence.

I’d much rather see the review codify that knowledge by building it into automation. Make the automation enforce the practices and standards. Make it smart enough to help the creator stay out of trouble. Better still, make it smart enough to help the creator solve problems successfully instead of just rejecting low quality inputs.

With this structure, you get much more leverage from the responsible party. Their knowledge gets applied across every invocation of the process. Because the feedback is immediate, the creator can learn much faster. This is how you build organizational knowledge.

Some technology is not amenable to this kind of automation. For example, parsing some developer’s DDL to figure out whether they’ve indexed things properly is a massive undertaking. To me, that’s a sufficient reason to either change how you use the technology or just change technology. With the DDL, you could move to a declarative framework for database changes (e.g., Liquibase). Or you could use virtualization to spin up a test database, apply the change, and see how it performs.

Or you can move to a database where the schema is itself data, available for query and inspection with ordinary program logic.

The automation may not be able to cover 100% of the cases in general-purpose programming. That’s why local context is important. As long as there is at least one way to solve the problem that works with the local infrastructure and automation, then the problem can be solved. In other words, we can constrain our languages and tools to fit the automation, too.

Finally, there may be a need for an exception process, where the automation can’t decide whether something is viable or not. That’s a great time to get the responsible party involved. That review will actually add value because every party involved will learn. Afterward, the RP may improve the automation or may even improve the target system itself.

After all, with all the time that you’re not spending in pointless reviews, you have to find something to do with yourself.

Happy queue hunting!

C9D9 on Architecture for Continuous Delivery

| Comments

Every single person I’ve heard talk about Continuous Delivery says you have to change your system’s architecture to succeed with it. Despite that, we keep seeing “lift and shift” efforts. So I was happy to be invited to join a panel to discuss architecture for Continuous Delivery. We had an online discussion last Tuesday on the C9D9 series, hosted by Electric Cloud.

They made the recording available immediately after the panel, along with a shiny new embed code.

Best of all, they supplied a transcript, so I can share some excerpts here. (Lightly edited for grammar, since I have relatives who are editors and I must face them with my head held high.)

Pipeline Orchestration

It’s easy to focus on the pipeline as the thing that delivers code into production. But I want to talk about two other central roles that it plays. One, with regards to risk management. To me the pipeline is not so much about ushering code out to production, but it’s about finding every opportunity to reject a harmful change, or a bad change prior to let it get into production. So I view the pipeline as an essential part of risk management.

I’ve also had a lot of lean training, so I’d look on the deployment pipeline as the value stream that developers use to deliver value to their customers. In that respect we need to think about the pipeline as production-grade infrastructure, and we need to treat it with production-like SLAs.

Cattle, Not Pets

I think a lot has been said about “cattle versus pets” over the last ten years or so. I just want to add one thing - the real challenge is identity. There are ton of systems and frameworks that implicitly assume stable identity on machines. Particularly a lot of distributed software toolkits. When you do have the cattle model, a machine identity may disappear and never come back again. I just really hope you’re not building up a queue of undelivered messages for that machine.

Service Orientation and Decoupling

Having teams running in parallel and being able develop more or less independently - I talk about team scale autonomy. But if there are very long builds, large artifacts and large number of artifacts, I regard that as the consequence of using languages and tools that are early bound and early linked. I don’t think it’s any accident that the people I heard of first doing continuous delivery were using PHP. You can regard each PHP file as its own deployable artifact, and so things move very quickly. If everything we wrote was extremely late bound, then our deployment would be an rsync command. So to an extent, breaking things down into services is a response to large artifacts, long build times, that’s one side of that.

The other side is team scale autonomy and the fact that you can’t beat Conway’s Law and that absolutely holds true. (Conway’s Law: an organization is constrained to produce software that recapitulates the structure of the organization itself. If you have four teams working on a compiler, you’re going to have a four pass compiler.)

Now, when we talk about decoupling, I need to talk about two different types of decoupling, both important.

The bigger your team gets, the more communication overhead goes up. We have known that since the 1960s, so breaking that down makes sense. But then we have to recompose things at runtime and that’s when coupling becomes a big issue. Operational coupling happens minute by minute by minute. If I have service A calling service B, service B goes down, I have to have some response. If I don’t do anything else, service A is also going to go down. So I need to build in some mechanisms to provide operational decoupling, maybe that’s a cache, maybe it’s timeouts, maybe it’s a circuit breaker, something along those lines, to protect one service from the failure of another service.

It’s not just the failure of the service! A deployment to the other service looks exactly like a failure from the perspective of the consumer. It’s simply not responding to request within an acceptable time.

So we have to pay attention to the operational decoupling.

Semantic coupling is even more insidious, and that’s what plays out over a span of months and years. We talk about API versioning quite a bit, but there other kinds of semantic coupling that creep in. I’ve been harping a lot lately about identifiers. If I have to pass an itemID to another system then I’m sort of implicitly saying there is one universe of itemIDs and that system has them all, and I can only talk to that system for items with those IDs.

Similarly with many services that we create, we create the service as though there is one instance of the service. We’d be better off creating the code that can instantiate that service many times for many consumers. So if you create a calendar service, don’t make one calendar that everyone has eventIDs on. Make a calendar service where we can ask for a new calendar and it gives you back a URL for a whole new calendar that is yours and only yours. This is the way you would build it if you were building a SaaS business. That’s how you would need to think about the decoupled services internally.

Messaging and Data Management

If I’m truly deploying continuously then I’ve got version N and version N+1 running against the same data source. So I need some way to accommodate that. In older less-flexible kinds of databases, that means triggers, shims, extra views, that kind of scaffolding.

I heard a great a story, I think it’s from Pinterest at Velocity a couple of years back. They had started with a monolithic user database and found they needed to split the table. After they already had 60 million users! But they were able to make many small deployments that each added kind of one step for an incremental migration. And once they got that in place, they let it sit for three months, at the end of that they found who was left and did a batch migration of those. Then they did a series of incremental deployments to remove the extra data management stuff.

So it’s one of those cases - doing continuous delivery both necessitates that you’re more sophisticated about your data changes, but it also gives you new tools to accomplish those changes.

There are a wide crop of databases that don’t require that kind of care and feeding when you make deployments. If you are truly architecting for operational ease and delivery, then that might be a sufficient reason to choose one of the newer databases over one of the less flexible relational stores.


The C9D9 discussion was quite enjoyable. The hosts ran the panel well, and even though all of us are pretty long-winded, nobody was able to filibuster. I’ll be happy to join them again for another discussion some time.

Software Eats the World

| Comments

During this morning’s drive, I crossed several small overpasses. It reminded me that the American Society of Civil Engineers rated more than 20% of our bridges as structurally deficient or functionally obsolete. That got me to thinking about how we even know how many bridges there are in a country as large as the U.S.

Some time in the past, it would require an army of people to go survey all the roads, looking for bridges and adding them to a ledger. Now, I’m sure it’s a query in a geographical database. The information had to be entered at least once, but now that it’s in the database we don’t need people to go wandering about with clicker counters.

Instead of clipboards and paper, the bridge survey needed data import from thousands of state and county GIS databases. That means coders to write the import jobs and DBAs to set up the target systems. It needed queries to count up the bridges and cross-check with inspection reports. So that requires more coders and maybe some UX designers for data visualization.

Back in 2011, Marc Andreessen said ”software is eating the world”. There’s no reason to think that’s going to slow down soon. And as software eats the world, work becomes tech work.

Microservices Versus Lean

| Comments

Back in April, I had the good fortune to speak at Craft Conf in lovely Budapest. It’s a fantastic conference that I would recommend.

During that conference, Randy Shoup talked about his experience migrating from monoliths to microservices at EBay and Google. David, one of the audience members asked an interesting question at the end of Randy’s talk. (I’m sorry that I didn’t get the full name of the questioner… if you are reading this, please leave a comment to let me know who you are.)

“Isn’t the concept of microservices contradictory with the lean/agile principles of a) collective code ownership, and b) optimizing whole processes and systems instead of small units?”

Randy already did a great job of responding to the first part of that question, so please view the video to hear his answer there. He didn’t have time to respond to the second part so I don’t know what his answer would be, but I will tell you mine.

Start From The “Why”

Let’s start by answering the question with a question. Why do we pursue Lean development in the first place? Your specific answer may vary, but I bet it relates back to “better use of capital” or “turning ideas into profit sooner.” Both of these are statements about efficiency: efficient use of capital and efficient use of time.

One of the first Lean changes is to reorganize people and processes around the value streams. That is a big upheaval! It often means moving from a functional structure to a cross-functional structure. (And I don’t mean matrixing!) Just moving to that cross-functional structure will deliver big improvements to cycle time and process efficiency. After that, teams in each value stream can further optimize to reduce their cycle time.

The next focus is on reducing “inventory.” For development, we consider any unreleased code or stories to be inventory. So, work-in-progress code, features that have been finished but not deployed, and the entirety of the backlog all count as inventory.

Reducing inventory always has the effect of making more problems visible. Maybe there are process bottlenecks to address, or maybe there are high defect rates at certain steps (like failed deployments to production, or a lot of rejected builds.)

This is the start of the real optimization loop: reduce the inventory until a new problem is revealed. Solve the problem in a way that allows you to further reduce inventory.

Which is the Value Stream?

David’s question seems to originate from the view that the value stream is the request handling process. So if a single request hits a dozen services, then one value stream cuts across multiple organizational boundaries. That would indeed be problematic.

However, I think the more useful viewpoint is that the value stream is “the software delivery process” itself. This is based on the premise that the value stream delivers “things customers would pay for.” Well, a customer wouldn’t pay for a single request to be handled. They would, however, pay for a whole new feature in your product.

Viewed that way, each service in production is the terminal point of its own value stream. So, Lean does not conflict with a microservice architecture. But could a microservice architecture conflict with Lean?

Return to “Why”

We asked, “Why Lean?” Now, let’s ask “Why microservices?” The answer is always “We want to preserve flexibility as we scale the organization.” Microservices are about embracing change at a macroscopic level. That has nothing to do with capital efficiency!

So are these ideas contradictory? To answer that, I need to dig into another aspect of Lean efforts: infrastructure.

Efficiency, Specialization, and Infrastructure

In the early days of aviation, airplanes were made of canvas and wood. They could land at pretty much any meadow that didn’t have cows or sheep in the way. Pilots navigated by sight and landmarks, including giant concrete arrows on the ground. Planes couldn’t go very fast, fly very high, carry many passengers, or haul a lot of cargo.

The maximum takeoff weight of an Airbus A380 is now 1.2 million pounds. It requires a specially reinforced runway of at least 9,020 feet and typically carries 525 passengers. It flies at an altitude of more than 8 miles. This is not an airplane that you navigate by eyeballing landmarks.

This aircraft is amazingly efficient. Achieving that efficiency requires extensive infrastructure. Radar on the plane and on the ground. Multiple comms systems. An extensive array of radio beacons and air traffic controllers on the ground and dozens of satellites in space, all sending signals to the on-board network of flight management systems. Billions of lines of code running across these devices. Airports with jetbridges that have multiple connections to the aircraft. Special vehicles to tow the plane, push the plane out, haul bags, fuel, de-ice, remove waste water… the list goes on and on.

In short, this is not just an airplane. It is part of an elaborate air transportation system.

It should be pretty obvious that the incredible efficiency of modern airliners comes at the expense of flexibility. Not just in terms of the individual aircraft, but in terms of changes to any part of the whole system.

You can see this play out in any technological arena. As we increase the systems’ efficiency, we accumulate infrastructure that both enables the efficient operation and also constrains the system to its current mode of operation.

In Lean initiatives, there is a gradual shift from draining inventory and solving existing problems into creating infrastructure to add efficiency. It’s not a bright line or a milestone to reach, but it is noticeable. As you get further into the infrastructure-efficiency realm, you must recognize two effects:

  • You will get better at certain actions.
  • Other actions become much, much harder.

As an example, suppose you are optimizing the value stream for delivering applications. (A reasonable thing to do.) You will eventually find that you need an automated way to move code into production. You may choose to build golden master images, or automate deployment via scripts, or use Docker to deploy the same configuration everywhere. You may commit to VSphere, Xen, OpenStack, or whatever. As you make these decisions, you make it easier to move code using the chosen stack and much, much harder to do it any other way.

Full Circle

So, with all that background, I’m finally ready to address the question of whether microservices and Lean are in conflict.

Given that:

  1. You want maneuverability from microservices.
  2. Your value stream is delivering features into production.
  3. You pursue Lean past the inventory-draining phase.
  4. Further efficiency improvements require you to commit to infrastructure and an extended system.
  5. That extended system will not be easy to change, no matter what you choose or how you build it.

Then the answer is “no.”

Development Is Production

| Comments

When I was at Totality, we treated an outage in our customers’ content management system as a Sev 2 issue. It ranked right behind “Revenue Stopped” in priority. Content management is critical to the merchants, copy writers, and editors. Without it, they cannot do their jobs.

For some reason, we always treated dev environment or QA environment issues as a Sev 3 or 4, with the “when I get around to it” SLA. I’ve come to believe that was incorrect.

The development environment and the QA environment are the critical tools needed for developers to do their jobs. When an environment is broken, it means those people are less effective. They might even be idle.

Why would you treat the tools developers use as any less critical? And yet, I see one company after another with unreliable, broken, half-integrated QA environments. They’ve got bad data, unreliable items, and manual test setup.

If the any stage of the development pipeline is broken, that’s exactly equivalent to the content pipeline being broken.

Development is production.

QA is production.

Your build pipeline is production.

Treat them accordingly!

The Fear Cycle

| Comments

Once you begin to fear your technology, you will shortly have cause to fear it even more.

The Fear Cycle goes like this:

  1. Small changes have unpredictable, scary, or costly results.
  2. We begin to fear making changes.
  3. We try to make every change as small and local as possible.
  4. The code base accumulates warts, knobs, and special cases.
  5. Fear intensifies.

Fear starts when an innocuous change goes badly. Maybe a production outage results, or maybe just an embarrassing bug. It may be a bug that gets upper management attention. Nothing instills fear like an executive committee meeting about your code defect!

This sphincter-shrinker originated because a developer couldn’t predict all the ramifications of a change. Maybe the test suite was inadequate. Or there are special cases that are only observed in production. (E.g., that one particular customer whose data setup is different than everyone else.) Whatever the specific cause, the general result is, “I didn’t know that would happen.”

Add a few of these events into the company lore and you’ll find that developers and project managers become loath to touch anything outside their narrow scope. They seek local safety.

The trouble with local safety is that it requires kludges. The code base will inevitably deteriorate as pressure for larger changes and broader refactoring builds without release.

The vicious cycle is completed when one of those local kludges is responsible for someone else’s “What? I didn’t know that!” moment. At this point, the fear cycle is self-sustaining. The cost of even small changes will continue to increase without limit. The time needed to get changes released will increase as well.

Breaking Point

One of several things will happen:

  1. A big bang rewrite (usually with a different team.) The focus will be “this time, we do it right!” See also: second system syndrome, Things You Should Never Do, Part I.
  2. Large scale outsourcing.
  3. Sell off the damaged assets to another company.

Avoiding the Cycle

The fear cycle starts when people treat a technical problem as a personal one. The first time a seemingly simple change causes a large and unpredictable effect, you need to convene a technical SWAT team to determine why the system allowed it to happen and what technical changes can avoid it in the future.

The worst response to a negative event is a tribunal.

Sadly, the difference between a technical SWAT team and a tribunal is mostly in how the individuals in that group approach the issue. Wise leadership is required to avoid the fear cycle. Look to people with experience in operations or technical management.

Breaking the Cycle

Like many reinforcing loops in an organization, the fear cycle is wickedly hard to break. So far, I have not observed any instance of a company successfully breaking out of it. If you have, I would be very interested to hear your experiences!