Wide Awake Developers

Power Systems

| Comments

This is an excerpt from something I’m working on this Labor Day holiday:

Large scale power outages act a lot like software failures. It starts with a small event, like a power line grounding out on a tree. Ordinarily that would be no big deal but under high-stress conditions it can turn into a cascading failure that affects millions of people. We can also learn from how power gets restored after an outage. Operators must perform a tricky balancing act between generation, transmission, and demand.

There used to be a common situation where power would be restored and then cut off again in a matter of seconds. It was especially common in the American South, where millions of air conditioners and refrigerators would all start at the same time. When a motor starts up, it draws a lot of current. You can see this in the way that lights dim when you start a circular saw. As the motor starts to spin, though, it creates “back EMF”–a kind of backpressure on the electrical current. (That’s when the lights return to full brightness.) If you add up the effects of millions of electric motors starting all at once, you see a huge upward blip in current draw, followed by a quick drop due to back current. Power transmission systems would see the spike and drop and propagate that to the generation systems. First they would increase their draw then drop it dramatically. That would make the generation systems think they should shut off some of the turbines. Right about the time they started reducing supply, the initial surge of back EMF would decline and current load would come back up to baseline levels. The increased current load hit just when supply was declining, causing excess demand to trip circuit breakers. Lights out, again.

Smarter appliances and more modern control systems have mitigated that particular failure mode now, but there are still useful lessons for us.

Remember DAT?

| Comments

Do you remember Digital Audio Tape? DAT was supposed to have all the advantages of digital audio—high fidelity and perfect reproduction—plus the “advantages” of tape. (Presumably those advantages did not include melting on the dashboard of your Chevy Chevelle or spontaneously turning into The Best of Queen after a fortnight.)

In hindsight, we can see that DAT was product twilight product. As the sun set on cassette, it was an attempt to bridge the discontinuous technology change to digital music production. It was a twilight product because it didn’t reimagine the existing technology to offer enough of a new advantage nor eliminate enough of the old disadvantages.

We often see such twilight products right before a major, discontinuous shift.

I think we’re in such a period when it comes to software development and deployment for cloud native systems. The products we have attempt to take the traditional model into the new environment. But they don’t yet reimagine the world of software development enough. Ten years from now, we will see that they offer some advantages but also carry forward baggage from the client-server era. (E.g., Unix-like full operating systems, coding one process at a time, treating network as a secondary concern, ignoring memory hierarchy in the languages.)

Whatever the “operating system for the cloud” is, we haven’t seen it yet.

QA Instability Implies Production Instability

| Comments

Many companies that have trouble delivering software on time exhibit a common pathology. Developers working on the next release are frequently interrupted for production support issues with the current release. These interrupts never appear in project schedules but can take up half of the developers’ hours. When you include the cost of task-switching, this means less than half of their available time is spent on the new feature work.

Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system’s behavior in QA is not a good predictor of its behavior in production.

QA environments are often configured differently than production. Not just in the usual sense of consuming a QA version of external services, but also in more basic ways. Server topology may be different. Memory settings, capacity ratios, and the presence of network components can all vary. QA often has a much simpler traffic routing scheme than production, particularly when a CDN is involved.

The other major source of QA unavailability has to do with data refreshes. QA environments either run with a miniscule, curated test data set, or they use some form of backward migration from production data. Each backward migration can be very disruptive, leading to one or more days where QA is not available.

Disruption arises when testers have to do manual data setup in order to test new features. These setups get overwritten with the next refresh. Sometimes, production data must be cleansed or scrubbed of PII before use in QA. This cleansing process often introduces its own data quality problems. The backward migration process must also be kept up to date so it can propagate data back into the schema for the next release. This requires copying data and schema into QA, then forward-migrating the schema according to the new release.

When many teams contend to get into a QA environment, that contention can result in lost time as well. Time is lost in delays when one team cannot move their code into QA during another team’s test. It is also lost when one team overwrites test data that a different team had set up. And it can be lost when one team’s code has bugs that prevent other teams from proceeding with their tests. Suppose one team works on login and registration, while another team works on friend requests. Clearly, the friend requests team cannot do their testing when login is broken. This last issue also applies across service boundaries: a service consumer may not be able to test because the QA version of their service provider is broken.

Finally, problems in QA simply take a lower priority than problems in production. Thus, the operations team may be fully consumed with current production issues, leaving the QA environment broken for extended periods. In a vicious feedback loop, this makes it likely that the next release will also create production problems.

My recommendations are these:

  • Give priority to well-functioning test environments.
  • Virtualize your test environments, so you can avoid inter-team dependencies on a QA environment.
  • Automate the backward data propagation, and make it part of spinning up a QA environment. When you must scrub PII, automate that process so that every QA environment can draw from a snapshot of cleansed data without impinging on the production DBAs.
  • If your QA stays unavailable because there are too many production issues, recognize that this is a self-sustaining pattern. You can temporarily redirect a “SWAT” team to fix QA and it will pay dividends for all future releases.

Wittgenstein and Design

| Comments

What does a philosopher born in the 19th Century have to say about software design? More than you might think, particularly his ideas about family resemblance.

Wittgenstein used the subject of “games” to illustrate an idea. We’ll start with a counter-example. Suppose we operate with the then-prevailing notion that words are defined like sets in axiomatic set theory. Then there is a decision procedure that will let us decide whether something is a member of the set “games” or not. Such a decision procedure must include everything that is a game and exclude everything which is not a game. Can we define such a decision procedure?

Does a game require competition? Some do. Not all.

Does a game have a score? Or an objective? Not all.

Does a game involve more than one person? Not necessarily.

Is a game a frivolous expenditure of energy? Some are. Others have deep moral and philosophical lessons.

How is a game of football like a game of solitaire?

It’s easy to see that mancala and go have something in common… little rounded stones. But what do they have in common with Minecraft? Stones?

Wittgenstein said that this is not an issue for set theory. Instead, he talked about family resemblances. As described in Wikipedia, “things which could be thought to be connected by one essential common feature may in fact be connected by a series of overlapping similarities, where no one feature is common to all.”

For games, this means there is no single feature that makes something a game. Instead, there are a set of overlapping similarities that make things more gamelike or less gamelike. We can even think about things that share more of the features as being more like each other. So go and mancala share features like: two players, stones on a board, alternating turns, one winner, ancient, cerebral, positional. This makes them pretty similar. A professionally played team sport with a ball on a field shares few qualities with go. (Although “people excited about the outcome” and “positional” might be common.) So the feature-distance between go and football is large, yet they are both still games.

I think this relates to the tasks of software design and architecture. We have a strong tendency to go looking for nouns in our designs. Once we find a noun in a domain, we want to make a software artifact that captures all members of the set induced by that noun. But that only works if we stick with axiomatic set theory. Set theory works well for well-defined technical concepts and much less well for things in the human sphere.

One simple example, the humble “name” field. Go read Falsehoods programmers believe about names. How do you feel about that “first name”, “last name” database structure now? After reading that list, how much can you confidently say about instances of a “Name” class? Or a “Name” service?

We have all these debates about “noun-first” or “verb-first”. Back in The Perils of Semantic Coupling, I argued for a behavior-oriented approach rather than a noun-oriented approach. Stop saying “what is this thing?” but rather “what can you do with it?” That leads us toward segregated interfaces.

Now I’d augment that to emphasize those feature descriptions rather than noun-like descriptions. Instead of noun-first or verb-first I’m going to try “adjective-first”.

In Love With Your Warts

| Comments

If you help other people solve problems, you may have run into this phenomenon: a person gleefully tells you how messed up their environment is. Processes make no sense, roadblocks are everywhere, and all previous attempts to make things better have failed. As you explore the details, you talk about directions to try. But every suggestion is met with an explanation of why it won’t work.

I say that these folks are “in love with their own warts.” That is, they know there’s a problem, but they’ve somehow been acclimated to it to such a degree that they can’t imagine a different world. They will consistently point to outside agents as the author of their woes, without realizing how much resistance they generate themselves.

Over time, by the way, there’s a reinforcing process. People who think and talk this way will cluster and drive out the less cynical.

These people can be intensely frustrating to work with, until you understand them. Understanding allows empathy, which is the only way to get past that self-generated resistance.

The first thing to understand is that any conversation about their problems isn’t really about their problems. An opening statement like, “We tried that but it didn’t work,” isn’t really asking for a solution. Instead, it’s an invitation to play a game. That game is called, “Stump the expert.” The player wins when you concede that nothing can ever improve. You “win” by suggesting something that the player cannot find an objection to. It’s not a real victory though, for reasons that will be clear in a moment.

Why does the player want to win this game instead of improving their world? For one thing, any solution you find is an implicit critique of the person who has been there. Suppose the solution is to shift a responsibility from one team to another. That requires management support in both teams. If that solution works, then it means the game-player could have produced the same improvement ages ago, but didn’t have enough courage to make it happen. Other changes might imply the game-player lacked sufficient authority, vision, credibility, or, rarely, technical acumen.

In every case, the game-player feels that your solution highlights a deficiency of theirs.

This is why “winning” the discussion isn’t really a win. You may get a grudging concession about the avenue to explore, but you’re still generating more resistance from that game-player.

My usual approach is to decline the invitation to the game. I don’t try to find point-by-point answers to things that have failed in the past. I usualy draw analogies to other organizations that have faced the same challenges and make parallels to their solutions. Failing that, I accept the objections (almost always phrased as roadblocks thrown up by others) and just tell them, “Let me handle that.” (Most of the time, I find that people on the opposite side of a boundary express roadblocks from the other side that all eventually cancel each other out. That is, the roadblock turns out to be illusory.)

I’d like to hear from you, dear Reader. Assume that you cannot simply fire or transfer the game-player. They have value beyond this particular habitual reflex.

How would you handle a situation like this? What have you tried, and what works?

Some Useful Techniques From Bygone Eras

| Comments


I find the old object-oriented design technique of CRC Cards to be useful when defining service designs. CRC is short for “Class, Responsibilities, Collaborators.” It’s a way to define what behavior belongs inside a service and what behavior it should delegate to other services.

Simulating a system via CRC is a good exercise for a group. Each person takes a CRC card and plays the role of that service. A request starts from outside the services and enters with one person. They can do only those things written down as their “responsibilities.” For anything else, they must send a message to someone else.

Personifying and role-playing really helps identify gaps in service design. You’ll find gaps where data or entire services are needed but don’t exist.

Tell, Don’t Ask

The more services you have, the more operational complexity you take on. Some patterns of service design seem to encourage high coupling and a high fan-in to a small number of critical services. (Usually, these are “entity” services… i.e., CRUDdy REST over a database table.)

Instead, I find it better to tell services what you want them to do. Don’t ask them for information, make a decision, then change some state.

Organizing around Tell, Don’t Ask leads you to design services around behavior instead of data. You’ll probably find you denormalize your data to make T,DA work. That’s OK. The runtime benefit of cleaner coupling will be worth it.

Data Flow Diagrams

If you ask someone who isn’t trained in UML to draw a system’s architecture, they will often draw something close to a Data Flow Diagram. This diagram shows data repositories and the transformation processes that populate them. DFDs are a very helpful tool because they force you to ask a few really key questions:

  1. Where did that information come from?
  2. How did it get there?
  3. Who updates it?
  4. Who uses the data we produce?

In particular, answering that last question forces you to think about whether you’re producing the right data for the downstream consumer.

Generalized Minimalism

| Comments

My daily language is Clojure. One of the joys of working in Clojure is its great core library. The core library has a wealth of functions that apply broadly across data structures. A typical function looks like this:

(defn nthnext
  "Returns the nth next of coll, (seq coll) when n is 0."
  {:added "1.0"
   :static true}
  [coll n]
    (loop [n n xs (seq coll)]
      (if (and xs (pos? n))
        (recur (dec n) (next xs))

I want to call your attention to two specific forms. The “seq” function works on any “Seqable” collection type. (N.B.: It has special cases for other types, including some to make Java interop more pleasant. But the core behavior is about Seqable.) The “next” function is similar: it works on anything that already is a Seq or anything that can be made into a Seq.

This provides a nice degree of abstraction and through that, generality.

Pretty much all of the core data types either implement ISeq or Seqable. That means I can call “seq”, “next”, and “nthnext” on any of them. Other data types can be brought into the fold by extending one of those interfaces to them. We extend the data to meet the core functions, instead of overloading functions for data types.

YAGNI Isn’t About Being Specific

Under this approach, writing a general function is both simpler and easier than writing a specific one.

For example, suppose I need to do that classic example of trivial functionality: summing a list of integers. The most natural way for me to write that is like this:

(reduce + 0 xs)

That is both simple and general. But it doesn’t meet the spec I said! It sums any numeric type, not just integers. If I decide that I really must restrict it to integers, I have to add code.

(assert (every? integer? xs))
(reduce + 0 xs)

This is a pattern I find pretty often when working in Clojure. When I generalize, I do it by removing special cases. This goes hand-in-hand with decomposing behavior into smaller and smaller units. As each unit gets smaller, I find it can be more general.

Here’s a less trivial example. Today, I’m working on a library we call Vase. (See Paul deGrandis’ talk on data-driven systems for more about Vase.) In particular, I’m updating it to work with a new routing syntax in Pedestal. With the new routing syntax, we can build routes from ordinary Clojure data… no more need for oddly-placed syntax-quoting.

One of the core concepts in Pedestal is the “interceptor”. They fulfill the same role as middleware in Ring. (One difference: interceptors are data structures that contain functions. Interceptors compose by making a vector of data, whereas Ring middleware composes by creating function closures. I find it easier to debug a stack of data than a stack of opaque closures.) Any particular route in Pedestal will have a list of interceptors that apply to that route.

When a service that uses Pedestal supplies interceptors, it composes a list of them. Suppose I want to make a convenience function that helps application developers build up that list. What would I need to do?

You probably already figured out that any such “convenience” functions I could create would basically duplicate core functions, but with added restrictions. Instead of “cons”, “conj”, “take”, and “drop”, I’d have to create “icons”, “iconj”, “itake”, and “idrop”. What a waste.

I have to ask myself, “Do I need some special behavior here?” And the answer is “YAGNI.”

YAGNI Is About Adding “Stuff”

YAGNI is commonly understood to mean “don’t generalize until you need to.” In some languages and libraries, I suppose that’s the right read. In my world, though, it is specializing that requires adding stuff. So I often call YAGNI if someone tries to make a thing less general than it could be.

Small functions that operate on abstractions instead of concrete types are both general and simple.

Redeeming the Original Sin

| Comments

While reading Bryan Cantrill’s slides from Papers We Love NYC, I was struck by something. One of the very first slides says:

The traditional UNIX security model is simple but inexpressive.

The papers go on to describe a progression of techniques to isolate processes from the host environment to greater and greater degrees. It began with the ancient precursor ‘chroot’, through Jails, and Zones. Each builds upon the previous work to improve the degree of isolation.

We’ve seen a parallel series of efforts in the Linux realm with virtual machines and containers.


All of these are introduced to restore the degree of isolation and resource control that was originally present in mainframe operating systems. Furthermore, it was the model that Multics was meant to supply.

Unix started with a simplified security model, meant for single user machines. It was “dumbed down” enough to be easy to implement on the limited machines of the day.

Zones, VMs, containers… they’re all ways to redeem Unix from its original sin. Maybe what we should look at is a better operating system?

What’s Lost With a DevOps Team

| Comments

Please understand, dear Reader, that I write this with positive intention. I’m not here to impugn any person or organization. I want to talk about some decisions and their natural consequences. These consequences seem negative to me and after reading this post you may agree.

When an established company faced a technology innovation, they often create a new team to adopt and exploit that innovation. During my career, I’ve seen this pattern play out with microcomputers, client/server architecture, open systems, web development, agile development, cloud architecture, NoSQL, and DevOps. Perhaps we can explore the pros and cons of that overall approach in some other post. For now, I want to specifically address the DevOps team.

A DevOps team gets created as an intermediary between development and operations. This is especially likely when dev and ops report through different management chains. That is to say, in a functionally-oriented structure. In a product-oriented structure, it is less likely.

This intermediary team gets tasked with automating releases and deployments. They are the ones to adopt some code-as-configuration platform. Sometimes they are also tasked with building an internal platform-as-a-service, but that more often falls to the infrastructure and operations teams.

So the devops team has development as their customer. Operations has the devops team as their customer. Work flows from development, through the tools created by the devops team, and into production. It would seem to capture the benefits of automation: it becomes predictable, repeatable, and safe.

All of that is true. However, even though this is an improvement, it misses out on even greater improvements that could be realized.

The key problem is the unclosed feedback loop. When developers are directly exposed to production operations, they learn. Sometimes they learn from negative feedback: getting woken up for support calls, debugging performance problems, or that horrible icy feeling in your stomach when you realize that you just shut down the wrong database in production.

With a DevOps team sitting between development and operations, the operations team remains in the “learning position.” But they lack the ability to directly improve the systems. Suppose a log message is ambiguous. If the operator who sees it can’t directly change the source code, then the message will never get corrected. (It’s important, but small… exactly the thing least likely to be worth filing a change request for.)

Over longer time spans, the things we learn from production should influence the entire architecture: from technology choices to code patterns and common libraries. A DevOps team sitting between development and operations impedes that learning.

DevOps is meant to be a style of interaction: direct collaboration between development and operations. A team in between that automates things is a tools team. It’s OK to call it a tools team. Tools are a good thing, despite what corporate budgeting seems to say these days.

Instead of creating a flow from development to DevOps to operations, consider putting development, tools, and operations all together and giving them the same goals. They should be collaborators working shoulder-to-shoulder rather than work stations in a software factory.

Give Them the Button!

| Comments

Here’s a syllogism for you:

  • Every technical review process is a queue
  • Queues are evil
  • Therefore, every review process is evil

Nobody likes a review process. Teams who have to go through the review look for any way to dodge it. The reviewers inevitably delegate the task downward and downward.

The only reason we ever create a review process is because we think someone else is going to feed us a bunch of garbage. They get created like this:

It starts when someone breaks a thing that they can’t or aren’t allowed to fix. The responsibility for repair goes to a different person or group. That party shoulders both responsibility for fixing the thing and also blame for allowing it to get screwed up in the first place.

(This is an unclosed feedback loop, but it is very common. Got a separate development and operations group? Got a separate DBA group from development or operations? Got a security team?)

As a followup, to ensure “THIS MUST NEVER HAPPEN AGAIN” the responsible party imposes a review process.

Most of the time, the review process succeeds at preventing the same kind of failure from recurring. The resulting dynamic looks like this:

The hidden cost is the time lost. Every time that review process has to go off, the creator must prepare secondary artifacts: some kind of submission to get on the calendar, a briefing, maybe even a presentation. All of these are non-value-adding to the end customer. Muda. Then there’s the delay on the review meeting or email itself. Consider that there is usually not just one review but several needed to get a major release out the door and you can see how release cycles start to stretch out and out.

Is there a way we can get the benefit of the review process without incurring the waste?

Would I be asking the question if I didn’t have an answer?

The key is to think about what the reviewer actually does. There are two possibilities:

  1. It’s purely a paperwork process. I’ll automate this away with a script that makes PDF and automatically emails it to whomever necessary. Done.
  2. The reviewer applied knowledge and experience to look for harmful situations.

Let’s talk mostly about the latter case. A lot of our technology has land mines. Sometimes that is because we have very general purpose tools available. Sometimes we use them in ways that would be OK in a different situation but fail in the current one. Indexing an RDBMS schema is a perfect example of this.

Sometimes, it’s also because the creators just lack some experience or education. Or the technology just has giant, truck-sized holes in it.

Whatever the reason, we expect that the reviewer is adding intelligence, like so:

This benefits the system, but it could be much better. Let’s look at some of the downsides:

  • Throughput is limited to the reviewer’s bandwidth. If they truly have a lot of knowledge and experience, then they won’t have much bandwidth. They’ll be needed elsewhere to solve problems.
  • The creator learns from the review meetings… by getting dinged for everything wrong. Not a rewarding process.
  • It is vulnerable to the reviewer’s availability and presence.

I’d much rather see the review codify that knowledge by building it into automation. Make the automation enforce the practices and standards. Make it smart enough to help the creator stay out of trouble. Better still, make it smart enough to help the creator solve problems successfully instead of just rejecting low quality inputs.

With this structure, you get much more leverage from the responsible party. Their knowledge gets applied across every invocation of the process. Because the feedback is immediate, the creator can learn much faster. This is how you build organizational knowledge.

Some technology is not amenable to this kind of automation. For example, parsing some developer’s DDL to figure out whether they’ve indexed things properly is a massive undertaking. To me, that’s a sufficient reason to either change how you use the technology or just change technology. With the DDL, you could move to a declarative framework for database changes (e.g., Liquibase). Or you could use virtualization to spin up a test database, apply the change, and see how it performs.

Or you can move to a database where the schema is itself data, available for query and inspection with ordinary program logic.

The automation may not be able to cover 100% of the cases in general-purpose programming. That’s why local context is important. As long as there is at least one way to solve the problem that works with the local infrastructure and automation, then the problem can be solved. In other words, we can constrain our languages and tools to fit the automation, too.

Finally, there may be a need for an exception process, where the automation can’t decide whether something is viable or not. That’s a great time to get the responsible party involved. That review will actually add value because every party involved will learn. Afterward, the RP may improve the automation or may even improve the target system itself.

After all, with all the time that you’re not spending in pointless reviews, you have to find something to do with yourself.

Happy queue hunting!