Wide Awake Developers

Constrain the Provider to Liberate Callers

2024-02-27T05:43:48-06:00

Back in the Before Times, I went to a Haskell-flavored FP conference, where one of the speakers said something that blew my mind. Sadly, it seems that I didn’t write this up at the time (although I swear I wrote it somewhere… maybe in an internal company memo) and I’ve lost the details of who said it. If by some quirk of odds, it was you, dear reader, please let me know!

The speaker posed a question: Given the following function signature–and no additional information at all–how many possible implementations of f are there?

f :: a -> a

For those who don’t read Haskell, it says “the function f takes an argument of type a and returns a value of type a”. We have no information about a whatsoever, and that’s the key to the koan. Without any information about type a, f cannot apply any operation to the value it receives. It cannot create a new instance of a because it doesn’t know how to construct it. It cannot modify the value because it doesn’t know any other function to apply. In fact, the only possible implementation of f is id, the identity function. If you give me an a, all I can do is give it back to you.

At first glance, this seems trivial or even tautological. But there’s something profound under the surface. If, like me, you come from the dynamic side of FP or the world of OOP where you always have a base class with some operations available, this might need a little bit of unpacking. I don’t want to turn this into a Haskell tutorial (I would certainly get a lot of it wrong!) but a couple of ideas are necessary.

Suppose we wanted to say that a different function g could operate on any integer. We would then have a different signature:

g :: Integer -> Integer

In this case, g can take any possible integer into any other possible integer. Now we know way less about the implementation. It could increment or decrement its argument. It could add 42 to odd values and subtract 99 from even ones. If we’re working with 64 bit integers, there are 18,446,744,073,709,552,000 implementations of this function that just ignore their argument and return a constant!

We could also use typeclasses to indicate partial information about the values. Suppose we wanted to create a function h that accepts values which can be compared and placed in an ordering. That would look something like :

h :: (Ord a) => a -> a

(I probably have the syntax wrong, but bear with me… it’s the idea that is important here.)

This says that h can take any value of any type, as long as that type is known to fulfill the requirements of the Ord typeclass. Ord is fairly minimal, it just requires that the type implements a few operations like “less than” and “greater than”. (It also brings in Eq, but we don’t need that at the moment.)

Let’s get back to our original f :: a -> a: By using the type variable a with zero additional information about a, the compiler enforces constraints on the implementer of f. f must accept any value of any type whatsoever. f is therefore not allowed to know anything about the arguments. It cannot make any assumptions about the values it will be called with. In a sense, f is maximally constrained because it has the least information possible about its arguments.

On the flip side, a caller of f is maximally liberated. Because the implementation cannot make any assumptions, it also means the implementation cannot impose any constraints on the caller. The caller can pass any value it wants to. The caller can use f in contexts and situations that the implementer of f never dreamed of.

I think this generalizes to a principle: Constrain the provider to liberate callers.

In my experience, elegant systems emerge when we have functions or modules that can be used in multiple contexts. So we would like to have multiple callers. But whatever functionality the provider offers can only be used in contexts that meet its assumptions. Callers are restricted to meeting the provider’s assumptions. Therefore assumptions in the provider constrain the caller. We should invert this: limit the provider’s ability to make assumptions thereby allowing callers to use it in various situations.

This extends to services across an enterprise, too. The more a provider is allowed to know about how it will be used, the narrower the ability to reuse or recompose the services will be. Extend that forward and backward along call chains in a distributed services environment, and you will see inflexibility set in as every caller of some service is itself a provider to others.

I think this is also linked to the phenomenon of single-use API definitions, where an API is written for a specific point-to-point interaction. The implementation inevitably makes too many assumptions about how it will be used. So you get an environment with a proliferation of APIs each with their own payload types.

Another related idea is selective amnesia. When designing an API to offer, you can choose to temporarily forget what you know about the caller. Instead think about how a second or third caller might want to invoke your service. This leads to an API that is “one notch more abstract” than you might otherwise design. (Before you shout YAGNI, please recall that we must weaken YAGNI across service boundaries.) Selective amnesia can help constrain the provider from making assumptions.

Rule of Eights

2023-05-28T10:29:28-05:00

The “rule of eights” is a handy way to think about feedback cycle time and the effect it has on human attention spans. This is something I heard–and am probably misremembering to some extent–at an agile conference back in the day. I can’t take credit for this but I also can’t remember who I heard it from. If you know who came up with this, please contact me so I can properly attribute this.

(Edit: Dion Stewart points out that this resembles the Powers of Ten article from Jakob Nielsen. I’ve read some of Nielsen’s work, so it’s entirely possible that I have misremembered and conflated that with some other thoughts on feedback.)

The basic idea is that the speed of feedback has a huge effect on a person’s ability to stay in a flow state. The longer the feedback takes, the more likely the person is to context-switch onto some other activity.

80 milliseconds

Faster than human reflexes. Feels effectively instantaneous. Unlikely to break flow.

800 milliseconds

A noticeable hitch or pause. Enough to be annoying while typing, but not overly irritating when running a command. Reasonable for an “execute on save” command like linting or formatting. Unlikely to break flow.

8 seconds

A discernable pause. Cannot be a continuous part of workflow, but might be a kind of punctuation between tasks. Will cause thoughts to wander. May cause alt-tabbing over to check email.

80 seconds

Annoying. Enough to get bored and provoke tab-switching, with high likelihood of lost flow.

8 minutes

This is a coffee break. Certain to break flow. A diligent dev may alt-tab to HN with the intent to come back when the job is done (but will probably get diverted into other work until long after the 8 minutes is over.) Will cause the developer to find ways to avoid incurring this. If this is your test execution time, tests will degrade as they are avoided during most dev work.

80 minutes

Flow is a long-lost dream. This is a lunch break. Devs will organize their day around not having to run this job. They get basically two runs per day at this pace: probably once in the morning and once in the afternoon and the afternoon job may be kicked off before heading out the door.

8 hours

Flow? What flow? This is something the dev starts before leaving for the day. Everything is scheduled around this execution time. The job is very likely to break overnight, and devs will invent ways to “recover” a broken build via monkey-patching, hotfixing, etc. This is waste on top of waste, but feels more “right” than incurring another day of delay. The underlying job will get more and more flaky as people come to rely on spackle and grout instead of fixing the foundation.

8 days

Effectively infinite. Jobs very likely to fail. Flow is irrelevant. This is the domain of researchers, prisoners, and postdocs.

The Bad Idea Game

2023-05-02T06:34:46-05:00

About ten years ago, I was introduced to something called “The Bad Idea Game” by Danvers Fleury. We were doing a company strategy retreat. Fortunately we did not spend it all on wordcrafting mission and values statements, and we actually engaged in some good strategy.

The bad idea game was a fun exercise that didn’t seem to produce any directly useful results. At first I thought it had been a waste of a precious hour from our limited supply. Afterwards, however, I noticed that we were thinking more broadly and considering more creative options.

Since then, I have used the Bad Idea Game occasionally when it seems that groupthink has emerged or people have become circular in their thinking. It helps break out of those recursive loops.

It is a facilitated exercise that starts by posing a challenge.

“Instead of hitting the problem head-on, let’s go a different direction. For the next hour, we’re going to think of the worst, most value-destroying ideas we can. For example, instead of thinking about how to improve your company image you would think of ideas for absolutely demolishing your company image. Things like: put out a video of you murdering puppies on main street. We’re going to do this because sometimes we can find good ideas in the ‘shadow’ of bad ones by reversing them or combining their reversals. "

You’ll get some (nervous) laughter at first and people will need some prodding. Use your usual facilitation tricks and tools (having a plant in the audience always helps). After a couple of minutes to explain the silliness of the exercise, have people spend five or ten minutes writing ideas on sticky notes.

Put the notes up on the board and let people riff on them for a while.

Depending on the group and the psychological safety in the company it might be best to not write down the bad ideas.

Why is this a useful exercise? I think there are two ways this helps. One thing that doesn’t usually result is directly producing a good idea in that session.

First, as I mentioned before, it can break people out of their habits of thought. It engages some creative faculties.

The second, more subtle effect that I didn’t appreciate until later, is that framing things as bad ideas gives you license to start naming elephants. In any company there are some things you just don’t talk about. Sensitive areas. Political hot buttons. Toes on which you must not step. Deeply ingrained assumptions. And sometimes, those things are holding you back. Especially if those deeply held beliefs are the very things that made you successful so far. And doubly so if your company isn’t good at self-reflection and confronting its own sacred cows.

If you’re going to call the CEO’s baby ugly, it’s better to do in under the cover of a bad idea exercise than to come right out and say it.

Everything We Build Has a Future Cost

2023-04-16T10:53:29-05:00

Suppose we build a road. If we build it road and walk away, it will decay into a hazard before long. It will be scoured by wind, rain, and sand. Ultraviolet rays from the sun will break down its molecular structure. The shifting earth beneath will crack and buckle it.

We must maintain what we build, and that requires expense.

Suppose we decide that the road is no longer needed or that it costs more to maintain than it is worth. There is expense to removing what we have built, too.

We cannot just close it and leave it alone. We must divert traffic off of that road to others, which may require some incremental new road construction to make connections. If people have moved into neighborhoods on that road, we must build new routes to connect their homes–or we must relocate them. If there are businesses, we must (somehow) deal with the owners. Once we’ve reduced usage to nothing–which can take years–we must tear up the asphalt or concrete, haul it away (to where?), and restore the land where the road sat.

Let’s consider a smaller case. Consider the humble birdfeeder, a trivial construct. You buy it, hang it, and fill it. The air fills with birdsong and you fill with joy. Of course, the birds eat the feed so you must replenish it. That’s the obvious future cost. Less obvious is that the plastic will degrade and break, and you will eventually replace the whole unit. It may not seem like much to hang the new unit in the old one’s place, but it does cost you time. For the aged or infirm, it may have a capital cost if a handyman is needed. Discarding the old feeder means throwing it in the trash–which you pay to haul away–or recycling–which you or your local government pay for as well.

It would be wise to consider the future cost when you undertake a project to build something. Maintenance, replacement, switching cost, disassembly, and removal… when you account for all of those, perhaps the present value of the project isn’t as appealing as it appeared.

Four Meanings of Priorities

2023-04-12T07:20:11-05:00

When trying to communicate, we sometimes use the same word thinking that it means the same thing to everyone. But words are slippery, multivalent things. I can speak a word with one meaning and you might hear it with another. The result is the illusion of communication.

As a leader you must be aware that your words can be taken in different ways. In one kind of culture, people might look for the most sinister possible interpretation and assume that’s what you must have meant. Even in a healthy culture, it is possible for you to say the exact same thing to two different people and they still end up disagreeing about your intended message.

I’ve been thinking about the word “priority” and its multiple meanings. It’s a word that comes up frequently. Everyone in your organization deserves a clear understanding of priorities and how their work connects to the organization’s goals. However, I’ve identified at least four distinctly different meanings of “priority”.

Sequence

Priority-as-sequence means we will do all of priority 1, then all of priority 2, and so on. In a small team or startup, this might be the only useful definition.

It is the sense of a recipe: chopping vegetables is the first priority, then you can saute them.

Allocation

As allocation, “priority” means we will spend the majority of our time on the top priority, a lesser amount on the second priority, and so on. This might be fractions of a week for a single team, or it might be allocation of headcount across an organization.

This is the sense we mean when we say “health is a priority” or “family is a priority”. We don’t literally plan to “finish all our health first, then all our family”. Instead it is a statement about how we intend to spend our time.

Trade-offs

In this sense, a list of priorities is a pre-decided ordering of trade-offs. When priority 1 and priority 2 are in tension, we know to make the trade-off in favor of priority 1. This somewhat intersects with priority-as-allocation, if the trade-off in question is “where do I devote my time”. However it is distinct when the question at hand is something like “do I optimize for performance or cost?”

The image here is buying a car. You might have “appearance” as a priority over “reliability” or vice versa.

Scope

Priority can also mean “the boundary of what I care about at all.” The list of priorities gives you permission to say “no” to other demands. If your organization–like so many others–is drowning under excess WIP and a years-long backlog, people will eagerly adopt this meaning of priority. (Even if what you meant was “allocation” or “trade-offs”.)

The image here is a backpack for hiking. There are only so many things that will fit into it. Bringing a tennis racket is probably not a priority.

As with any question of definitions, none of these is more right than others. The key is to make sure you have a shared understanding so that everyone has clarity and can work toward the same purpose.

Transactions Aren't Everything

2023-04-06T00:00:00+00:00

When building an application, we tend to select a database technology based on its transactional characteristics. We consider raw performance, API style, consistency model, data model, and deployment architecture. That’s about as much as your service cares about: can it meet the functional and non-functional requirements for the production behavior of the service?

Even in a microservice architecture where no other application is allowed to access the service’s database, that database probably has a bunch of other clients. You may not think about these when making your selection, but how well or badly your database supports them has a big effect on its eventual success.

ETL connectors. Your company has some kind of ETL or ELT to feed bulk data from your service to analytical processing.
PII Discovery and Classification Tools. These tools are part of data governance and will be a horizontal capability. They look through your schema and samples of your data to see if there is undeclared PII.
Backup and Recovery. Whether this is something like cross-datacenter replication or “cold” storage of backup files, everybody’s got one. (And by the way, have you tested your backups lately?)
Query Optimization. Your service doesn’t need this but you do. (Especially in the cloud where performance equals savings!)

The horizontal tools are probably licensed at an enterprise level. Your company has to pay for the connectors to each different type of database in use. So if you’re the only user of a particular DB technology, that connector cost (whether license fee or labor to build it) is part of the cost of choosing that technology. It’s not a trivial component of the overall price tag, so make sure you understand the full cost/benefit equation when making your choices.

Counterfactuals are not Causality

2021-06-19T16:27:50+00:00

Suppose we’ve had a recent error with a Kubernetes cluster. As often happens with a problem in our systems, we noticed it first in terms of the visible error, which we could state as “Builds did not complete.” Now we want to trace backwards to figure out what happened. A common technique is the “Five Whys” popularized by Lean thinking. So we ask “Why did builds not complete” and we find “Kubernetes could not start the pod, and the operation timed out after 1 hour.”

We could certainly debate whether that’s a single “why” or two of them in one step, but that’s not the key topic right now. The main thing is that this is a straightforward statement about causality. “Pod no start” leads directly to “build no done.”

The next step in this analysis reveals that the pod would not start because a volume was full with too many files. Again, direct causality.

The tricky bit comes next. Why was the volume full with too many files? At this point, we’re likely to see a change in the nature of the explanation. Some variation of the following might be offered:

The admin did not configure file purging.
The cluster admins did not monitor for “volume full” conditions.
The developers did not clean up files from old builds.

Do you notice how all these “causes” are stated in the form of something that didn’t happen? They are “counterfactuals.”

A counterfactual is a statement about how the world might be different now if something had happened differently in the past. It’s a kind of “alternate history” idea.

Here’s the rub: a counterfactual cannot be a cause. By definition the counterfactual did not happen, therefore it cannot have caused anything. Only events that actually occur can be causes of other events. Causality should be stated in a form “Because X then Y”. The statement “If not X then not Y” is not an explanation, it is a kind of wishful thinking about how the past might have unfolded differently.

When performing Five Whys it is important to avoid this counterfactual leap. Stick to the events that actually occurred.

Unlimited Counterfactuals

Notice in the incident analysis I outlined earlier, there are three counterfactuals listed. Each of them independently would have been sufficient to avert the incident. But these are hardly the only three counterfactuals we could construct:

We used Kubernetes for our CI cluster instead of static VMs.
We use CI instead of a human working at the command line.
We put code in a repository instead of directly editing files on production instances.

I could go on, but you probably felt like the first three were somehow more reasonable than these three. In some way, the original set are “closer” to actual reality than these three. Nonetheless, I could go on constructing counterfactuals for an unlimited period of time. “If the Earth hadn’t been habitable then we would not be here to care about our CI builds not finishing.” Once you start making counterfactuals, there’s really no end to them. Again, that’s because these are not events that happened. Only a finite number of events actually happened so the chain of causality is finite. An infinite number of things didn’t happen so we can always find more “missing things” to blame.

Speaking of Blame

This is also where people come into conflict when analyzing the chain of events. One person might posit a counterfactual about an event a different person or team didn’t do. That person or team naturally bristles–it feels like they are being blamed. (And worse, being blamed for not doing something, so they are being called negligent!) They would be impelled to put forward their own counterfactual which might haul in yet another team. If the negative outcome was significant, this cloud of hypotheticals becomes a “blamestorm” looking to rain down on somebody. Defenses go up, and learning stops.

Counterfactuals are the condensation nuclei for blamestorms.

Using Counterfactuals For Good

The counterfactual leap indicates where people stop looking for causes and jump to thinking about solutions. Try to reformulate the counterfactual as a statement about future prevention:

If we configure file purging, then this won’t happen again
If we monitor for “volume full” conditions, then this won’t happen again
If we clean up files from old builds, then this won’t happen again.

These are useful statements. When formulated this way, they’re clearly talking about the future and not hypothesizing an alternate history. (You might have noticed that I also snuck in a bunch of “we” statements in place of the more specific attributes above.)

As long as we remain clear that these counterfactuals are not the cause of the problem that already happened, but are changes to our reality that can prevent future occurrences, we can use them without inducing blamestorming.

As a practical technique, during a Five Whys or post-incident review, when someone poses a counterfactual as a cause, I suggest capturing it in the forward-looking version in a parking lot of potential changes.

Stepping Farther Away From Reality

This reformulation also helps weed out the more far-fetched conterfactuals… the ones that felt kind of “out there” or even silly before. Let’s try it with the second set from above:

If we use static VMs instead of Kubernetes for our CI cluster, then this won’t happen again. (Possibly true statement, though somewhat lacking in support.)
If we use a human working at the command line instead of CI, then this won’t happen again. (Probably. Humans are more adaptable and can figure out when to purge files. But there are likely to be other undesirable effects.)
If we edit files directly on production instances instead of putting code in a repository, then this won’t happen again. (Umm… definitely a case where the cure is worse than the disease!)

This last one also lets me illustrate something about the counterfactuals from before. You might have felt more resistance to the second set because you were automatically thinking about negative consequences if that statement had been true. Humans are very good at hypothesizing these counterfactuals. Faced with a bad outcome, our brains spontaneously and instantly conjecture a large branching tree of alternate histories. And just as quickly, we prune that tree of those branches which we know would produce other negative effects that are worse than the outcome we had. Just imagine, “If we provoke a nuclear war that ends civilization, then this CI build failure won’t happen again.”

So when I pose a counterfactual that says “if we edit files directly on production instances, this won’t happen again,” your instinctive response is to say, “yeah, but.” This is now thinking about two steps away from the current reality. Step 1 is to imagine the alternative history where the counterfactual had occurred. Step 2 is to extrapolate the negative outcome of the consequences of that alternative history. Sometimes we even go further steps away from reality by postulating still more counterfactuals that could compensate for the negative consequences of the first one.

Conclusions

Counterfactuals don’t say anything about what actually happened. They express wishful thinking about an alternate history where the bad event didn’t happen. Because they represent “events that didn’t occur” they cannot have caused anything. However, stating a counterfactual can trigger an unhelpful round of blamestorming. Try to reformulate counterfactuals offered as explanations for past events so you can state them as injections to prevent recurrence. Of course, you must also contemplate what other effects those injections would have!

Watch out for the pitfall of counterfactuals when analyzing anything. It’s a common trap for post-incident reviews, retrospectives, project post-mortems, and other cases when you need to reconstruct a chain of events.

"Manual" and "Automated" are just words

2020-10-15T13:35:45-05:00

Driving down a shady road, windows down, listening to the frogs and crickets, my family was in the car talking about various stuff and things. This summer evening we happened to talk about the invention and emergence of the word “yeet.” I observed that it was kind of cool to have a word with a known origin and etymology, even if that was only because it was a made-up word. My daughter instantly responded that “all words were made up by someone.”

What could I say? Of course it’s true!

I’ve previously talked about the difficulty that words present. In 2015 I discussed the perils of semantic coupling that could emerge when we get fooled by nouns. The existence of a noun makes us think we understand a concept. Once we try to define a predicate to answer “Is X an instance of Y?” for any noun Y it becomes difficult, verging on impossible, to find a categorical statement. Instead we fall back to the Potter Stewart method.

In Wittgenstein and Design (say that three times fast) I talked about pursuing adjectives instead of nouns as a way to carve a design space.

Today, I want to talk about how we use words as signifiers for their semiotic content. In particular, the words “manual” and “automated.”

Two-legs good, four-legs bad

We are now ten years into the DevOps era. Among both practitioners and adopters, there is a tendency to use “automated” as a pseudo-synonym (psynonym?) for “good” while “manual” stands in for “bad.” The trouble is that the closer you look the harder it gets to tell whether any particular thing is manual or automated!

Suppose we are in an incident. I invoke the “break glass” process to ssh into a server to run a bash script. Was that manual or automated? Well, both, sort of.

We are in an incident… probably initiated without human intervention based on monitoring systems that detected a triggering condition.
I invoke the break glass process… wait a second. How did I even get involved? Maybe the systems notified me directly via PagerDuty. That would have no human intervention. Or maybe our operations center decided to escalate to level 3 support, and I’m the on-call this week. In the second case, a human decided the escalation was required and clicked a button in ServiceNow. ServiceNow then used a database to contact me. Was that manual? Automated? Semi-automated?
I invoke the break glass process… wait another second. Once I’m involved, I have to bring information into my head. That information came from humans and systems. I have to then decide a course of action. I guess we’d call that manual? (Although “manual” derives from Latin “manus” which means hand powered, not brain powered.) Invoking the break glass process is an action in a system that I trigger by entering a rationale and clicking a button.
to ssh into a server… entirely facilitated by the systems.
to run a bash script… does a bash script count as automated? Or is it manual because I had to invoke the script? What if there’s no script but a wiki page with a list of commands I keystroke each time? Sounds more manual, but I’m still invoking tools that already exist. At some level, everything above toggling a program is automated.

Out of the Morass

Instead of applying a blanket statement like “manual” or “automated”, we should look more closely. Specifically, what actions are being executed by which people or systems via which tools in response to which stimuli.

When we engage with detail at that level we can begin to ask and answer more useful questions than “is it automated”. For example:

How long does it take from the stimulus to the action? Bear in mind that shorter is not always better.
What is the probability of error in performing the action? Toggling in that 1401 program… pretty high probability of error. Running a bash script… low probability of error. (But that probability rises geometrically with each argument to the script!)
What judgement or decision-making is required to choose an action in response to a stimulus? As we build ever-more-powerful levers to move our systems, and particularly as we give our systems their own internal feedback loops through the control plane, we need to think of them more like cybernetic systems. (Think about PID controllers, Kalman filters, inertial models, or creating a radar track from a series of intermittent “blips.")

Breaking the question down this way won’t help us answer whether something is “automated” or “manual.” But it will help us answer how likely the process is to deliver availability, stability, security; or conversely, how likely it is to amplify noise, create oscillation, or induce drag.

Blocker? Pre-requisite.

2020-09-22T10:27:07-05:00

In discussions about change in a complex system I commonly hear people object, “We can’t do that because X.”

(That statement often follows a passive-aggressive prelude such as “That’s all well and good” or “being tactical for a moment.” Depending on your organizational culture you may also hear “That’s great in theory…” Or if your company is more aggressive-aggressive, “Get real!")

My advice is to reformulate that statement. Treat the blocker as a missing prerequisite: “In order to do that, X must be true. Let’s see what it would take.”

At that point, you may find “We can’t do X because Y.” Keep going, turn Y into a prerequisite for X. As you continue this process, you’re building out a “future reality tree”. It is a tree of preconditions that ultimate arrive at a desirable effect.

Now comes the really hard part. You have to scrutinize the resulting tree. I recommend using the “Categories of Legitimate Reservation” that emerged from Eliyahu Goldratt’s work on the Theory of Constraints.

Once you’re satisfied that the tree is a true depiction of the preconditions, you need to get brutally honest and look for unintended consequences. For every precondition in your tree, ask “what other effects will result.” Those effects are consequences. You must add those to your tree, otherwise you only consider benefits not costs or drawbacks.

When you’re done with this tree, you need to evaluate it. Does the net result of all the consequences produce a better outcome than the situation you’re in? Are the actions needed to create the preconditions possible? Feasible?

If you’ve truly captured the prerequisites and consequences, then people who both support the changes and dislike the changes should be able to agree on the truth of the tree. If not, you are either missing preconditions, disagree about the likelihood of the consequences, or you are working from different sets of axioms.

Delay Induces Lamination

2020-09-21T00:00:00+00:00

I’ve seen a repeated pattern that plays out in many companies. Delay, or more accurately, the perception of delay induces the creation of “extra” layers in the architecture. The pattern goes like this:

A component or subsystem needs to add a capability to serve some end-user need.
It will take “too long” to implement that capability in the component. (This is where the perception part really steps in.) Maybe the team is stretched too thinly. Maybe the capability is low value relative to the rest of the pipeline and gets scheduled out in the future. Maybe the team has a lot of technical debt to contend with. Or maybe it just really does take a long time to implement in a particular layer.
The requestor then moves up the call stack and looks for a component at a layer closer to the end user, so the capability can be added there. Often this means introducing a new layer between the end user and the “slow” component. This might be a kind of strategic maneuver to engulf and extinguish the other component. In a strongly political environment, you will see this play out as executives jockey for position against each other. It might be a good faith effort to create a new “orchestration” layer to bring together diverse capabilities.
If the effort succeeds, there is a loss of coherence: the new layer never implements the same interface as the one it decorates. So callers must decide which layer to invoke. Behavior differs. Maybe even the data available differs.
If there is more than one community of end users, they are likely to pick different layers to interact with. Legacy users may prefer to stick with calls to the lower layer, as they see only cost and no benefit to switching. Newer users may prefer the newer layer, especially if it’s interface style is more contemporary.

The net result is:

Increase in complexity and technical debt.
Increase in “organizational debt” (measured by the number of teams needed to effect a user-visible change.)
Customer frustration once they experience channel disparity.

Complexity Collapse

2020-09-20T11:53:00-05:00

There’s a pattern I’ve observed a few times through scientific and computing history. I think of it as “complexity collapse”. It’s probably related to Kuhn’s paradigm shift.

The pattern starts with an approach that worked in the past. Gaps in the approach lead to accretions and additions. These restore the approach to functionality, but at the expense of added complexity.

That added complexity at first appears preferable to rebuilding the approach from the ground up. Eventually, however, the tower of complexity becomes impossible to extend further. At this point, the field is ripe for a complexity collapse and replacement with a fundamentally different approach.

In the realm of science, this complexity collapse has led to the most famous reformulations in history:

Ancient astronomers assumed that the heavens were perfect. The stars were permanently fixed to a sphere, except for the “wanderers.” Planets and our moon, being heavenly bodies, must move in circles. The fly in the ointment was that circles alone could not explain apparent retrograde motion. Hipparchus and Ptolemy believed the explanation to be epicycles – circles superimposed on circles. Eventually, Copernicus showed that the number of epicycles needed would be drastically reduced with a heliocentric model. However, further improvements in optics and measurements caused the epicycles to proliferate again.
Kepler swept the epicycles away with a clean, simple explanation: orbits are ellipses. His three laws, derived with the aid of Tycho Brahe’s incredibly accurate observations, described the motion of all the wanderers with a single explanation. Newton later showed the inverse square law of gravity would produce those ellipses. Newton could not have discovered the universal law of gravitation in the paradigm of epicycles.
Near the end of the nineteenth century, physicists faced a similar tower of complexity when it came to explaning black-body radiation spectra. All the existing models for light, heat, and emission predicted much higher energy radiation at high frequencies than was actually observed. This would imply something called the “ultraviolet catastrophe” (which should absolutely be your next band name). It meant the night sky should be blazing with hard ultraviolet light. Not just that, but the farther you looked to the high frequency end of the spectrum, the higher the energy you would find. In other words, a black-body radiator could produce nearly infinite energy by just sitting there.
As with the epicycles, the first response was to add adjustments to fix the ultraviolet catastrophe within existing equations of classical mechanics. Many such models were created by theorists whose names are only known to students of science history today. They all focused on adding corrective terms–based on unknown mechanisms–to the high end of the spectrum.
Max Planck showed instead that the entire observed spectrum could be explained with one simple law. It only required that light came in particles (later called “photons”) whose energy depended on their wavelength rather than their mass times velocity. Planck swept away the complexity of the old model, replaced it with a simple set of equations, and laid the foundation for quantum mechanics.
In the Java programming world, the challenge of building data-based shared systems with HTML front ends led a collection of vendors (virtually all gone now, absorbed into either Oracle or IBM) to create the “Java Enterprise Edition” specifications including the notorious Enterprise JavaBeans (EJB). This built on a tower of complex specifications for remote invocation and activation, interface descriptions, several roles that didn’t exist before (or since.) This stack could indeed allow programmers to create HTML based applications with a database hiding in the shadows.
The Spring framework emerged as an alternative that focused instead on “plain old Java objects” (POJOs). It replaced complex interactions across development time, configuration time, deployment time, and run time with a simple model: objects that could be “injected”. Then it offered a collection of libraries that included classes useful for building the kind of applications developers needed to build.

Admittedly, the examples from astronomy and quantum physics are more fundamental to our understanding of the universe than XML-based dependency injection. But these examples all illustrate a similar dynamic. Complexity accumulates, a new theory replaces the old one, leading to complexity collapse.

All those examples include a common coda as well: complexity grows again!

The elliptical orbits of Kepler and Newton work when bodies are far from very massive objects. There is a “correction” needed. That correction was predicted in 1915 and observed in 1919 in what may be the only planetary occultation to reach the front page of the New York Times. The corrected theory neatly explained the same simple elliptical orbits. It also predicted that Mercury, being very close to our Sun, would exhibit orbital precession because it made a simple ellipse on curved space (a geodesic ellipse.) That new theory explained the earlier results along with some new ones… but we did have to fundamentally change our view of space and time, and open the door to black holes, the twins paradox, and a host of other counterintuitive (but verified!) phenomena. Einstein’s beautiful equation fits on a single line of one page… but it takes a stack of books on one side of you to understand the equation, and a taller stack of books on the other side to explore the incredible implications.
Planck’s law is simple, but nobody would make the same claim about the quantum mechanics it ushered in. The more we pursued Planck’s implications, the weirder our universe got.
Of Spring, many people have quoted Harvey Dent from “The Dark Knight”… “You either die a hero or live long enough to see yourself become the villian.”

Today, the contradictions between quantum mechanics and general relativity lead many physicists to look for a new model. Not adjustments but a new paradigm to sweep away and unify the towers of complexity in both fields.

In the realm of data-based applications with web-ish interfaces, complexity collapse led many to embrace Ruby on Rails or Node.js. Both ecosystems have had mini-collapses but no complete replacement, yet.

Staggering Skeleton

2020-09-19T20:09:53-05:00

We’ve talked before about a walking skeleton. That is a fully connected, but not very functional, system that includes all the major integrations. It serves to demonstrate that anything at all can run in the expected topology.

But some languages and frameworks ask you to get more correct to form a walking skeleton. Strongly typed languages, frameworks that require you to run from a non “-SNAPSHOT” library version, deployment tools that only fetch from official repositories, etc.

On the other hand, some languages and ecosystems let you put a bunch of half-broken, mostly inconsistent crap together and it still kind of works. There’s a good chance that the copy-and-pasted monstrosity will fall apart if you poke it with unexpected input. And it might blow up, crash, or delete the contents of your high school permanent record. But it still works just enough to say “I can improve on it from here.”

This is a staggering skeleton. It could fall over at any moment, but eppur si muove.

Depending on your background, you might view the staggering skeleton as yet another way we allow a mostly broken stack of complex, unreliable software to keep proliferating. Or you might say it makes an easier on-ramp and allows more people to contribute without forcing them through Hindley-Milner hoops.

Either way, you can’t deny that staggering skeletons were a big influence on the web. Early websites were a lot of copy-paste-and-modify. That’s part of how the web grew so fast.

Weakness Invites Competition

2020-09-15T00:00:00+00:00

Today, nobody wants to start up a competitor to Amazon. New ecommerce retailers aim at niche markets because Amazon is such a juggernaut and fierce competitor that it would be foolhardy to go against them.

Those niche retailers look at Amazon as an exit strategy more than a competitor. Like Microsoft in the 90’s, Amazon isn’t the competition, they’re the environment that any entrant deals with.

Competitors emerge when they sense an opportunity to take away market share from a weakened incumbent. This starts at the periphery: a new entrant takes away a small, uninteresting, or insignificant portion of the market.

The incumbent gradually finds themselves hemmed in by upstarts each nibbling away at their fringes.

The upstarts will both expand the edges into areas the incumbent never realized were part of their TAM. Meanwhile the upstarts make incursions into the core.

Eventually one or two of the upstarts will become the dominant player in the new, expanded market.

This is the classic “Innovator’s Dilemma.” It starts when the dominant player is seen as vulnerable.

Scaffold or Straightjacket?

2020-08-27T00:00:00+00:00

Scaffold or Straightjacket?

Douglas Adams' classic sci-fi comedy novel “The Hitchhiker’s Guide to the Galaxy” opens with a bulldozer approach Arthur Dent’s house. Since Arthur is still inside the house, he is naturally concerned.

When Arthur confronts the foreman of the demolition crew, he is informed that his house is to be destroyed to make way for a highway bypass. When discussing the public notice that the local planning office had posted, they have this conversation:

“But the plans were on display…”

“On display? I eventually had to go down to the cellar to find them.”

“That’s the display department.”

“With a flashlight.”

“Ah, well, the lights had probably gone.”

“So had the stairs.”

“But look, you found the notice, didn’t you?”

“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard’.”

For the record, there was no leopard.

This is a funny moment for both the absurdity and the familiarity. Anyone who has interacted with a bureaucracy can recognize their experience in Arthur’s. Unfortunately, we tend to attach the name process to both that kind of experience and a very different one.

Process as (Accidental) Constraint

Some processes are deliberately designed to limit or constrain the “consumer” of the process. These are the exception though.

Most of the time, a group or department manager will create a process for how that particular group does its work. Where the trouble arises is that an Arthur Dent doesn’t just interact with that one group. Instead, he has to deal with several groups that each have their own processes. Each group knows their own process, but probably has no view into the processes of the other groups. They can point Arthur from their own department to another (to go get a signed form of some kind of another.)

Each group acted reasonably, but the experience from Arthur’s point of view is absurd.

As a personal anecdote, my wife is a US citizen who was born to two US parents in a US Army field hospital in Bangkok, Thailand. Consequently she had dual citizenship until age 18. At that point, the US State Department contacted her to declare which citizenship she intended to keep. She had to send a form back to that department, including a consular certificate of natural birth. Where would that certificate come from? The US State Department. In other words, she had to send a request to one office in the State Department to get a document to send back to another office in the State Department. For years we have joked that those offices are probably across the hall from each other.

We have here what the outside observer experiences as one “process.” But no-one can tell them the entire process because there is no global designer. It is a piecemeal of constantly-changing internal departmental processes. Thus the whole picture is shrouded (the lights had gone) and there are leopards.

Taiichi Ohno’s Kind of Process

Taiichi Ohno created what we now call the Toyota Production System. It has inspired decades of study in quality and rapid improvement. From TPS we have gained vocabulary like “kanban”, “andon cord”, and “kaizen.” You could get a Master’s level course in process design by studying just Toyota and Waffle House. (The American restaurant chain.)

One of the most eye-opening things about TPS is how they approached processes. Every work station had the work process printed and posted right where the work is done. Every process was updated almost every day. In fact, Ohno would walk the factory floor looking for process documents that looked aged: stained paper, yellowing, tears, etc. He would then ask the worker why they had not learned anything in such a long time. He would then ask that worker’s manager why they had not learned anything. In TPS, kaizen is not a big enterprise initiative… it happens a thousand times a day in small groups of workers and their managers solving problems together.

On a previous project, we had an XP lab full of pairing stations. We wanted all the pairing stations to be identical: same OS, same configuration, same IDE. That way any pair could take any station on any day and be productive. To make this work, we had a wiki page with setup instructions. Every time we needed to do a fresh setup, we would walk through the instructions. I wrote the initial instructions, but even so I walked through the instructions each time to make sure I incorporated improvements that other people had made. If we found errors, we updated the process. If we found ways to improve efficiency, we updated the process. In fact, the most common kind of problem came because we didn’t do the process often enough so today we would probably reimage one station every day to make sure we kept pressure on improving that process. The consistency of the stations paid dividends every day because we never had contention for “the QA machine” or “the machine with the memory card reader.”

Scaffolding versus Straightjackets

It’s an unfortunate collision in the English language that both of these experiences have the word “process.”

Taiichi Ohno’s kind of process is a scaffolding. It supports the work and lifts up the worker to perform at a higher level of quality. It captures the best of what we’ve learned about how to do the work so that everyone can benefit.

Because the process is written and posted right with the work, it means that changing the process document actually changes the process. (As opposed to changing the document then holding training sessions, sending work in progress back to square one, and having stragglers following the old process for months.) In other words, they write the process down exactly so it can be changed!

Ohno’s processes allow the worker to improve his or her own work. The Arthur Dent style of process is defined by the worker for other people to follow. The difference is immense.

Deleting From Databases is Not Cleanup

2020-08-05T00:00:00+00:00

Creating thousands or millions of entities and then deleting them does not return your database to its initial state.

Queries won’t show the deleted entities, but operational results can.

For example, a table in an RDBMS may have extra storage segments allocated to it. These can generate higher I/O times until someone runs an analyze job to reset the table stats for the query planner. Some databases treat “DELETE FROM USER” very differently from “TRUNCATE USER”.

Some non-relational DBs use tombstone records to indicate where a deleted entity had been. That’s to facilitate eventual consistency when propagating the deletion overlaps with propagating other modifications.

Narrow but Deep?

2020-07-27T07:20:35-05:00

In “A Philosophy of Software Design,” (ISBN-13: 978-1732102200) John Ousterhout describes the ideal functional interface as “narrow but deep.” That is, it should not expose many methods or functions, but the ones it does expose should be powerful.

I have mixed reactions to this principle, so I’d like to explore some examples that support it and others that argue against it. Throughout this section, my lens is malleability.

First, imagine a somewhat typical Java domain object with a “broad but shallow” interface. That is, it exposes getters and setters for many attributes. That gives it a wide surface area. The functionality provided by those methods is slim. One could argue (and I have) that this is no better than making the object’s attributes public. It adheres to a naming convention that was created for 90’s era GUI builders and the pedantic rule that members ought to be private.

Thin as that Java object’s interface is, it can still inhibit change if any of the members are references to other objects. A caller must navigate a graph of references, thereby coupling to what should be the internal structure of the object and preventing the object from changing those internals. (c.f. The Law of Demeter) I will consider this example as supportive of the “Narrow but Deep” principle, in that we see a clear failure mode of the contrapositive.

Second, consider a more intelligent Java object that does not merely expose attributes but provides behavior beyond “addXxxListener” or “addOrderLine”. It likely has a wider interface, making it “broad but deep”. Would this object inhibit change? Possibly. In this case it largely depends on how much of that surface area any particular caller engages. The broader the object’s interface, the more specialized its use becomes. A very wide interface on an object that is used just once indicates it is exquisitely adapted to its current usage. One would not expect to use it in different compositions. On the other hand, an object with the same wide interface might be used in multiple contexts where the additional breadth indicates affordances added to facilitate reuse. This style would be characteristic of Smalltalk or it’s cousin Objective-C. “Broad but Deep” then sits on the border for me. I think it can work but easily becomes a barrier to change.

Third, let us consider a case that we might call “Narrow but Too Deep.” A very narrow interface would be something like the Interpreter pattern from the Gang of Four (ISBN-13: 978-0201633610). An interpreter has basically one method interpret which takes an object that supplies instructions. Perhaps the argument is an AST or even a string. This is very narrow and very deep. How does it do on change?

Change in the caller is very well handled. The caller can readily supply a different set of instructions to achieve new behavior. Change within the interpreter is a more complex question. Changes to the implementation of the interpret method are easily done. Callers have no visibility into the machinery of execution. Changes to the instruction set are more difficult. Addition to the instructions are easily accomplished. Forward-compatible modifications to the instructions are feasible (though they may or may not be easy.) Removal of instructions will be difficult. That is because callers have absolute freedom to construct their instructions however they like. Thus, narrowing the instruction set either requires an extended deprecation period for callers to upgrade, or it requires the ability for the interpreter’s authors to change the call sites.

When we consider those change cases together, we see that a) expansion is easy; b) modification is possible if it is forward compatible; and c) contraction is a breaking change. These are the characteristics of an interface! We have created a new level in which we’ve defined a broad (and potentially shallow) interface: the instruction set. This should not surprise us–after all the pattern is called “Interpreter” so the fact that we’ve created a new language is implicit in the pattern. It is the interface in that new language which becomes challenging to evolve.

(Best advice about the interface in that language: you are a language designer. Design carefully. Be conservative about additions, because whatever you add will be very difficult to retract.)

We can see this same effect with interprocess communication interfaces as well. HTTP offers a narrow interface: headers, a handful of methods, a URL, and a payload. (The headers are probably the broadest part, especially when you consider their mutual interactions. But most non-browser use of HTTP is restricted to a tiny handful of those headers.) HTTP is too narrow by itself, so application programmers have variously adopted XMLRPC, SOAP, REST, and GraphQL to provide a new level of language atop raw HTTP. Let’s consider REST for a moment.

As a new level, we should think of a collection of REST resources as a language. Indeed, we commonly see resource representations, URL schemas, and response codes defined with an interface definition language (IDL) called OpenAPI. Looking back through previous IPC mechanisms, we always find some kind of IDL, whether it is called as such or not. That IDL in fact defines the new language level. The collection of IDLs in play within the boundaries of set of collaborating applications supplies the grammar of that specific distributed system. Perhaps this is one reason why it is so difficult to achieve a coherent distributed system, because the grammar is amalgamated from many disparate sources that lack an overall design.

Another example of “Broad but Too Deep” would be SQL. If you take a hard look at JDBC and strip away everything that is just there to construct other JDBC objects, you have basically two parts: execution and introspection. Introspection allows Java code to examine the constructs created and consumed by the SQL language. Ignore that for a moment and consider execution. Executing a SQL statement from inside an application closely resembles executing it from a command line. Submit a string to the database and read the results. Most of SQL reduces to one method: execute. There are variations that serve only to bridge between the two language levels: batching, cached statements, query versus modification. I consider these to be accumulated cruft that are not the essence of the interface.

Nobody, anywhere would say that using SQL from inside an application makes it easier to modify the system. Instead, as with our Interpreter pattern, we have to consider what the interface is in the new language level. That is, we must design the SQL interface to enhance malleability. One way to do that is to create views for consumers rather than having them tie directly to tables. This is the exact analog of programming to segregated interfaces on objects instead of directly engaging the objects' full complement of methods. Elaborate joins in application code are the SQL equivalent to violating the Law of Demeter.

(Ironically, I usually see this problem solved in the exact opposite way: with a mapping layer on the caller side that makes it even easier to directly couple to the precise table definitions.)

We refer to the structure of tables and constraints in the database as a model, but we could also describe it as a grammar. We can only make statements in the language of the database that the grammar permits. And again, we find that it is easy to expand the grammar (schema), may or may not be easy to modify, and very difficult to contract.

This characteristic of creating a new language level seems to pop up every time we make the interface at one level sufficiently narrow and deep. Then we need to worry about coupling and malleability of the new level.

Consequences are not Pros or Cons

2020-06-28T00:00:00+00:00

I’ve noticed a pattern in much business writing, including technical writing. People feel compelled to label every effect as “pro” or “con”. I think this springs from our primary-school training in persuasive writing. As a result, what should be an engineering analysis often reads like marketing copy.

(A related effect, when writing persuasively, people tend to minimize or discount the effects they don’t like. Richard Feymann advised students to be their own harshest critics, to find ways to poke holes in their own arguments. It’s the only way to avoid fooling yourself, he said, and you are the easiest person in the world for you to fool.)

Instead, I suggest we first describe simply consequences, not benefits or problems. That’s because a consequence is just a statement about how the future will differ from the past. It is objective.

Whether you judge that consequence to be a “pro” or “con” depends entirely on your relationship to the change. If you perceive the change as an improvement to status quo then you call it a “pro”. If you don’t like the version of the future which includes that consequence, then you call it a “con”. That means labelling a consequence as a benefit is subjective. It describes the relationship of you and the change.

What about the changes that you don’t particularly like or dislike? The ones that are neither “pro” nor “con”? Most of the time those don’t get written down at all!

(As an aside, I also see technical writeups that include a list of “pros” for the recommended solution, where each “pro” precisely lines up with a “con” of the current world. This always tells me the author chose the solution first, then stated the problem in such a way that their chosen solutions appears to be the best option.)

I recommend that you begin by listing the consequences. Find all the ways that the future will be unlike the past, if we choose that path. Look for second-order effects–the consequences of the consequences.

Look for interactions. How does this approach combine with other systems, processes, or people?

As you make this list of consequences, try to avoid coloring your thoughts about the consequences by what your intentions are. Whether you proposed a technical system, a process change, or a policy difference, once the change is made your intentions are irrelevant. Only the resulting system state matters. Anything the system allows will be done regardless of whether that matches your intended application. Therefore, think beyond the intended outcome or purpose of your approach. How could this be accidentally misused? Or deliberately abused?

Armed with this list, you will be ready to think about how the consequences affect you and your organization. This is when you judge whether the effect of those consequences creates a future that you like better than today.

Why did we stop at 2?

2020-06-24T00:00:00+00:00

In the dim reaches of Unix history, the first shell was written. It attached file descriptor 0 as a pipe from the TTY device to a process. That became “stdin”. File descriptor 1 is a pipe from the process out to the TTY. That’s “stdout”. I don’t know when FD 2 became “stderr” but it was early.

When you write a Unix program, you don’t have to open these file descriptors. They’re opened by the parent process, before it uses “exec” to load the new program’s code. So by the time “main()” is called in the child program, FDs 0, 1, and 2 are already connected.

For back end services, we kind of abandoned stdout for a long time, in favor of logging frameworks that wrote output into files. Then we added log scrapers and aggregators to gather those logs on a server.

That looked like this:

Logging framework (extra dependency in codebase, impediment to library composition) writes to file
Agent on host tails file sends to collector
Collector daemon on log aggregator writes to FS there
Search engine indexes logs

Recently, stdout has had a bit of a renaissance with the advent of sidecars. Before your application starts (usually in a container now), the container platform connects a pipe to FD 1. The other end of that pipe goes to a socket which is connected to a “sidecar”. The sidecar reads from the socket and passes the data along to a log collector.

So now instead of this linkage that requires a logging framework inside your application, you just use builtin functions like printf() or System.out.println(). You still have to format the log line, which might want a library function in your app. But now different libraries that each spit to stdout can compose nicely. We’ll leave it up to the log collector and indexer to ingest different formats.

Let’s pursue this idea further. What else could we simply provide to an application by hooking up file descriptors before executing it?

Messaging Topics

When an application wants to use messaging, it has to include a client library that knows how to connect to the messaging service. That requires authentication so the application has to manage credentials to supply to the client library to connect to the messaging service.

Those credentials are not part of the application code base, they have to get mixed in by some build or deployment step.

Because the application has to include a client library, the application becomes specific to a particular messaging product.

What if we said “fd 3 and up are for messaging topics?” Each FD could be bound to a topic as either input or output. The application would just use “send” and “recv” socket operations on those FDs. (If we used “read” and “write” file operations on the FD, we’d have to figure out how many bytes to read. What we really want is “a message” as a unit.)

It would then be the responsibility of the runtime platform to supply “pipes” that connect those FDs for the application to the actual messaging infrastructure. We would certainly implement that connection via sidecars again.

With this approach, the application no longer needs a client library. The platform would be responsible to provide some messaging ability. Applications that need precise control over acknowledgements might not be able to use this but simple applications that don’t worry about batching or distributed transactions could go a long way with basic send and receive operations.

Databases

Similar situation with databases. Why do we need all kinds of specific wire protocols? How about a file descriptor that is connected directly to a database. Write to “stddb” and the DB gets it as a SQL statement or query. Read from “stddb” to read results.

Now the application doesn’t need driver libraries in it. Nor does it need to manage credentials. That would be part of the platform configuration for the application, so we’re separating concerns in a different way.

Other Uses

What else could we simplify if we renew the idea that a program’s environment is set up by the runtime that launches the program?

Time Emerges From Events

2020-06-18T19:33:35-05:00

Without an event, no time passes. This may seem like an odd assertion. You may say, “I can see time passing all around me!” But how do you see it? Do you look at the ticking hands of a clock? In a mechanical clock, each tick is an event: when the tension on an escapement exceeds the friction between its prong and the gear, and the escapement knocks over to the other side with the familiar “tick.” That motion transfers to gears which torque the hands a bit further around.

A digital clock? An oscillating quartz crystal resonates at a frequency, causing a changing voltage. That voltage feeds a transistor, and when the voltage is high enough the transistor feeds current to a counter. Transistors inside the counter flip and flop, eventually charging some LCD segments and discharging others.

Events everywhere.

Then there are the photons that bounce off the clock into your eyeballs. They excite your retinal neurons which fire signals to your brain and trigger a whole new cascade of electrochemical activity.

Without all those events, you can’t even perceive the current time.

All clocks require physical interactions, whether mediated by springs and gears, quartz oscillators, or network packets (which arrive as self-propagating excitations of the electromagnetic field.)

What about computers? How do they understand time? Let’s start with the easy case of a physical machine like a laptop or desktop machine.

Inside the computer is an oscillator, just like in your digital clock. It may be a piezo-electric quartz oscillator, or it may be an “LC oscillator” (a capacitor and an inductor.) That oscillator emits a voltage to a clock circuit in your CPU which increments a counter. A program executing on that CPU can run an instruction like RDTSC to get that counter value. Your operating system gives the impression of multiple simultaneous programs by generating an interrupt every so often, which makes the CPU stop what it’s doing and go execute something else. Physical interactions all over the place! There’s the mechanical vibration of the crystal, or the back-and-forth of electric to magnetic field in the LC oscillator. In the CPU, the transistors flip on and off shuttling electrons around.

What about a virtual machine? It doesn’t have an oscillator, but the underlying host machine does. So the VM can send an I/O instruction to ask the “hypervisor” what it’s clock says. Or, after waking up the virtual machine, the hypervisor can just sent an I/O packet to the VM with the current time. More events: all the physical interaction of the physical host’s clock, plus the electron-shuffling of I/O to the VM.

If you were to somehow stop all those physical interactions, time would not pass.

Reading List

2020-04-27T08:36:55-05:00

Architecture & Development

Required Reading

Architecture Decision Records
C4 Model (Note: we will only use the first 3 C's.)
Accelerate
Wardley Maps
Failure Modes and Continuous Resilience

Shared Mutable Team State

2019-03-21T08:36:55-05:00

Shared State

When programming distributed systems, the hardest kind of data to manage is shared mutable state. It requires some kind of synchronization between writers to avoid missed updates. And, after changes, it requires some kind of mechanism to restore coherence between readers.

I previously wrote about that idea of a coherence penalty as it applies to humans. Following those lines, we might regard the system of development teams in an organization as its own distributed system. Teams pass messages. Both sides must understand the semantics. Packets get lost. Nodes disappear.

Within that framework, we can consider the same dimensions of state as we would with a distributed computing system:

Local, immutable state. Easy.
Local, mutable state. Relatively easy to manage.
Shared, immutable state. Essentially write-once. This is a send-only (unicast or broadcast) item that doesn’t require further synchronization. (But see my note later about the time dimension.)
Shared, mutable state. Both synchronization and coherence penalties apply here.

So what would constitute shared mutable state between teams?

Mutable State for Humans

Teams and the humans on the teams carry around an understanding of how the system works. That definitely constitutes mutable state.

I think that the metadata used by the software also constitutes shared state. It may be mutable or immutable. More about that shortly.

The software these teams create has shared mutable state of its own. That would be data that the software creates and reads. The data may be at rest in a database or it may be in motion, in the form of messages being passed around.

For the teams that create the software, however, the shared state is the protocol or schema definition. When those change, synchronization and coherence mechanisms are required. To some extent, this is just a consequence of Conway’s Law, but it’s taken me ten years to understand it.

Consequences of Shared State

For teams to move quickly and independently, we want to minimize the synchronization and coherence delays between teams, in exactly the same way we would do when making the software itself more scalable. So we want to reduce the amount of shared, mutable metadata across team boundaries.

Some corollaries.

Less Shared Metadata Means Less Penalties

Every API has a schema. That means every novel API becomes a new piece of shared state. If you expect to evolve that API, you are planning to mutate the state. Find out if there will be multiple writers!

Where possible, favor a new implementation of an existing API to reduce the amount of state involved. Consider using standard media types and representations, or creating local standards. The time spent creating the standard definitely counts as a synchronization delay, but at least it is explicitly recognized rather than buried in Jira tickets. Also, this time spent creating the standard may cause you to create a better definition that won’t need to change as much. Thus you trade a larger early penalty for repeated penalties later.

Integration via database table maximizes the need for concurrent mutation of the schema. This is why we’ve come to believe that we should avoid such integration. But again, there may be a place to use it effectively, so long as we recognize the effect on our team-scalability.

Immutable Metadata

Shared, immutable data allows consuming software to scale better by avoiding propagation delays. Shared, immutable data also benefits from caching and can use a publication model.

The same goes for teams. API or schema definitions that never change only require publication. But do they allow for change? Yes, with some constraints.

If every change is strictly additive then we can consider the “publication date” of an updated protocol definition to be part of the protocol’s name. Thus, it isn’t a revision of the old protocol, but rather a new protocol entirely that derives from the old one without replacing or invalidating it.

For instance, the existance of HTTP/2 does not mean that HTTP/1.1 no longer exists.

Likewise, you may create a new API definition under a new name. As long as you continue supporting the old definition, then you have not mutated the old shared state, you’ve just created a new piece of immutable shared data.

The technology we use doesn’t make it easy to maintain multiples of some shared state. For example, RDBMSs have no way to express the idea that the new schema is a copy of the old schema with an extra table. Not only is their data model all about “update in place” but their metadata is also “update in place.” Similarly, most of our frameworks for writing APIs are too explicit about routes in URLs. They bake in URL parts like “/api/v1” in every route so it is hard to say that “v3” is “v2” with some changes, and “v2” is “v1” with some changes.

Consider Structuring Teams Around Shared State

This is the dual of Conway’s Law. One way to decide team boundaries is around interfaces. That is, set up your teams such that there is a team boundary everywhere you want an interface.

That interface definition is shared state which may be mutable. So, consider also drawing team boundaries to maximize ownership over that state. Transform it from shared state to private state and the rate of mutation matters less. Of course, as soon as you draw those new lines you may have created new interfaces, so look carefully for team designs that reduce the global amount of shared mutable state.

If you follow that approach when considering all the different interfaces you must negotiate, then everything gets sucked into a single gigantic team. I think this is why there’s a kind of “gravitational” attraction that tries to pull interacting pieces of software into one mass.

Maybe it’s like the life of a star. The life of a star is the unsteady conflict betwee gravity and pressure. Gravity tries to collapse the star, which creates fusion. Fusion makes pressure which holds the star up from collapse.

In software, shared mutable state at interface boundaries plays the role of gravity. Taken to the limit you get monoliths. Communciation overhead and coherence penalties (scaling quadratically with team size) act like pressure. Taken to their limit you get pico-services with solo owners. Rules like the two-pizza team are meant to impose a constraint via force majure.

More to Explore

Some of what we know from fallible message-passing networks can extend to the system that creates the software systems. But we must also keep in mind that people have resilience mechanisms that computers lack. “Hey, did you get my email?” actually works with humans. Humans can also switch from discussing their shared state (say, a protocol definition) to negotiating a new meta-model for that shared state (the meta-meta-model for the data the software will pass.) Software systems cannot renegotiate their protocols dynamically.

There may be more insight available from looking at team and organizational structure as a distributed system.

My Favorite Bit of Language Design

2018-12-26T09:08:55-05:00

An elegant design conserves mechanisms. It combines a small number of primitives in various ways. When I first learned about this elegant bit of design in Smalltalk-80, I laughed with delight.

In Smalltalk, the primitives are “object” and “message”. That’s basically it – except for blocks, which we will see a little later. Behavior arises via objects sending messages to each other. In fact, Smalltalk doesn’t even need control structures in the language grammar. Objects and messages suffice. How does a language without control structures do anything useful? How can any conditional logic work?

The key is with the class hierarchy for Boolean, True, and False.

In most languages, “boolean” is a primitive type that doesn’t have any behavior. True and false are values of the type boolean. Not so in Smalltalk. Boolean is an abstract class that has two subclasses: True and False.

Boolean defines selectors like ifTrue: and ifTrue:ifFalse: but does not implement them. Each parameter is a block: an object wrapping a chunk of behavior that can be invoked later. (Ruby also calls these blocks, but only allows one at the end of a parameter list.) In Smalltalk, arguments are interleaved with the words in the method selectors. Here’s an example from Squeak, a modern Smalltalk, in the Character class:

isSeparator
   "Answer whether the receiver is one of the separator characters--space, 
   cr, tab, line feed, or form feed."

   | integerValue |
   (integerValue := self asInteger) > 32 ifTrue: [ ^false ].
   integerValue
   	caseOf: {
   		[ 32 "space" ] -> [ ^true ].
   		[ 9 "tab" ] -> [ ^true ].
   		[ 13 "cr"] -> [ ^true ].
   		[ 10 "line feed" ] -> [ ^true ].
   		[ 12 "form feed"] -> [ ^true ] }
   	otherwise: [ ^false  ]

The first line just names the method. The quoted string is documentation visible in the class browser. | integerValue | says this method uses one local variable. Then we get to the interesting line. (integerValue := self asInteger) sends the asInteger message to self and assigns the result to integerValue. The assignment returns the value which was assigned, an integer object. Next, the > message is sent to the integer object, with 32 as a parameter. Yes, comparison “operators” are also just messages sent to objects. Every number is an object. The result of > is an instance of Boolean. So the paradoxical-seeming ifTrue: [ ^false ] will be sent to whichever Boolean was returned from >. The caret just means “return” and false is a literal that names the singleton instance of the class False.

That’s a lot of messages in one short line of code. Thanks to the hard work of many brilliant programmers and quite a few transistor-doublings since 1980, it performs well today. There are also many tricks with pointer tagging and flyweight objects that make it reasonable to have numbers represented with objects.

Now we get to the punchline and the genius of Smalltalk’s little trio for Boolean logic. So True implements ifTrue: something like this:

ifTrue: trueBlock [
  "We are true -- evaluate trueBlock"

  
  ^trueBlock value
]

(This sample from GNU Smalltalk. )

True knows it is true, so it unconditionally evaluates the block. It won’t surprise you to see that False implements ifTrue: like this:

ifTrue: trueBlock [
  "We are false -- answer nil"

  
  ^nil
]

All the other variants such as ifTrue:ifFalse and ifFalse: are implemented similarly. In fact and: and or: operate the same way.

The beautiful part about this is how a small number of features, used consistency and pervasively, combine to allow simplicity to emerge.

Control structures can be discarded in favor of objects sending messages and evaluating blocks. Polymorphism subsumes conditionality, but it only works if objects are pervasive. If Smalltalk had the same split between boxed numerics versus primitive numbers that Java uses, this wouldn’t work. Numbers must be objects. True and false must be objects, not primitive values or puns for distinguished values of uint8_t.

Since I learned about Smalltalk’s elegant trio, I’ve tried to apply the same principle in my own designs. Maybe we can push an idea farther. Make it more pervasive. Get more mileage out of it. Represent some other behavior (like conditionality) with a more simpler but more general idea (like polymorphism.) Ask the question, “What if we made everything an X?” for some value of X.

Networking Topics

2018-09-30T12:59:52-05:00

Another quick post based on a Twitter exchange. (Maybe this will help save content from the ephemera of Tweets.)

A short, incomplete list topics in networking that programmers should know about:

ICMP messages
Frame size and fragmentation
Socket options
Listen queue and behavior when full.
All the timeouts and why they exist.
When read, write, and connect calls block and why.
When memory buffers are copied and how to avoid.

A reference I love is the encyclopedic The TCP/IP Guide (note: affiliate link.) It’s got detail on pretty much everything, including bogons and the secret masters of the Internet—the DNS root servers and BGP administration.

Joyful Isolation

2018-09-27T09:08:55-05:00

Way back in January, Sam Newman tweeted this (perhaps rhetorical) question:

I was in the middle of creating this slide (wrt patch hygiene) and had to stop half-way through and ask myself - aren’t we all just making this worse? pic.twitter.com/fCTAYDc3Pn
— Sam Newman (@samnewman) January 14, 2018

It got a handful of retweets recently, and I responded with:

I've said it before, but each of these layers is another attempt to achieve isolation between apps. It could (should?) be fixed with a new OS at the bottom, ditching every layer above that.
— Michael Nygard (@mtnygard) September 24, 2018

Which definitely needs some expansion as Ola Bini pointed out. So here goes. (Caution: long ramble ahead. Second caution: I’m going to gloss over a lot of details in an effort to convey a bigger picture.)

The textbook definition of an operating system is that it provides process isolation, memory management, and hardware abstraction. Some useful operating systems have been built that remove various parts of this definition, but for now I’ll use that.

Let’s look at what various degrees of process isolation could mean and why we’ve got that stack of stuff in Sam’s slide.

The most basic degree of isolation would be that one process cannot read or modify another process’s memory. “Modern” operating systems like Linux, Windows, and macOS do pretty well on that. (They’re modern in the sense of “widely used today” but all are based on 30+ year old foundations.)

Memory isolation offers some degree of security. Security will be a recurring theme.

Other conflicts between processes might arise from error rather than malice. Overusing the CPU, for example. Or consuming all available memory. This would allow one process to unfairly deny service to other processes, so an operating system must also enforce usage limits.

When today’s operating systems were invented (and here I’m describing the Linux kernel as an instance of the Unix family), the idea of multitenant workload was strictly a mainframe concern. Mini- and micro-computers barely existed. The largest networks consisted of a few dozen intermittently connected machines. Most of the users knew each other by first name. Active, anonymous threats were unknown.

Benign noninterference between processes sufficed.

Process isolation now needs to mean much more than just memory protection and quota enforcement. In fact, the definition of “process” breaks down a bit, too.

In an operating system, a “process” consists of allocated memory (some of which may be paged out to storage), a memory mapping, and control information: threads' stacks, open files, network sockets, entitlements or permissions, interrupt vectors, and so on. The operating system prevents one process from interfering with another, but it doesn’t prevent it from detecting the presence of others.

That is exactly what’s needed for multitenant cloud workload. A process from user A should have no way to detect the presence or absence of a process from user B. They might come from competing organizations. For government workload, they might operate under different security classification schemes.

As we look at the stack of virtualization and containerization in Sam’s slide, we can see how each layer attempts to plug some detection holes in lower layers.

The hypervisor is an operating system. It runs other operating systems because the guest operating systems are bad at preventing detection.

For example, each process should have it’s own IP address so it cannot detect other processes by their use of TCP ports it would like to occupy.

Each process should appear to have full control over the filesystem. Otherwise, processes could detect each other via changes to files. (Implemented by the VM, and again by the container.) That means the application’s own configuration files should be isolation. But it also means the operating system configurations should be isolated.

Each process should have it’s own namespace for users. Otherwise they could detect each other via the user listing. (Implemented by the VM and again by containers.)

An aside about containers: a “container” process with it’s own view of a filesystem plus an isolated “namespace” for kernel objects. That means a process running in a container is really executing on the same underlying kernel as the host operating system. It’s just not allowed to see other processes. Add a virtual NIC and IP address to the container and it has the kind of isolation I’m talking about.

When we look at this stack of layers in terms of detection-prevention, the crucial need for strong patch hygiene becomes clear. Any hole in an underlying layer allows detections that should not be allowed. Since no layer really provides perfect isolation, we must treat a patch at any layer with the same priority as a ring 0 bug in the lowest level.

(I also wonder if mainframes still have something to offer here. I just don’t know enough about their operating systems to say one way or the other. But think about this: IBM had virtual machines in the 1960’s.)

What could we do to create an operating system that meets our needs today?

Elevate non-detectability to the primary design goal. There should be no call or action an isolated workload can perform that would reveal the presence or absence of other workload on the same system. That includes other instances of the same workload!

A program can’t know what physical host it runs on. In a really extreme interpretation, programs can’t even be allowed to sample the clock too quickly, or else they could use timing attacks to detect other workloads!

Such non-detectability is not possible with Unix-style kernels. Likewise for Windows kernels. A microkernel like Mach might be able to achieve it, but Darwin as built would not. All of these embed the multi-user, multi-process, shared-filesystem model too deeply. Thus, the stack of virtualization and containerization.

There are some capability-based operating systems that offer promise. seL4 comes to mind.

I find unikernels interesting as a way of packaging applications. An operating system that aims toward true non-dectability might well use such a “super-fat binary” as a unikernel. It would carry the program text along with the expected filesystem. (A program binary today is mostly an image of the bytes that will go into memory for execution—called the text. There is some additional information about variable initialization and relinking symbols based on their actual load address.)

Functions as a service certainly step toward greater isolation. Each function execution might as well happen in a new operating system, as far as the function itself can tell.

It’s likely that this kind of operating system would have a very different notion of the “unit of workload” than a process. A process with threads is a compromise notion anyway. It allows the threads to share each other’s memory but assigns permissions, resources, and quotas for the collection of threads.

In a container, we get these levels of grouping:

The container has a process space (meaning PIDs), IP address, sockets, file descriptors, file system, and user base. It has an overall quota on CPU, memory, and network usage.
A process in the container has permissions of one user, resources, fine-grained quotas. It cannot see the memory of other processes in the container.
Threads in a process share memory, but do not have their own permissions or quotas.

If we extend that to cover the VM, hypervisor, and host operating system, we get 6 levels of grouping but each level has a totally different model.

I don’t know what the design would look like if we aimed for a homogenous structure that allowed grouping or isolation at each level. It would probably look more like an Erlang supervision tree or seL4 style capability delegation. It would look very different from the Unix-derived systems we have now.

Evolving Away From Entities

2018-04-28T16:48:10-05:00

Hat tip to Stuart Halloway… once again a 10 minute conversation with Stu grew into a combination of code and writing that helped me clarify my thoughts.

I've been working on new content for my Monolith to Microservices workshop. As the name implies, we start with a monolith. Everyone gets their own fully operational production environment, running a fork of the code for Lobsters. It's a link sharing site with a small but active group of users. I chose that application for a few reasons:

It's very well written and has good internal structure. That makes it malleable enough to use in a classroom setting.
It's small enough to be useful during a week-long workshop.
The domain is familiar enough that students don't need a ton of domain-specific introduction.

One of the features of Lobsters is "hats." From the site's own description:

Hats are a more formal process of allowing users to post comments while "wearing such and such hat" to give their words more authority (such as an employee speaking for the company, or an open source developer speaking for the project).

Hats are the first feature that we factor out into its own microservice. I thought it might be interesting to walk through that process and how the new service is defined.

This is going to be a long post, because I'm trying to recapitulate my whole thought process. Please let me know if I've skipped steps.

Point of Departure

The feature basically works like this:

A user can have zero or more "hats." Each hat has a short name that designates a project or product. Examples include "bsd" or "docker".
When posting a comment or sending a private message, the user can choose a hat to "wear". This could be to demonstrate credibility or make an official statement.
It's the job of site admins to verify the user's identity and standing relative to the hat.

Given that description, it's pretty natural to think about an API like this one. (Follow the link to read an API doc in "Blueprint" format.)

As you read the API description, notice that most of the routes read like "create this thing", "delete this thing", and so on. It sounds suspiciously close to CRUD, and that should trigger an uneasy memory about entity services.

Entity services are what you get when you only think about the data and not how you are going to use it.

A more subtle problem with this API is that it provides the wrong point of entry for the most common use case. When a user starts posting a comment, the site needs to find out what hats that user wears. It seldom needs to find all the wearers for a hat.

We can plaster over this gap by adding more routes to the API. But there's probably a better way to approach the whole issue.

Think About Behavior

If I have a mantra for architecture and design, it's "Think about the behavior." I advise people to evaluate their designs by walking through use cases. What components have the ability to make progress toward the goal? How does the flow of control get there? What information do I need to supply to it?

If we think about behavior in the "hats" feature, we'll see that the original API has some big gaps.

How does an admin know that someone needs a hat?
When does the system need to read all the users for a hat?
Can someone who has a hat bestow it on someone else?
What happens if someone has a hat when they make a comment but later loses that hat?

Let's take a behavior-oriented view on the hats. In particular, let's think about the lifecycle:

A user requests an admin to bestow a hat upon them.
The admin can grant that request. In that case, the hat is attached to the user.
The admin can reject that request.

If we stick with the CRUD style API, that behavior is "pushed" out to someplace else. Requests, approvals, rejections all have to go in the caller. What's left in the service isn't enough to be interesting. We'd have a caller that still knows all the details of the data. Any other callers of the Hats service would also be completely coupled to the details of the data. It might as well just be an RDBMS table.

Why would we just take a single table from an RDBMS and put it on the far end of an HTTP interface?

Take Two

Let's try making an API that maps the actions in the original controller. After all, we're decomposing a monolith into microservices. The monolith already works so we know the current design solves for the features needed.

The controller has these methods:

index
build_request
create_request
requests_index
approve_request
reject_request

Notice something interesting about those methods? None of them talk about hats! The only one that actually cares about hats is index, and all it does it get all the hats to serve the hats page that shows everything. That's the least interesting of the behaviors. The hats controller appears to be almost entirely about the process of requesting a hat and approving or rejecting such a request. Let's set a bookmark called "Requests" here and come back to it later.

Where do hats actually get used? Let's look at comments. When a user starts to post a comment or reply to another comment, there's this bit of code that checks to see if the user has any hats. If so, the comment fragment offers the user the option to "put on a hat". That gets carried through to the comments controller, where the hat is attached to the comment. (Now we know what happens if the user doffs a hat… old comments still show that they did wear the hat at the time of the comment. Nice!)

The comments controller doesn't have any methods about the hats, although it does use hat data. It gets the hat data from the User model which has_many :hats.

Now we understand the feature much better. Instead of talking about it in the abstract, we can talk about when each part of the feature is activated and used. We understand the lifecycle of the data: who creates it? When? In response to what signals?

All of these are essential to designing successful microservices! If you try to design services in a vaccuum, you'll find they don't work together and you need a bunch of glue code in the service consumers. (Hint: if you start trying to solve distributed two-phase commit across microservices, then you haven't gotten concrete enough about the actual use cases.)

Not One But Two

Recall that the HatsController class seemed to be about requesting a hat more than the actual hat? Suppose we created microservice methods like:

Request hat
Approve hat request
Reject hat request
List pending hat requests
See my hat request

That would pretty much map one-to-one with the existing HatsController methods. Seems like an easy way to solve the problem. The only problem is that it seems a touch too specific. We can often make microservices simpler, more general, and more useful by abstracting the interface up one step. Let's try that out:

Request "thing"
Approve request for "thing"
Reject request for "thing"
List pending requests
See my request

"Request thing" and "Approve or reject request" seem to be pretty general ideas. I bet you can think of half a dozen other uses for that concept in your company. What is the "thing" though? We need some kind of concrete representation, right? Let's try to avoid premature commitment to that. I want to see how long we can just use a URL to identify the thing.

With this in mind, take a look at the new request service API. Like the HatsController, this doesn't say anything about hats. In fact, it seems to rely on external information for almost everything. It uses URLs to identify the person (or system) making the request, the thing being requested, and the person (or system) that approves or rejects the request.

This may seem like premature generalization and you may cry "YAGNI" at me. I understand. But there's something about YAGNI you have to keep in mind… it applies when you can keep the cost of change low and refactor across interfaces. Microservices do well at keeping the cost of change low, but are much more difficult when refactoring. The whole idea is that a service interface is isomorphic to a boundary in your organization. So we don't have collective code ownership, we don't have refactoring across the interface, and I contend that YAGNI must be greatly weakened as a rule.

What Was That About Hats Again?

Requests are sorted, more or less. Now we need to turn our attention to the question of the actual hats. We saw earlier that hats appear in three places:

When building a page with comments on it, the (initially hidden) comment form needs to know what hats a user has.
When posting a comment, copy the ID of whatever hat the user was wearing into the comment itself.
When rendering a comment, display the text of whatever hat is attached to the comment.

Hats for a User

We can mostly handle the first case by querying for requests by subject (the subject being the user.) This could be done when the user logs in or the first time the user goes to a comment page.

However, the current method for querying by subject will return too much. First of all, it will return requests that are pending or were rejected. We can easily handle that using a matrix-query style of URL with both subject and status as parameters. Second, if we really do use the Request service for more than the hats feature, we don't want other "kinds" of requests appearing in the comments page. This one is trickier, since it needs a kind of meta-data that doesn't exist on the current definition of the Request service.

I'm not going to add that metadata just yet. That's because my workshop simulates the process of progressively splitting services out from a monolith. It's a common case to discover that your existing functionality is a subset of some more general, more valuable use case. That's when you go back and apply some data migrations and define a new API that deals with the general case. You then make the original API "magically" add the new metadata.

This may result in API names like "foo2" and "baz3." That's OK. The refactored, evolved version of your system won't look like a greenfield design would. Your system will show its history. Don't think of that as ugly scars. Think of it like laugh lines.

Adding a hat to a comment

When we find out what hats a user has, we get a list of URLs. Adding a hat to a comment doesn't need any additional interaction with requests or hats. Just copy the URL onto the comment where the code used to copy the ID.

Displaying a hat

One last interesting bit. We need to exchange a hat URL (from the 'object' of the original request) for a text label to display. This is the first thing we've encountered that is truly unique to hats… and it's basically a reference table.

This post is getting quite long as it is and I need to save something for people who come to my class. So I'll leave you with these quick thoughts:

Reference tables are a common need. Maybe we can create a more general service for curating reference tables. That would include information like who is allowed to add entries.
Someone may request a hat that doesn't exist yet. If the request is approved, then the hat "poofs" into existence. So what is the difference between "proposed" reference data and "current" reference data?
Is that lifecycle both general and interesting? Maybe there are two different APIs for dealing with curating the data versus just looking at current.
On the consuming side, we might decide to simply cache all the existing entries for a reference table. It's reasonable to have a query method on the table that says "give me the complete list of Hats" (or countries, or currencies, time zones, etc.) Fetching those at startup time is reasonable, but on a cache miss we still need to go ask about a single entry.
If we do create a reference data service, which deployment model do we want to use: A single instantiation of the service with all of our reference tables? Or one instantiation for each reference table? Think about the tradeoffs here both in terms of infrastructure cost and operations cost. (More instances = more infrastructure. Fewer instances = less ops cost at low volume, but more operations as scaling becomes harder.)

Conclusions

Our first idea is usually not the best one. To understand the boundaries and interface that make sense for a service, we have to think about it in situ. We aren't trying to model the world. We are trying to build systems that deliver features. Those features are specific and we must design our APIs to deliver those specific features. At the same time, however, we can often deliver the features just as easily by abstracting the API up one level. This makes a service more general and more reusable. It delivers more marginal value (i.e., it makes future work cheaper) and may even be simpler to write because it has less special-case logic or constraints.

We need to be careful to not push work into the gaps between services. One way to avoid that is to design APIs in terms of the caller's needs rather than the provider's view of the world.

Finally, sometimes the original service we set out to build evaporates completely when you discover that an apparently unitary concept is actually a composition of different concepts hiding under a noun.

Data is the New Oil

2018-03-02T16:11:35-06:00

The other day I tweeted that "Data is the New Oil." A lot of people retweeted, but a quite a few asked what I meant by that. I'll amplify a bit to explain the analogy.

This ended up being a lot to unpack from a quick tweet! For quite a few years now, I've used Twitter as a way to scratch the itch of personal expression. A quick sound bite there, highly compressed and idiosyncratic was just enough to relieve the mental pressure. As a consequence, I stopped blogging nearly as much. Lately, though, I feel the need for nuance and explanation, so I hope to do more in this space.

—

First, oil was the key resource that drove the industrial revolution in the 20th century. That was the age of oil and steel, according to economist and historian Carlota Perez. In Technological Revolutions and Financial Capital, Prof. Perez shows that every technology revolution goes through predictable phases, from irruption to exhaustion. Economics in the 20th C were totally defined by access to and movement of oil. Those who had it either had leverage or became victims, depending on their ability to create military and economic alliances. Oil reserves could put a nation on the world stage. A nation that bargained well with its oil would have power far beyond what its population size or technological ability would usually merit.

In fact, a large part of the U.S. economic dominance in the latter portion of the 20th C can be explained by the petrodollar. Since the Bretton Woods conference after WWII, oil transactions around the world were denominated in USD. If Saudi Arabia sold oil to China, then China had to pay SA in dollars. That meant China needed plenty of USD currency reserves and SA needed the US to hold riyal. (The biggest economic story in the world right now is not the DJIA hitting 26,000 or falling by 0.5% in a day… it's that China, Russia, Saudi Arabia, and Iran are now trading oil denominated in rubles, yuan, and SDRs.)

But before the internal combustion engine, oil wasn't a resource it was a nuisance. The oil-rich land in Oklahoma is where the U.S. Government settled people it wanted to get out of the way. Oil gets in the way of farming. It was development of the new technology that turned oil from a hassle into a resource.

Once oil became a resource, a feedback loop got underway. More demand for oil led to more extraction, which caused industries to find new uses for the stuff. Plastics, fertilizers, etc. Increased demand drove increased supply and more efficient extraction, which in turn led to more demand.

Prof. Perez already identified the next technological revolution as information technology. However, I think her book got the timing wrong. It was published in 2002 and dated the start of the revolution to the advent of the personal computer in 1970. With the advantage of 16 years of additional observation, I think that there were two missing pieces: networking and machine learning. The real irruption of information technology started over the last decade. And as with the previous revolution, this one creates a need for a new resource: data.

Before this, data was a nuisance. It filled up disks and needed to be purged. It was often dirty (meaning not fully correct or conforming to syntactic rules) and incomplete. But toward the end of the 00's, some people started to see it as a resource. You might spend a lot of time cleansing and canonicalizing small data sets. But with a lot of data, it's impossible. At the same time though, you don't need to clean the data to glean information. Some kinds of errors average out and interesting signals emerge.

(If only I had come up with the name "Big Data" instead of "Dirty Data!")

Of course, we're well beyond mere Big Data now. With every eye turning toward machine learning, we've got a new challenge for our data. That's training.

A machine learning model is only as good as the training data. The training data itself needs to be classified. In other words, to train a machine to detect cars, you need a lot of photos where some are tagged "this has a car" and others don't have that tag. Yes, some CAPTCHAs just might be using you to train a machine, instead of proving you aren't one.

(Aside: we're going to see a lot of conflict about biases in ML models. We will expect the machines to be free of human cognitive and social biases, but we're training them with data created and classified by humans! We will actually be asking the machines to make errors in a systematic way to offset humans' systematic errors in the training inputs. It's not hard to see why HAL 9000 went spare.)

Data is digital, but it's not easy to move around in these quantities. We're not talking about a station wagon full of tapes barreling down the highway… we're talking about a convoy of 18-wheelers loaded with racks full of disks.

Companies that have tagged or classified data sets are the new oil producing and exporting countries. If you have large quantities of classified photos, video, voice, text, etc. you are well-positioned to train ML models. If you don't have such a dataset, then you need to create a consumer-oriented startup to get humans to do the initial classification for you or you need to license access to data from one of the big players. (There are some open-access datasets that hobbyists can use, but those will never be as large or as current as the proprietary data sets.) Alternatively, focus on providing the engineering support and tooling for the technostates that have the data, the same way that Norway provides engineering to Saudi Arabia.

Just as oil production led to new uses of oil that reshaped everything from consumer products to food production to hygiene, I fully expect data-fueled ML models to reshape this century. Moreover, we will see demand for ever-greater data production from our homes, workplaces, and devices. This will cause tension and conflict about data use just as happened with land-use, water-use, and mineral rights. That will lead to new legal regimes and doctrines. In extreme cases, it may lead to revolutions similar to the Revolutions of 1848 in Europe.

Coherence Penalty for Humans

2018-01-09T10:22:23-08:00

This is a brief aside from my ongoing series about avoiding entity services. An interesting dinner conversation led to thoughts that I needed to write down.

Amdahl's Law

In 1967, Gene Amdahl presented a case against multiprocessing computers. He argued that the maximum speed increase for a task would be limited because only a portion of the task could be split up and parallelized. This portion, the "parallel fraction," might differ from one kind of job to another, but it would always be present. This argument came to be known as Amdahl's Law.

When you graph the "speedup" for a job relative to the number of parallel processors devoted to it, you see this:

The graph is asymptotic in the serial fraction, so there is an upper limit to the speedup.

From Amdahl to USL

The thing about Amdahl's Law is that, when Gene made his argument, people weren't actually building very many multiprocessing computers. His formula was based on first principles: if the serial fraction of a job is exactly zero, then it's not a job but several.

Neil Gunther extended Amdahl's Law based on observations of performance measurements from many machines. He arrived at the Universal Scalability Law. It uses two parameters to represent contention (which is similar to the serial fraction) and incoherence. Incoherence refers to the time spent restoring a common view of the world across different processors.

In a single CPU, incoherence penalties arise from caching. When one core changes a cache line, it tells other cores to eject that line from their caches. If they need to touch the same line, they spend time reloading it from main memory. (This is a slightly simplified description… but the more precise form still has incoherence penalty.)

Across database nodes, incoherence penalties arise from consistency and agreement algorithms. The penalty can be paid when data is changed (as in the case of transactional databases) or when the data is read in the case of eventually consistent stores.

Effect of USL

When you graph the USL as a function of number of processors, you get the green line on this graph:

(Image from perfdynamics.com)

(The purple line shows what Amdahl's Law would predict.)

Notice that the green line reaches a peak and then declines. It means that there is a number of nodes that produces maximum throughput. Add more processors and throughput goes down. Overscaling hurts throughput. I've seen this in real-life load testing.

We'd often like to increase the number of processors and get more throughput. There are exactly two ways to do that:

Reduce the serial fraction
Reduce the incoherence penalty

USL in Teams?

Let's try an analogy. If the "job" is a project rather than a computational task, then we can look at the number of people on the project as the "processors" doing the work.

In that case, the serial fraction would be whatever portion of the work can only be done one step after another. That may be fodder for a future post, but it's not what I'm interested in today.

There seems to be a direct analog for the incoherence penalty. Whatever time the team members spend re-establishing a common view of the universe is the incoherence penalty.

For a half-dozen people in a single room, that penalty might be really small. Just a whiteboard session once a week or so.

For a large team across multiple time zones, it could be large and formal. Documents and walkthrough. Presentations to the team, and so on.

In some architectures coherence matters less. Imagine a team with members across three continents, but each one works on a single service that consumes data in a well-specified format and produces data in a well-specified format. They don't require coherence about changes in the processes, but would need coherence for any changes in the formats.

Sometimes tools and languages can change the incoherence penalty. One of the arguments for static typing is that it helps communicate across the team. In essence, types in code are the mechanism for broadcasting changes in the model of the world. In a dynamically typed language, we'd either need secondary artifacts (unit tests or chat messages) or we'd need to create boundaries where subteams only rarely needed to re-cohere with other subteams.

All of these are techniques aimed at the incoherence penalty. Let's recall that overscaling causes reduced throughput. So if you have a high coherence penalty and too many people, then the team as a whole moves slower. I've certainly experienced teams where it felt like we could cut half the people and move twice as fast. USL and the incoherence penalty now helps me understand why that was true—it's not just about getting rid of deadwood. It's about reducing the overhead of sharing mental models.

In The Fear Cycle I alluded to codebases where people knew large scale changes were needed, but were afraid of inadvertant harm. This would imply a team that was overscaled and never achieved coherence. Once lost, it seems to be really hard to re-establish. That means ignoring the incoherence penalty is not an option.

USL and Microservices

By the way, I think that the USL explains some of the interest in microservices. By splitting a large system into smaller and smaller pieces, deployed independently, you reduce the serial fraction of a release. In a large system with many contributors, the serial fraction comes from integration, testing, and deployment activities. Part of the premise for microservices is that they don't need the integration work, integration testing, or delay for synchronized deployment.

But, the incoherence penalty means that you might not get the desired speedup. I'm probably stretching the analogy a bit here, but I think you could regard interface changes between microservices as requiring re-coherence across teams. Too much of that and you won't get the desired benefit of microservices.

What to do about it?

My suggestion: take a look at your architecture, language, tools, and team. See where you spend time re-establishing coherence when people make changes to the system's model of the world.

Look for splits. Split the system with internal boundaries. Split the team.

Use your environment to communicate the changes so re-cohering can be a broadcast effort rather than one-to-one conversations.

Look at your team communications. How much of your time and process is devoted to coherence? Maybe you can make small changes to reduce the need for it.

Services By Lifecycle

2018-01-05T14:00:44-06:00

This post took a lot longer to pull together than I expected. Not because it was hard to write, but because it was too easy to write too much. Like a pre-bonsai tree, it would grow out of control and get pruned back over and over.

In the meantime, I delivered a workshop and spent some lovely holiday time with my family.

But it’s a new year now, and January is devoid of holidays so it’s high time I got back to business.

Avoiding the Entity Service

In my last post, I made a case against entity services. To recap, an entity service is a set of CRUD operations on a business entity such as Person, Location, Contract, Order, etc. It’s an antipattern because it creates high semantic and operational coupling. Edge services suffer from common-mode failures through their shared dependency on the entity services. Changes or outages in the entity services have large “failure domains.”

A lot of good advice springs from Eric Evan’s hugely influential book “Domain-Driven Design.” It was written before the service era, but seems to apply well now. I’m not an expert on DDD, though, so I’m going to offer some techniques that may or may not be described there. (I dig the “bounded context” idea, but need to re-read the whole book before I comment on it more.)

There are several ways to avoid entity services. This post explores just one (though it’s one I particularly like.) Future posts will look at additional techniques.

Focus on Behavior Instead of Data

When you think about what a service knows, you always end up back at CRUD. I recommend thinking in terms of the service’s responsibilities. (And don’t say it’s responsible for knowing some data!) Does it apply policy? Does it aggregate a stream of concepts into a summary? Does it facilitate some kinds of changes? Does it calculate something? And so on.

Once you know what a service does, you can figure out what it needs to know. For instance, a service that restricts content delivery based on local laws needs to know a few things:

What jurisdiction applies?
What classifiers are on the content?
What classifiers are not allowed in that jurisdiction?

Notice that #1 and #2 are specific to an individual request, while #3 is slowly-changing data. Thus it makes sense for the service to know #3 and be told #1 and #2.

This leads us to a deeper discussion about what the service knows. How does that data get into the service? Is there a GUI for a legal team? Maybe there’s a feed we can purchase from a data vendor. Maybe we need to apply machine learning based on lawsuits and penalties incurred when we violate local laws. (I’m kidding! Don’t do the last one!) The answers to these questions will firm up your design and situate your service in its ecosystem.

Model Like It’s 1999

When modeling, I like to use a technique from object-oriented design. CRC cards let me lay out physical tokens that represent services. Then I can play a game where I simulate a request coming in at the edge. A service can add information to a request and pass it along to another service, following the “Tell, Don’t Ask” principle.

If you are in a team, you can deal out cards to players then simulate a request by passing a physical object around. That will quickly reveal gaps in a design. Some common gaps that I see:

A service doesn’t know where to send the request. It lacks knowledge of other services that can continue processing. The solution is either to statically introduce it to the next party or to provide URLs in the data that lead to the handler.
A service receives a request that is insufficient. The incoming request either lacks information or has an implicit context that should be turned into data on the request.

While playing the CRC game, it’s OK to assume your service already has data it naturally depends on. That is, slowly-changing data the service uses should be considered an asynchronous process relative to the current request. But do make note of that slowly-changing data so you remember to build in the flows needed to populate it.

If you follow “Tell, Don’t Ask” strictly, then the activation set will be a strict tree. Anywhere a service calls out to more than one downstream, it should be sending instructions forward in parallel rather than serially making queries followed by an instruction.

Dealing with Consistency

If it were just a matter of passing requests along with extra data, then life would be simple. As often happens, trouble comes from side effects.

Services are not pure functions. If a service call always results in the same result for the same parameters, then you don’t need a service. Make a library and avoid the operational overhead! Services only make sense when they change something in the world. That means state and state changes are unavoidable concerns in a service-based architecture.

Consistency immediately comes up as an issue. Many words have already been written about CAP. Some good, some misguided, and some pure marketing. I even wrote an earlier post about the subtle differences between the C in CAP versus the C in ACID.

Let’s look at one way to deal with consistency in the face of changing state.

Divide Services by Lifecycle in a Business Process

Many business processes have entities that go through a series of milestones. In a particular state, changes are allowed to certain attributes but not others. Once a subset of the properties are “valid” (whatever that means) the entity can transition to the next stage of the business process.

Instead of viewing this as a single entity with a bunch of booleans, or a CURRENT_STATE attribute (which implies a state machine that is unknown to consumers) we can view each state as a different thing.

For example, consider this process from a peer-to-peer lending situation:

A loan requestor starts by creating a project proposal. The requestor can provide descriptive text, an amount to request, some media assets (projects with big vivid pictures get funded faster.)
Once the loan request is completed, the requestor submits it for approval. At this point, the requestor is no longer allowed to change the amount requested.
An analyst from the host company reviews the proposal. In parallel, a background job checks the requestor’s credit score, repayment history, and criminal record.
The analyst reviews the request and either assigns a target interest rate, rejects the request outright, or indicates that more information is needed from the requestor.
Once approved, the proposal is visible to funders. Funders can commit to a certain amount (contingent on their credit scores.) At this stage, none of the proposal information can be changed, although the whole proposal could be withdrawn.
Once fully funded, funders must transfer money within 3 days. No additional funders are allowed to join the project at this time, but they can go on a waiting list in case some of the committed funders fail to supply the money.
Once funds are in the funders' accounts, it all gets transfered into a short-term holding account. The project information, all the individuals' information (tax IDs, etc.) goes to an underwriter who produces a legal loan document for all parties to sign.

For the moment, we’re leaving out some of the tributaries of this flow.

Notice how moving through the business process causes previous information to become effectively read-only?

The original form of this system was a monolith that had a state field on a “Loan” model object. That was a wide object, since it had everything from the initial proposal through to the ultimate payment. If we made that into a “Loan” microservice we would exactly end up with an entity service, CRUD operations, and high coupling into that service, as shown below.

Try playing CRC with this design. You’ll find that all requests reach the Loan service.

What is less evident from the diagram is about the cost of embedding a state model into the entity service directly. If we put a state field on the Loan, then every Loan must go through the same state machine. It locks us into a single kind of business process. At the time, we already knew the company was exploring direct-funded loans through a banking partner. So there would be a minimum of two flavors of process. (Or one process with proliferating branches.)

I briefly considered using my DEVS library to represent the state plus state machine as EDN data on each Loan, but ultimately decided against it.

Instead, I thought we could make each state into its own service, as shown here.

Now as the business process moves along, we’re really sending documents to each service. For example, from Proposal to Project, we send a “ProjectStarter” document that contains all the attributes needed for a Project. When the analyst approves a project, the analyst GUI (or backend for same) creates a “LoanStarter” and sends it to the UnfundedLoan service. Likewise, once all funding is received, the “Collection” service creates a “LoanPackage” document and sends it to the “Underwriting” service. (That’s “collection” as in “gatherer of documents” not “collection” as in “break your kneecaps.") Further downstream, we set up a schedule of payments to receive from the requestor and payments to issue to the backers. We also keep a set of ledgers to track balances per account.

Each of the services has facilities to add or update information relevant to that service. It ignores anything in the incoming documents that it doesn’t need.

This gives us a lot of flexibility in how we build the overall process.

Consider our direct-funding scenario. We need a new “DirectFunding” service that finds suitable candidates. It sends a document out to the bank and receives a response. On a favorable response, DirectFunding can create its own version of the LoanPackage document for underwriting. In other words, treating these stages as services connected by well-defined document formats allows us to introduce more pathways without creating the state machine from hell.

As an additional benefit, we can easily monitor the flow of documents to see when the process is healthy. We can monitor each service’s activity to create a cumulative flow diagram. We get a lot of visibility. And since some stages are triggered by humans (e.g., the analysts) we can even figure out how our staff model must scale with business throughput.

It should also be clear that this style works well with event transfer instead of document transfer. It would be natural to put all the documents onto a message bus.

Overall, I think this style offers a nice degree of alignment between the technology and the business. The only “downside” I can see is that it requires a service developer to understand how that service contributes to value streams and business processes.

Backtracking, Errors, and Races

There is still a minute window of opportunity for perceived inconsistency to sneak in. For example, what happens if the requestor tries to change the proposal while the analyst is reviewing it? Or worse, what if they change it in those milliseconds between when the analyst clicks “Approve” in the GUI and when the document goes over to the Project service? For that matter, how do we tell the Proposal service that the proposal can no longer be edited without withdrawing the request and resubmitting as a new Project?

This post is already getting too long, so I’m going to answer those questions next time. It shouldn’t take another month since we’re past the holiday-fun-times and into the serious winter months.

If you’re interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

I’m hosting a session this March in sunny Florida. Especially to all my dear friends and colleagues back home in Minnesota… we know that March is a great time to not be in Minnesota.

Or, contact me to schedule a workshop at your company.

The Entity Service Antipattern

2017-12-05T12:53:44-06:00

In my last post I talked about the need to keep things separated once they've been decoupled. Let's look at one of the ways this breaks down: entity services.

If a pattern is a solution to a problem in a context, what is an antipattern? An antipattern is a commonly-rediscovered solution to a problem in a context, that inadvertently creates a resulting context we like less than the original context. In other words, it's a pattern that makes things worse (according to some value system.)

I contend that "entity services" are an antipattern.

To make that case, I need to establish that "entity services" are a commonly-rediscovered solution to a problem and that the resulting context is worse than the starting context (a monolith.)

Let's start with the "commonly-rediscovered" part. Entity services are in Microsoft's .NET microservices architecture ebook. Spring has a tutorial with them. (Spring may give us the absolute easiest way to create an entity service. The same class can be annotated with JSON mapping and persistence mapping.) RedHat has a Microservice Reference Architecture with product-service and sales-service. Some of the microservice-focused frameworks such as JHipster start with CRUD on data entities.

In order to make the case that the resulting context is worse than the starting context, I need to assume what that starting context actually is. For the sake of generality, I'll assume a largish, legacy application that is more-or-less a monolith. It may call out to some integration points to get work done, but features are pretty much local and in-process. There are multiple instance of the process running on different hosts. Basically, like the following diagram.

All features reside in the code for the application instances.

Many other authors have enumerated the sins of the monolith, so I won't belabor them here. (Though I feel compelled to make a brief aside to say that we did somehow deliver quite a lot of working, valuable features that ran in monoliths.)

How might we describe this initial context?

It is clear where the code to build a feature goes and how to test the code.
The release cadence is dictated by the slowest-delivering subteam.
There is little inherent enforcement of boundaries, thus coupling tends to increase over time.
Performance problems can be found by profiling a single application.
The cause of availability problems is typically found in one place.
Building features that rely on multiple entities is straightforward, though it may come at the cost of inappropriate coupling.
As the code grows large, the organization is at risk of entering the fear cycle.
Feature availability may be compromised by inappropriate coupling via common modes in the application. (E.g., thread pools, connection pools.)
Feature availability should be improved by redundancy of the whole application itself. This is reduced, however, if the application is vulnerable to the surrounding environment, as in the case of a Self-Denial Attack, memory leak, or race condition.

Supposing we move this to a microservice architecture, with entity services. We might end up with something like this example from the Spring tutorial:

In this version, you should assume that each of the service boxes comprises multiple instances of that service.

Obviously there are more moving parts involved. That immediately means it's harder to maintain availability.

The challenges of performance analysis and debugging are well documented, so I won't belabor them.

But in this resulting context, where do features get created? A few of them are direct interactions of the "Online Shopping" service and the individual entity services. For example, creating an account is definitely just between Online Shopping and Accounts.

Most features, however, require more than one of the entities. They will use aggregates or intersections of entities.

For example, suppose we need to calculate the total price of a cartful of items. That involves the cart, the products (for their individual prices) and the account to find the applicable sales tax or VAT. I predict that it will be implemented in the Online Shopping service by making a bunch of calls to the entity services to get their data.

We can depict this with an "activation set" (a term I made up) to show which services are activated during processing of a single request type. For this picture, we focus on just the services and elide the infrastructure.

So to price the cart, we have to activate four of the five services in our architecture.

That activation represents operational coupling, which affects availability, performance, and capacity.

It also represents semantic coupling. A change to any of the entity services has the potential to ripple through into the online shopping service. (In particularly bad cases, the online shopping service may find itself brokering between data formats: translating version 5 of the user data produced by Accounts into the version 3 format that Cart expects.)

A common corollary to entity services is the idea of "stateless business process services." I think this meme dates back to last century with the original introduction of Java EE, with entity beans and session beans. It came back with SOA and again with microservices.

What happens to our picture if we introduce a process service to handle pricing the cart?

Not much improvement.

Bear in mind this is the activation set for just one request type. We have to consider all the different request types and overlay their activation sets. We'll find the entity services are activated for the majority of requests. That makes them a problem for availability and performance. It also means they won't be allowed to change as fast as we'd like. (Services with a high fan-in need to be more stable.)

So, let's look at the resulting context of moving to microservices with entity services:

Performance analysis and debugging is more difficult. Tracing tools such as Zipkin are necessary.
Additional overhead of marshalling and parsing requests and replies consumes some of our precious latency budget.
Individual units of code are smaller.
Each team can deploy on its own cadence.
Semantic coupling requires cross-team negotiation.
Features mainly accrue in "nexuses" such as API, aggregator, or UI servers.
Entity services are invoked on nearly every request, so they will become heavily loaded.
Overall availability is coupled to many different services, even though we expect individual services to be deployed frequently. (A deployment look exactly like an outage to callers!)

In summary, I'd say both criteria are met to label entity services as an antipattern.

Stay tuned. In a future post, we'll look at what to do instead of entity services.

If you're interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

There is a session open to the public in March 2018.

Or, contact me to schedule a workshop at your company.

Keep 'Em Separated

2017-11-27T15:45:31-06:00

Software doesn't have any natural boundaries. There are no rivers, mountains, or deserts to separate different pieces of software. If two services interact, then they have a sort of "attractive force" that makes them grow towards each other. The interface between them becomes more specific. Semantic coupling sneaks in. At a certain point, they might as well be one module running in one process.

If you're building microservices, you need to make sure they don't grow together into an impenetrable bramble. The whole microservice bet is that we can trade deployment and operational complexity for team-scale autonomy and development speed. The last thing you want is to take on the operational complexity of microservices and still move slowly due to semantic coupling among them.

Maybe you've recently broken up a monolith into microservices, but found that things aren't as easy and rosy as the conference talks led you to believe.

Maybe you have a microservice architecture that is starting to slow down and get harder. Like cooling honey, it seems sweet at first but gets stickier later.

I'm going to write a short series of posts with techniques to keep 'em separated. This will go into API design for microservices, information architecture, and feature design. It'll be all about making smaller, more general pieces, that you can rearrange in interesting ways.

If you're interested in learning more about breaking up monoliths, you might like my Monolith to Microservices workshop.

There is a session open to the public in March 2018.

Or, contact me to schedule a workshop at your company.

Root Cause Analysis as Storytelling

2017-11-08T17:01:06-06:00

Humans are great storytellers and even better story-listeners. We love to hear stories so much that when there aren't any available, we make them up on our own.

From an early age, children grasp the idea of narrative. Even if they don't understand the forms of storytelling so much, you can hear a four-year-old weave a linked list of events from her day.

We look for stories behind everything. At a deep level, we want the world's events to mean something. Effect follows cause, and causes have an actor to set them in motion.

Our sense of balance also demands that large effects should have large causes, with correspondingly large intent.

A drunk driver speeds through a red light, oblivious. A crossing car stops short. The shaken driver creeps home with a pounding pulse, full of queasy adrenaline. She unbuckles her daughter and hugs her tightly.

A drunk driver speeds through a red light, oblivious. A crossing car is in the intersection. The drunk smashes into it, right at the drivers' side door. The woman's bloody face is hidden behind airbags. Her daughter sits in her new wheelchair for her mother's funeral.

The difference between those stories is a matter of a split second in timing. There is absolutely no change in the motives or desires of anyone in the two vignettes. The first drunk, if caught, would get a jail term and large fine. He would probably lose his driver's license.

But most people would judge the motives of the second driver far more harshly. They would condemn him to a lengthy prison term and a lifetime ban on driving.

When we see a large effect, we expect a large cause, with a large intent.

The idea that some vast, horrible events strike randomly fills us with dread. People can't bear the thought that a single unbalanced nobody can change the course of a nation's history with one rifle shot, so they spend more than 50 years searching for "the truth."

"Root Cause Analysis" expresses a desire for narrative. With the power of hindsight, we want to find out what went wrong, who did it, and how we can make sure it never happens again. But because we have the posterior event, we judge the prior probabilities differently. Any anomaly or blip suddenly becomes suspect.

People don't look as hard at anomalies when nothing bad happens.

They don't notice all the times the same weird log message pops up before … everything continues as normal.

When we look for "root cause," what we are really trying to discern is not "what made this happen." We are looking for something that would have stopped it from happening. We are building a counterfactual narrative—an alternate history—where that drunk driver dropped his keys in the parking lot and was thereby delayed a few crucial seconds.

Peel back the surface on a root cause analysis and you almost always see a formula that goes like this: "factor X" could have prevented this. "Factor X" was not present, therefore the bad event happened.

The catch is that there is usually an endless variety of possible counterfactuals. Often, more than one counterfactual narrative would have prevented the bad outcome equally well. Which one was the root cause? Non-existence of "factor X" or non-existence of "factor Y?"

Next time you have a bad incident, why not try to focus your efforts in a different way? Work on learning from the times that things don't go wrong. And be explicit about looking for many possible interventions that would have prevented the problem. Then select ones with broad ability to prevent or impede many different problems.

Release It Second Edition in Beta

2017-08-24T08:09:51-05:00

I’m excited to announce the beta of Release It! Second edition.

It’s been ten years since the first edition was released. Many of the lessons in that book hold strong. Some are even more relevant today than in 2007. But a few things have changed. For one thing, capacity management is much less of an issue today. The rise of the cloud means that developers are more exposed to networks than ever. And in this era of microservices, we’ve got more and better ops tools in the open source world than ever.

All of that motivated me to update the book for this decade. I’ve removed the section on capacity and capacity optimization and replaced it with a section that builds up a picture of our systems by doing a “slow zoom” out from the hardware, to single processes, to clusters, to the controlling infrastructure, and to security issues.

The first beta does not yet include two additional new parts on deployment and solving systemic problems. Those will be coming in the next few weeks.

In the meanwhile, I look forward to hearing your comments and feedback! Join the conversation in the book's forum.

Spectrum of Change

2017-06-23T11:26:33-05:00

I’ve come to believe that every system implicitly defines a spectrum of changes, ordered by their likelihood. As designers and developers, we make decisions about what to embody as architecture, code, and data based on known requirements and our experience and intuition.

We pick some kinds of changes and say they are so likely that we should represent the current choice as data in the system. For instance, who are the users? You can imagine a system where the user base is so fixed that there’s no data representing the user or users. Consider a single-user application like a word processor.

Another system might implicitly indicate there is just one community of users. So there’s no data that represents an organization of users–it’s just implicit. On the other hand, if you’re building a SaaS system, you expect the communities of users to come and go. (Hopefully, more come than go!) So you make whole communities into data because you expect that population to change very rapidly.

If you are building a SaaS system for a small, fixed market you might decide that the population won’t change very often. In that case, you might represent a population of users in the architecture via instancing.

So data is at the high-energy end of the spectrum, where we expect constant change. Next would be decisions that are contemplated in code but only made concrete in configuration. These aren’t quite as easy to change as data. Furthermore, we expect that only one answer to any given configuration choice is operative at a time. That’s in contrast to data where there can be multiple choices active simultaneously.

Below configuration are decisions represented explicitly in code. Constructs like policy objects, strategy patterns, and plugins all indicate our belief that the answer to a particular decision will change rapidly. We know it is likely to change, so we localize the current answer to a single class or function. This is the origin of the “Single Responsibility Principle.”

Farther down the spectrum, we have cross-cutting behavior in a single system. Logging, authentication, and persistence are the typical examples here. Would it be meaningful to say push these up into a higher level like configuration? What about data?

Then we have those things which are so implicit to the service or application that they aren’t even represented. Everybody has a story about when they had to make one of these explicit for the first time. It may be adding a native app to a Web architecture, or going from single-currency, single-language to multinational.

Next we run into things that we expect to change very rarely. These are cross-cutting behavior across multiple systems. Authentication services and schemas often land at this level.

So the spectrum goes like this, from high energy, rapidly changing, blue to cool, sedate red:

Data
Configuration
Encapsulated code
Cross-cutting code
Implicit in application
Cross-cutting architecture

Implications

The farther toward the “red” end of the spectrum we relegate a concern, the more tectonic it will be to change it.

No particular decision “naturally” falls at one level or another. We just have experience and intuition about which kinds of changes happen with greatest frequency. That intuition isn’t always right.

Efforts to make everything into data in the system lead to rules engines and logic programming. That doesn’t usually end up with the end-user control we think. It turns out you still need programmers to think through changes to rules in a rules engine. Instead of democratizing the changes, you’ve made them more esoteric.

It’s also not feasible to hoist everything up to be data. The more decisions you energy-boost to that level, the more it costs. And at some point you generalize enough that all you’ve done is create a new programming language. If everything about your application is data, you’ve written an interpreter and recursed one level higher. Now you still have to decide how to encode everything in that new language.

Queuing for QA

2017-05-01T20:40:29-05:00

Queues are the enemy of high-velocity flow. When we see them in our software, we know they will be a performance limiter. We should look at them in our processes the same way.

I've seen meeting rooms full of development managers with a chart of the year, trying to allocate which week each dev project will enter the QA environment. Any project that gets done too early just has to wait its turn in QA. Entry to QA becomes a queue of its own. And as with any queue, the more variability in the processing time, the more likely the queue is to back up and get delayed.

When faced with a situation like that, the team may look for the ``right number'' of QA environments to build. There is no right number. Any fixed number of environments just changes the queuing equation but keeps the queue. A much better answer is to change the rules of the game. Instead of having long-lived (in other words, broken and irreproducible) QA environments, focus on creating a machine for stamping out QA environments. Each project should be able to get its own disposable, destructible QA system, use it for the duration of a release, and discard it.

Availability and Stability

2016-11-27T13:26:17-06:00

Last post covered technical definitions of fault, error, and failure. In this post we will apply these definitions in a system.

Our context is a long-running service or server. It handles requests from many different consumers. Consumers may be human users, as in the case of a web site, or they may be other programs.

Engineering literature has many definitions of "availability." For our purpose we will use observed availability. That is the probability that the system survives between the time a request is submitted and the time it is retired. Mathematically, this can be expressed as the probability that the system does not fail between time T_0 and T_1, where the difference T_1 - T_0 is the request latency.

(There is a subtle issue with this definition of observed availability, but we can skirt it for the time being. It intrinsically assumes there is some other channel by which we can detect failures in the system. In a pure message-passing network such as TCP/IP, there is no way to distinguish between "failed" and "really, really slow." From the consumer's perspective, "too slow" is failed.)

The previous post established that faults will occur. To maintain availability, we must prevent faults from turning into failures. At the component level, we may apply fault-tolerance or fault-intolerance. Either way, we must assume that components will lose availability.

Stability, then, is the architectural characteristic that allows a system to maintain availability in the face of faults, errors, and partial failures.

At the system level, we can create stability by applying the principles of recovery-oriented computing.

Severability. When a component is malfunctioning, we must be able to cut it off from the rest of the system. This must be done dynamically at runtime. That is, it must not require changes to configuration or rebooting of the system as a whole.
Tolerance. Components must be able to absorb "shocks" without transmitting or amplifying them. When a component depends on a another component which is failing or severed, it must not exhibit higher latency or generate errors.
Recoverability. Failing components must be restarted without restarting the entire system.
Resilience. A component may have higher latency or error rate when under stress from partial failures or internal faults. However, when the external or internal condition is resolved, the component must return to its previous latency and error rate. That is, it must display no lasting damage from the period of high stress.

Of these characteristics, recoverability may be the easiest to achieve in today's architectures. Instance-level restarts of processes, virtual machines, or containers all achieve recoverability of a damaged components.

The remaining characteristics can be embedded in the code of a component via Circuit Breakers, Bulkheades, and Timeouts.

Fault, Error, Failure

2016-11-27T13:01:50-06:00

Our systems suffer many insults when they contact the real world. Flaky inputs, unreliable networks, and misbehaving users, to name just a few. As we design our components and systems to thrive in the only environment that matters, it pays to have mental schema and language to discuss the issues.

A fault is an incorrect internal state in your software. Faults are often introduced at component, module, or subsystem boundaries. There can be a mismatch between the contract a module is designed to implement and its actual behavior. A very simple example is accepting a negative integer or zero when a strictly positive integer was expected.

A fault may also occur when a latent bug in the software is triggered by an external or internal condition. For example, attempting to allocate an object when memory is exhausted will return a null pointer. If the software proceeds with the null pointer it can cause problems later, perhaps in a far distant part of the code.

Such an incorrect state may be recoverable. A fault-tolerant module will attempt to restore a good internal state after detecting a fault. Exception handlers and error-checking code are efforts to provide fault-tolerance.

Another school of thought says that fault tolerance is unreliable. In this approach, once a fault has occurred, the entire memory state of the program must be regarded as corrupt. Instead of attempting to restore a good state by backtracking or patching up the internal state, fault-intolerant modules will exit to avoid producing errors. A system built from these fault-intolerant modules will include supervisor capabilities to restart exited modules.

If a fault propagates in the system, it can produce visibly incorrect behavior. This is an error. Faults may occur without producing errors, as in the case of fault-tolerant modules that correct their own state before an error is observed. An error may be limited to an incorrect output displayed to a user. It can include any incorrect behavior, including data loss or corruption, network flooding, or launching attack drones.

At the component, module, or subsystem level, or mission is to prevent faults from causing errors.

A failure results when a system terminates without completing its job. For a long-running service or server, it stops responding to requests in a finite time. For a program that should run to completion and exit, it exits abnormally before completing. A failure may be preferrable to an error, depending on the harm caused by the error.

Next time, I will address system stability in the face of faults, errors, and failures.

Power Systems

2016-09-05T10:14:18-05:00

This is an excerpt from something I'm working on this Labor Day holiday:

–

Large scale power outages act a lot like software failures. It starts with a small event, like a power line grounding out on a tree. Ordinarily that would be no big deal but under high-stress conditions it can turn into a cascading failure that affects millions of people. We can also learn from how power gets restored after an outage. Operators must perform a tricky balancing act between generation, transmission, and demand.

There used to be a common situation where power would be restored and then cut off again in a matter of seconds. It was especially common in the American South, where millions of air conditioners and refrigerators would all start at the same time. When a motor starts up, it draws a lot of current. You can see this in the way that lights dim when you start a circular saw. As the motor starts to spin, though, it creates "back EMF"–a kind of backpressure on the electrical current. (That's when the lights return to full brightness.) If you add up the effects of millions of electric motors starting all at once, you see a huge upward blip in current draw, followed by a quick drop due to back current. Power transmission systems would see the spike and drop and propagate that to the generation systems. First they would increase their draw then drop it dramatically. That would make the generation systems think they should shut off some of the turbines. Right about the time they started reducing supply, the initial surge of back EMF would decline and current load would come back up to baseline levels. The increased current load hit just when supply was declining, causing excess demand to trip circuit breakers. Lights out, again.

Smarter appliances and more modern control systems have mitigated that particular failure mode now, but there are still useful lessons for us.

Remember DAT?

2016-07-29T13:18:11-05:00

Do you remember Digital Audio Tape? DAT was supposed to have all the advantages of digital audio—high fidelity and perfect reproduction—plus the "advantages" of tape. (Presumably those advantages did not include melting on the dashboard of your Chevy Chevelle or spontaneously turning into The Best of Queen after a fortnight.)

In hindsight, we can see that DAT was a twilight product. As the sun set on the cassette era, DAT was an attempt to bridge the discontinuous technology change to digital music production. It was a twilight product because it didn't sufficiently reimagine the existing technology to offer enough of a new advantage nor did it eliminate enough of the old disadvantages.

We often see such twilight products right before a major, discontinuous shift.

I think we're in such a period when it comes to software development and deployment for cloud native systems. The tools we have attempt to take the traditional model into the new environment. But they don't yet reimagine the world of software development enough. Ten years from now, we will see that they offered some advantages but also carried forward baggage from the client-server era. Unix-like full operating systems, coding one process at a time, treating network as a secondary concern, ignoring memory hierarchy in the languages.

Whatever the "operating system for the cloud" is, we haven't seen it yet.

QA Instability Implies Production Instability

2016-07-14T10:06:11-05:00

Many companies that have trouble delivering software on time exhibit a common pathology. Developers working on the next release are frequently interrupted for production support issues with the current release. These interrupts never appear in project schedules but can take up half of the developers' hours. When you include the cost of task-switching, this means less than half of their available time is spent on the new feature work.

Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system's behavior in QA is not a good predictor of its behavior in production.

QA environments are often configured differently than production. Not just in the usual sense of consuming a QA version of external services, but also in more basic ways. Server topology may be different. Memory settings, capacity ratios, and the presence of network components can all vary. QA often has a much simpler traffic routing scheme than production, particularly when a CDN is involved.

The other major source of QA unavailability has to do with data refreshes. QA environments either run with a miniscule, curated test data set, or they use some form of backward migration from production data. Each backward migration can be very disruptive, leading to one or more days where QA is not available.

Disruption arises when testers have to do manual data setup in order to test new features. These setups get overwritten with the next refresh. Sometimes, production data must be cleansed or scrubbed of PII before use in QA. This cleansing process often introduces its own data quality problems. The backward migration process must also be kept up to date so it can propagate data back into the schema for the next release. This requires copying data and schema into QA, then forward-migrating the schema according to the new release.

When many teams contend to get into a QA environment, that contention can result in lost time as well. Time is lost in delays when one team cannot move their code into QA during another team's test. It is also lost when one team overwrites test data that a different team had set up. And it can be lost when one team's code has bugs that prevent other teams from proceeding with their tests. Suppose one team works on login and registration, while another team works on friend requests. Clearly, the friend requests team cannot do their testing when login is broken. This last issue also applies across service boundaries: a service consumer may not be able to test because the QA version of their service provider is broken.

Finally, problems in QA simply take a lower priority than problems in production. Thus, the operations team may be fully consumed with current production issues, leaving the QA environment broken for extended periods. In a vicious feedback loop, this makes it likely that the next release will also create production problems.

My recommendations are these:

Give priority to well-functioning test environments.
Virtualize your test environments, so you can avoid inter-team dependencies on a QA environment.
Automate the backward data propagation, and make it part of spinning up a QA environment. When you must scrub PII, automate that process so that every QA environment can draw from a snapshot of cleansed data without impinging on the production DBAs.
If your QA stays unavailable because there are too many production issues, recognize that this is a self-sustaining pattern. You can temporarily redirect a "SWAT" team to fix QA and it will pay dividends for all future releases.

Wittgenstein and Design

2016-07-10T09:27:27-05:00

What does a philosopher born in the 19th Century have to say about software design? More than you might think, particularly his ideas about family resemblance.

Wittgenstein used the subject of "games" to illustrate an idea. We'll start with a counter-example. Suppose we operate with the then-prevailing notion that words are defined like sets in axiomatic set theory. Then there is a decision procedure that will let us decide whether something is a member of the set "games" or not. Such a decision procedure must include everything that is a game and exclude everything which is not a game. Can we define such a decision procedure?

Does a game require competition? Some do. Not all.

Does a game have a score? Or an objective? Not all.

Does a game involve more than one person? Not necessarily.

Is a game a frivolous expenditure of energy? Some are. Others have deep moral and philosophical lessons.

How is a game of football like a game of solitaire?

It's easy to see that mancala and go have something in common… little rounded stones. But what do they have in common with Minecraft? Stones?

Wittgenstein said that this is not an issue for set theory. Instead, he talked about family resemblances. As described in Wikipedia, "things which could be thought to be connected by one essential common feature may in fact be connected by a series of overlapping similarities, where no one feature is common to all."

For games, this means there is no single feature that makes something a game. Instead, there are a set of overlapping similarities that make things more gamelike or less gamelike. We can even think about things that share more of the features as being more like each other. So go and mancala share features like: two players, stones on a board, alternating turns, one winner, ancient, cerebral, positional. This makes them pretty similar. A professionally played team sport with a ball on a field shares few qualities with go. (Although "people excited about the outcome" and "positional" might be common.) So the feature-distance between go and football is large, yet they are both still games.

I think this relates to the tasks of software design and architecture. We have a strong tendency to go looking for nouns in our designs. Once we find a noun in a domain, we want to make a software artifact that captures all members of the set induced by that noun. But that only works if we stick with axiomatic set theory. Set theory works well for well-defined technical concepts and much less well for things in the human sphere.

One simple example, the humble "name" field. Go read Falsehoods programmers believe about names. How do you feel about that "first name", "last name" database structure now? After reading that list, how much can you confidently say about instances of a "Name" class? Or a "Name" service?

We have all these debates about "noun-first" or "verb-first". Back in The Perils of Semantic Coupling, I argued for a behavior-oriented approach rather than a noun-oriented approach. Stop saying "what is this thing?" but rather "what can you do with it?" That leads us toward segregated interfaces.

Now I'd augment that to emphasize those feature descriptions rather than noun-like descriptions. Instead of noun-first or verb-first I'm going to try "adjective-first".

In Love With Your Warts

2016-04-08T14:24:17-04:00

If you help other people solve problems, you may have run into this phenomenon: a person gleefully tells you how messed up their environment is. Processes make no sense, roadblocks are everywhere, and all previous attempts to make things better have failed. As you explore the details, you talk about directions to try. But every suggestion is met with an explanation of why it won't work.

I say that these folks are "in love with their own warts." That is, they know there's a problem, but they've somehow been acclimated to it to such a degree that they can't imagine a different world. They will consistently point to outside agents as the author of their woes, without realizing how much resistance they generate themselves.

Over time, by the way, there's a reinforcing process. People who think and talk this way will cluster and drive out the less cynical.

These people can be intensely frustrating to work with, until you understand them. Understanding allows empathy, which is the only way to get past that self-generated resistance.

The first thing to understand is that any conversation about their problems isn't really about their problems. An opening statement like, "We tried that but it didn't work," isn't really asking for a solution. Instead, it's an invitation to play a game. That game is called, "Stump the expert." The player wins when you concede that nothing can ever improve. You "win" by suggesting something that the player cannot find an objection to. It's not a real victory though, for reasons that will be clear in a moment.

Why does the player want to win this game instead of improving their world? For one thing, any solution you find is an implicit critique of the person who has been there. Suppose the solution is to shift a responsibility from one team to another. That requires management support in both teams. If that solution works, then it means the game-player could have produced the same improvement ages ago, but didn't have enough courage to make it happen. Other changes might imply the game-player lacked sufficient authority, vision, credibility, or, rarely, technical acumen.

In every case, the game-player feels that your solution highlights a deficiency of theirs.

This is why "winning" the discussion isn't really a win. You may get a grudging concession about the avenue to explore, but you're still generating more resistance from that game-player.

My usual approach is to decline the invitation to the game. I don't try to find point-by-point answers to things that have failed in the past. I usualy draw analogies to other organizations that have faced the same challenges and make parallels to their solutions. Failing that, I accept the objections (almost always phrased as roadblocks thrown up by others) and just tell them, "Let me handle that." (Most of the time, I find that people on the opposite side of a boundary express roadblocks from the other side that all eventually cancel each other out. That is, the roadblock turns out to be illusory.)

I'd like to hear from you, dear Reader. Assume that you cannot simply fire or transfer the game-player. They have value beyond this particular habitual reflex.

How would you handle a situation like this? What have you tried, and what works?

Some Useful Techniques From Bygone Eras

2016-03-02T13:06:44-06:00

CRC

I find the old object-oriented design technique of CRC Cards to be useful when defining service designs. CRC is short for "Class, Responsibilities, Collaborators." It's a way to define what behavior belongs inside a service and what behavior it should delegate to other services.

Simulating a system via CRC is a good exercise for a group. Each person takes a CRC card and plays the role of that service. A request starts from outside the services and enters with one person. They can do only those things written down as their "responsibilities." For anything else, they must send a message to someone else.

Personifying and role-playing really helps identify gaps in service design. You'll find gaps where data or entire services are needed but don't exist.

Tell, Don't Ask

The more services you have, the more operational complexity you take on. Some patterns of service design seem to encourage high coupling and a high fan-in to a small number of critical services. (Usually, these are "entity" services… i.e., CRUDdy REST over a database table.)

Instead, I find it better to tell services what you want them to do. Don't ask them for information, make a decision, then change some state.

Organizing around Tell, Don't Ask leads you to design services around behavior instead of data. You'll probably find you denormalize your data to make T,DA work. That's OK. The runtime benefit of cleaner coupling will be worth it.

Data Flow Diagrams

If you ask someone who isn't trained in UML to draw a system's architecture, they will often draw something close to a Data Flow Diagram. This diagram shows data repositories and the transformation processes that populate them. DFDs are a very helpful tool because they force you to ask a few really key questions:

Where did that information come from?
How did it get there?
Who updates it?
Who uses the data we produce?

In particular, answering that last question forces you to think about whether you're producing the right data for the downstream consumer.

Generalized Minimalism

2016-02-29T10:22:43-06:00

My daily language is Clojure. One of the joys of working in Clojure is its great core library. The core library has a wealth of functions that apply broadly across data structures. A typical function looks like this:

(defn nthnext
  "Returns the nth next of coll, (seq coll) when n is 0."
  {:added "1.0"
   :static true}
  [coll n]
    (loop [n n xs (seq coll)]
      (if (and xs (pos? n))
        (recur (dec n) (next xs))
        xs)))

I want to call your attention to two specific forms. The “seq” function works on any “Seqable” collection type. (N.B.: It has special cases for other types, including some to make Java interop more pleasant. But the core behavior is about Seqable.) The “next” function is similar: it works on anything that already is a Seq or anything that can be made into a Seq.

This provides a nice degree of abstraction and through that, generality.

Pretty much all of the core data types either implement ISeq or Seqable. That means I can call “seq”, “next”, and “nth” on any of them. Other data types can be brought into the fold by extending one of those interfaces to them. We extend the data to meet the core functions, instead of overloading functions for data types.

YAGNI Isn’t About Being Specific

Under this approach, writing a general function is both simpler and easier than writing a specific one.

For example, suppose I need to do that classic example of trivial functionality: summing a list of integers. The most natural way for me to write that is like this:

(reduce + 0 xs)

That is both simple and general. But it doesn’t meet the spec I said! It sums any numeric type, not just integers. If I decide that I really must restrict it to integers, I have to add code.

(assert (every? integer? xs))
(reduce + 0 xs)

This is a pattern I find pretty often when working in Clojure. When I generalize, I do it by removing special cases. This goes hand-in-hand with decomposing behavior into smaller and smaller units. As each unit gets smaller, I find it can be more general.

Here’s a less trivial example. Today, I’m working on a library we call Vase. (See Paul deGrandis' talk on data-driven systems for more about Vase.) In particular, I’m updating it to work with a new routing syntax in Pedestal. With the new routing syntax, we can build routes from ordinary Clojure data–no more need for oddly-placed syntax-quoting.

One of the core concepts in Pedestal is the “interceptor”. They fulfill the same role as middleware in Ring. (One difference: interceptors are data structures that contain functions. Interceptors compose by making a vector of data, whereas Ring middleware composes by creating function closures. I find it easier to debug a stack of data than a stack of opaque closures.) Any particular route in Pedestal will have a list of interceptors that apply to that route.

When a service that uses Pedestal supplies interceptors, it composes a list of them. Suppose I want to make a convenience function that helps application developers build up that list. What would I need to do?

You probably already figured out that any such “convenience” functions I could create would basically duplicate core functions, but with added restrictions. Instead of “cons”, “conj”, “take”, and “drop”, I’d have to create “icons”, “iconj”, “itake”, and “idrop”. What a waste.

I have to ask myself, “Do I need some special behavior here?” And the answer is “YAGNI.”

YAGNI Is About Adding “Stuff”

YAGNI is commonly understood to mean “don’t generalize until you need to.” In some languages and libraries, I suppose that’s the right read. In my world, though, it is specializing that requires adding stuff. So I often call YAGNI if someone tries to make a thing less general than it could be.

Small functions that operate on abstractions instead of concrete types are both general and simple.

Redeeming the Original Sin

2016-02-12T10:24:38-06:00

While reading Bryan Cantrill's slides from Papers We Love NYC, I was struck by something. One of the very first slides says:

The traditional UNIX security model is simple but inexpressive.

The papers go on to describe a progression of techniques to isolate processes from the host environment to greater and greater degrees. It began with the ancient precursor 'chroot', through Jails, and Zones. Each builds upon the previous work to improve the degree of isolation.

We've seen a parallel series of efforts in the Linux realm with virtual machines and containers.

However!

All of these are introduced to restore the degree of isolation and resource control that was originally present in mainframe operating systems. Furthermore, it was the model that Multics was meant to supply.

Unix started with a simplified security model, meant for single user machines. It was "dumbed down" enough to be easy to implement on the limited machines of the day.

Zones, VMs, containers… they're all ways to redeem Unix from its original sin. Maybe what we should look at is a better operating system?

What's Lost With a DevOps Team

2016-01-27T15:03:28-06:00

Please understand, dear Reader, that I write this with positive intention. I'm not here to impugn any person or organization. I want to talk about some decisions and their natural consequences. These consequences seem negative to me and after reading this post you may agree.

When an established company faced a technology innovation, they often create a new team to adopt and exploit that innovation. During my career, I've seen this pattern play out with microcomputers, client/server architecture, open systems, web development, agile development, cloud architecture, NoSQL, and DevOps. Perhaps we can explore the pros and cons of that overall approach in some other post. For now, I want to specifically address the DevOps team.

A DevOps team gets created as an intermediary between development and operations. This is especially likely when dev and ops report through different management chains. That is to say, in a functionally-oriented structure. In a product-oriented structure, it is less likely.

This intermediary team gets tasked with automating releases and deployments. They are the ones to adopt some code-as-configuration platform. Sometimes they are also tasked with building an internal platform-as-a-service, but that more often falls to the infrastructure and operations teams.

So the devops team has development as their customer. Operations has the devops team as their customer. Work flows from development, through the tools created by the devops team, and into production. It would seem to capture the benefits of automation: it becomes predictable, repeatable, and safe.

All of that is true. However, even though this is an improvement, it misses out on even greater improvements that could be realized.

The key problem is the unclosed feedback loop. When developers are directly exposed to production operations, they learn. Sometimes they learn from negative feedback: getting woken up for support calls, debugging performance problems, or that horrible icy feeling in your stomach when you realize that you just shut down the wrong database in production.

With a DevOps team sitting between development and operations, the operations team remains in the "learning position." But they lack the ability to directly improve the systems. Suppose a log message is ambiguous. If the operator who sees it can't directly change the source code, then the message will never get corrected. (It's important, but small… exactly the thing least likely to be worth filing a change request for.)

Over longer time spans, the things we learn from production should influence the entire architecture: from technology choices to code patterns and common libraries. A DevOps team sitting between development and operations impedes that learning.

DevOps is meant to be a style of interaction: direct collaboration between development and operations. A team in between that automates things is a tools team. It's OK to call it a tools team. Tools are a good thing, despite what corporate budgeting seems to say these days.

Instead of creating a flow from development to DevOps to operations, consider putting development, tools, and operations all together and giving them the same goals. They should be collaborators working shoulder-to-shoulder rather than work stations in a software factory.

Give Them The Button!

2015-10-23T10:37:45-05:00

Here's a syllogism for you:

Every technical review process is a queue
Queues are evil
Therefore, every review process is evil

Nobody likes a review process. Teams who have to go through the review look for any way to dodge it. The reviewers inevitably delegate the task downward and downward.

The only reason we ever create a review process is because we think someone else is going to feed us a bunch of garbage. They get created like this:

It starts when someone breaks a thing that they can't or aren't allowed to fix. The responsibility for repair goes to a different person or group. That party shoulders both responsibility for fixing the thing and also blame for allowing it to get screwed up in the first place.

(This is an unclosed feedback loop, but it is very common. Got a separate development and operations group? Got a separate DBA group from development or operations? Got a security team?)

As a followup, to ensure "THIS MUST NEVER HAPPEN AGAIN" the responsible party imposes a review process.

Most of the time, the review process succeeds at preventing the same kind of failure from recurring. The resulting dynamic looks like this:

The hidden cost is the time lost. Every time that review process has to go off, the creator must prepare secondary artifacts: some kind of submission to get on the calendar, a briefing, maybe even a presentation. All of these are non-value-adding to the end customer. Muda. Then there's the delay on the review meeting or email itself. Consider that there is usually not just one review but several needed to get a major release out the door and you can see how release cycles start to stretch out and out.

Is there a way we can get the benefit of the review process without incurring the waste?

Would I be asking the question if I didn't have an answer?

The key is to think about what the reviewer actually does. There are two possibilities:

It's purely a paperwork process. I'll automate this away with a script that makes PDF and automatically emails it to whomever necessary. Done.
The reviewer applied knowledge and experience to look for harmful situations.

Let's talk mostly about the latter case. A lot of our technology has land mines. Sometimes that is because we have very general purpose tools available. Sometimes we use them in ways that would be OK in a different situation but fail in the current one. Indexing an RDBMS schema is a perfect example of this.

Sometimes, it's also because the creators just lack some experience or education. Or the technology just has giant, truck-sized holes in it.

Whatever the reason, we expect that the reviewer is adding intelligence, like so:

This benefits the system, but it could be much better. Let's look at some of the downsides:

Throughput is limited to the reviewer's bandwidth. If they truly have a lot of knowledge and experience, then they won't have much bandwidth. They'll be needed elsewhere to solve problems.
The creator learns from the review meetings… by getting dinged for everything wrong. Not a rewarding process.
It is vulnerable to the reviewer's availability and presence.

I'd much rather see the review codify that knowledge by building it into automation. Make the automation enforce the practices and standards. Make it smart enough to help the creator stay out of trouble. Better still, make it smart enough to help the creator solve problems successfully instead of just rejecting low quality inputs.

With this structure, you get much more leverage from the responsible party. Their knowledge gets applied across every invocation of the process. Because the feedback is immediate, the creator can learn much faster. This is how you build organizational knowledge.

Some technology is not amenable to this kind of automation. For example, parsing some developer's DDL to figure out whether they've indexed things properly is a massive undertaking. To me, that's a sufficient reason to either change how you use the technology or just change technology. With the DDL, you could move to a declarative framework for database changes (e.g., Liquibase). Or you could use virtualization to spin up a test database, apply the change, and see how it performs.

Or you can move to a database where the schema is itself data, available for query and inspection with ordinary program logic.

The automation may not be able to cover 100% of the cases in general-purpose programming. That's why local context is important. As long as there is at least one way to solve the problem that works with the local infrastructure and automation, then the problem can be solved. In other words, we can constrain our languages and tools to fit the automation, too.

Finally, there may be a need for an exception process, where the automation can't decide whether something is viable or not. That's a great time to get the responsible party involved. That review will actually add value because every party involved will learn. Afterward, the RP may improve the automation or may even improve the target system itself.

After all, with all the time that you're not spending in pointless reviews, you have to find something to do with yourself.

Happy queue hunting!

C9D9 on Architecture for Continuous Delivery

2015-10-18T09:29:01-05:00

Every single person I've heard talk about Continuous Delivery says you have to change your system's architecture to succeed with it. Despite that, we keep seeing "lift and shift" efforts. So I was happy to be invited to join a panel to discuss architecture for Continuous Delivery. We had an online discussion last Tuesday on the C9D9 series, hosted by Electric Cloud.

They made the recording available immediately after the panel, along with a shiny new embed code.

Best of all, they supplied a transcript, so I can share some excerpts here. (Lightly edited for grammar, since I have relatives who are editors and I must face them with my head held high.)

Pipeline Orchestration

It's easy to focus on the pipeline as the thing that delivers code into production. But I want to talk about two other central roles that it plays. One, with regards to risk management. To me the pipeline is not so much about ushering code out to production, but it's about finding every opportunity to reject a harmful change, or a bad change prior to let it get into production. So I view the pipeline as an essential part of risk management.

I've also had a lot of lean training, so I'd look on the deployment pipeline as the value stream that developers use to deliver value to their customers. In that respect we need to think about the pipeline as production-grade infrastructure, and we need to treat it with production-like SLAs.

Cattle, Not Pets

I think a lot has been said about "cattle versus pets" over the last ten years or so. I just want to add one thing - the real challenge is identity. There are ton of systems and frameworks that implicitly assume stable identity on machines. Particularly a lot of distributed software toolkits. When you do have the cattle model, a machine identity may disappear and never come back again. I just really hope you're not building up a queue of undelivered messages for that machine.

Service Orientation and Decoupling

Having teams running in parallel and being able develop more or less independently - I talk about team scale autonomy. But if there are very long builds, large artifacts and large number of artifacts, I regard that as the consequence of using languages and tools that are early bound and early linked. I don’t think it's any accident that the people I heard of first doing continuous delivery were using PHP. You can regard each PHP file as its own deployable artifact, and so things move very quickly. If everything we wrote was extremely late bound, then our deployment would be an rsync command. So to an extent, breaking things down into services is a response to large artifacts, long build times, that's one side of that.

The other side is team scale autonomy and the fact that you can't beat Conway’s Law and that absolutely holds true. (Conway’s Law: an organization is constrained to produce software that recapitulates the structure of the organization itself. If you have four teams working on a compiler, you're going to have a four pass compiler.)

Now, when we talk about decoupling, I need to talk about two different types of decoupling, both important.

The bigger your team gets, the more communication overhead goes up. We have known that since the 1960s, so breaking that down makes sense. But then we have to recompose things at runtime and that's when coupling becomes a big issue. Operational coupling happens minute by minute by minute. If I have service A calling service B, service B goes down, I have to have some response. If I don't do anything else, service A is also going to go down. So I need to build in some mechanisms to provide operational decoupling, maybe that's a cache, maybe it’s timeouts, maybe it's a circuit breaker, something along those lines, to protect one service from the failure of another service.

It's not just the failure of the service! A deployment to the other service looks exactly like a failure from the perspective of the consumer. It's simply not responding to request within an acceptable time.

So we have to pay attention to the operational decoupling.

Semantic coupling is even more insidious, and that's what plays out over a span of months and years. We talk about API versioning quite a bit, but there other kinds of semantic coupling that creep in. I've been harping a lot lately about identifiers. If I have to pass an itemID to another system then I'm sort of implicitly saying there is one universe of itemIDs and that system has them all, and I can only talk to that system for items with those IDs.

Similarly with many services that we create, we create the service as though there is one instance of the service. We'd be better off creating the code that can instantiate that service many times for many consumers. So if you create a calendar service, don’t make one calendar that everyone has eventIDs on. Make a calendar service where we can ask for a new calendar and it gives you back a URL for a whole new calendar that is yours and only yours. This is the way you would build it if you were building a SaaS business. That's how you would need to think about the decoupled services internally.

Messaging and Data Management

If I'm truly deploying continuously then I've got version N and version N+1 running against the same data source. So I need some way to accommodate that. In older less-flexible kinds of databases, that means triggers, shims, extra views, that kind of scaffolding.

I heard a great a story, I think it's from Pinterest at Velocity a couple of years back. They had started with a monolithic user database and found they needed to split the table. After they already had 60 million users! But they were able to make many small deployments that each added kind of one step for an incremental migration. And once they got that in place, they let it sit for three months, at the end of that they found who was left and did a batch migration of those. Then they did a series of incremental deployments to remove the extra data management stuff.

So it's one of those cases - doing continuous delivery both necessitates that you're more sophisticated about your data changes, but it also gives you new tools to accomplish those changes.

There are a wide crop of databases that don't require that kind of care and feeding when you make deployments. If you are truly architecting for operational ease and delivery, then that might be a sufficient reason to choose one of the newer databases over one of the less flexible relational stores.

Conclusion

The C9D9 discussion was quite enjoyable. The hosts ran the panel well, and even though all of us are pretty long-winded, nobody was able to filibuster. I'll be happy to join them again for another discussion some time.

Software Eats the World

2015-10-03T18:10:31-04:00

During this morning's drive, I crossed several small overpasses. It reminded me that the American Society of Civil Engineers rated more than 20% of our bridges as structurally deficient or functionally obsolete. That got me to thinking about how we even know how many bridges there are in a country as large as the U.S.

Some time in the past, it would require an army of people to go survey all the roads, looking for bridges and adding them to a ledger. Now, I'm sure it's a query in a geographical database. The information had to be entered at least once, but now that it's in the database we don't need people to go wandering about with clicker counters.

Instead of clipboards and paper, the bridge survey needed data import from thousands of state and county GIS databases. That means coders to write the import jobs and DBAs to set up the target systems. It needed queries to count up the bridges and cross-check with inspection reports. So that requires more coders and maybe some UX designers for data visualization.

Back in 2011, Marc Andreessen said "software is eating the world". There's no reason to think that's going to slow down soon. And as software eats the world, work becomes tech work.

Microservices versus Lean

2015-08-11T06:38:51-05:00

Back in April, I had the good fortune to speak at Craft Conf in lovely Budapest. It's a fantastic conference that I would recommend.

During that conference, Randy Shoup talked about his experience migrating from monoliths to microservices at EBay and Google. David, one of the audience members asked an interesting question at the end of Randy's talk. (I'm sorry that I didn't get the full name of the questioner… if you are reading this, please leave a comment to let me know who you are.)

"Isn't the concept of microservices contradictory with the lean/agile principles of a) collective code ownership, and b) optimizing whole processes and systems instead of small units?"

Randy already did a great job of responding to the first part of that question, so please view the video to hear his answer there. He didn't have time to respond to the second part so I don't know what his answer would be, but I will tell you mine.

Start From The "Why"

Let's start by answering the question with a question. Why do we pursue Lean development in the first place? Your specific answer may vary, but I bet it relates back to "better use of capital" or "turning ideas into profit sooner." Both of these are statements about efficiency: efficient use of capital and efficient use of time.

One of the first Lean changes is to reorganize people and processes around the value streams. That is a big upheaval! It often means moving from a functional structure to a cross-functional structure. (And I don't mean matrixing!) Just moving to that cross-functional structure will deliver big improvements to cycle time and process efficiency. After that, teams in each value stream can further optimize to reduce their cycle time.

The next focus is on reducing "inventory." For development, we consider any unreleased code or stories to be inventory. So, work-in-progress code, features that have been finished but not deployed, and the entirety of the backlog all count as inventory.

Reducing inventory always has the effect of making more problems visible. Maybe there are process bottlenecks to address, or maybe there are high defect rates at certain steps (like failed deployments to production, or a lot of rejected builds.)

This is the start of the real optimization loop: reduce the inventory until a new problem is revealed. Solve the problem in a way that allows you to further reduce inventory.

Which is the Value Stream?

David's question seems to originate from the view that the value stream is the request handling process. So if a single request hits a dozen services, then one value stream cuts across multiple organizational boundaries. That would indeed be problematic.

However, I think the more useful viewpoint is that the value stream is "the software delivery process" itself. This is based on the premise that the value stream delivers "things customers would pay for." Well, a customer wouldn't pay for a single request to be handled. They would, however, pay for a whole new feature in your product.

Viewed that way, each service in production is the terminal point of its own value stream. So, Lean does not conflict with a microservice architecture. But could a microservice architecture conflict with Lean?

Return to "Why"

We asked, "Why Lean?" Now, let's ask "Why microservices?" The answer is always "We want to preserve flexibility as we scale the organization." Microservices are about embracing change at a macroscopic level. That has nothing to do with capital efficiency!

So are these ideas contradictory? To answer that, I need to dig into another aspect of Lean efforts: infrastructure.

Efficiency, Specialization, and Infrastructure

In the early days of aviation, airplanes were made of canvas and wood. They could land at pretty much any meadow that didn't have cows or sheep in the way. Pilots navigated by sight and landmarks, including giant concrete arrows on the ground. Planes couldn't go very fast, fly very high, carry many passengers, or haul a lot of cargo.

The maximum takeoff weight of an Airbus A380 is now 1.2 million pounds. It requires a specially reinforced runway of at least 9,020 feet and typically carries 525 passengers. It flies at an altitude of more than 8 miles. This is not an airplane that you navigate by eyeballing landmarks.

This aircraft is amazingly efficient. Achieving that efficiency requires extensive infrastructure. Radar on the plane and on the ground. Multiple comms systems. An extensive array of radio beacons and air traffic controllers on the ground and dozens of satellites in space, all sending signals to the on-board network of flight management systems. Billions of lines of code running across these devices. Airports with jetbridges that have multiple connections to the aircraft. Special vehicles to tow the plane, push the plane out, haul bags, fuel, de-ice, remove waste water… the list goes on and on.

In short, this is not just an airplane. It is part of an elaborate air transportation system.

It should be pretty obvious that the incredible efficiency of modern airliners comes at the expense of flexibility. Not just in terms of the individual aircraft, but in terms of changes to any part of the whole system.

You can see this play out in any technological arena. As we increase the systems' efficiency, we accumulate infrastructure that both enables the efficient operation and also constrains the system to its current mode of operation.

In Lean initiatives, there is a gradual shift from draining inventory and solving existing problems into creating infrastructure to add efficiency. It's not a bright line or a milestone to reach, but it is noticeable. As you get further into the infrastructure-efficiency realm, you must recognize two effects:

You will get better at certain actions.
Other actions become much, much harder.

As an example, suppose you are optimizing the value stream for delivering applications. (A reasonable thing to do.) You will eventually find that you need an automated way to move code into production. You may choose to build golden master images, or automate deployment via scripts, or use Docker to deploy the same configuration everywhere. You may commit to VSphere, Xen, OpenStack, or whatever. As you make these decisions, you make it easier to move code using the chosen stack and much, much harder to do it any other way.

Full Circle

So, with all that background, I'm finally ready to address the question of whether microservices and Lean are in conflict.

Given that:

You want maneuverability from microservices.
Your value stream is delivering features into production.
You pursue Lean past the inventory-draining phase.
Further efficiency improvements require you to commit to infrastructure and an extended system.
That extended system will not be easy to change, no matter what you choose or how you build it.

Then the answer is "no."

Development is Production

2015-08-06T08:06:33-05:00

When I was at Totality, we treated an outage in our customers' content management system as a Sev 2 issue. It ranked right behind "Revenue Stopped" in priority. Content management is critical to the merchants, copy writers, and editors. Without it, they cannot do their jobs.

For some reason, we always treated dev environment or QA environment issues as a Sev 3 or 4, with the "when I get around to it" SLA. I've come to believe that was incorrect.

The development environment and the QA environment are the critical tools needed for developers to do their jobs. When an environment is broken, it means those people are less effective. They might even be idle.

Why would you treat the tools developers use as any less critical? And yet, I see one company after another with unreliable, broken, half-integrated QA environments. They've got bad data, unreliable items, and manual test setup.

If the any stage of the development pipeline is broken, that's exactly equivalent to the content pipeline being broken.

Development is production.

QA is production.

Your build pipeline is production.

Treat them accordingly!

The Fear Cycle

2015-07-15T07:11:38-05:00

Once you begin to fear your technology, you will shortly have cause to fear it even more.

The Fear Cycle goes like this:

Small changes have unpredictable, scary, or costly results.
We begin to fear making changes.
We try to make every change as small and local as possible.
The code base accumulates warts, knobs, and special cases.
Fear intensifies.

Fear starts when an innocuous change goes badly. Maybe a production outage results, or maybe just an embarrassing bug. It may be a bug that gets upper management attention. Nothing instills fear like an executive committee meeting about your code defect!

This sphincter-shrinker originated because a developer couldn't predict all the ramifications of a change. Maybe the test suite was inadequate. Or there are special cases that are only observed in production. (E.g., that one particular customer whose data setup is different than everyone else.) Whatever the specific cause, the general result is, "I didn't know that would happen."

Add a few of these events into the company lore and you'll find that developers and project managers become loath to touch anything outside their narrow scope. They seek local safety.

The trouble with local safety is that it requires kludges. The code base will inevitably deteriorate as pressure for larger changes and broader refactoring builds without release.

The vicious cycle is completed when one of those local kludges is responsible for someone else's "What? I didn't know that!" moment. At this point, the fear cycle is self-sustaining. The cost of even small changes will continue to increase without limit. The time needed to get changes released will increase as well.

Breaking Point

One of several things will happen:

A big bang rewrite (usually with a different team.) The focus will be "this time, we do it right!" See also: second system syndrome, Things You Should Never Do, Part I.
Large scale outsourcing.
Sell off the damaged assets to another company.

Avoiding the Cycle

The fear cycle starts when people treat a technical problem as a personal one. The first time a seemingly simple change causes a large and unpredictable effect, you need to convene a technical SWAT team to determine why the system allowed it to happen and what technical changes can avoid it in the future.

The worst response to a negative event is a tribunal.

Sadly, the difference between a technical SWAT team and a tribunal is mostly in how the individuals in that group approach the issue. Wise leadership is required to avoid the fear cycle. Look to people with experience in operations or technical management.

Breaking the Cycle

Like many reinforcing loops in an organization, the fear cycle is wickedly hard to break. So far, I have not observed any instance of a company successfully breaking out of it. If you have, I would be very interested to hear your experiences!

Components and Glue

2015-06-17T07:08:34-05:00

There's a well-known architectural style in desktop applications called "Components and Glue". The central idea is that independent components are composed together by a scripting layer. The glue is often implemented in a different or more dynamic language than the components.

The C2 wiki's page on ComponentGlue has been stable since 2004, so obviously this is not a new idea.

Emacs is one example of this approach. The components are written in C, the glue is ELisp. (To be fair, though, the ELisp outnumbers the C by a pretty large factor.)

Perl was originally conceived as a glue language.

Visual Basic applications also followed this pattern. Components written in C or C++, glue in VB itself.

I think Components and Glue is a relevant architecture style today, especially if we want to compose and recompose our services in novel ways.

My last several posts have been about decomposing services into smaller, more independent units. Each one could be its own micro-SaaS business. Some application needs to stitch these back together. I often see this done in a separate layer that presents a simplified interface to the applications.

This glue layer may be written in a different language than the services themselves. For that matter, the individual services may be written in a variety of languages, but that's a subject for a different time.

The glue layer changes more rapidly than the back end services, because it needs to keep serving the applications as they change. Even when the back end services are provided by an enterprise IT group, the integration layer will be more affiliated with the front end web & app teams.

We embrace plurality, so if there's one glue layer, there may be more. We should allow multiple glue layers, where each one is adapted to the needs of its consumers. That begins to look like this:

The smaller and lighter we make the glue, the faster we can adapt it. The endpoint of that progression looks like AWS Lambda where every piece of script gets its own URL. Hit the URL to invoke the script and it can hit services, reshape the results, and reply in a client-specific format.

Once we reach that terminus, we can even think of individual functions as having URLs. Like one-off scripts in ELisp or perl, we can write glue for incidental needs: one-time marketing events, promotions, trial integrations, and so on.

"Scripts as glue" also lets us deal with a tension that often arises with valuable customers. Sometimes the biggest whales also demand a lot of customization. How should we balance the need to customize our service for large customers (the whales) and the need to generalize to serve the entire market? We can create suites of scripts that present one or more customer-specific interfaces, while the interior of our services remain generalized.

This also allows us to handle one of the hardest cases: when a customer wants us to "plug in" their own service in lieu of one of ours. As I've said before, all our services use full URLs for identifiers, so we should be able to point those URLs at our outbound customer glue. That glue calls the customer service according to its API and returns results according to our formats.

The components and glue pattern remains viable. As we decompose monoliths, it is a great way to achieve separation between services without undue burden on the front end applications and their developers.

Faceted Identities

2015-06-12T06:48:23-05:00

I have a rich and multidimensional relationship with Amazon. It started back in 1996 or 1997, when it became the main supplier for my book addiction. As the years went by, I became an "Amazon Affiliate" in a futile attempt to balance out my cash flow with the company. Later, I started using AWS for cloud computing. I also claimed my author page.

Let's contemplate the data architecture needed to maintain such a set of relationships. Let's assume for the moment that Amazon were using a SQL RDBMS to hold it all. The obvious approach is something I could call the "Big Fat User Table". One table, keyed by my secret, internal user ID, with columns for all the different possible thing a user can be to Amazon. There would be a dozen columns for my affiliate status, a couple for my author page, a boolean to show I've signed up for AWS, and a bunch of booleans for each of the individual services.

Such a table would table would be an obvious bottleneck. Any DBA worth her salt would split that sucker into many tables, joined by a common key (the user ID.) New services would then just add a table in their own database with the common user ID. Let's call this approach the "Universal Identifier" design.

That would also allow one-to-many relations for some aspects. For example, when I lived in Minnesota, the state demanded that Amazon keep track of tax for each affiliate. Amazon responded by shutting down all the affiliate accounts in Minnesota. I recently moved to Florida and was able to open a new account with my new address. So I have two affiliate accounts attached to my user account.

For what it's worth, column family databases would kind of blur the lines between the Big Fat User Table and the Universal Identifier design.

We can get more flexible than the Universal Identifier, though.

You see, if we push the User ID into all the various services, that implies that the "things" that service manages can only be consumed by a User. Maneuverable architecture says we should be able to recompose services in novel configurations to solve business problems.

Instead of pushing the User ID into each service, we should just let each service create IDs for its "things" and return them to us.

For example, a Calendar Service should be willing to create a new calendar for anyone who asks. It doesn't need to know the ID of the owner. Later, the owner can present the calendar ID as part of a request (usually somewhere in the URL) to add events, check dates, or delete the calendar. Likewise, a Ledger service should be willing to create a new ledger for any consumer, to be used for any purpose. It could be a user, a business, or a one-time special partnership. The calls could be coming from a long-lived application, a bit of script hooked to a URL, or curl in a bash script. Doesn't matter.

If we've got all these services issuing identifiers, we need some way to stitch them back together. That's where the faceted identities come in. If we start from a user and follow all the related "stuff" connected to that user, it looks a lot like a graph.

When a user logs in to the customer-facing application, that app is responsible for traversing the graph of identities, making requests to services, and assembling the response.

I hope you aren't surprised when I say that different applications may hold different graphs, with different principals as their roots. That goes along with the idea that there's no privileged vantage point. Every application gets to act like the center of its own universe.

Going Meta

If you've been schooled in database design, this probably looks a little weird. I'm removing the join keys from the relational databases. (Some day soon I need to write a post addressing a common misconception: that "relational" databases got their name because they let you relate tables together.)

The key issue I'm aiming at is really about logical dependencies in the data. Foreign key relationships are a policy statement, not a law of nature. Policies change on short notice, so they should be among the most malleable constructs we have. By putting that policy in the bottommost layer of every application, we make it as hard as possible to change!

We can think of a hierarchy of "looseness" in relationships:

Two ideas, stored in one entity: As coupled as it gets. Neither idea can be used without the other. (An "entity" here can be a table or link data resources with URLs. It's not about the storage, but about the required relationship.)
Two ideas, two entities, one-to-one: Still, both ideas must be used together.
Two ideas, two entities, one-to-one optional: Now we can at least decide whether the second item is needed with the first.
Two ideas, two entities, one-to-many: This admits that the second idea may come in different quantities than the first.
Two ideas, two entities, many-to-many: Much more flexible! Both ideas can be combined in differing quantities as needed. However, this still requires that these ideas are only used together with each other. In other words, if ideas X and Y have a many-to-many relationship, I don't get to reuse idea X together with idea A.
Two ideas, externalized relationship: This is the heart of faceted identities. Ideas X and Y can be completely independent. Each can be used together by other applications.

Interface Segregation Principle

The "I" in SOLID stands for Interface Segregation Principle. It says that a client should only depend on an interface with the minimum set of methods it needs. An object may support a wide set of behavior, but if my object only needs three of those behaviors, then I should depend on an interface with precisely those three behaviors. (One hopes those three make sense together!)

This has an application when we use faceted identies as well. Sometimes we have a very nice separation where the facets don't need to interact with each other, only the application interacts with all of them. More often though, we do need to pass an identifier from one kind of thing into another. That's when the contract becomes important. If service Y requires a foreign identifier "X" to perform an action, then it needs to be clear about what it will do with "X". It's up to the calling application to ensure that the "X" it passes can perform those actions.

Summary

Maneuverability is all about composing, recomposing, and combinging services in novel configurations. One of the biggest impediments to that is relationships among entities. We want to make those as loose as possible by externalizing the relationships to another service. This allows entities to be used in new ways without coordinated change across services. Furthermore, it allows different applications to use different relationship graphs for their own purposes.

Inverted Ownership, Part 2

2015-05-26T06:22:39-05:00

My last post on the subject of inverted ownership felt a bit abstract, so I thought I might illustrate it with a typical scenario.

In this first figure, we see a newly-extracted Catalog service, freshly factored out of the old monolithic application. It's part of the company's effort to become more maneuverable. We don't know, or particularly care, what storage model it uses internally. From the outside, it presents an interface that looks like "SKUs have attributes".

All seems well. It looks and smells like a microservice: independently deployable, released on its own schedule by a small autonomous team.

The problem is what you don't see in the picture: context. This service has one "universe" of SKUs. It doesn't serve catalogs. It serves one catalog. The problem becomes evident when we start asking what consumers of this service would want. If we think of the online storefront as the only consumer then it looks fine. Ask around a bit, though, and you'll find other interested parties.

While IT toils to get down to a single source of record for product information, the wheelers and dealers in the business are out there signing up partners, inventing marketing campaigns, and looking into new lines of business. Pretty much all of those are going to screw around with the very idea of "the catalog".

Maneuverability demands that we can combine and recombine our services in novel ways. What can we do with this catalog service that would let it be reused in ways that the dev team didn't foresee?

Instancing might be one approach… multiple deployments from the same code base. High operational overhead, but it's better than being stuck.

I prefer to make the context explicit instead.

Zero, One, Many

There's an old saying that the only sensible numbers are zero, one, and infinity. One catalog isn't enough, so the right number to support is "infinity." (Or some resource-constrained approximation.)

What does it take? All we have to do is make catalog service create catalogs for anyone who asks. Any consumer that needs a catalog can create one. That might be a big, sophisticated online storefront. But it could be someone using cURL to manually construct a small catalog for a one-off marketing effort. The catalog service shouldn't care who wants the catalog or what purpose they are going to put it to.

Of course, this means that subsequent requests need to identify which catalog the item comes from. Good thing we're already using URLs as our identifiers.

Considerations

There are some practical issues (and maybe objections) to address.

First, does this mean that the SKUs are duplicated across all those catalogs? Not necessarily. We're talking about the interface the service presents to consumers. It can do all kinds of deduplication internally. See my post about the immutable shopping cart for some ideas about deduplication and "natural" identifiers.

Second, and trickier, how do the SKUs get associated to the catalog? Does each microsite and service need to populate its own catalog? Can it just cherry-pick items from a "master" catalog?

You can probably guess that I don't much like the idea of a "master" catalog. Instead, we would populate a newly-minted catalog by feeding it either item representations (serialized data in a well-known format) or better yet, hyperlinks that resolve to item representations.

How about this: make the service support HTML, RDFa, and a standardized microformat as a representation. Then you just feed your catalog service with URLs that point to HTML. Those can come from a catalog of your own, an internal app for cleansing data feeds, or even a partner or vendor's web site. Now you've unified channel feeds, data import, and catalog creation.

Third, is it really true that just anyone can create a catalog? Doesn't this open us up to denial-of-service attacks wherein someone could create billions of catalogs and goop up our database? My response is that we don't ignore questions of authorization and permission, but we do separate those concerns. We can use proxies at trust boundaries to enforce permission and usage limits.

Conclusion

When you make the context explicit, you allow a service to support an arbitrary number of consumers. That includes consumers that don't exist today and even ones you can't predict. Each service then becomes a part that you can recombine in novel ways to meet future needs.

Inverted Ownership

2015-05-08T09:44:43-04:00

One of the sources of semantic coupling has to do with identifiers, and especially with synthetic identifiers. Most identifiers are just alphanumeric strings. Systems share those identifiers to ensure that both sides of an interface agree on the entity or entities they are manipulating.

In the move to services, there is an unfortunate tendency to build in a dependency on an ambient space of identifiers. This limits your organization's maneuverability.

Contextualized IDs

The trouble is that a naked identifier doesn't tell you what space of identifiers it comes from. There is just this ambient knowledge that a field called Policy ID is issued from the System of Record for policies. That means there can only be one "space" of policy numbers, and they must all be issued by the same SoR.

I don't believe in the idea of a single system of record. One of my rules for architecture without an end state is "Embrace Plurality". Whether through business changes or system migrations, you will always end up with multiple systems of record for any concept or entity.

In that world, it's important that IDs carry along their context. It isn't enough to have an alphanumeric Policy ID field. You need a URN or URI to identify which policy system issued that policy number.

Liberal Issuance

Imagine a Calendar service that tracks events by date and time. It would seem weird for that service to keep all events for every user in the same calendar, right? We should really think of it as a Calendars service. I'd expect to see API to create a calendar, which returns the URL to "my" calendar. Every other API call then includes that URL, either as a prefix or a parameter.

In the same way, your services should serve all callers and allow them to create their own containers. If you're building a Catalog service, think of it as a Catalogs service. Anybody can create a catalog, for any purpose. Likewise, a Ledger service should really be Ledgers. Any client can create a ledger for any reason.

This is the way to create services that can be recombined in novel ways to create maneuverability.

The Perils of Semantic Coupling

2015-04-29T06:03:21-05:00

On the subject of maneuverability, many organizations run into trouble when they try to enter new lines of business, create a partnership, or merge with another company. Updating enterprise systems becomes a large cost factor in these business initiatives, sometimes large enough to outweigh the benefits case. This is a terrible irony: our automation provides efficiency, but removes flexibility.

If you break down the cost of such changes, you'll find it comes in equal parts from changes to individual systesm and changes to integrations across systems. Integrations are always costly and full of risk, and never more so than we changing cardinalities. Partnerships and mergers pretty much always change cardinalities, too.

The cost factor arises from "semantic coupling." That is the coupling between services introduced because the services need to share concepts. It usually appears as data types or entity names that pop up in many services.

As an example, let's think about a tiny retailing system with a small set of what I'll call "macroservices". One of the most important entity types here is the Stock Keeping Unit, or SKU. It represents "a thing which can be sold". In a typical retail system, it has a large number of attributes that describe how the item is priced, delivered, displayed on the web, upsold and cross-sold, reviewed, categorized, and taxed.

SKUs are created in a master data management system. There may be a variety of feeds that get massaged into MDM, but we'll consider that to be outside the boundary of our interest for now. From MDM, the SKU must be distributed to a number of other services:

Each of these macroservices uses aspects of the SKU for its own purpose. Content management attaches "telling and selling" content to the SKU so it can be presented nicely on the web. Pricing adds it to the pricing rules. Shipping identifies the carriers, options, and costs to deliver it. Order management–probably a great big silver beast of a system–tracks inventory, orders, delivery rules, returns, and a lot more.

Now what happens if we have to make a major change to the SKU? Let's imagine that we want to change how we manage prices. In the past, merchants set prices on each item individually. Now, we've got too much in the catalog for that to scale so we introduce the idea of price points for digital items. A price point is a price that applies to a large number of SKUs. When we change the price point, all SKUs that refer to it should be changed at the same time. So, if we decide to reduce the price of a low-bitrate MP3 track from $0.99 to $0.89, we can just change a single price point record.

How many systems do we have to change for this new concept?

If we consider "price point" to be part of our core domain, then we have to add that concept everywhere. The surface area of that change is really large, and it will be a costly change to make. It might even be too costly to be worth doing. We could hire a small army of temp workers to update price records by hand twice a year and still come out ahead. That's not a very satisfying answer though. All this automation is supposed to make us more efficient! What good is it if we are stuck with outdated processes because our systems are too hard to change?

The key problem is semantic coupling. There are a lot of systems here that shouldn't need to care about the "price point" concept. It has no bearing on the digital locker, shipping, or ratings & reviews.

In this example, we can reduce the semantic coupling. Simply decide that "price point" is not a core concept. It is a detail of data management for the MDM system. Everything downstream receives SKUs with a list price. No downstream system should care how that list price was determined.

This decision flattens a many-to-one relationship from SKU to price point. In so doing, we get a huge benefit. We eliminate an entire entity and all references to it from all the downstream systems.

I would even make a case for shattering the concept of SKU into multiple separate concepts. MDM may keep that concept. Downstream, though, each system has its own set of internal concepts. We should treat identifiers from other systems as opaque tokens that we map onto our own system's space.

For example, the pricing service doesn't need to know that it is pricing SKUs. It just needs to price "things that can be priced." I know, it sounds tautological, but I think we get misled as humans… we think of SKU as a unitary concept so we build it as such in our systems. But look what happens if we say a pricing service can price "stuff and things" as long as they have some mapping in the pricing service itself. We can add an entirely new universe of things to price, without forcing everything on Earth to be a SKU!

We should scrutinize each of the other services, asking ourselves, "Does this really care about a SKU? Or does it care about something that a SKU happens to posess?" I would argue that in each case, the service really cares about "Thing that can be Xed". Priced, taxed, shipped, reviewed, etc. Are SKUs the only things that can be taxed? Are they the only things that can be reviewed? Etc.

Iterate this process and four things will happen:

Your services will shrink.
Your services will become much more general.
Each service will own its own space of identifiers.
Your organization will become more maneuverable.

The key point I want to make here is that a concept may appear to be atomic just because we have a single word to cover it. Look hard enough and you will find seams where you can fracture that concept. Don't share the whole thing. Don't couple all your downstream systems to the whole concept, and definitely don't couple your downstream to a complex of related concepts! It's a cardinal sin.

Maneuverability

2015-04-23T12:13:21+02:00

Agile development works best at team scale. When a team can self-organize, refine their methods, build the toolchain, and modify adapt it to their needs, they will execute effectively. We should be happy to achieve that! I worry when we try to force-fit the same techniques at larger scales.

At the scale of a whole organization, we need to look at the qualities we want to have. (We can't necessarily produce those qualities directly, but we can create the conditions that allow them to emerge.) When we look at attempts to scale agile development up, the quality the org wants is maneuverability.

Maneuverability is the ability to change your vector rapidly. It's about gaining, shedding, or redirecting momentum. Keeping with the analogy of momentum, we can call that which resists change in the momentum vector "inertial mass." Personnel are mass, because it's relatively hard to add or shed personnel. Technical debt is a component of mass, too. It makes changes to your technical strategy harder. Actually, I'd even go so far as to say that code itself is mass. KLOCs kill.

Maneuverability has been explored most fully by the military. Superior maneuverability allows a fighter aircraft to get inside the enemy's turn radius, then shoot for the kill. An army with high maneuverability can engage, disengage, and reorient to exploit an enemy's weakness. In the words of John Boyd, it allows you to separate your opponent into multiple, non-cooperating centers of gravity.

Maneuverability is an emergent property. It requires a number of prerequisites in the organization's structure, leadership style, operations, and ability to execute. I firmly believe that maneuverability requires a great ability to execute at the micro scale.

Agile development provides that ability to execute in software development. It is a necessary, but not sufficient, part of maneuverability. There are other necessary capabilities in the technical arena. I think that infrastructure and architecture have important roles to play for maneuverability as well.

I have previously given talks on the subject of maneuverability. I'll also be posting some further thoughts about pertinent architecture decisions.

Bad Layering

2015-04-14T06:28:22-05:00

If I had to guess, I would say that "Layers" is probably the most commonly applied architecture pattern. And why not? Parfaits have layers, and who doesn't like a parfait? So layers must be good.

Like everything else, though, there's a good way and a bad way.

The usual Neapolitan stack looks like this:

On one of my favorite projects of all, we used more layers because we wanted to further isolate different behaviors. In that project, we added a "UI Model" distinct from the "Domain."

We impose this style because we want to separate concerns. This should provide us with two big benefits. First, we can change the contents of each layer independently. So changes to the GUI should not affect the domain, and changes to the domain should not affect persistence. The second benefit we want is the ability to substitute a layer. We may swap out a layer for the sake of testing (often in the case of persistence layers) or for different product configurations.

People sometimes make an argument for swapping out a layer in case of technology change. That argument is used for ORMs in the persistence layer, but I don't find it convincing. Changing persistence on an existing application is by far not the most common kind of change. You'd be buying an expensive option that is seldom exercised.

When Good Layers Go Bad

The trouble arises when the layers are built such that we have to drill through several of them to do something common. Have you ever checked in a commit that had a bunch of new files like "Foo", "FooController", "FooForm", "FooFragment", "FooMapper", "FooDTO", and so on? That, dear reader, is a breakdown in layering.

It comes from each layer being decomposed along the same dimension. In this case, aligned by domain concept. That means the domain layer is dominating the other layers.

I would much rather see each layer have objects and functions that express the fundamental concepts of that layer. "Foo" is not a persistence concept, but "Table" and "Row" are. "Form" is a GUI concept, as is "Table" (but a different kind of table than the persistence one!) The boundary between each layer should be a matter of translating concepts.

In the UI, a domain object should be atomized into its constituent attributes and constraints. In persistence, it should be atomized into rows in one or more tables (in SQL-land) or one or more linked documents.

What appears as a class in one layer should be mere data to every other layer.

How Does It Happen?

This breakdown in layering can arise from more than one dynamic process.

The application framework may impose this structure.
The language may not have abstractions powerful enough to make it pleasant to work with data.
TDD without enough refactoring. Each thin slice through the application adds one more strand of "Foo and Friends". Truly merciless refactoring would pull out the common behavior sideways into the layer-specific concepts I described above. Lacking merciless refactoring, the project will accrete sticky strands like cotton candy on a toddler.
The team may not have seen it done any other way.

What If It Happens To You?

Maybe you already have degenerate layers. Assuming they aren't required by your framework, start looking for opportunities to refactor. Don't just build a class hierarchy so you can inherit implementations. Rather, look for common patterns of interaction. Figure out how to turn the code you've got in classes into data acted on by classes relevant to the layer.

Use maps. Convert objects into maps from field identifier to an object that represents the salient aspect of the field for that layer:

For a GUI, those aspects will be something like "lexical type", "editable", "constraint" / "validation", "semantic class", and so on.
For persistence, they will deal with "length", "representation format", "referent," etc.

Seek and destroy DTOs. They should be maps.

A DTO clearly indicates that your class is crossing a boundary. And yet, it requires that code on both sides of the boundary codes to the method signatures of the DTO. That means there is precisely zero translation at the boundary.

Where To Go From Here

Let me be clear, I like parfaits. (Yogurt and fruit! Ice cream, nuts, caramel!) I have nothing against layers. Most of my applications are built from layers. It's just that getting the benefits we seek requires more effort than smearing a single domain concept across multiple subdirectories.

If "Layers" is the only architecture pattern you've used, then you're in for a treat. There are plenty of other fundamental structures to explore. Pipes and filters. Blackboard. Components. Set GoF aside and go read Pattern-Oriented Software Architecture. The whole series is a treasure trove and an encyclopedia.

People Don't Belong to Organizations

2015-04-11T15:45:57-05:00

One company that gets this right is Github. I exist as my own person there. I'm affiliated with my employer as well as other organizations.

We are long past the days of "the company man," when a person's identity was solely bound to their employer. That relationship is much more fluid now.

A company that gets it wrong is Atlassian. I've left behind a trail of accounts in various Jirae and Confluences. Right now, the biggest offender in their product lineup is HipChat. My account is identified by my email address, but it's bound up with an organization. If I want to be part of my employer's HipChat as well as a client's, I have to resort to multiple accounts signed up with plus addresses. It's great that GMail supports that, but I still can't log in to more than one account at a time.

More generally, this is a failure in modeling. Somewhere along the line, somebody drew a line between `Organization` and `Person` on their model, with a one-to-many relationship. One `Organization` can have many `Person` entities, but each `Person` belongs to exactly one `Organization`.

I'll go even further. The proper way to approach this today is to relate `Organization` and `Person` by way of another entity. Reify the association! Is it employment? Put the start and end dates on the employment. Oh, and don't delete the association once it ends… that's erasing it from history.

I think the default for pretty much any relationship these days should be many-to-many. Particularly any data relationship that models a real relationship in the external world. We shouldn't let the bad old days of SQL join tables deter us from doing the right thing now.

Glue Fleet and Compojure Together Using Protocols

2011-01-15T14:09:08-06:00

Inspired by Glenn Vanderburg's article on Clojure templating frameworks, I decided to try using Fleet for my latest pet project. Fleet has a very nice interface. I can call a single function to create new Clojure functions for every template in a directory. That really makes the templates feel like part of the language. Unfortunately, Glenn's otherwise excellent article didn't talk about how to connect Fleet into Compojure or Ring. I chose to interpret that as a compliment, springing from his high esteem of our abilities.

My first attempt, just calling the template function directly as a route handler resulted in the following:

java.lang.IllegalArgumentException: No implementation of method: :render of protocol: #'compojure.response/Renderable found for class: fleet.util.CljString

Ah, you've just got to love Clojure errors. After you understand the problem, you can always see that the error precisely described what was wrong. As an aid to helping you understand the problem... well, best not to dwell on that.

The clue is the protocol. Compojure knows how to turn many different things into valid response maps. It can handle nil, strings, maps, functions, references, files, seqs, and input streams. Not bad for 22 lines of code!

There's probably a simpler way that I can't see right now, but I decided to have CljString support the same protocol.

Take a close look at the call to extend-protocol on lines 12 through 15. I'm adding a protocol--which I didn't create--onto a Java class--which I also didn't create. My extension calls a function that was created at runtime, based on the template files in a directory. There's deep magic happening beneath those 3 lines of code.

Because I extended Renderable to cover CljString, I can use any template function directly as a route function, as in line 17. (The function views/index was created by the call to fleet-ns on line 10.)

So, I glued together two libraries without changing the code to either one, and without resorting to Factories, Strategies, or XML-configured injection.

Metaphoric Problems in REST Systems

2011-01-14T10:12:13-06:00

I used to think that metaphor was just a literary technique, that it was something you could use to dress up some piece of creative writing. Reading George Lakoff’s Metaphors We Live By, though has changed my mind about that.

I now see that metaphor is not just something we use in writing; it’s actually a powerful technique for structuring thought. We use metaphor when we are creating designs. We say that a class is like a factory, that an object is a kind of a thing. The thing may be an animal, it may be a part of a whole, or it may be representative of some real world thing.

All those are uses of metaphor, but there is a deeper structure of metaphors that we use every day, without even realizing it. We don’t think of them as metaphors because in a sense these are actually the ways that we think. Lakoff uses the example of “The tree is in front of the mountain.” Perfectly ordinary sentence. We wouldn’t think twice about saying it.

But the mountain doesn’t actually have a front, neither does the tree. Or if the mountain has a front, how do we know it’s facing us? What we actually mean, if we unpack that metaphor is something like, “The distance from me to the tree is less than the distance from me to the mountain.” Or, “The tree is closer to me than the mountain is.” That we assign that to being in front is actually a metaphoric construct.

When we say, “I am filled with joy.” We are actually using a double metaphor, two different metaphors related structurally. One, is “A Person Is A Container,” the other is, “An Emotion Is A Physical Quantity.” Together it makes sense to say, if a person is a container and emotion is a physical thing then the person can be full of that emotion. In reality of course, the person is no such thing. The person is full of all the usual things a person is full of, tissues, blood, bones, other fluids that are best kept on the inside.

But we are embodied beings, we have an inside and an outside and so we think of ourselves as a container with something on the inside.

This notion of containers is actually really important.

Because we are embodied beings, we tend to view other things as containers as well. It would make perfect sense to you if I said, “I am in the room.” The room is a container, the building is a container. The building contains the room. The room contains me. No problem.

It would also make perfect sense to you, if I said, “That program is in my computer.” Or we might even say, “that video is on the Internet.” As though the Internet itself were a container rather than a vast collection of wires and specialized computers.

None of these things are containers, but it’s useful for us to think of them as such. Metaphorically, we can treat them as containers. This isn’t just an abstraction about the choice of pronouns. Rather the use of the pronouns I think reflects the way that we think about these things.

We also tend to think about our applications as containers. The contents that they hold are the features they provide. This has provided a powerful way of thinking about and structuring our programs for a long time. In reality, no such thing is happening. The program source text doesn’t contain features. It contains instructions to the computer. The features are actually sort of emergent properties of the source text.

Increasingly the features aren’t even fully specified within the source text. We went through a period for a while where we could pretend that everything was inside of an application. Take web systems for example. We would pretend that the source text specified the program completely. We even talked about application containers. There was always a little bit of fuzziness around the edges. Sure, most of the behavior was inside the container. But there were always those extra bits. There was the web server, which would have some variety of rules in it about access control, rewrite rules, ways to present friendly URLs. There were load balancers and firewalls. These active components meant that it was really necessary to understand more than the program text, in order to fully understand what the program was doing.

The more the network devices edged into Layer 7, previously the domain of the application, the more false the metaphor of program as container became. Look at something like a web application firewall. Or the miniature programs you can write inside of an F5 load balancer. These are functional behavior. They are part of the program. However, you will never find them in the source text. And most of the time, you don’t find them inside the source control systems either.

Consequently, systems today are enormously complex. It’s very hard to tell what a system is going to do once you put into production. Especially in those edge cases within hard to reach sections of the state space. We are just bad at thinking about emergent properties. It’s hard to design properties to emerge from simple rules.

I think we’ll find this most truly in RESTful architectures. In a fully mature REST architecture, the state of the system doesn’t really exist in either the client or the server, but rather in the communication between the two of them. We say, HATEOAS “Hypertext As The Engine Of Application State,” (which is a sort of shibboleth use to identify true RESTafarian’s from the rest of the world) but the truth is: what the client is allowed to do is to hold to it by the server at any point in time, and the next state transition is whatever the client chooses to invoke. Once we have that then the true behavior of the system can’t actually be known just by the service provider.

In a REST architecture we follow an open world assumption. When we’re designing the service provider, we don’t actually know who all the consumers are going to be or what their individual and particular work flows maybe. Therefore we have to design for a visible system, an open system that communicates what it can do, and what it has done at any point in time. Once we do that then the behavior is no longer just in the server. And in a sense it’s not really in the client either. It’s in the interaction between the two of them, in the collaborations.

That means the features of our system are emergent properties of the communication between these several parts. They’re externalized. They’re no longer in anything. There is no container. One could almost say there’s no application. The features exists somewhere in the white space between those boxes on the architecture diagram.

I think we lack some of the conceptual tools for that as well. We certainly don’t have a good metaphorical structure for thinking about behavior as a hive-like property emerging from the collaboration of these relatively, independent and self-directed pieces of software.

I don’t know where the next set of metaphors will come from. I do know that the attempt to force web-shaped systems in to the application is container metaphor, simply won’t work anymore. In truth, they never worked all that well. But now it’s broken down completely.

Metaphoric Problems in REST Systems (audio)

2011-01-14T10:02:52-06:00

Metaphoric problems in rest systems by mtnygard

Time motivates architecture

2010-04-21T08:00:00-05:00

Let’s engage in a thought experiment for a moment. Suppose that software was trivial to create and only ever needed to be used once. Completely disposable. So, somebody comes to you and says, “I have a problem and I need you to solve it. I need a tool that will do blah-de-blah for a little while.” You could think of the software the way that a carpenter thinks of a jig for cutting a piece of wood on a table saw, or a metalworker thinks of creating a jig to drill a hole at the right angle and depth.

If software were like this, you would never care about its architecture. You would spend a few minutes to create the thing that was needed, it would be used for the job at hand, and then it would be thrown away. It really wouldn’t matter how good the software was on the inside–how easy it was to change–because you’d never change it! It wouldn’t matter how it adapted to changing business requirements, because you’d just create a new one when the new requirement came up. In this thought experiment we wouldn’t worry about architecture.

The key difference between this thought experiment and actual software? Of course, actual software is not disposable. It has a lifespan over some amount of time. Really, it’s the time dimension that makes architecture important.

Over time, we need for many different people to work effectively in the software. Over time, we need the throughput of features to stay constant, or hopefully not decrease too much. Maybe it even increases in particularly nice cases. Over time, the business needs change so we need to adapt the software.

It’s really time that makes us care about architecture.

Isn’t it interesting then, that we never include time as a dimension in our architecture descriptions?

Circuit Breaker in Scala

2010-04-21T06:00:00-05:00

FaKod (I think that translates as "The Fatalistic Coder"?) has written a nice Scala implementation of the Circuit Breaker pattern, and even better, has made it available on GitHub.

Check out http://github.com/FaKod/Circuit-Breaker-for-Scala for the code.

The Circuit Breaker can be mixed in to any type. See http://wiki.github.com/FaKod/Circuit-Breaker-for-Scala/ for an example of usage.

The Future of Software Development

2010-04-20T06:00:00-05:00

I’ve been asked to sit on a panel regarding the future of software development. This is always risky and makes me nervous, for two reasons. First, prediction is a notoriously low success-rate activity. Second, the people you always see making predictions like this are usually well past their “use by” date. Nevertheless, here are a collection of barely-related thoughts I have on that subject.

Two obvious trends are cloud computing and mobile access. They are complementary. As the number of people and devices on the net increases, our ability to shape traffic on the demand side gets worse. Spikes in demand will happen faster and reach higher levels over time. Mobile devices exacerbate the demand side problems by greatly increasing both the number of people on the net and the fraction of their time they are able to access it.
Large traffic volumes both create and demand large data. Our tools for processing tera- and petabyte datasets will improve dramatically. Map/Reduce computing (a la Hadoop) has created attention and excitement in this space, but it is ultimately just one tool among many. We need better languages to help us think and express large data problems. In particular, we need a language that makes big data processing accessible to people with little background in statistics or algorithms.
Speaking of languages, many of the problems we face today cannot be solved inside a single language or application. The behavior of a web site today cannot be adequately explained or reasoned about just by examining the application code. Instead, a site picks up attributes of behavior from a multitude of sources: application code, web server configuration, edge caching servers, data grid servers, offline or asynchronous processing, machine learning elements, active network devices (such as application firewalls), and data stores. “Programming” as we would describe it today–coding application behavior in a request handler–defines a diminishing portion of the behavior. We lack tools or languages to express and reason about these distributed, extended, fragmented systems. Consequently, it is difficult to predict the functionality, performance, capacity, scalability, and availability of these systems.
Some of this will be mitigated naturally as application-specific functions disappear into tools and frameworks. Companies innovating at the leading edge of scalability today are doing things in application-specific behavior to compensate for deficiencies in tools and platforms. For example, caching servers could arguably disappear into storage engines and no-one would complain. In other words, don’t count the database vendors out yet. You’ll see key-value stores and in-memory data grid features popping up in relational databases any day now.
In general, it appears that Objects will diminish as a programming paradigm. Object-oriented programming will still exist… I’m not claiming “the death of objects” or something silly like that. However, OO will become just one more paradigm among several, rather than the dominant paradigm it has been for the last 15 years. “Object oriented” will no longer be synonymous with “good”.
Some people have talked about “polyglot programming”. I think this is a red herring. Polylgot is a reality, but it should not be a goal. That is, programmers should know many languages and paradigms, but deliberately mixing languages in a single application should be avoided. What I think we will find instead is mixing of paradigms, supported by a single primary language, with adjunct languages used only as needed for specialized functions. For example, an application written in Scala may mix OO, functional, and actor-based concepts, and it may have portions of behavior expressed in SQL and Javascript. Nevertheless, it will still primarily be a Scala application. The fact that Groovy, Scala, Clojure, and Java all run on Java Virtual Machine shouldn’t mislead us into thinking that they are interchangeable… or even interoperable!
Regarding Java. I fear that Java will have to be abandoned to the “Enterprise Development” world. It will be relegated to the hands of cut-rate business coders bashing out their gray business applications for $30 / hour. We’ve passed the tipping point on this one. We used to joke that Java would be the next COBOL, but that doesn’t seem as funny now that it’s true. Java will continue to exist. Millions of lines of it will be written each year. It won’t be the driver of innovation, though. As individual programmers, I’d recommend that you learn another language immediately and differentiate yourself from the hordes of low-skill, low-rent outsource coders that will service the mainstream Java consumer.
Where will innovation come from? Although some of the blush seems to be coming off Ruby, the reduction in hype has mainly allowed Ruby and Ruby on Rails developers to knuckle down and produce. That community continues to drive tremendous innovation. Many of the interesting developments here relate to process. Ruby developers have given us fantastic tools like Gems and Capistrano, that let small teams outperform and outproduce groups four times their size.
To my great surprise, data storage has become a hotbed of innovation in the last few years. Some of this is driven by the high-scalability fetishists, which is probably the wrong reason for 98% of companies and teams. However, innovations around column stores, graph databases, and key-value stores offer developers new tools to reduce the impedance mismatch between their data storage and their programming language. We spent twenty years trying to squeeze objects into relational databases. Aside from the object databases, which were an early casualty of Oracle’s ascension, we mostly focused on changing the application code through framework after framework and ORM after ORM. It’s refreshing to see storage models that are easier to use and easier to modify.
This will also cause another flurry of “reactive innovation” from the database vendors, just as we saw with “Universal Databases” in the mid-90s. The big players here–Microsoft and Oracle–won’t let some schemaless little upstarts erode their market share. More significantly, they aren’t about to let their flagship products–and the ones which give them beachheads inside every major corporation–get intermediated by some open-source frameworks banged up by the social network giants. Look for big moves by these vendors into high scalability, agile storage, and eventual consistency storage.

Failover: Messy Realities

2010-04-19T06:00:00-05:00

People who don't live in operations can carry some funny misconceptions in their heads. Some of my personal faves:

Just add some servers!
I want a report of every configuration setting that's different between production and QA!
We're going to make sure this (outage) never happens again!

I've recently been reminded of this during some discussions about disaster recovery. This topic seems to breed misconceptions. Somewhere, I think most people carry around a mental model of failover that looks like this:

That is, failover is essentially automatic and magical.

Sadly, there are many intermediate states that aren't found in this mental model. For example, there can be quite some time between failure and it's detection. Depending on the detection and notification, there can be quite a delay before failover is initiated at all. (I once spoke with a retailer whose primary notification mechanism seemed to be the Marketing VP's wife.)

Once you account for delays, you also have to account for faulty mechanisms. Failover itself often fails, usually due to configuration drift. Regular drills and failover exercises are the only way to ensure that failover works when you need it. When the failover mechanisms themselves fail, your system gets thrown into one of these terminal states that require manual recovery.

Just off the cuff, I think the full model looks a lot more like this:

It's worth considering each of these states and asking yourself the following questions:

Is the state transition triggered automatically or manually?
Is the transition step executed by hand or through automation?
How long will the state transition take?
How can I tell whether it worked or not?
How can I recover if it didn't work?

Life's Little Frustrations

2010-04-18T18:24:40-05:00

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -Leslie Lamport

On my way to QCon Tokyo and QCon China, I had some time to kill so I headed over to Delta's Skyclub lounge. I've been a member for a few years now. And why not? I mean, who could pass up tepid coffee, stale party snacks, and a TV permanently locked to CNN? Wait... that actually doesn't sound like such a hot deal.

Oh! I remember, it's for the wifi access. (Well, that plus reliably clean bathrooms, but we need not discuss that.) Being able to count on wifi access without paying for yet another data plan has been pretty helpful for me. (As an aside, I might change my tune once I try a mifi box. Carrying my own hotspot sounds even better.)

Like most wifi providers, the Skyclub has a captive portal. Before you can get a TCP/IP connection to anything, you have to submit a form with a checkbox to agree to 89 pages of terms and conditions. I'm well aware that Delta's lawyers are trying to make sure the company isn't liable if I go downloading bootlegs of every Ally McBeal episode. But I really don't know if these agreements are enforceable. For all I know, page 83 has me agreeing to 7 years indentured servitude cleaning Delta's toilets.

Anyway, Delta has outsourced operations of their wifi network to Concourse Communications. And apparently, they've had an outage all morning that has blocked anyone from using wifi in the Minneapolis Skyclubs. When I submit the form with the checkbox, I get the following error page:

Including this bit of stacktrace:

There's a lot to dislike here.

Why is this yelling at me, the user? To anyone who isn't a web site developer, this makes it sound like the user did something wrong. There's a ton of scary language here: "instance-specific error", "allow remote connections", "Named Pipes Provider"... heck, this sounds like it's accusing the user of hacking servers. "Stack trace" sure sounds like the Feds are hot on somebody's trail, doesn't it?
Isn't it fabulous to know that Ken keeps his projects on his D: drive? If I had to lay bets, I'd say that Ken screwed up his configuration string. In fact, the whole problem smells like a failed deployment or poorly executed change. Ken probably pushed some code out late on a Friday afternoon, then boogied out of town. My prediction (totally unverifiable, of course) is that this problem will take less than 5 minutes to resolve, once Ken gets his ass back from the beach.
We mere users get to see quite a bit of internal information here. Nothing really damaging, unless of course Wilson ORMapper has some security defects or something like that.
Stepping back from this specific error message, we have the larger question: is it sensible to couple availability of the network to the availability of this check-the-box application? Accessing the network is the primary purpose of this whole system. It is the most critical feature. Is collecting a compulsory boolean "true" from every user really as important as the reason the whole damn thing was built in the first place? Of course not! (As an aside, this is an example of Le Chatelier's Principle: "Complex systems tend to oppose their own proper function.")

We see this kind of operational coupling all the time. Non-critical features are allowed to damage or destroy critical features. Maybe there's a single thread pool that services all kinds of requests, rather than reserving a separate pool for the important things. Maybe a process is overly linearized and doesn't allow for secondary, after-the-fact processing. Or, maybe a critical and a non-critical system both share an enterprise service---producing a common-mode dependency.

Whatever the proximate cause, the underlying problem is lack of diligence in operational decoupling.

Topics in Architecture

2010-01-03T17:16:06-06:00

I’m working on a syllabus for an extensive course on web architecture. This will be for experienced programmers looking to become architects.

Like all of my work about architecture, this covers technology, business, and strategic aspects, so there’s an emphasis on creating high-velocity, competitive organizations.

In general, I’m aiming for a mark that’s just behind the bleeding edge. So, I’m including several of the NoSQL persistence technologies, for example, but not including Erjang because it’s too early. (Or is that “erl-y”? )

(What I’d really love to do is make a screencast series out of all of these. I’m daunted, though. There’s a lot of ground to cover here!)

EDIT: Added function and OO styles of programming. (Thanks @deanwampler.) Added JRuby/Java under languages. (Thanks @glv.)

I’m interested in hearing your feedback. What would you add? Remove?

Methods and Processes
- Systems Thinking/Learning Organization
- High Velocity Organizations
- Safety Culture
- Error-Inducing Systems (“Normal Accidents”)
- Points of Leverage
- Fundamental Dynamics: Iteration, Variation, Selection, Feedback, Constraint
- 5D architecture
- Failures of Intuition
- ToC
- Critical Chain
- Lean Software Development
- Real Options
- Strategic Navigation
- OODA
- Tempo, Adaptation
- XP
- Scrum
- Lean
- Kanban
- TDD
Architecture Styles
- REST / ROA
- SOA
- Pipes & Filters
- Actors
- App-server centric
- Event-Driven Architecture
Web Foundations
- The “architecture” of the web
- HTTP 1.0 & 1.1
- Browser fetch behaviors
- HTTP Intermediaries
The Nature of the Web
- Crowdsourcing
- Folksonomy
- Mashups/APIs/Linked Open Data
Testing
- TDD
- Unit testing
- BDD/Spec testing
- ScalaCheck
- Selenium
Persistence
- Redis
- CouchDB
- Neo4J
- eXist
- “Web-shaped” persistence
Technical architecture
- 8 Fallacies of Distributed Computing
- CAP Theorem
- Scalability
- Reliability
- Performance
- Latency
- Capacity
- Decoupling
- Safety
Languages and Frameworks
- Spring
- Groovy/Grails
- Scala
  - Lift
- Clojure
  - Compojure
- JRuby
  - Rails
- OSGi
Design
- Code Smells
- Object Thinking
- Object Design
- Functional Thinking
- API Design
- Design for Operations
- Information Hiding
- Recognizing Coupling
Deployment
- Physical
- Virtual
- Multisite
- Cloud (AWS)
- Chef
- Puppet
- Capistrano
Build and Version Control
- Git
- Ant
- Maven
- Leiningen
- Private repos
- Collaboration across projects

"If the last one goes, we'll be up here all night!"

2009-12-18T15:54:21-06:00

There’s an old joke about a couple of folks on a plane who hear the captain successively announce that they’ve lost one, two, then three engines. Each time, he reassures the passengers that they’re OK, but will be progressively later to land. After the losing the third engine, one passenger tells the other, “If the last one goes, we’ll be up here all night!”

It’s a remarkable aircraft that can fly on just one out of four engines. Most four engine jets need at least two to cruise. (I’ve been told that they can make a controlled descent on one engine, but can’t maintain altitude.)

Likewise, your web app probably needs more than just one functioning server to handle demand. The usual approach to computing availability is to compute the odds that at least one server survives:

If all the servers are identical, meaning that we expect them to have the same failure rate, then this reduces to the more familiar form:

Coupling and Coevolution

2009-12-03T11:12:03-06:00

The mighty Mississippi River starts in Minnesota, at Lake Itasca. Every kid in Minnesota has to make the ritual pilgrimage to Itasca State Park at some point, where wading across North America’s longest river is a rite of passage.

One of the very interesting things in Itasca State Park is a section of forest that is fenced off so that deer cannot enter it. It’s part of a decades-long experiment to see how forests are affected by browsing herbivores. What’s really interesting is that not only are the quantity of plants different inside the protected area, but the types of plants and trees are different, too. Because deer prefer to nibble on younger trees, fewer saplings survive in the main body of the forest than in the fenced-off portion. Outside the fence, the distribution of tree size and age is biased toward older trees. The population of trees is weighted more toward resinous species like pines, which deer prefer not to eat. Inside the fence, more saplings survive into young maturity, so you see a more even distribution of tree ages and a wider diversity of species represented in the mature trees. The changes in the canopy affect the ground cover which, in turn, change how deer could (if allowed) reach the trees and browse them.

So, here’s a feedback loop that involves deer, trees, leaves and brush. The net result is a different ecosystem (albeit a slightly artificial one.)

Most physical and biological systems are like this in several ways, particularly relating to feedback. In our artificial systems (electrical, mechanical, symbolic, or semantic) we build in feedback mechanisms as a deliberate control. These are often one dimensional, proportional, and negative.

In natural systems, feedback arises everywhere. Sometimes, it proves to be helpful for the long-term stability of the system. In which case, the feedback itself gets reinforced by the existence and perpetuation of the system it exists within. In a sense, the system adapts to reinforce beneficial feedback. Conversely, feedback webs that cause too much instability will, like an overly aggressive virus, lead to destruction of their host system and disappear. So, we can see the constituents of a system co-evolving with each other and the system itself.

The old “microphone-amplifier-speaker-squealing” example of feedback really fails here. We lack both language and metaphor to really grasp this kind of interaction over time. In part, I think that’s because we like to separate the world into isolated components and only talk about components at a single level of abstraction. The trouble is that abstractions like “level of abstraction” only exist in our minds.

Here’s another example of coevolution, courtesy of Jared Diamond in “Guns, Germs, and Steel”. I’ll apologize in advance for oversimplifying; I’m devoting a paragraph to an argument he develops across entire chapters.

At some point, a group of nomads decided that the seeds of these particular grasses were tasty. In collecting the grasses, they spread it around. Some kinds of seeds survived the winter better and responded well to being sown by humans. Now, nobody sat down and systematically picked out which seeds grew better or worse. They didn’t have to, because the seeds that grew better produced more seeds for the next generation. Over time, a tiny difference (fractions of a percent) in productivity would lead some strains to supplant the others. Meanwhile, inextricably linked, some humans figured out how to plants, harvest, and eat these early grains. These humans had an advantage over their neighbors, so they were able to feed more babies. That turns out to be a benefit, because farming is hard work and requires more offspring to help produce food. (Another feedback loop.) Oh, and this kind of labor makes it advantageous to keep livestock, too. Over time, these farmers would breed and feed more children than the nomads, so farmers would come to be a larger and larger percentage of the population. Just as an added wrinkle, keeping livestock and fertilizing fields both lead to diseases that simultaneously harm the individuals and occasionally decimate the population, but also provide some long-term benefits such as better disease resistance and inadvertent biological warfare when encountering other civilizations.

Try to diagram the feedback loops here: nomads, farmers, livestock, grains, birthrates, and so on. Everything is connected to everything else. It’s really hard to avoid slipping into teleological language here. We’ve got feedback and feedforward at several different levels and timescales here, from the scale of microbes to livestock to civilizations, and across centuries. This dynamic altered the course of many species evolution: cattle, wheat, maize, and yes, good old H. Sapiens.

This complexity of interaction extends to planetary and stellar levels as well. At some sufficiently long time scale, the intergalactic medium is coupled to our planetary ecosystem.

The human intellectual penchant for decomposition, isolation, and leveled abstraction is purely an artifact of the size of our bodies and the duration of our lives.

GMail Outage Was a Chain Reaction

2009-09-02T09:25:03-05:00

Google has published an explanation of the widespread GMail outage from September 1st. In this explanation, they trace the root cause to a layer of “request routers”:

…a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.

This perfectly describes the “Chain Reaction” stability antipattern from Release It!

Hadoop versus VPN

2009-07-31T10:03:13-05:00

I’ve been doing some work with Hadoop lately, and I just ran into an interesting problem with networking. This isn’t a bug, per se, but a conflict in my configuration.

I’m running on a laptop, using a pseudo-distributed cluster. That means all the different processes are running, but they’re all running on one box. That makes it possible to test jobs with full network communication, but without deploying to a production cluster.

I’m also working remotely, connecting to the corporate network by VPN. As is commonly done, our VPN is configured to completely separate the client machine from its local network. (If it didn’t, you could use the VPN machine to bridge the secure corporate network to your home ISP, coffeeshop, airport, etc.)

Here’s the problem: when on the VPN, my machine can’t talk to its own IP address. Right now, ifconfig reports the laptops IP address as 192.168.1.105. That’s the address associated with the physical NIC on the machine.

The odd part is that Hadoop mostly works this way. I’ve configured the name node, job tracker, task tracker, datanodes, etc. to all use “localhost”. I can use HDFS, I can submit jobs, and all the map tasks work fine. The only problem is that when the map tasks finish, the task tracker cannot send data from the map tasks to the reduce tasks. The job appears to hang.

In the task tracker’s log file, I see reports every 20 seconds or so that say

2009-07-31 11:01:33,992 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200907310946_003_r_000000_0 0.0% reduce > copy >

The instant I disconnected from the VPN, the copy proceeded and the reduce job ran.

I’m sure there’s a configuration property somewhere within Hadoop that I can change. When (if) I find it, I’ll update this post.

An AspectJ Circuit Breaker

2009-07-16T09:01:15-05:00

Spiros Tzavellas pointed me to his implementation of Circuit Breaker. His approach uses AspectJ and can be applied using a bytecode weaver or AspectJ compiler. He's also got unit tests with 85% coverage.

Spiros' project page is here, and the code is (where else?) on GitHub. He appears to be quite actively developing the project.

Two New Circuit Breaker Implementations

2009-07-16T07:35:45-05:00

The excellent Will Sargent has created a Circuit Breaker gem that's quite nice. You can read the docs at rdoc.info. He's released the code (under LGPL) on GitHub.

The other one has actually been out for a couple of months now, but I forgot to blog about it. Scott Vlamnick created a Grails plugin that uses AOP to weave Circuit Breaker functionality as "around" advice. This one can also report its state via JMX. In a particularly nice feature, this plugin supports different configurations in different environments.

Workmen, tools, etc.

2009-05-20T20:17:03-05:00

We’ve all heard the old saw, “It’s a poor workman that blames his tools.” Let’s think about that for a minute. Does it actual mean that a skilled craftsman can do great work with shoddy implements?

Well, can a chef make a souffle with a skillet?

Can a cabinetmaker round an edge with dull router bits?

I’m not going to rule it out. Perhaps there’s a brilliant chef who—at this very moment—is preparing to introduce the world to the “skiffle.” And, it’s possible that one could coax a dull router into making a better quarter round through care, attention, and good speed control.

Going by the odds, though, I’d bet on scrambled eggs and splinters.

Like a lot of old sayings, this one doesn’t make much sense in it’s usual interpretation. Most people take this proverb to mean that you should be able to turn out top-notch work with whatever tools you’re given. It’s an excuse for bad tools, or lack of interest in improving them.

This homily dates back to a time when workers would bring their own tools to the job, leading to the popular origin story for the phrase “getting sacked”. (No comments about mÃ¸Ã¸se bites, please.) Some crafts have evaded the assembly line, and in those, craftsman still bring their own tools. Chefs bring their prized knives. Fine carpenters bring their own hand and bench tools.

There is a grain of truth in the common interpretation that good tools don’t make a good workman. There’s another level of truth under the surface, though. The 13th Century French version of this saying translates as, “A bad workman will never find a good tool.” I like this version a lot better. Tools cannot make one good, but bad tools can hurt a good worker’s performance. That sounds a lot less like “quit whining and use whatever’s at hand,” doesn’t it?

On the other hand, if you supply your own tools, you’re not as likely to tolerate bad ones, are you? I think this is the most important interpretation. Good workers—if given the choice—will select the best tools and keep them sharp.

Minireview: Beginning Scala

2009-05-18T14:41:57-05:00

As you can probably tell from my recent posts, I’ve been learning Scala. I recently dug into another Scala book, Beginning Scala by David Pollak.

Beginning Scala is a nice, gentle introduction to this language. It takes a gradual, example driven approach that emphasizes running code early. This makes it a good intro for people who want to use the language for applications first, then worry about creating frameworks later.

Don’t let that fool you, though. Pollak gets to the sophisticated parts soon enough. I particularly like a example of creating a new “control structure” to execute stuff in the context of a JDBC connection. This puts some meat on the argument that Scala is a “scalable language.” Where other languages either implement this as a keyword (as in Groovy’s “with”) or a framework (Spring’s “templates”), here it can be added with one page of example code.

Beginning Scala also has a very thorough discussion of actors. I appreciate this, because actors were my main motivation for learning Scala in the first place.

Pollak separates the act of consuming a library from that of creating a library. He advises us to worry most about types, traits, co- and contravariance, etc. mainly when we are creating libraries. True to this notion, chapter 7 is called “Traits and Types and Gnarly Stuff for Architects”. It doesn’t sound like much fun, but it is important material. I find that Scala makes me think more about the type system than other languages. It’s strongly, and statically, typed. (So much so, in fact, that it makes me realize just how loose Java’s own type system is.) As such, it pays to have a firm understanding of how code turns into types. Scala has a rich set of tools for building an expressive type system, but there is also complexity there. Checking in at 60 pages, this chapter covers Scala’s tools along with guidance on good styles and idioms.

Interestingly, although there is a Lift logo on the cover, there’s nothing about Lift in the book itself. Considering that Pollak is the creator of Lift, it’s curious that this book doesn’t deal with it. Perhaps that’s being left for another title.

Overall, I endorse Beginning Scala.

Units of Measure in Scala

2009-05-07T22:00:09-05:00

Failure to understand or represent units has caused several major disasters, including the costly Ariane 5 disaster in 1996. This is one of those things that DSLs often get right, but mainstream programming languages just ignore. Or, worse, they implement a clunky unit of measure library that ensures you can never again write a sensible arithmetic expression.

While I was at JAOO Australia this week, Amanda Laucher showed some F# code for a recipe that caught my attention. It used numeric literals with that directly attached units to quantities. What’s more, it was intelligent about combining units.

I went looking for something similar in Scala. I googled my fingertips off, but without much luck, until Miles Sabin pointed out that there’s already a compiler plugin sitting right next to the core Scala code itself.

Installing Units

Scala has it’s own package manager, called sbaz. It can directly install the units extension:

sbaz install units

This will install it under your default managed installation. If you haven’t done anything else, that will be your Scala install directory. If you have done something else, you probably already know what you’re doing, so I won’t try to give you instructions.

Using Units

To use units, you first have to import the library’s “Preamble”. It’s also helpful to go ahead and import the “StandardUnits” object. That brings in a whole set of useful SI units.

I’m going to do all this from the Scala interactive interpreter.

scala> import units.Preamble._
import units.Preamble._

scala> import units.StandardUnits._
import units.StandardUnits._

After that, you can multiply any number by a unit to create a dimensional quantity:

scala> 20*m
res0: units.Measure = 20.0*m

scala> res0*res0
res1: units.Measure = 400.0*m*m

scala> Math.Pi*res0*res0
res2: units.Measure = 1256.6370614359173*m*m

Notice that when I multiplied a length (in meters) times itself, I got an area (square meters). To me, this is a really exciting thing about the units library. It can combine dimensions sensibly when you do math on them. In fact, it can help prevent you from incorrectly combining units.

scala> val length = 5*mm
length: units.Measure = 5.0*mm

scala> val weight = 12*g
weight: units.Measure = 12.0*g

scala> length + weight
units.IncompatibleUnits: Incompatible units: g and mm

I can’t add grams and millimeters, but I can multiply them.

Creating Units

The StandardUnits package includes a lot of common units relating to basic physics. It doesn’t have any relating to system capacity metrics, so I’d like to create some units for that.

scala> import units._
import units._

scala> val requests = SimpleDimension("requests")
requests: units.SimpleDimension = requests

scala> val req = SimpleUnit("req", requests, 1.0)
req: units.SimpleUnit = req

scala> val Kreq = SimpleUnit("Kreq", requests, 1000.0)
Kreq: units.SimpleUnit = Kreq

Now I can combine that simple dimension with others. If I want to express requests per second, I can just write it directly.

scala> 565*req/s
res4: units.Measure = 565.0*req/s

Conclusion

This extension will be the first thing I add to new projects from now on. The convenience of literals, with the extensibility of adding my own dimensions and units means I can easily keep units with all of my numbers.

There’s no longer any excuse to neglect your units in a mainstream programming language.

Kudos to Relevance and Clojure

2009-05-06T01:18:26-05:00

It’s been a while since I blogged anything, mainly because most of my work lately has either been mind-numbing corporate stuff, or so highly contextualized that it wouldn’t be productive to write about.

Something came up last week, though, that just blew me away.

For various reasons, I’ve engaged Relevance to do a project for me. (Actually, the first results were so good that I’ve now got at least three more projects lined up.) They decided—and by “they”, I mean Stuart Halloway—to write the engine at the heart of this application in Clojure. That makes it sound like I was reluctant to go along, but actually, I was interested to see if the result would be as expressive and compact as everyone says.

Let me make a brief aside here and comment that I’m finding it much harder to be the customer on an agile project than to be a developer. I think there are two main reasons. First, it’s hard for me to keep these guys supplied with enough cards to fill an iteration. They’re outrunning me all the time. Big organizations like my employer just take a long time to decide anything. Second, there’s nobody else I can defer to when the team needs a decision. It often takes two weeks just for me to get a meeting scheduled with all of the stakeholders inside my company. That’s an entire iteration gone, just waiting to get to the meeting to make a decision! So, I’m often in the position of making decisions that I’m not 100% sure will be agreeable to all parties. So far, they have mostly worked out, but it’s a definite source of anxiety.

Anyway, back to the main point I wanted to make.

My personal theme is making software production-ready. That means handling all the messy things that happen in the real world. In a lab, for example, only one batch file ever needs to be processed at once. You never have multiple files waiting for processing, and files are always fully present before you start working on them. In production, that only happens if you guarantee it.

Another example, from my system. We have a set of rules (which are themselves written in Clojure code) that can be changed by privileged users. After changing the configuration, you can tell the daemonized Clojure engine to “(reload-rules!)”. The “!” at the end of that function means it’s an imperative with major side effects, so the rules get reloaded right now.

I thought I was going to catch them up when I asked, oh so innocently, “So what happens when you say (reload-rules!) while there’s a file being processed on the other thread?” I just love catching people when they haven’t dealt with all that nasty production stuff.

After a brief sidebar, Stu and Glenn Vanderburg decided that, in fact, nothing bad would happen at all, despite reloading rules in one thread while another thread was in the middle of using the rules.

Clojure uses a flavor of transactional memory, along with persistent data structures. No, that doesn’t mean they go in a database. It means that changes to a data structure can only be made inside of a transaction. The new version of the data structure and the old version exist simultaneously, for as long as there are outstanding references to them. So, in my case, that meant that the daemon thread would “see” the old version of the rules, because it had dereferenced the collection prior to the “reload-rules!” Meanwhile, the reload-rules! function would modify the collection in its own transaction. The next time the daemon thread comes back around and uses the reference to the rules, it’ll just see the new version of the rules.

In other words, two threads can both use the same reference, with complete consistency, because they each see a point-in-time snapshot of the collection’s state. The team didn’t have to do anything special to make this happen… it’s just the way that Clojure’s references, persistent data structures, and transactional memory work.

Even though I didn’t get to catch Stu and Glenn out on a production readiness issue, I still had to admit that was pretty frickin' cool.

JAOO Australia in 1 Month

2009-04-03T12:12:06-05:00

The Australian JAOO conferences are now just one month away. I’ve wanted to get to Australia for at least ten years now, so I am thrilled to finally get there.

I’ll be delivering a tutorial on production ready software in both the Brisbane and Sydney conferences. This tutorial was a hit at QCon London, where I first delivered it. The Australian version will be further improved.

During the main conference, I’ll be delivering a two-part talk on common failure modes of distributed systems break and how to recover from such breakage. These talks apply whether you’re building web facing systems or internal shared services/SOA projects.

Quantum Backups

2009-03-20T08:40:54-05:00

Backups are the only macroscopic system we commonly deal with that exhibits quantum mechanical effects. This is odd enough that I’ve spent some time getting tangled up in these observations.

Until you attempt a restore, a backup set is neither good nor bad, but a superposition of both. This is the superposition principle.

The peculiarity of the superposition principle is dramatically illustrated with the experiment of SchrÃ¶dinger’s backup. This is when you attempt to restore SchrÃ¶dinger’s pictures of his cat, and discover that the cat is not there.

In a startling corollary, if you use offsite vaulting, a second quantum variable is introduced, in that the backup set exists and does not exist simultaneously. A curious effect emerges upon applying the Hamiltonian operator. The operator shows that certain eigenvalues are always zero, revealing that prime numbered tapes greater than 5 in a set never exist.

Finally, the Heisenbackup principle says that the user of a system is entangled with the system itself. As a result, within 30 days of consciously deciding that you do not need to run a backup, you will experience a complete disk crash. Because you’ve just read this, your 30 days start now.

Sorry about that.

Update: Sun Cloud API Not the Same as Amazon

2009-03-19T07:29:33-05:00

It looks like the early reports that Sun’s cloud API would be compatible with AWS resulted from the reporters' exuberance (or mere confusion.)

It’s actually nicer than Amazon’s.

It is based on the REST architectural style, with representations in JSON. In fact, I might start using it as the best embodiment of REST principles. You start with an HTTP GET of “/”. In this repsonse to this and every other request, it is the hyperlinks in the response that indicate what actions are allowed.

Sun has a wiki to describe the API, with a very nicely illustrated “Hello, Cloud” example.

Can you make that meeting?

2009-03-18T15:55:13-05:00

I’m convinced that the next great productivity revolution will be de-matrixing the organizations we’ve just spent ten years slicing and dicing.

Yesterday, I ran into a case in point: What are the odds that three people can schedule a meeting this week versus having to push it into next week?

Turns out that if they’re each 75% utilized, then there’s only a 15% chance they can schedule a one hour meeting this week. (If you always schedule 30 minute meetings instead of one hour, then the odds go up to about 25%.)

Here’s the probability curve that the meeting can happen. This assumes, by the way, that there are no lunches or vacation days, and that all parties are in the same time zone. It only gets worse from here.

So, overall, there’s about an 85% chance that 3 random people in a meeting-driven company will have to defer until next week.

Bring it up to 10 people, in a consensus-driven, meeting-oriented company, and the odds drop to 0.00095%.

No wonder “time to first meeting” seems to dominate “time to do stuff.”

Amazon as the new Intel

2009-03-18T11:31:07-05:00

Update: Please read this update. The information underlying this post was based on early, somewhat garbled, reports.

A brief digression from the unpleasantness of reliability.

This morning, Sun announced their re-entry into the cloud computing market. After withdrawing Network.com from the marketplace a few months ago, we were all wondering what Sun’s approach would be. No hardware vendor can afford to ignore the cloud computing trend… it’s going to change how customers view their own data centers and hardware purchases.

One thing that really caught my interest was the description of Sun’s cloud offering. It sounded really, really similar to AWS. Then I heard the E-word and it made perfect sense. Sun announced that they will use EUCALYPTUS as the control interface to their solution. EUCALYPTUS is an open-source implementation of the AWS APIs.

Last week at QCon London, we heard Simon Wardley give a brilliant talk, in which he described Canonical’s plan to create a de facto open standard for cloud computing by seeding the market with open source implementations. Canonical’s plan? Ubuntu and private clouds running EUCALYPTUS.

It looks like Amazon may be setting the standard for cloud computing, in the same way that Intel set the standard for desktop and server computing, by defining the programming interface.

I don’t worry about this, for two reasons. One, it forestalls any premature efforts to force a de jure standard. This space is still young enough that an early standard can’t help but be a drag on exploration of different business and technical models. Two, Amazon has done an excellent job as a technical leader. If their APIs “win” and become de facto standards, well, we could do a lot worse.

Getting Real About Reliability

2009-03-16T17:10:02-05:00

In my last post, I user some back-of-the-envelope reliability calculations, with just one interesting bit, to estimate the availability of a single-stacked web application, shown again here. I cautioned that there were a lot of unfounded assumptions baked in. Now it's time to start removing those assumptions, though I reserve the right to introduce a few new ones.

Is it there when I want it?

First, lets talk about the hardware itself. It's very likely that these machines are what some vendors are calling "industry-standard servers." That's a polite euphemism for "x86" or "ia64" that just doesn't happen to mention Intel. ISS servers are expected to exhibit 99.9% availability.

There's something a little bit fishy about that number, though. It's one thing to say that a box is up and running ("available") 99.9% of the times you look at it.If I check it every hour for a year, and find it alive at least 8,756 out of 8,765 times, then it's 99.9% available. It might have broken just once for 9 hours, or it might have broken 9 times for an hour each, or it might have broken 36 times for half an hour each.

This is the difference between availability and reliability. Availability measures the likelihood that a system can perform its function at a specific point in time. Reliability, on the other hand, measures the likelihood that a system will have failed before a point in time. Availability and reliability both matter to your users. In fact, a large number of small outages can be just as frustrating as a single large event. (I do wonder... since both ends of the spectrum seem to stick out in users' memories, perhaps there's an optimum value for the duration and frequency of outages, where they are seldom enough to seem infrequent, but short enough to seem forgivable?)

We need a bit more math at this point.

It must be science... it's got integrals.

Let's suppose that hardware failures can be described as function of time, and that they are essentially random. It's not like the story of the "priceless" server room, where failure can be expected based on actions or inaction. We'll also carry over the previous assumption that hardware failures among these three boxes are independent. That is, failure of any one boxes does not make other boxes more likely to fail.

We want to determine the likelihood that the box is available, but the random event we're concerned with is a fault. Thus, we first need to find the probability that a fault has occurred by time t. Checking for a failure is sampling for an event X between times 0 and t.

The function f(t) is the probability distribution function that describes failures of this system. We'll come back to that shortly, because a great deal hinges on what function we use here. The reliability of the system, then is the probability that the event X didn't happen by time t.

One other equation that will help in a bit is the failure rate, the number of failures to expect per unit time. Like reliability, the failure rate can vary over time. The failure rate is:

Failure distributions

So now we've got integrals to infinity of unknown functions. This is progress?

It is progress, but there are some missing pieces. Next time, I'll talk about different probability distributions, which ones make sense for different purposes, and how to calibrate them with observations.

Reliability Math

2009-02-27T23:20:16-06:00

Suppose you build a web site out of a single stack of one web, app, and database server. What sort of availability SLA should you be willing to support for this site?

We'll approach this in a few steps. For the first cut, you'd say that the appropriate SLA is just the expected availability of the site. Availability is defined in different ways depending on when and how you expect to measure it, but for the time being, we'll say that availability is the probability of getting an HTTP response when you submit a request. This is the instantaneous availability.

What is the probability of getting a response from the web server? Assuming that every request goes through all three layers, then the probability of a response is the probability that all three components are working. That is:

This follows our intuition pretty closely. Since any of the three servers can go down, and any one server down takes down the site, we'd expect to just multiply the probabilities together. But what should we use for the reliability of the individual boxes? We haven't done a test to failure or life cycle test on our vendor's hardware. In fact, if our vendor has any MTBF data, they're keeping it pretty quiet.

We can spend some time hunting down server reliability data later. For now, let's just try to estimate it. In fact, let's estimate widely enough that we can be 90% confident that the true value is within our range. This will give us some pretty wide ranges, but that's OK... we haven't dug up much data yet, so there should be a lot of uncertainty. Uncertainty isn't a show stopper, and it isn't an excuse for inaction. It just means there are things we don't yet know. If we can quantify our uncertainty, then we can still make meaningful decisions. (And some of those decisions may be to go study something to reduce the uncertainty!)

Even cheap hardware is getting pretty reliable. Would you expect every server to fail once a year? Probably not. It's less frequent than that. One out of the three servers fail every two years? Seems to be a little pessimistic, but not impossible. Let's start there. If every server fails once every two years, at a constant rate [1], then we can say that the lower bound on server reliability is 60.6%. Would we expect all of these servers to run for five years straight without a failure? Possible, but unlikely. Let's use one failure over five years as our upper bound. One failure out of fifteen server-years would give an annual availability of 93.5% for each server.

So, each server's availability is somewhere between 60.6% and 93.5%. That's a pretty wide range, and won't be satisfactory to many people. That's OK, because it reflects our current degree of uncertainty.

To find the overall reliability, I could just take the worst case and plug it in for all three probabilities, then plug in the best case. That slightly overstates the edge cases, though. I'm better off getting Excel to help me run a Monte Carlo analysis to give me an average across a bunch of scenarios. I'll construct a row that randomly samples a scenario from within these ranges. It will pick three values between 60.6% and 93.5% and compute their product. Then, I'll copy that row 10,000 times by dragging it down the sheet. Finally, I'll average out the computed products to get a range for the overall reliability. When I do that, I get a weighted range of 28.9% to 62.6%. [2] [3]

Yep, this single stack web site will be available somewhere between 28.9% of the time and 62.6%. [4]

Actually, it's likely to be worse than that. There are two big problems in the analysis so far. First, we've only accounted for hardware failures, but software failures are a much bigger contributor to downtime. Second, more seriously, the equation for overall reliability assumes that all failures are disjoint. That is, we implicitly assumed that nothing could cause more than one of these servers to fail simultaneously. Talk about Pollyanna! We've got common mode failures all over the place, especially in the network, power, and data center arenas.

Next time, we'll start working toward a more realistic calculation.

1. I'm using a lot of simplifying assumptions right now. Over time, I'll strip these away and replace them with more realistic calculations. For example, a constant failure rate implies an exponential distribution function. It is mathematically convenient, but doesn't represent the effects of aging on moving components like hard drives and fans.

2. You can download the spreadsheet here.

3. These estimation and analysis techniques are from "How to Measure Anything" by Doug Hubbard.

4. Clearly, for a single-threaded stack like this, you can achieve much higher reliability by running all three layers on a single physical host.

2009 Calendar as OmniGraffle Stencil

2009-02-27T23:10:32-06:00

I had need of a stencil that would let me drop monthly calendars on a number of pages. I found it useful, and someone else might, too.

Download the stencil.

Fast Iteration versus Elegant Design

2009-02-21T16:00:42-06:00

I love the way that proggit bubbles stuff around. Today, for a while at least, the top link is to a story from Salon in May of 2000 about Bill and Lynne Jolitz, the creators of 386BSD.

[An aside: I'm not sure exactly when I became enough of a graybeard to remember as current events things which are now discussed as history. It's really disturbing that an article from almost a decade ago talks about events seven years earlier than that, and I remember them happening! To me, the real graybeards are the guys that created UNIX and C to begin with. Me? I'm part of the second or third UNIX generation, at best. Sigh...]

Anyway, Bill and Lynne Jolitz created the first free, open-source UNIX that ran on x86 chips. Coherent was around before that, and I think SCO UNIX was available for x86 at the same time. SCO wasn't evil then, just expensive. In those days, you had to lay down some serious jing to get UNIX on your PC. Minix was available for free, but Tannenbaum held firm that Minix should teach principles rather than be a production OS, so he favor pedagogical value over functionality. Consequently, Minix wasn't a full UNIX implementation. (At least at that time. It might be now.)

Just contemplate the hubris of two programmers deciding that they would create their own operating system, to be UNIX, but fixing the flaws, hacks, and workarounds that had built up over more than a decade. Not only that, but they would choose to give it away for the cost of floppies! And not only that, but they would build it for a processor that serious UNIX people sneered at. Most impressive of all, they succeeded. 386BSD was a technically superior, well-architected version of UNIX for commodity hardware. The Jolitzes extrapolated Intel's growth curve and rapid product cycles and saw that x86 processors would advance far faster than the technically superior RISC chips.

At various times, I ran Minix, 386BSD, and SCO UNIX on my PC well before I even heard of Linux. Each of them had the field before Linus even made his 0.1 release.

So why is Linux everywhere, and we only hear about 386BSD in historical contexts? There is exactly one answer, and it's what Eric Raymond was really talking about in The Cathedral and the Bazaar. TCatB has been seen mostly as an argument for open-source versus commercial software, but what Raymond saw was that the real competition comes down to an open contribution model versus closed contributions. Linus' promiscuous contribution policy simply let Linux out-evolve 386BSD. More contributors meant more drivers, more bug fixes, more enhancements... more ideas, ultimately. Two people, no matter how talented, cannot outcode thousands of Linux contributors. The best programmers are 10 times more productive than the average, and I would rate Bill and Lynne among the very best. But, as of last April, the Linux Foundation reported that more than 3,600 people had contributed to the kernel alone.

Iteration is one of the fundamental dynamics. Iteration facilitates adaptation, and adaptation wins competition. History is littered with the carcasses of "superior" contenders that simply didn't adapt as fast as their victorious challengers.

Why Do Enterprise Applications Suck?

2009-02-20T22:23:55-06:00

What is it about enterprise applications that makes them suck?

I mean, have you ever seen someone write 1,500 words about how much they love their corporate expense reporting system? Or spend their free time mashing up the job posting system together with Google maps? Of course not. But why not?

There’s a quality about some software that inspires love in their users, and it’s totally devoid in enterprise software. The best you can ever say about enterprise software is when it doesn’t get in the way of the business. At it’s worst, enterprise software creates more work than it automates.

For example, in my company, we’ve got a personnel management tool that’s so unpredictable that every admin in the company keeps his or her own spreadsheet of requests that have been entered. They have to do that because the system itself randomly eats input, drops requests, or silently trashes forms. It’s not a training problem, it’s just lousy software.

We’ve got a time-tracking system that has a feature where an employee can enter in a vacation request. There’s a little workflow triggered to have the supervisor approve the vacation request. I’ve seen it used inside two groups. In both cases, the employee negotiates the leave request via email then enters it into the time tracking system. I know several people who use Travelocity to find their flights before they log in to our corporate travel system. And you wouldn’t even believe how hard our sales force automation system is compared to Salesforce.com.

Way back in 1937, Ronald Coase elaborated his theory about why corporations exist. He said that a firm’s boundaries should be drawn so as to minimize transaction costs… search and information costs, bargaining costs, and cost of policing behavior. By almost every measure, then, external systems offer lower transaction costs than internal ones. No wonder some people think IT doesn’t matter.

If the best you can do is not mess up a nondifferentiating function like personnel management, it’s tough to claim that IT can be a competitive advantage. So, again I’ll ask, why?

I think there are exactly four reasons that internal corporate systems are so unloved and unlovable.

The serve their corporate overlords, not their users.

This is simple. Corporate systems are built according to what analysts believe will make the company more efficient. Unfortunately, this too often falls prey to penny-wise-pound-foolish decisions that micro-optimize costs while suboptimizing the overall value stream. Optimizing one person’s job with a system that creates more work for a number of other people doesn’t do any good for the company as a whole.

They only do gray-suited, stolidly conservative things.

Corporate IM now looks like an obvious idea, but messaging started frivolously. It was blocked, prohibited, and firewalled. In 1990, who would have spent precious capital on something to let cubicle-dwellers ask each other what they were doing for lunch? As it turns out, a few companies were on the leading edge of that wave, but their illicit communications were done in spite of IT. How many companies would build something to Create Breakthrough Products Through Collaborative Play?

They have captive audiences.

If your company has six purchasing systems, that’s a problem. If you have a choice of six online stores, that’s competition.

They lack “give-a-shitness”.

I think this one matters most of all. Commerce sites, Web 2.0 startups, IM networks… the software that people love was created by people who love it, too. It’s either their ticket to F-U money, it’s their brainchild, or it’s their livelihood. The people who build those systems live with them for a long time, often years. They have reason to care about the design and about keeping the thing alive.

This is also why, once acquired, startups often lose their luster. The founders get their big check and cash out. The barnstormers that poured their passion into it discover they don’t like being assimilated and drift away.

Architects, designers, and developers of corporate systems usually have little or no voice in what gets built, or how, or why. (Imagine the average IT department meeting where one developer says this system really ought to be built using Scala and Lift.) The don’t sign on, they get assigned. I know that individual developers do care passionately about their work, but usually have no way to really make a difference.

The net result is that corporate software is software that nobody gives a shit about: not its creators, not its investors, and not its users.

Tracking and Trouble

2009-02-19T11:55:27-06:00

Pick something in your world and start measuring it. Your measurements will surely change a little from day to day. Track those changes over a few months, and you might have a chart something like this.

Now that you've got some data assembled, you can start analyzing it. The average over this sample is 59.5. It's got a variance of 17, which is about 28% of the mean. You can look for trends. For example, we seem to see an upswing for the first few months, then a pullback starting around 90 days into the cycle. In addition, it looks like there is a pretty regular oscillation superimposed on the main trend, so you might be looking at some kind of weekly pattern as well.

The next few months of data should make the patterns clearer.

Indeed, from this chart, it looks pretty clear that the pullback around 100 days was the early indicator of a flattening in the overall growth trend from the first few months. Now, the weekly oscillations are pretty much the only movement, with just minor wobbles around a ceiling.

I'll fast forward and show the full chart, spanning 1000 samples (over three years' worth of daily measurements.)

Now we can see that the ceiling established at 65 held against upward pressure until about 250 days in, when it finally gave way and we reached a new support at about 80. That support lasted for another year, when we started to see some gradual downward pressure resulting in a pullback to the mid-70s.

You've probably realized by now that I'm playing a bit of a game with you. These charts aren't from any stock market or weather data. In fact, they're completely random. I started with a base value of 55 and added a little random value each "day".

When you see the final chart, it's easy to see it as the result of a random number generator. If you were to live this chart, day by day, however, it's exceedingly hard not to impose some kind of meaning or interpretation on it. The tough part is that you actually can see some patterns in the data. I didn't force the weekly oscillations into the random number function, they just appeared in the graph. We are all exceptional good at pattern detection and matching. We're so good, in fact, that we find patterns all over the place. When we are confronted with obvious patterns, we tend to believe that they're real or that they emerge from some underlying, meaningful structure. But sometimes, they're really just nothing more than randomness.

Nassim Nicholas Taleb is today's guru of randomness, but Benoit Mandelbrot wrote about it earlier in the decade, and Benjamin Graham wrote about this problem back in the 1920's. I suspect someone has sounded this warning every decade since statistics were invented. Graham, Mandelbrot, and Taleb all tell us that, if we set out to find patterns in historical data, we will always find them. Whether those patterns have any intrinsic meaning is another question entirely. Unless we discover that there are real forces and dynamics that underlie the data, we risk fooling ourselves again and again.

We can't abandon the idea of prediction, though. Randomness is real, and we have a tendency to be fooled by it. Still, even in the face of those facts, we really do have to make predictions and forecasts. Fortunately, there are about a dozen really effective ways to deal with the fundamental uncertainty of the future. I'll spend a few posts exploring these different ways to deal with the uncertainty of the future.

Booklist

2009-02-14T18:10:34-06:00

I made a LibraryThing list of books relevant to the stuff that’s banging around in my head now. These are in no particular order or organization. In fact, this is a live widget, so it might change as I think of other things that should be on the list.

The key themes here are time, complexity, uncertainty, and constraints. If you’ve got recommendations along these lines, please send them my way.

2024 update: the widget no longer works, the account is lost, this list is a victim of linkrot

Cold Turkey

2009-02-13T22:10:23-06:00

Subtle Interactions, Non-local Problems

2009-02-12T09:54:57-06:00

Alex Miller has a really interesting blog post up today. In LBQ + GC = slow, he shows how LinkedBlockingQueue can leave a chain of references from tenured dead objects to live young objects. That sounds really dirty, but it actually means something to Java programmers. Something bad.

The effect here is a subtle interaction between the code and the mostly hidden, yet omnipresent, garbage collector. This interaction just happens to hit a known sore spot for the generational garbage collector. I won't spoil the ending, because I want you to read Alex's piece.

In effect, a one-line change to LinkedBlockingQueue has a dramatic effect on the garbage collector's performance. In fact, because the problem causes more full GC's, you'd be likely to observe this problem in an area completely unconnected with the queue itself. By leaving these refchains worming through multiple generations in the heap, the queue damages a resource needed by every other part of the application.

This is a classic common-mode dependency, and it's very hard to diagnose because it results from hidden and asynchronous coupling.

Combining here docs and blocks in Ruby

2009-02-06T10:43:21-06:00

Like a geocache, this is another post meant to help somebody who stumbles across it in a future Google search. (Or as an external reminder for me, when I forget how I did this six months from now.)

I've liked here-documents since the days of shell programming. Ruby has good support for here docs with variable interpolation. For example, if I want to construct a SQL query, I can do this:

def build_query(customer_id)
  <<-STMT
    select * 
     from customer
   where id = #{customer_id}
  STMT
}

Disclaimer: Don't do this if customer_id comes from user input!

Recently, I wanted a way to build inserts using a matching number of column names and placeholders.

def build_query
  <<-STMT
    insert into #{table} ( #{columns()} ) values ( #{column_placeholders()} )
  STMT
end

In this case, columns and column_placeholders were both functions.

One oddity I ran into is the combination of here documents and block syntax. RubyDBI lets you pass a block when executing a query, the same way you would pass a block to File::open(). The block gets a "statement handle", which gets cleaned up when the block completes.

  dbh.execute(query) { |sth| 
    sth.fetch() { |row|
      # do something with the row
    }
  }

Combining these two lets you write something that looks like SQL invading Ruby:

  dbh.execute(<<-STMT) { |sth|
      select distinct customer, business_unit_id, business_unit_key_name
       from problem_ticket_lz
       order by customer
    STMT
    sth.fetch { |row|
      print "#{row[1]}\t#{row[0]}\t#{row[2]}\n"
    }
  }

This looks pretty good overall, but take a look at how the block opening interacts with the here doc. The here doc appears to be line-oriented, so it always begins on the line after the <<-STMT token. On the other hand, the block open follows the function, so the here doc gets lexically interpolated in the middle of the block, even though it has no syntactic relation to the block. No real gripe, just an oddity.

Beautiful Architecture

2009-02-05T15:45:56-06:00

O'Reilly has released "Beautiful Architecture," a compilation of essays by software and system architects. I'm happy to announce that I have a chapter in this book. The finished book is shipping now, and available through Safari. I think the whole thing has turned out amazingly well, both instructive and interesting.

One of the editors, Diomidas Spinellis, has posted an excellent description and summary.

Another Cause of TNS-12541

2009-02-05T15:13:21-06:00

Using a custom WindowProc from Ruby

2009-01-26T09:24:06-06:00

This is off the beaten path today, maybe even off the whole reservation. Still, I searched for some code to do this, and couldn't find it. Maybe this will help somebody else trying to do the same thing.

I'm currently prototyping a desktop utility using Ruby and wxRuby. The combination actually makes Windows desktop programming palatable, which is a very pleasant surprise.

Part of what I'm doing involves showing messages with Snarl. I want my Ruby program to generate messages that can be clicked. Snarl is happy to tell you that your message has been clicked. It does it by sending your window a message, using whatever message code you want.

So, for example, if I want to get a WM_USER message back, then I create a new notification like this:

@msg = Snarl.new('Clickable message', {:message => 'Click me, please!', :timeout => Snarl::NO_TIMEOUT, :reply_window => @win_handle, :reply_window_message => Windows::WM_USER})

If the user clicks on my message, I'll get a WM_USER event delivered to my window (identified by @win_handle). Since I'm using wxRuby, which wraps wxWidgets, that presents a bit of a problem. Although wxWidgets allows you to subclass its default window proc, wxRuby does not. A couple of forum posts suggested using the Windows API to hook the window proc, which is what I did.

Here's the code:

begin
  require 'rubygems'
rescue LoadError
end

I installed wxRuby as a gem, so that's boilerplate.

require 'lib/snarl'
require 'wx'
require 'windows/api'

module WindProc
  include Windows
  
  GWL_WNDPROC = -4

  WM_USER = 0x04FF

  API.auto_namespace = 'WindProc'
  API.auto_constant = true

  API.new('SetWindowLong', 'LIK', 'L', 'user32')
  API.new('CallWindowProc', 'PIIIL', 'L', 'user32')
end

This module just gets me access to the Windows API functions SetWindowLong and CallWindowProc. SetWindowLong is deprecated in favor of SetWindowLongPtr, but I couldn't get that to load properly through the windows/api module. At some point, when you're prototyping something, you just have to decide not to solve every puzzle, especially if you can find a workable alternative.

API.new() constructs a Ruby object implemented by some C native code. It uses the prototype string in the second argument to translate Ruby parameters into C values when you eventually call the API function. The conversion is done in glue code that knows how to map some Ruby primitives to C values, but it's not all that bright. In particular, there's no way to introspect on the Win32 API itself to see if you're lying to the glue code. In fact, I'm lying a little bit here. The prototype I used---'LIK'---tells the API module that I'm looking for a function that takes a long, an integer, and a callback. Strictly speaking, this should have been 'LIL', but I needed the glue code to convert a Ruby procedure into a C pointer.

The next section defines a subclass of Wx::Frame, the base type for all standalone windows.

class HookedFrame < Wx::Frame
  def initialize(parent, id, title)
    super(parent, -1, title)

    evt_window_create() { |event| on_window_create(event) }
  end

I register a handler for the window create event. At this point, I'm still within the bounds of wxWidget's own event handling framework. The interesting bits happen inside the on_window_create method.

  def on_window_create(event)
    @old_window_proc = 0
    @my_window_proc = Win32::API::Callback.new('LIIL', 'I') { |hwnd, umsg, wparam, lparam|
      if not self.hooked_window_proc(hwnd, umsg, wparam, lparam) then
        WindProc::CallWindowProc.call(@old_window_proc, hwnd, umsg, wparam, lparam)
      end
    }
    @old_window_proc = WindProc::SetWindowLong.call(self.handle, WindProc::GWL_WNDPROC, @my_window_proc)
  end

There are several juicy bits here. First, I'm using Win32::API::Callback.new() to create a callback object. How does this get used? It's a little roundabout. When I call WindProc::SetWindowLong(), I pass the callback object. (This is why I used 'LIK' as the prototype string earlier.) Now, WindProc::SetWindowLong() isn't just a pointer to the native Windows library function. It's actually a Ruby object that wraps the library function. The API object is implemented by C code. Like the API object, the callback object is a Ruby object implemented by C code. In particular, it has an ivar that points to a Ruby procedure. Because I passed a block to Callback.new(), the block itself will be the procedure. Inside API.call(), any argument of type "K" gets set as the "active callback" and then substituted with a C function called CallbackFunction. CallbackFunction looks up the active callback, translates parameters according to the callback's prototype, then tells Ruby to invoke the proc associated with the callback.

Whew.

So, I call SetWindowLong.call(), passing it the Callback I created with a block. SetWindowLong.call() ultimately callls the Windows DLL function SetWindowsLong, passing it the address of CallbackFunction. When Windows calls CallbackFunction, it looks up the Ruby Callback object and invokes it's procedure.

Another oddity. For some reason, although the callback object has an instance variable called @function, there seems to be no way to set it after construction. If you pass a block, @function will point to the block. If you don't, @function will be nil, with no way to set it to anything else. In other words, the API will happily let you create useless Callback objects.

The rest is easy. Inside my block, I just call out to a method that can be overridden by descendants of HookedFrame. My test implementation just blurts out some stuff to let me know the plumbing is working.

  def hooked_window_proc(hwnd, uMsg, wParam, lParam)
    puts "In the hook: 0x#{uMsg.to_s(16)}\t#{wParam}\t#{lParam}\n"
    if uMsg == NotifierApp::WM_USER then
      puts "That's what I've been waiting to hear:\t#{wParam}\t#{lParam}\n"
      true
    end
    false
  end

As I reviewed this post, I realized a something else. ActiveCallback is static in the C glue code. That means there can only be one callback set at a time. If I called some other Windows API function with its own callback, that would overwrite the reference to my Ruby code. But, Windows would still keep calling to the same pointer as before. In other words, calling any other Windows API function that takes a callback would cause that callback to become my window proc! Yikes!

Overall, this works, but seems like a kludge. Ironically, even as I got this working, I started getting dissatisfied with Snarl itself. I think I need more flexibility to display persistent information, rather than just alerts.

OTUG Tonight

2009-01-20T12:09:02-06:00

Attack of Self-Denial, 2008 Style

2008-12-13T10:42:09-06:00

(Human | Pattern) Languages, part 2

2008-12-08T00:56:17-06:00

At the conclusion of the modulating bridge, we expect to be in the contrasting key of C minor. Instead, the bridge concludes in the distantly related key of F sharp major... Instead of resolving to the tonic, the cadence concludes with two isolated E pitches. They are completely ambiguous. They could belong to E minor, the tonic for this movement. They could be part of E major, which we've just heard peeking out from behind the minor mode curtains. [He] doesn't resolve them into a definite key until the beginning of the third movement, characteristically labeled a "Scherzo".

In my last post, I lamented the missed opportunity we had to create a true pattern language about software. Perhaps calling it a missed opportunity is too pessimistic. Bear with me on a bit of a tangent. I promise it comes back around in the end.

The example text above is an amalgam of a lecture series I've been listening to. I'm a big fan of The Teaching Company and their courses. In particular, I've been learning about the meaning and structure of classical, baroque, romantic, and modern music from Professor Robert Greenberg.1 The sample I used here is from a series on Beethoven's piano sonatas. This isn't an actual quote, but a condensation of statements from one of the lectures. I'm not going to go into all the music theory behind this, but it is interesting.2

There are two things I want you to observe about the sample text. First, it's loaded with jargon. It has to be! You'd exhaust the conversational possibilities about the best use of a D-sharp pretty quickly. Instead, you'll talk about structures, tonalities, relationships between that D-sharp and other pitches. (D-sharp played together with a C? Very different from a quick sequence of D-sharp, E, D-sharp, C.) You can be sure that composers don't think in terms of individual notes. A D-sharp by itself doesn't mean anything. It only acquires meaning by its relation to other pitches. Hence all that stuff about keys---tonic, distantly related, contrasting. "Key" is a construct for discussing whole collections of pitches in a kind of shorthand. To a musician, there's a world of difference between G major and A flat minor, even though the basic pitch (the tonic) is only one half-step apart.

Also notice that the text addresses some structural features. The purpose and structure of a modulating bridge is pretty well understood, at least in certain circles. The notion that you can have an "expected" key certainly implies that there are rules for a sonata. In fact, the term "sonata" itself means some fairly specific things3... although to know whether we're talking about "a sonata" or "a movement in sonata form" requires some additional context.

In fact, this paragraph is all about context. It exists in the context of late Classical, early Romantic era music, specifically the music of Beethoven. In the Classical era, musical forms---such as sonata form---pretty much dictates the structure of the music. The number of movements, their relationships to each other, their keys, and even their tempos were well understood. A contemporary listener had every reason to expect that a first movement would be fast and bright, and if the first movement was in C major, then the second, slower movement would be a minuet and trio in G major.

Music and music theory have evolved over the last thousand-odd years. We have a vocabulary---the potentially off-putting jargon of the field. We have nesting, interrelating contexts. Large scale patterns (a piano sonata) create context for medium scale patterns (the first movement "allegretto") which in turn, create context for the medium and small scale patterns (the first theme in the allegretto consists of an ABA'BA phrasing, in which the opening theme sequences a motive upward over octaves.) We even have the ability to talk about non sequiturs---like the modulating bridge above---where deliberate violation of the pattern language is done for effect.4

What is all this stuff if it isn't a pattern language?

We can take a few lessons, then, from the language of music.

The first lesson is this: give it time. Musical language has evolved over a long time. It has grown and been pruned back over centuries. New terms are invented as needed to describe new answers to a context. In turn, these new terms create fresh contexts to be exploited with yet other inventions.

Second, any such language must be able to assimilate change. Nothing is lost, even amidst the most radical revolutions. When the Twentieth Century modernists rejected the tonal system, they could only reject the structures and strictures of that language. They couldn't destroy the language itself. Phish plays fugues in concert... they just play them with electric guitars instead of harpsichords. There are Baroque orchestras today. They play in the same concert halls as the Pops and Philharmonics. The homophonic texture of plain chant still exists, and so do the once-heretical polyphony and church-sanctioned monophony. Nothing is lost, but new things can be encompassed and incorporated.

And, mainframes still exist with their COBOL programs, together with distributed object systems, message passing, and web services. The Singleton and Visitor patterns will never truly go away, any more than batch programming will disappear.

Third, we must continue to look at the relationships between different parts of our nascent pattern language. Just as individual objects aren't very interesting, isolated patterns are less interesting than the ways they can interact with each other.

I believe that the true language of software has as much to do with programming languages as the language of music has to do with notes. So, instead of missed opportunity, let us say instead that we are just beginning to discover our true language.

1. Professor Greenberg is a delightful traveling companion. He's witty, knowledgeable and has a way of teaching complex subjects without ever being condescending. He also sounds remarkably like Penn Jillette.

2. The main reason is that I would surely get it wrong in some details and risk losing the main point of my post here.

3. And here we see yet another of the complexities of language. The word "sonata" refers, at different times, to a three movement concert work, a single movement in a characteristic structure, a four movement concert work, and in Beethoven's case, to a couple of great fantasias that he declares to be sonatas simply because he says so.

4. For examples ad nauseum, see Richard Wagner and the "abortive gesture".

(Human | Pattern) Languages

2008-12-08T00:19:36-06:00

We missed the point when we adopted "patterns" in the software world. Instead of an organic whole, we got a bag of tricks.

The commonly accepted definition of a pattern is "a solution to a problem in a context." This is true, but limiting. This definition loses an essential characteristic of patterns: Patterns relate to other patterns.

We talk about the context of a problem. "Context" is a mental shorthand. If we unpack the context it means many things: constraints, capabilities, style, requirements, and so on. We sometimes mislead ourselves by using the fairly fuzzy, abstract term "context" as a mental handle on a whole variety of very concrete issues. Context includes stated constraints like the functional requirements, along with unstated constraints like, "The computation should complete before the heat death of the universe." It includes other forces like, "This program is written in C#, so the solution to this problem should be in the same language or a closely related one." It should not require a supercooled quantum computer, for example.

Where does the context for a small-scale pattern originate?1 Context does not arise ex nihilio. No, the context for a small-scale pattern is created by larger patterns. Large grained patterns create the fabric of forces that we call the context for smaller patterns. In turn, smaller patterns fit into this fabric and, by their existence, they change it. Thus, the small scale patterns create feedback that can either resolve or exacerbate tensions inherent in the larger patterns.

Solutions that respect their context fit better with the rest of the organic whole. It would be strange to be reading some Java code, built into layered architecture with a relational database for storage, then suddenly find one component that has its own LISP interpreter and some functional code. With all respect to "polyglot programming", there'd better be a strong motivation for such an odd inclusion. It would be a discontinuity... in other words, it doesn't fit the context I described. That context---the layered architecture, the OO language, relational database---was created by other parts of the system.

If, on the other hand, the system was built as a blackboard architecture, using LISP as glue code over intelligent agents acting asynchronously, then it wouldn't be at all odd to find some recursive lambda expressions. In that context, they fit naturally and the Java code would be an oddity.

This interrelation across scale knits patterns together into a pattern language. By and large, what we have today is a growing group of proper nouns. Please don't get me wrong, the nouns themselves have use. It's very helpful to say "you want a Null Object there," and be understood. That vocabulary and the compression it provides is really important.

But we shouldn't mistake a group of nouns for a real pattern language. A language is more than just its nouns. A language also implies ways of connecting statements sensibly. It has idioms and semantics and semiotics.2 In a language, you can have dialog and argumentation. Imagine a dialog in patterns as they exist today:

"Pipes and filters."

"Observer?"

"Chain of Responsibility!"

You might be able to make a comedy sketch out of that, but not much more. We cannot construct meaningful dialogs about patterns at all scales.

What we have are fragments of what might become a pattern language. GoF, the PLoPD books, the PoSA books... these are like a few charted territories on an unmapped continent. We don't yet have the language that would even let us relate these works together, let alone relating them to everything else.

Everything else? Well, yes. By and large, patterns today are an outgrowth of the object-oriented programming community. I contend, however, that "object-oriented" is a pattern! It's a large-scale pattern that creates really significant context for all the other patterns that can work within it. Solutions that work within the "object-oriented" context make no sense in an actor-oriented context, or a functional context, or a procedural context, and so on. Each of these other large-scale patterns admit different solutions to similar problems: persistence, user interaction, and system integration, to name a few. I can imagine a pattern called "Event Driven" that would work very well with "Object oriented", "Functional", and "Actor Oriented", but somewhat less well with "Procedural programming", and contradict utterly with "Batch Processing". (Though there might be a link between them called "Buffer file" or something like that.)

That's the piece that we missed. We don't have a pattern language yet. We're not even close.

1. By "large" and "small", I don't mean to imply that patterns simply nest hierarchically. It's more complex and subtle than that. When we do have a real pattern language, we'll find that there are medium-grained patterns that work together with several, but not all, of the large ones. Likewise, we'll find small-scale patterns that make medium sized ones more or less practical. It's not a decision tree or a heuristic.

2. That's what keeps, "Fill the idea with blue" from being a meaningful sentence. All the words work, and they're even the right part of speech, yet the sentence as a whole doesn't fit together.

Connection Pools and Engset

2008-12-03T09:35:25-06:00

In my last post, I talked about using Erlang models to size the front end of a system. By using some fundamental capacity models that are almost a century old, you can estimate the number of request handling threads you need for a given traffic load and request duration.

Inside the Box

It gets tricky, though, when you start to consider what happens inside the server itself. Processing the request usually involves some kind of database interaction with a connection pool. (There are many ways to avoid database calls, or at least minimize the damage they cause. I'll address some of these in a future post, but you can also check out Two Ways to Boost Your Flagging Web Site for starters.) Database calls act like a kind of "interior" request that can be considered to have its own probability of queuing.

Because this interior call can block, we have to consider what effects it will have on the duration of the exterior call. In particular, the exterior call must take at least the sum of the blocking time plus the processing time for the interior call.

At this point, we need to make a few assumptions about the connection pool. First, the connection pool is finite. Every connection pool should have a ceiling. If nothing else, the database server can only handle a finite number of connections. Second, I'm going to assume that the pool blocks when exhausted. That is, calling threads that can't get a connection right away will happily wait forever rather than abandoning the request. This is a simplifying assumption that I need for the math to work out. It's not a good configuration in practice!

With these assumption in place, I can predict the probability of blocking within the interior call. It's a formula closely related to the Erlang model from my last post, but with a twist. The Erlang models assume an essentially infinite pool of requestors. For this interior call, though, the pool of requestors is quite finite: it's the number of request handling threads for the exterior calls. Once all of those threads are busy, there aren't any left to generate more traffic on the interior call!

The formula to compute the blocking probability with a finite number of sources is the Engset formula. Like the Erlang models, Engset originated in the world of telephony. It's useful for predicting the outbound capacity needed on a private branch exchange (PBX), because the number of possible callers is known. In our case, the request handling threads are the callers and the connection pool is the PBX.

Practical Example

Using our 1,000,000 page views per hour from last time, Table 1 shows the Engset table for various numbers of connections in the pool. This assumes that the application server has a maximum of 40 request handling threads. This also supposes that the database processing time uses 200 milliseconds of the 250 milliseconds we measured for the exterior call.

N	Engset(N,A,S)
0	100.00000%
1	98.23183%
2	96.37740%
3	94.43061%
4	92.38485%
5	90.23293%
6	87.96709%
7	85.57891%
8	83.05934%
9	80.39867%
10	77.58656%
11	74.61210%
12	71.46397%
13	68.13065%
14	64.60087%
15	60.86421%
16	56.91211%
17	52.73932%
18	48.34604%
19	43.74105%
20	38.94585%
21	34.00023%
22	28.96875%
23	23.94730%
24	19.06718%
25	14.49235%
26	10.40427%
27	6.97050%
28	4.30152%
29	2.41250%
30	1.21368%
31	0.54082%
32	0.21081%
33	0.07093%
34	0.02028%
35	0.00483%
36	0.00093%
37	0.00014%
38	0.00002%
39	0.00000%
40	0.00000%

Notice that when we get to 18 connections in the pool, the probability of blocking drops below 50%. Also, notice how sharply the probability of blocking drops off around 23 to 31 connections in the pool. This is a decidedly nonlinear effect!

From this table, it's clear that even though there are 40 request handling threads that could call into this pool, there's not much point in having more than 30 connections in the pool. At 30 connections, the probability of blocking is already less than 1%, meaning that the queuing time is only going to add a few milliseconds to the average request.

Why do we care? Why not just crank up the connection pool size to 40? After all, if we did, then no request could ever block waiting for a connection. That would minimize latency, wouldn't it?

Yes, it would, but at a cost. Increasing the number of connections to the database by a third means more memory and CPU time on the database just managing those connections, even if they're idle. If you've got two app servers, then the database probably won't notice an extra 10 connections. Suppose you scale out at the app tier, though, and you now have 50 or 60 app servers. You'd better believe that the DB will notice an extra 500 to 600 connections. They'll affect memory needs, CPU utilization, and your ability to fail over correctly when a database node goes down.

Feedback and Coupling

There's a strong coupling between the total request duration in the interior call and the request duration for the exterior call. If we assume that every request must go through the database call, then the exterior response time must be strictly greater than the interior blocking time plus the interior processing time.

In practice, it actually gets a little worse than that, as this causal loop diagram illustrates.

It reads like this: "As the interior call blocking time increases, the exterior call duration increase. As the interior call blocking increases, the exterior call duration time increases." This type of representation helps clarify relations between the different layers. It's very often the case that you'll find feedback loops this way. Any time you do find a feedback loop, it means that slowdowns will produce increasing slowdowns. Blocking begets blocking, quickly resulting in a site hang.

Conclusions

Queues are like timing dots. Once you start seeing them, you'll never be able to stop. You might even start to think that your entire server farm looks like one vast, interconnected set of queues.

That's because it is.

People use database connection pools because creating new connections is very slow. Tuning your database connection pool size, however, is all about optimizing the cost of queueing against the cost of extra connections. Each connection consumes resources on the database server and in the application server. Striking the right balance starts by identifying the required exterior response time, then sizing the connection pool---or changing the architecture---so the interior blocking time doesn't break the SLA.

For much, much more on the topic of capacity modeling and analysis, I definitely recommend Neil Gunther's website, Performance Agora. His books are also a great---and very practical---way to start applying performance and capacity management.

Thread Pools and Erlang Models

2008-11-30T20:32:12-06:00

Sizing, Danish Style

Folks in telecommunications and operations research have used Erlang models for almost a century. A. K. Erlang, a Danish telephone engineer, developed these models to help plan the capacity of the phone network and predict the grade of service that could be guaranteed, given some basic metrics about call volume and duration. Telephone networks are expensive to deploy, particularly when upgrading your trunk lines involves digging up large portions of rocky Danish ground or running cables under the North Sea.

The Erlang-B formula predicts the probability that an incoming call cannot be serviced, based on the call arrival rate, average call time, and number of lines available. Erlang-C is similar, but allows for calls to be queued while waiting for service. It predicts the probability that a call will be queued. It can also show when calls will never be serviced, because the rate of arriving calls exceeds the system's total capacity to serve them.

Erlang models are widely used in telecomm, including GPRS network sizing, trunk line sizing, call center staffing models, and other capacity planning arenas where request arrival is apparently random. In fact, you can use it to predict the capacity and wait time at a restaurant, bank branch, or theme park, too.

It should be pretty obvious that Erlang models are widely applicable in computer performance analysis, too. There's a rich body of literature on this subject that goes back to the dawn of the mainframe. Erlang models are the foundation of most capacity management groups. I'm not even going to scratch the surface here, except to show how some back-of-the-envelope calculations can help you save millions of dollars.

One Million Page Views

In my case, I wanted to look at thread pool sizing. Suppose you have an even 1,000,000 requests per hour to handle. This implies an arrival rate (or lambda) of 0.27777... requests per millisecond. (Erlang units are dimensionless, but you need to start with the same units of time, whether it's hours, days, or milliseconds.) I'm going to assume for the moment that the system is pretty fast, so it handles a request in 250 milliseconds, on average.

(Please note that there are many assumptions underneath simply statements like "on average". For the moment, I'll pretend that request processing time follows a normal distribution, even though any modern system is more likely to be bimodal.)

Table 1 shows a portion of the Erlang-C table for these parameters. Feel free to double-check my work with this spreadsheet or this short C program to compute the Erlang-B and Erlang-C values for various numbers of threads. (Thanks to Kenneth J. Christensen for the original program. I can only claim credit for the extra "for" loop.)

Table 1. Erlang-C values at 250 ms / request

N	Pr_Queue (Erlang-C)
67	undef
68	undef
69	undef
70	0.921417281
71	0.791698369
72	0.676255938
73	0.574128540
74	0.484342834
75	0.405921606
76	0.337892350
77	0.279296163
78	0.229196685
79	0.186688788
80	0.150906701
81	0.121031288
82	0.096296202
83	0.075992736
84	0.059473196
85	0.046152756
86	0.035509802
87	0.027084849
88	0.020478191
89	0.015346497
90	0.011398581
91	0.008390600
92	0.006120940
93	0.004424999
94	0.003170077
95	0.002250524
96	0.001583268
97	0.001103786
98	0.000762573
99	0.000522098

From Table 1, I can immediately see that anything less than 70 threads will never keep up. With less than 70 threads, the queue of unprocessed requests will grow without bound. I need at least 91 threads to get below a 1% chance that a request will be delayed by queueing.

Performance and Capacity

Now, what happens if the average request processing time goes up by 100 milliseconds on those same million requests? Adjusting the parameters, I get Table 2.

Table 2. Erlang-C values at 350 ms / request

N	Pr_Queue (Erlang-C)
96	undef
97	undef
98	0.907100356
99	0.797290966
100	0.697789489
101	0.608014385
102	0.527376532
103	0.455282634
104	0.391138874
105	0.334354749
106	0.284347016
107	0.240543652
108	0.202387733
109	0.169341130
110	0.140887936
111	0.116537521
112	0.095827141
113	0.078324041
114	0.063626999
115	0.051367297
116	0.041209109
117	0.032849334
118	0.026016901
119	0.020471625
120	0.016002658
121	0.012426630
122	0.009585560
123	0.007344611
124	0.005589775
125	0.004225555

Now we need a minimum of 99 threads before we can even expect to keep up and we need 122 threads to get down under that 1% queuing threshold.

On the other hand, what about increasing performance by 100 millseconds per request? I'll let you run the calculator for that, but it looks to me like we need between 42 and 59 threads to meet the same thresholds.

That swing, from 150 to 350 milliseconds per request makes a huge difference in the number of concurrent threads your system must support to handle a million requests per hour---almost a factor of 3 times. Would you be willing to triple your hardware for the same request volume? Next time anyone says that "CPU is cheap", fold your arms and tell them "Erlang would not approve." On the flip side, it might be worth spending some administrator time on performance tuning to bring down your average page latency. Or maybe some programmer time to integrate memcached so every single page doesn't have to trudge all the way to the database.

Summary and Extension

Obviously, there's a lot more to performance analysis for web servers than this. Over time, I'll be mixing more analytic pieces with the pragmatic, hands-on posts that I usually make. It'll take some time. For one thing, I have to go back and learn about stochastic process and Markov chains. Pattern recognition and signal processing I've got. Advanced probability and statistics I don't got.

In fact, I'll offer a free copy of Release It to the first commenter who can show me how to derive an Erlang-like model that accounts for a) garbage collection times (bimodal processing time distribution), b) multiple coupled wait states during processing, c) non-equilibrium system states, and d) processing time that varies as a function of system utilization.

Constraint, Chaos, Collapse

2008-11-16T09:23:47-06:00

Licensing for Windows on EC2

2008-10-26T07:11:53-05:00

One thing I noticed when I fired up my first Windows instances on EC2 was that Windows never asked me for a license key. From examining the registry, it appears that a valid license key is installed at boot time. On two instances of image ami-b53cd8dc (ec2-public-windows-images/Server2003r2-i386-anon-v1.01 for i386) I got exactly the same key.

Likewise, on two different instances of ami-7b2bcf12 (ec2-public-windows-images/Server2003r2-x86_64-anon-v1.00 or x64), I got the same license key--though not the same key as the i386 image.

This tells me that the license key is probably baked into the image. It's also possible that these particular license keys are unique to my account. If someone else wants to compare keys, it'd be an interesting experiment.

Either way, the extra 2.5 cents per hour on the small instance must go to Microsoft to pay for license rental.

Windows on EC2, from a Mac

2008-10-23T13:54:30-05:00

It may be a bit perverse, but I wanted to hit a Windows EC2 instance from my Mac. After a little hitch getting started, I got it to work. There are a few quirks about accessing Windows instances, though.

First off, SSH is not enabled by default. You'll need to use remote desktop to access your instance. Remote desktop uses port 3389, so the first step is to create a new security group for Windows desktop access

$ ec2-add-group windows -d 'Windows remote desktop access'
GROUP    windows    Windows remote desktop access

Then, allow access to port 3389 from your desired origin. I'm allowing it from anywhere, which isn't a great idea, but I'm on the road a lot. I never know what the hotel's network origin will be.

$ ec2-authorize windows -p 3389 -P tcp
GROUP        windows    
PERMISSION        windows    ALLOWS    tcp    3389    3389    FROM    CIDR    0.0.0.0/0

Obviously, you could add that permission to any existing group that you already use.

There's a bit of a song and dance to log in. Where Linux instances typically use SSH with public-key authentication, Windows server requires a typed password. Amazon has come up with a reasonable, but slightly convoluted, way to extract a randomized password.

You will need to start your instance in the new security group and with a keypair. The docs could be a little clearer, in that here you're providing the name of the keypair as it was registered with EC2. The first few times I tried this, I was giving it the path of the file containing the keypair, which doesn't work.

$ ec2-describe-keypairs
KEYPAIR    devkeypair    02:10:65:9e:51:73:7e:93:bd:30:e2:5d:91:03:d5:e1:d4:0e:c0:f4
$ ec2-run-instances ami-782bcf11 -g windows -k devkeypair
RESERVATION    r-82429ceb    001356815600    windows
INSTANCE    i-f172db98    ami-782bcf11            pending    devkeypair    0        m1.small    2008-10-23T20:01:36+0000    us-east-1a            windows

After all that, and waiting through a Windows boot cycle, you can access the Windows desktop through RDP.

What's that? You don't have an RDP client, because you're a Mac user? I like CoRD for that. I also saw a lot of references to rdesktop, which is available through Darwin Ports. (For today, I wasn't prepared to install Ports just to try out the Windows EC2 instance!)

Extract the public IP address of your instance:

$ ec2-describe-instances
RESERVATION    r-82429ceb    001356815600    windows
INSTANCE    i-f172db98    ami-782bcf11    ec2-75-101-252-238.compute-1.amazonaws.com    domU-12-31-39-02-48-31.compute-1.internal    running    devkeypair    0        m1.small    2008-10-23T20:01:36+0000    us-east-1a        windows

Fire up CoRD and paste the IP address into "Quick Connect".

Well, now what? Obviously, you'll use "Administrator" as the username, but what's the password? There's a new command in the latest release of ec2-api-tools called "ec2-get-password".

$ ec2-get-password i-f172db98 -k keys/devkeypair.pem
edhnsNG1J5

Note that this time, I'm using the path of my keypair file. EC2 uses this to decrypt the password from the instance's console output. At boot time, Windows prints out the password, encrypted with the public key from the keypair you named when starting the instance.

Success at last: fully logged in to my virtual Windows server from my Mac desktop.

Don't Break My Heart, EC2!

2008-10-23T11:22:59-05:00

I'm a huge booster of AWS and EC2. I have two talks about cloud computing, and one that's pretty specific to AWS, on the No Fluff, Just Stuff traveling symposium.

With today's announcement about EC2 coming out of beta, and about Windows support, I wanted to try out a Windows server on EC2.

Heartbreak!

ec2-describe-images -a | grep windows
IMAGE    ami-782bcf11    ec2-public-windows-images/Server2003r2-i386-anon-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-792bcf10    ec2-public-windows-images/Server2003r2-i386-EntAuth-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-7b2bcf12    ec2-public-windows-images/Server2003r2-x86_64-anon-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-7a2bcf13    ec2-public-windows-images/Server2003r2-x86_64-EntAuth-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-3934d050    ec2-public-windows-images/SqlSvrExp2003r2-i386-Anon-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-0f34d066    ec2-public-windows-images/SqlSvrExp2003r2-i386-EntAuth-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-8135d1e8    ec2-public-windows-images/SqlSvrExp2003r2-x86_64-Anon-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-9835d1f1    ec2-public-windows-images/SqlSvrExp2003r2-x86_64-EntAuth-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-6834d001    ec2-public-windows-images/SqlSvrStd2003r2-x86_64-Anon-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-6b34d002    ec2-public-windows-images/SqlSvrStd2003r2-x86_64-EntAuth-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-cd8b6ea4    khaz_windows2003srvEE/image.manifest.xml    602961847481    available    public        i386    machine        

mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10 -z us-east-1a
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10 -z us-east-1b
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10 -z us-east-1c
Server.InsufficientInstanceCapacity: Insufficient capacity.

Ack! Insufficient capacity?! That's not supposed to happen. Wait a second... let me try my own image

mtnygard@donk /var/tmp/nms $ ec2-describe-images
IMAGE    ami-8a0beee3    com.michaelnygard/nms-base-v1.manifest.xml    001356815600    available    private        i386    machine        
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-8a0beee3
RESERVATION    r-0c4a9465    001356815600    default
INSTANCE    i-8e79d0e7    ami-8a0beee3            pending        0        m1.small    2008-10-23T17:25:21+0000    us-east-1c        
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10
Server.InsufficientInstanceCapacity: Insufficient capacity.

Very interesting. Looks like there's enough capacity to run all the Linux based images, but not enough for Windows?

Seems like there might be some contractual limit on how many Windows licenses Amazon is allowed to rent out. I would also infer some serious pent-up demand to eat them all up this quickly.

Or maybe it's just a glitch. We'll see.

Update [1:15 PM] I was just able to start five instances. Could be fluctuations in demand, or it could be clearing of a glitch. It's always hard to tell what's really happening inside the cloud.

Update [2:50 PM] My plaintive post in the AWS forums got a very quick response. The inscrutable wizard JeffW posted a "we're working on it" and "it's fixed" messages just 3 minutes apart. We'll probably never know quite what was going on.

Perfection is Not Always Required

2008-10-14T21:08:22-05:00

In my series on dirty data, I made the argument that sometimes incomplete, inaccurate, or inconsistent data was OK. In fact, not only is it OK, but it can be an advantage.

There's a really slick Ruby library called WhatLanguage that illustrates this beautifully. The author also wrote a nice article introducing the library. WhatLanguage automatically determines the language that a piece of text is written in.

For example (from the article)

require 'whatlanguage'

"Je suis un homme".language      # => :french

Very nice.

WhatLanguage works by comparing words in the input text to a data structure that can tell you whether a word exists in the corpus. There's the catch, though. It can return a false positive! That would mean you get an incorrect "yes" sometimes for words that aren't in the language in question. On the other hand, it's guaranteed against false negatives.

You might imagine that there are pretty limited circumstances when you'd use a data structure that sometimes returns incorrect answers. (There is a calculable probability of a false positive. It never reaches zero.) It works for WhatLanguage, though.

You see, each word contributes to a histogram binned by possible language. Ultimately, one language "wins", based on whichever has the most entries in the histogram. False positives may contribute an extra point to incorrect languages, but the correct language will pretty much always emerge from the noise, provided there's enough source text to work from.

So, there's another example of information emerging from noisy inputs, just as long as there's enough of it.

Arrival at JAOO

2008-09-27T23:34:14-05:00

Considering that it's 7:30 AM local time---where "local" means Aarhus, Denmark---and I'm awake and online, it looks like I've successfully reset my internal clock. Of course, my approach consisted of staying awake for 28 hours continuously then having three excellent beers with dinner. There are probably easier ways, and there may be repercussions later.

I've always heard good things about JAOO, so it was an honor and a delight to be invited. So far, just hanging around the hotel has been interesting. Waiting to check in yesterday evening, I encountered Richard Gabriel and one of the guys who designed Windows PowerShell. (He still calls it Monad, which I think was a much better name than "PowerShell". Also, I wish I'd gotten his name, but I was a too distracted by the problem with my reservation.)

After dinner, I started chatting with some ThoughtWorkers over a game of ZombieFluxx. Two observations: first, ZombieFluxx is the kind of game that only a computer programmer or a lawyer could love. The deck of cards includes many cards that change the rules of the game itself. Gameplay changes from turn to turn based on the current state of the rule cards showing. There's even a card that requires you to groan like the undead whenever you turn over a new "zombie" card. Very meta. Second, it seems that TW people make up half of every conference I go to. They must have a fantastic training budget, because they are disproportionately represented relative to their much larger competitors like Accenture, Deloitte, and that crowd. Woe to the conference industry if ThoughtWorks falls on hard times.

My primary goal for today was to get over jetlag. Having accomplished that before 8 AM, I'll now see about straightening out my hotel situation. It's hard to think much about software when you may not have a roof over your head come nightfall.

Update: Got my hotel issues resolved. Now at a thoroughly modern, thoroughly Danish hotel called the "Best Western Oasia". Funny, but I always think of "Best Western" as the cruddy, mildewed cheap hotels off the Interstate in places like west Texas and Birmingham, Alabama. This hotel may cause me to reevaluate that image! It's nice, in a kind of "living inside Ikea" way.

(And, yes, I know Ikea is Swedish, not Danish. It's the bare wood, spare furnishings, and black lacquer I'm talking about.)

The Infamous Seinfeld-Gates Ad

2008-09-05T17:07:07-05:00

In Korean

2008-09-04T16:55:30-05:00

"Release It" has now been translated into Korean. I just received three copies of a work that's hauntingly familiar, but totally opaque to me.

I kind of wonder how the pop-culture jokes came through. I bet C3PO and R2D2 made it OK, but I wonder whether "dodge, duck, dip, dive, and dodge" made it past the Korean copy editor. (For that matter, I'm faintly surprised it made it past the English copy editor.)

ReadWriteWeb on Dirty Data

2008-08-24T22:29:50-05:00

A short while back, I did a brief series on the value of "dirty data"---copious amounts of unstructured, non-relational data created by the many interactions user have with your site and each other.

ReadWriteWeb has a post up about Four Ad-Free Ways that Mined Data Can Make Money, along very similar lines. Well worth a read.

97 Things Every Software Architect Should Know

2008-08-19T14:12:26-05:00

O'Reilly is creating a new line of "community-authored" books. One of them is called "97 Thing Every Software Architect Should Know".

All of the "97 Things" books will be created by wiki, with the best entries being selected from all the wiki contributions.

I've contributed several axioms that have been selected for the book:

Long-time readers of this blog may recognize some of these themes.

You can see the whole wiki here.

How Buildings Learn

2008-08-19T10:14:43-05:00

Stewart Brand's famous book How Buildings Learn has been on my reading queue for a while, possibly a few years. Now that I've begun reading it, I wish I had gotten it sooner. Listen to this:

The finished-looking model and visually obsessive renderings dominate the let's-do-it meeting, so that shallow guesses are frozen as deep decisions. All the design intelligence gets forced to the earliest part of the building process, when everyone knows the least about what is really needed.

Wow. It's hard to tell what industry he's talking about there. It could easily apply to software development. No wonder Brand is so well-regarded in the Agile community!

Another wonderful parallel is between what Brand calls "Low Road" and "High Road" buildings. A Low Road building is one that is flexible, cheap, and easy to modify. It's hackable. Lofts, garages, old factory floors, warehouses, and so on. Each new owner can gut and modify it without qualms. A building where you can drill holes through the walls, run your own cabling, and rip out every interior wall is a Low Road building.

High Road buildings evolve gradually over time, through persistent care and love. There doesn't necessarily have to be a consistent--or even coherent--vision, but each own does need to feel a strong sense of preservation. High Road buildings become monuments, but they aren't made that way. They just evolve in that direction as each generation adds their own character.

Then there are the buildings that aren't High or Low Road. Too static to be Low Road, but not valued enough to be High Road. Resistant to change, bureaucratic in management. Diffuse responsibility produces static (i.e., dead) buildings. Deliberately setting out to design a work of art, paradoxically, prevents you from creating a living, livable building.

Again, I see some clear parallels to software architecture here. On the one hand, we've got Low Road architecture. Easy to glue together, easy to rip apart. Nobody gets bent out of shape if you blow up a hodge-podge of shoestring batch jobs and quick-and-dirty web apps. CGI scripts written in perl are classic Low Road architecture. It doesn't mean they're bad, but they're probably not going to go a long time without being changed in some massive ways.

High Road architecture would express a conservativism that we don't often see. High Road is not "big" architecture. Rather, High Road means cohesive systems lovingly tended. Emacs strikes me as a good example of High Road architecture. Yes, it's accumulated a lot of bits and oddments over the years, but it's quite conservative in its architecture.

Enterprise SOA projects, to me, seem like dead buildings. They're overspecified and too focused on the moment of rollout. They're the grand facades with leaky roofs. They're the corporate office buildings that get gerrymandered into paralysis. They preach change, but produce stasis.

Dan Pritchett on Availability

2008-08-17T08:03:19-05:00

Dan Pritchett is a man after my own heart. His latest post talks about the path to availability enlightenment. The obvious path--reliable components and vendor-supported commercial software--leads only to tears.

You can begin on the path to enlightenment when you set aside dreams of perfect software running on perfect hardware, talking over perfect networks. Instead, embrace the reality of fallible components. Don't design around them, design for them.

How do you design for failure-prone components? That's what most of Release It! is all about.

Agile Tool Vendors

2008-08-08T08:06:35-05:00

There seems to be something inherently contradictory about "Enterprise" agile tool vendors. There's never been a tool invented that's as flexible in use or process as the 3x5 card. No matter what, any tool must embed some notion of a process, or at least a meta-process.

I've looked at several of the "agile lifecycle management" and "agile project management" tools this week. To me, they all look exactly like regular project management tools. They just have some different terminology and ajax-y web interfaces.

Vendors listen: just because you've got a drag-and-drop rectangle on a web page doesn't make it agile!

The point of agile tools isn't to move cards around the board in ever-cooler ways. It isn't to automatically generate burndown graphs and publish them for management.

The point of agile tools is this: at any time, the team can choose to rip up the pavement and do it differently next iteration.

What happens once you've paid a bunch of money for some enterprise lifecycle management tool from one of these outfits? (Name them and they appear; so I won't.) Investment requires use. Once you've paid for something---or once your boss has paid for it---you'll be stuck using it.

Now look, I'm not against tools. I use them as force multipliers all the time. I just don't want to get stuck with some albatross of a PLM, ALM, LFCM, or LEM, just because we paid a gob of money for it.

The only agile tools I want are those I can throw away without qualm when the team decides it doesn't fit any more. If the team cannot change its own processes and tools, then it cannot adapt to the things it learns. If it cannot adapt, it isn't agile. Period.

Beyond the Village

2008-07-29T06:40:25-05:00

As an organization scales up, it must navigate several transitions. If it fails to make these transitions well, it will stall out or disappear.

One of them happens when the company grows larger than "village-sized". In a village of about 150 people or less, it's possible for you to know everyone else. Larger than that, and you need some kind of secondary structures, because personal relationships don't reach from every person to every other person. Not coincidentally, this is also the size where you see startups introducing mid-level management.

There are other factors that can bring this on sooner. If the company is split into several locations, people at one location will lose track of those in other locations. Likewise, if the company is split into different practice areas or functional groups, those groups will tend to become separate villages on their own. In either case, the village transition will happen sooner than 150.

It's a tough transition, because it takes the company from a flat, familial structure to a hierarchical one. That implicitly moves the axis of status from pure merit to positional. Low-numbered employees may find themselves suddenly reporting to a newcomer with no historical context. It shouldn't come as a surprise when long-time employees start leaving, but somehow the founders never expect it.

This is also when the founders start to lose touch with day-to-day execution. They need to recognize that they will never again know every employee by name, family, skills, and goals. Beyond village size, the founders have to be professional managers. Of course, this may also be when the board (if there is one) brings in some professional managers. It shouldn't come as a surprise when founders start getting replaced, but somehow they never expect it.

S3 Outage Report and Perspective

2008-07-26T20:39:20-05:00

Amazon has issued a more detailed statement explaining the S3 outage from June 20, 2008. In my company, we'd call this a "Post Incident Report" or PIR. It has all the necessary sections:

Observed behavior
Root cause analysis
Followup actions: corrective and operational

This is exactly what I'd expect from any mature service provider.

There are a few interesting bits from the report. First, the condition seems to have arisen from an unexpected failure mode in the platform's self-management protocol. This shouldn't surprise anyone. It's a new way of doing business, and some of the most innovative software development, applied at the largest scales. Bugs will creep in.

In fact, I'd expect to find more cases of odd emergent behavior at large scale.

Second, the back of my envelope still shows S3 at 99.94% availability for the year. That's better than most data center providers. It's certainly better than most corporate IT departments do.

Third, Amazon rightly recognizes that transparency is a necessary condition for trust. Many service providers would fall into the "bunker mentality" of the embattled organization. That's a deteriorating spiral of distrust, coverups, and spin control. Transparency is most vital after an incident. If you cannot maintain transparency then, it won't matter at any other time.

Article on Building Robust Messaging Applications

2008-07-22T09:37:46-05:00

I've talked before about adopting a failure-oriented mindset. That means you should expect every component of your system or application to someday fail. In fact, they'll usually fail at the worst possible times.

When a component does fail, whatever unit of work it's processing at the time will most likely be lost. If that unit of work is backed up by a transactional database, well, you're in luck. The database will do it's Omega-13 bit on the transaction and it'll be like nothing ever happened.

Of course, if you've got more than one database, then you either need two-phase commit or pixie dust. (OK, compensating transactions can help, too, but only if the thing that failed isn't the thing that would do the compensating transaction.)

I don't favor distributed transactions, for a lot of reasons. They're not scalable, and I find that the risk of deadlock goes way up when you've got multiple systems accessing multiple databases. Yes, uniform lock ordering will prevent that, but I never want to trust my own application's stability to good coding practices in other people's apps.

Besides, enterprise integration through database access is just... icky.

Messaging is the way to go. Messaging offers superior scalability, better response time to users, and better resilience against partial system failures. It also provides enough spatial, temporal, and logical decoupling between systems that you can evolve the endpoints independently.

Udi Dahan has published an excellent article with several patterns for robust messaging. It's worth reading, and studying. He addresses the real-world issues you'll encounter when building messaging apps, such as giant messages clogging up your queue, or "poison" messages that sit in the front of the queue causing errors and subsequent rollbacks.

Kingpins of Filthy Data

2008-07-17T20:50:12-05:00

If large amounts of dirty data are actually valuable, how do you go about collecting it? Who's in the best position to amass huge piles?

One strategy is to scavenge publicly visible data. Go screen-scrape whatever you can from web sites. That's Google's approach, along with one camp of the Semantic Web tribe.

Another approach is to give something away in exchange for that data. Position yourself as a connector or hub. Brokers always have great visibility. The IM servers, the Twitter crowd, and the social networks in general sit in the middle of great networks of people. LinkedIn is pursuing this approach, as are Twitter+Summize, and BlogLines. Facebook has already made multiple, highly creepy, attempts to capitalize on their "man-in-the-middle" status. Meebo is in a good spot, and trying to leverage it further. Metcalfe's Law will make it hard to break into this space, but once you do, your visibility is a great natural advantage.

Aggregators get to see what people are interested in. FriendFeed is sitting on a torrential flow of dirty data. ("Sewage", perhaps?) FeedBurner sees the value in their dirty data.

Anyone at the endpoint of traffic should be able to get good insight into their own world. While the aggregators and hubs get global visibility, the endpoints are naturally more limited. Still, that shouldn't stop them from making the most of the dirt flowing their way. Amazon has done well here.

Sun is making a run at this kind of visibility with Project Hydrazine, but I'm skeptical. They aren't naturally in a position to collect it, and off-to-the-side instrumentation is never as powerful. Although, companies like Omniture have made a market out of off-to-the-side instrumentation, so there's a possibility there.

Carriers like Verizon, Qwest, and AT&T are in a natural position to take advantage of the traffic crossing their networks, but as parties in a regulated industry, they are mostly prohibited from looking at the traffic crossing their networks.
fantastic visibility

So, if you're a carrier or a transport network, you're well positioned to amass tons of dirty data. If you are a hub or broker, then you've already got it. Otherwise, consider giving away a service to bring people in. Instead of supporting it with ad revenue, support it by gleaning valuable insight.

Just remember that a little bit of dirty data is a pain in the ass, but mountains of it are pure gold.

Inverting the Clickstream

2008-07-16T11:00:00-05:00

Continuing my theme of filthy data.

A few years ago, there was a lot of excitement around clickstream analysis. This was the idea that, by watching a user's clicks around a website, you could predict things about that user.

What a backwards idea.

For any given user, you can imagine an huge number of plausible explanations for any given browsing session. You'll never enumerate all the use cases that motivate someone to spend ten minutes on seven pages of your web site.

No, the user doesn't tell us much about himself by his pattern of clicks.

But the aggregate of all the users' clicks... that tells us a lot! Not about the users, but about how the users perceive our site. It tells us about ourselves!

A commerce company may consider two products to be related for any number of reasons. Deliberate cross-selling, functional alignment, interchangability, whatever. Any such relationships we create between products in the catalog only reflect how we view our own catalog. Flip that around, though, and look at products that the users view as related. Every day, in every session, users are telling us that products have some relationship to each other.

Hmm. But, then, what about those times when I buy something for myself and something for my kids during the same session? Or when I got that prank gift for my brother?

Once you aggregate all that dirty data, weak connections like the prank gift will just be part of the background noise. The connections that stand out from the noise are the real ones, the only ones that ultimately matter.

This is an inversion of the clickstream. It tells us nearly nothing about the clicker. Instead, it illuminates the clickee.

Mounds of Filthy Data

2008-07-16T08:15:58-05:00

Data is the future.

The barriers to entering online business are pretty low, these days. You can do it with zero infrastructure, which means no capital spent on depreciating assets like servers and switches. Open source operating systems, databases, servers, middleware, libraries, and development tools mean that you don't spend money on software licenses or maintenance contracts. All you need is an idea, followed by a SMOP.

With both the cost side trending toward zero, how can there be any barrier to entry?

The "classic" answer is the network effect, also known as Metcalfe's Law. (The word "classic" in web business models means anything more than two years old, of course.) The first Twitter user didn't get a whole lot out of it. The ten-million-and-first gets a lot more benefit. That makes it tough for a newcomer like Plurk to get an edge.

I see a new model emerging, though. Metcalfe's Law is part of it, keeping people engaged. The best thing about having users, though, is that they do things. Every action by every user tells you something, if you can keep track of it all.

Twitter gets a lot of its value from the people connected at the endpoints. But, they also get enormous power from being the hub in the middle of it. Imagine what you can do when you see the content of every message passing through a system that large. A few things come to mind right away. You could extract all the links that people are posting to see what's hot today. (Zeitgeist.) You could use semantic analysis to tell how people feel about current topics, like Presidential candidates in the U.S. You could track product names and mentions to see which products delight people and which cause frustration. You could publish a slang dictionary that actually keeps up! The possibilities are enormous.

Ah, I can already sense an objection forming. How the heck is anyone supposed to figure out all that stuff from noisy, messy textual human communication? We're cryptic, ironic, and oblique. We sometimes mean the exact opposite of what we say. Any machine intelligence that tries to grok all of Twitter will surely self-destruct, right? That supposed "data" is just a big steaming pile of human contradictions!

In my view, though, it's the dirtiness of the data that makes it beautiful. Yes, there will be contradictions. There will be ironic asides. But, those will come out in the wash. They'll be balanced out by the sincere, meaningful, or obvious. Not every message will be semantically clear or consistent, but given enough messy data, clear patterns will still emerge.

There's the key: enough data to see patterns. Large amounts. Huge amounts. Vast piles of filthy data.

Over the next couple of days, I'll post a series of entries exploring how to amass dirty data, who's got a natural advantage, and programming models that work with it.

Hard Problems in Architecture

2008-07-07T12:42:35-05:00

Many of the really hard problems in web architecture today exist outside the application server. Here are three problems that haven't been solved. Partial solutions exist today, but nothing comprehensive.

Uncontrolled demand

Users tend to arrive at web sites in huge gobs, all at once. As the population of the Net continues to grow, and the need for content aggregators/filters grows, the "front page" effect will get worse.

One flavor of this is the "Attack of Self-Denial", an email, radio, or TV campaign that drives enough traffic to crash the site. Marketing can slow you down. Really good marketing can kill you at any time.

Versioning, deployment and rollback

With large scale applications, as with large enterprise integrations, versioning rapidly becomes a problem. Versioning of schemas, static assets, protocols, and interfaces. Add in a dash of SOA, and you have a real nightmare. You can count on having at least one interface broken at any given time. Or, you introduce such powerful governors that nothing ever changes.

As the number of nodes increases, you eventually find that there's always at least one deployment going on. A "deployment" becomes less of a point-in-time activity than it is a rolling wave. A new service version will take hours or days to be deployed to every node. In the meantime, both the old and new service version must coexist peacefully. Since both service versions will need to support multiple protocol versions (see above) you have a combinatorial problem.

And, of course, some of these deployments will have problems of their own. Today, many application deployments are "one way" events. The deployment process itself has irreversably destructive effects. This will have to change, so every deployment can be done both forward and back. Oh, and every deployment will also be deploying assets to multiple targets---web, application, and database---while also triggering cache flushes and, possibly, metadata changes to external partners like Akamai.

Applications will need to participate in their own versioning, deployment, and management.

Blurring the lines

There used to be a distinction between the application and the infrastructure it ran on. That meant you could move applications around and they would behave pretty much the same in a development environment as in the real production environment. These days, firewalls, load balancers, and caching appliances blur the lines between "infrastructure" and "application". It's going to get worse as the lines between "operational support system" and "production applications" get blurred, too. Automated provisioning, SLA management, and performance management tools will all have interactions with the applications they manage. These will inevitably introduce unexpected interactions... in a word, bugs.

Creeping Fees

2008-06-25T16:25:00-05:00

A couple of years ago, the Minneapolis-St. Paul airport introduced self-pay parking gates. Scan a credit card on the way in and on the way out, and it just debits the card. This obviously saves money on parking attendants, and it's pretty convenient for parkers.

At first, to encourage adoption, they offered a discount of $2 per day. Every time you'd approach the entry, a friendly voice from a Douglas Adams novel would ask, "Would you like to save $2 per day on parking?" For general parking, that meant $14 instead of $16 per day.

Some time later, this switched from being an incentive for adopting the system to a penalty for avoiding it. How? They raised the rates by $2 per day. So now, the top rate if you use self-pay is back to $16. If you don't use it, then your top rate bumped up to $18. Clearly they put somebody from the banking industry in charge of this parking system.

Now, it's changed again, from $2 per day to $2 per transaction. So it's just $2 off the top of whatever your overall parking fees are.

This gradual creep is really interesting. I wonder what the next step will be. A $2 per year discount would be one way to approach it. Maybe a "frequent parker" program. More likely the discount will drop to $1 per transaction, or it will just be discarded altogether.

That's OK with me, because swiping the credit card is still more convenient than exchanging cash money with a human anyway.

Besides, back when it was cash based, I always got tagged with the ATM fee anyway.

Word Cloud Bandwagon

2008-06-17T14:57:19-05:00

Wordle has been meming it's way around the 'Net lately. Figured I'd join the crowd by doing a word cloud for Release It. This is from the preface.

Considering that this is just from fairly simple text analysis, I'm surprised at how accurately it represents the key concerns. "Software" and "money" have roughly equal prominence. "Life" appears near the middle, along with "excitement", "revenue", "production" and "systems". Not bad for an algortihm.

Webber and Fowler on SOA Man-Boobs

2008-06-07T21:39:20-05:00

InfoQ posted a video of Jim Webber and Martin Fowler doing a keynote speech at QCon London this Spring. It's a brilliant deconstruction of the concept of the Enterprise Service Bus. I can attest that they're both funny and articulate (whether on the stage or off.)

Along the way, they talk about building services incrementally, delivering value at every step along the way. They advocate decentralized control and direct alignment between services and the business units that own them.

I agree with every word, though I'm vaguely uncomfortable with how often they say "enterprise man boobs".

Coincidence or Back-end Problem?

2008-06-07T09:48:24-05:00

An odd thing happened to me today. Actually, an odd thing happened yesterday, but it's having the same odd thing happen today that really makes it odd. With me so far?

Yesterday, while I was shopping at Amazon, Amazon told me that my American Express card had expired. While it is set for a May expiration, it's several years in the future. I didn't think too much of it, because when I re-entered the same information, Amazon accepted it.

Today, I got the same thing with the same card on iTunes!

Online stores don't do a whole lot with your credit cards. For the most part, they just make a call out to a credit card processor. Small stores have to go through a second-tier CCVS system that charges a few pennies per transaction. Large ones---and do they get larger than Amazon?---generally connect directly to a payment processor. The payment processor may charge a fraction of a cent per transaction, but they definitely make it up in volume.

(There are other business factors, too, like the committed transaction volume, response time SLAs, and the like.)

Asynchronously, the payment processor collects from the issuing bank. It's the issuing bank that actually bills you, and sets your interest rate and payment terms.

Whereas VISA and MasterCard work with thousands of issuers, American Express doesn't. When you get an AmEx card, they are the issuing bank as well as the payment processor.

Which makes it highly suspect that the same card gave me the same error through two different sites. It makes me think that American Express has introduced a bug in their validation system, causing spurious declines for expiration.

Social Factors

2008-06-06T15:45:00-05:00

I mentioned Tom DeMarco just a couple of days ago. I'm re-reading his great book, Why Does Software Cost So Much? for the first time in about ten years.

Personally, I credit Tom as one of the unsung progenitors of the agile movement. Long before we had "Agile" or even "lightweight methods", Tom was talking about the psycho-social nature of software development.

For instance, here's an excerpt from essay 8, "Nontechnological Issues in Software Engineering":

Imagine your boss just plunked a specification on your desk and asked, "How long will it take you and one other person to get this job done?" What's the first question out of your mouth?
Would you ask, "Can we use object-oriented methods?" or "What CASE system can we buy?" or "Is it okay to use rapid prototyping?" Of course not. Your first question is,
Who is the other person?

Absolutely. Right on, Tom.

Plurk.

2008-06-06T10:13:33-05:00

A friend invited me to Plurk. So far, I've resisted Twitter for no good reason (other than a vague sense of social insecurity.) I figure I'll dip my toe into Plurk, though.

This link is an open invite to Plurk. It'll let anyone join. Fair warning, it's also a "friend" link.

Six Word Methods

2008-06-03T20:44:37-05:00

In his great collection of essays Why Does Software Cost So Much?, Tom DeMarco makes the interesting point that the software industry had grown from zero to $300 billion dollars (in 1993). This indicates that the market had at least $300B worth of demand for software, even while complaining continuously about the cost and quality of the very same software. It seems to me that the demand for software production, together with the time and cost pressures, has only increased dramatically since then.

(DeMarco enlightens us that the perennial question, "Why does software cost so much?" is not really a question at all, but rather a goad or a negotiation. Also very true.)

Fundamentally, the demand for software production far outstrips our industry's ability to supply it. In fact, I believe that we can classify most software methods and techniques by their relation and response to the problem of surplus demand. Some try to optimize for least-cost production, others for highest quality, still others for shortest cycle time.

In the spirit of six-word memoirs, here are the sometimes dubious responses that various technology and development methods' offer to the overwhelming demand for software production.

Waterfall: Nevermind backlog, requirements were signed off.

RAD: Build prototypes faster than discarding them.

Offshore outsourcing: Army of cheap developers producing junk.

Onshore outsourcing: Same junk, but with expensive developers.

Agile: Avoid featuritis; outrun pesky business users.

Domain-specific languages: Compress every problem into one-liners.

CMMi: Enough Process means nothing's ever wasted.

Relational Databases: Code? Who cares? Data lives forever.

Model-driven architecture: Jackson Pollack's models into inscrutable code.

Web Services: Terrorize XML until maximum reuse achieved.

FORTH: backward writing IF punctuation time SAVE.

SOA: Iron-fisted governance ensures total calcification.

Intentional programming: Parallelize programming... make programmers of everyone.

Google as IDE: It's been done, probably in Befunge.

Open-source: Bury the world in abandoned code.

Mashups: Parasitize others' apps, then APIs change.

LISP: With enough macros, one uberprogrammer sufficies.

perl: Too busy coding to maintain anyway.

Ruby: Meta-programming: same problems, mysterious solutions.

Ocaml: No, try meta-meta-meta-programming.

Groovy: Faster Java coding, runs like C-64.

Software-as-a-Service: Don't write your own, rent ours.

Cloud Computing: Programmers would go faster without administrators.

New Article: S2AP + Eclipse + Maven walkthrough

2008-05-30T18:52:17-05:00

See Getting Started With SpringSource Application Platform, Eclipse, and Maven.

Most of the information out there about programming in S2AP is in blogs or references to really old OSGi tutorials. It took me long enough to configure some basic Eclipse project support that I figured it was worth writing down. All of the frameworks and tool sets are very flexible, which means you have more choices to deal with when setting up a project. Sometimes, being concrete helps... there may be a lot of options, but when it's time to do a project, you only care about one set of choices for those options. This guide is completely specific to using Eclipse to write bundle projects for SpringSource Application Platform.

If that's your specific set of needs, great! If not, that's OK too, because the beauty of the Web is that somebody else will have a tutorial on your exact combination, too.

Canadian Privacy Commissioner Highlights Cloud Privacy Concerns

2008-05-28T13:42:26-05:00

A little while ago, I wrote a piece about the conflict between "clouds" and the hard boundaries of the political sphere. There's no physical place called "cyberspace", and any cloud computing infrastructure has to actually exist somewhere.

Like many U.S. citizens, I really hate the idea that facts about me become somebody else's copyrighted property just because they get stored in a database. Canada has a justifiably good reputation for protecting its citizens' privacy. Their legal framework takes the refreshing position of protecting individuals rather than protecting the ability of non-corporeal entities (a.k.a. "incorporated persons", a.k.a. "corporations") to collect any and all information.

I hadn't realized that there were such offices as the "Information and Privacy Commissioner of Ontario", however.

Better still, Ontario's IPC Commissioner, Dr. Ann Cavoukian, is very current. She's just released a white paper on the privacy implications of cloud computing. She's calling for open standards around digital identity management, and outlines some technological building blocks needed for controllable trust and identity verification.

Unlike the U.S. approach to identity verification, Dr. Cavoukian's approach has nothing to do with catching illegal aliens, welfare frauds, or terrorists. Instead, it's about creating open, trustworthy ways for humans to interact in all their various modalities from commerce, to entertainment, and even to romance.

Quickie: GAE is GA

2008-05-28T12:56:12-05:00

According to eWeek, Google will make GAE open to public use on May 28th. Which would be today.

The original GAE site isn't updated at this point, but you can get started anyway. I just set up my account and registered an app. (I predict tens of thousands of empty apps. Long-tail distribution here, just like SourceForge: an overwhelming majority of empty projects, with a vanishingly tiny minority that have 99% of the traffic.)

Now I just need to find time to learn Python and write something cool.

Wii Wescue

2008-05-16T15:56:58-05:00

So, I got a Wii for Father's Day last year. It's been a lot of fun to play together with my kids, my wife, and even my parents and in-laws. It's fantastic to have a game system that we can all play together and be reasonably competitive. My six-year old can hold her own in Wii bowling, but she cries a lot when we play Halo. (I'm just kidding...)

Unfortunately, my three-year old put a shiny disc of her own into it: a plastic toy coin. Well, it does say "Play Money" right on the front. Right in the drive slot. I figured my Wii was a goner for sure.

I set about opening the thing up to remove the coin, but got stumped by these custom screws, kind of like a Philips head, but with three prongs. Turns out these are called "Triwing" screws and they're specifically designed to keep end users out of the machine, on the theory that these are not widely used screws, so most people won't have the means to unscrew them. True, it slowed me down a bit. I had to order a kit from Thinkgeek that has driver bits for every console on the market.

Opened it up, got the coin out, and the Wii still works!

But, surely these belong somewhere, don't they?

Opening Up SpringSource AP

2008-05-14T15:57:20-05:00

Just now getting my hands on the SpringSource Application Platform. It's deceptive, because there's very little functionality exposed when you run it. It starts up with less ceremony than Apache or Tomcat. (Which is kind of funny, when you consider that it includes Tomcat.)

When you look at the bundle repository, though, it's clear that a lot of stuff is packaged in here. In a way, that's like the Spring framework itself. On the surface, it looks like just a bean configurator. All the really powerful stuff is in the libraries built out of that small core.

Here's a quick listing of the bundles in version 1.0.0.beta:

./bundles/ext/com.springsource.com.google.common.collect-0.5.0.alpha.jar
./bundles/ext/com.springsource.edu.emory.mathcs.backport-3.0.0.jar
./bundles/ext/com.springsource.javax.activation-1.1.0.jar
./bundles/ext/com.springsource.javax.annotation-1.0.0.jar
./bundles/ext/com.springsource.javax.ejb-3.0.0.jar
./bundles/ext/com.springsource.javax.el-2.1.0.jar
./bundles/ext/com.springsource.javax.jms-1.1.0.jar
./bundles/ext/com.springsource.javax.mail-1.4.0.jar
./bundles/ext/com.springsource.javax.persistence-1.0.0.jar
./bundles/ext/com.springsource.javax.servlet-2.5.0.jar
./bundles/ext/com.springsource.javax.servlet.jsp-2.1.0.jar
./bundles/ext/com.springsource.javax.servlet.jsp.jstl-1.1.2.jar
./bundles/ext/com.springsource.javax.xml.bind-2.0.0.jar
./bundles/ext/com.springsource.javax.xml.rpc-1.1.0.jar
./bundles/ext/com.springsource.javax.xml.soap-1.3.0.jar
./bundles/ext/com.springsource.javax.xml.stream-1.0.1.jar
./bundles/ext/com.springsource.javax.xml.ws-2.1.1.jar
./bundles/ext/com.springsource.json-1.0.0.BUILD-20080422112602.jar
./bundles/ext/com.springsource.org.antlr-3.0.1.jar
./bundles/ext/com.springsource.org.aopalliance-1.0.0.jar
./bundles/ext/com.springsource.org.apache.catalina-6.0.16.jar
./bundles/ext/com.springsource.org.apache.commons.fileupload-1.2.0.jar
./bundles/ext/com.springsource.org.apache.commons.io-1.4.0.jar
./bundles/ext/com.springsource.org.apache.commons.logging-1.1.1.jar
./bundles/ext/com.springsource.org.apache.coyote-6.0.16.jar
./bundles/ext/com.springsource.org.apache.el-6.0.16.jar
./bundles/ext/com.springsource.org.apache.jasper-6.0.16.jar
./bundles/ext/com.springsource.org.apache.jasper.org.eclipse.jdt-6.0.16.jar
./bundles/ext/com.springsource.org.apache.juli.extras-6.0.16.jar
./bundles/ext/com.springsource.org.apache.taglibs.standard-1.1.2.jar
./bundles/ext/com.springsource.org.aspectj.runtime-1.6.0.m2.jar
./bundles/ext/com.springsource.org.aspectj.weaver-1.6.0.m2.jar
./bundles/ext/com.springsource.slf4j.org.apache.commons.logging-1.5.0.jar
./bundles/ext/com.springsource.slf4j.org.apache.log4j-1.5.0.jar
./bundles/ext/org.springframework.aop-2.5.4.A.jar
./bundles/ext/org.springframework.aspects-2.5.4.A.jar
./bundles/ext/org.springframework.beans-2.5.4.A.jar
./bundles/ext/org.springframework.context-2.5.4.A.jar
./bundles/ext/org.springframework.context.support-2.5.4.A.jar
./bundles/ext/org.springframework.core-2.5.4.A.jar
./bundles/ext/org.springframework.jdbc-2.5.4.A.jar
./bundles/ext/org.springframework.jms-2.5.4.A.jar
./bundles/ext/org.springframework.orm-2.5.4.A.jar
./bundles/ext/org.springframework.osgi.core-1.1.0.M2A.jar
./bundles/ext/org.springframework.osgi.extender-1.1.0.M2A.jar
./bundles/ext/org.springframework.osgi.io-1.1.0.M2A.jar
./bundles/ext/org.springframework.transaction-2.5.4.A.jar
./bundles/ext/org.springframework.web-2.5.4.A.jar
./bundles/ext/org.springframework.web.portlet-2.5.4.A.jar
./bundles/ext/org.springframework.web.servlet-2.5.4.A.jar
./bundles/ext/org.springframework.web.struts-2.5.4.A.jar
./bundles/subsystems/com.springsource.platform.common/com.springsource.platform.common.env-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.common/com.springsource.platform.common.math-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.concurrent/com.springsource.platform.concurrent.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.config/com.springsource.platform.config.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.control/com.springsource.platform.control.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.deployer/com.springsource.platform.deployer.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.deployer/com.springsource.platform.deployer.hot-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.ffdc/com.springsource.platform.ffdc.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.io/com.springsource.platform.io.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.kernel/com.springsource.platform.kernel.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.kernel/com.springsource.platform.kernel.dm-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.management.proxy/com.springsource.platform.management.proxy-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.profile/com.springsource.platform.profile.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.serviceability/com.springsource.platform.serviceability.ffdc-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.serviceability/com.springsource.platform.serviceability.ffdc.aspects-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.serviceability/com.springsource.platform.serviceability.tracing.aspects-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.servlet/com.springsource.platform.servlet.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.servlet/com.springsource.platform.servlet.tomcat-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.system/com.springsource.platform.system.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.web/com.springsource.platform.web.core-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.web/com.springsource.platform.web.dm-1.0.0.beta.jar
./bundles/subsystems/com.springsource.platform.web/com.springsource.platform.web.support-1.0.0.beta.jar

There's clearly a lot of functionality built in, but how do you get at it? The SAP, erm, SpringSource AP documentation screams for improvement. Maybe they think that, because all the parts are documented elsewhere, there's no need for any integrated docset. If so, they would be wrong. Despite that, I'm interested enough to keep poking away at it.

Oh, and one other thing: the default administrator account is admin/springsource. (It's actually defined in servlet/conf/tomcat-users.xml.) For some reason, that's buried in chapter 5 of the user guide. It would be handy to make that more prominent.

JavaOne is a Hot Zone

2008-05-09T13:44:53-05:00

Apparently, there's a virus attack. Not a computer virus. A real virus. Hot zone instead of a hot spot.

From my inbox this morning:

The JavaOne conference team has been notified by the San Francisco Department of Public Health about an identified outbreak of a virus in the San Francisco area. Testing is still underway to identify the specific virus in question, but they believe it to be the Norovirus, a common cause of the "stomach flu", which can cause temporary flu-like symptoms for up to 48 hours. Part of the San Francisco area impacted includes the Moscone Center, the site of the JavaOne conference which is being held this week. We are working with the appropriate San Francisco Department of Public Health and Moscone representatives to mitigate the impact this will have on the conference and steps are being taken overnight to disinfect the facility. We have not received any indication that the show should end early, so will have the full schedule of events on Friday as planned. We hope to see you then.

Please see the attached notification from the Department of Public Health.

For further information, as well as Frequently Asked Questions related to the Norovirus, please visit the San Francisco Department of Public Health website at http://sfcdcp.org/norovirus.cfm

The CDC description includes the phrase "acute gastroenteritis."

Grab Bag of Demos

2008-05-09T12:54:18-05:00

Sun opened the final day of JavaOne with a general session called "Extreme Innovation". This was a showcase for novel, interesting, and out-of-this-world uses of Java based technology.

VisualVM

VisualVM works with local or remote applications, using JMX over RMI to connect to remote apps. While you have to run VisualVM itself under JDK 1.6, it can connect to any version of JVm from 1.4.2 through 1.7. Local apps are automatically detected and offered in the UI for debugging. VisualVM uses the Java Platform Debugger Architecture to show thread activities, memory usage, object counts, and call timing. It can also take snapshots of the application's state for post-mortem or remote analysis.

Memory problems can be a bear to diagnose. VisualVM includes a heap analyzer that can show reference chains. From the UI, it looks like it can also detect and indicate reference loops.

One interesting feature of VisualVM is the ability to add plug-ins for application-specific behavior. Sun demonstrated a Glassfish plugin that adds custom metrics for request latency and volume, and the ability to examine each application context independently.

The application does not require any special instrumentation, so you can run VisualVM directly against a production application. According to Sun, it adds "almost no overhead" to the application being examined. I'd still be very cautious about that. VisualVM allows you to enable CPU and memory profiling in real-time, so that will certainly have an effect on the application. Not to mention, it also lets you trigger a heap dump, which is always going to be costly.

VisualVM is available for download now.

JavaScript Support in NetBeans

Sun continues to push NetBeans at every turn. In this case, it was a demo of the JavaScript plugin for NetBeans. This is really a nice plugin. It uses type inferencing to provide autocompletion and semantic warnings. For example, it would warn you if a function had inconsistent return statements. (Such as returning an object from one code path mixed with a void return from another.)

It also has a handy developer aid: it warns developers about browser compatibility.

I don't do a whole lot of JavaScript, but I couldn't help thinking about other dynamic languages. Ifthe plugin can do that kind of type inferencing---without executing the code---for one dynamic language, then it should be possible to do for other dynamic languages. That could remove a lot of objections about Groovy, Scala, JRuby, etc.

Fluffy Stuff at the Edge

We got a couple of demos of Java in front of the end-user. One was a cell phone running an OpenGL scene at about 15 frames per second on an NVidia chipset. All the rendering was done in Java and displayed via OpenGL ES, with 3D positional audio. Not bad at all.

Project Darkstar got a few moments in the spotlight, too. They showed off a game called Call of the Kings, a multiplayer RTS that looked like it came from 1999. Call of the Kings uses the jMonkey Engine (built on top of JOAL, JOGL, and jInput) on the client and Project Darkstar's game server on the backend. It's OK, but as game engines go, I'm not sure how it will be relevant.

There was also a JavaCard demo, running Robocode players on JavaCards. That's not just storing the program on the card, it was actually executing on the card. Two finalists were brought up on stage (but not given microphones of their own, I noticed) for a final battle between their tanks. Yellow won, and received a PS3. Red lost, but got a PSP for making it to the finals.

Sentilla tried to get out from the "creepy" moniker by bouncing mesh-networked, location-tracking beachballs around the audience. Each one had a Sentilla "mote" in it, with a 3D accelerometer inside. Receivers at the perimeter of the hall could triangulate the beachballs' locations by signal strength. For me, the most interesting thing here was James Gosling's talk about powering the motes. They draw so little power that it's possible to power them from ambient sources: vibration and heat. Interesting. Still creepy, but interesting.

The next demo was mind-blowing. The livescribe pulse is a Java computer built into a pen. It's hard to describe how wild this thing is, you almost have to see it for any of this to make sense.

At one point, the presenter wrote down a list, narrating as he went. For item one, he wrote the numeral "1" and the word "pulse", describing the pen as he went. For item two, he wrote the numeral "2" and draw a little doodle of a desktop. Item three was the numeral and a vague cloudy thing. All this time, the pen was recoding his audio, and associating bits of the audio stream with the page locations. So when he tapped the numeral "1" that he had written, the pen played back his audio. Not bad.

Then he put an "application card" on the table and tapped "Spanish" on it. He wrote down the word "one"... and the pen spoke the word "uno". He wrote "coffee please" and it said "cafe por favor". Then he had it do the same phrase in Mandarin and Arabic. Handwriting recognition, machine translation, and speech synthesis all in the pen. Wow.

Next, he selected a program from the pen's menu. The special notebook has a menu crosshair on it, but you can draw your own crosshair and it works the same way: use the pen to tap the up-arrow on paper, and the menu changes on the display. He picked a piano program, and the pen started to give him directions on how to draw a piano. Once he was done drawing it, he could tap the "keys" on paper to play notes.

The pen captures x, y, and t information as you write, so it's digitizing the trajectory rather than the image. This is great for data compression when you're sharing pages across the livescribe web site. It's probably also great for forgers, so there might be a concern there.

Industrial Strength

Emphasizing real-time Java for a bit, Sun showed off "Blue Wonder", an industrial controller built out of an x86 computer running Solaris 10 and Java RTS 2.0. This is suitable for factory control applications and is, apparently, very exciting to factory control people.

From the DARPA Urban Challenge event, we saw "Tommy Jr.", an autonomous vehicle. It followed Paul Perrone into the room, narrating each move it was making. Fortunately, nobody tried to demonstrate it's crowd control or law enforcement features. Instead, they showed off an array of high resolution sensors and actuators. It's all controlled, under very tight real-time constraints, by a single x86 board running Solaris and Java RTS.

Into New Realms

Next, we saw a demo of JMars. This impressive application helps scientists make sense out of the 150 terabytes of data we've collected from various Mars probes. It combines data and imaging layers from many different probes. One example overlaid hematite concentrations on top of an infrared image layer. It also knows enough about the various satellites orbits to help plan imaging requests.

Ultimately, JMars was built to help target landing sites for both scientific interest and technical viability. We'll soon see how well they did: the Phoenix lander arrives in about two weeks, targeting a site that was selected using JMars.

JMars is both free to use and is also open source. Dr. Phil Christensen from Arizon State University invited the Java community to explore Mars for themselves, and perhaps join the project team.

CERN
Thousands of people, physicists and otherwise, are eagerly awaiting the LHC's activation. We got to see a little bit behind the scenes about how Java is being used within CERN.

On the one hand, some very un-sexy business process work is being done. LHC is a vast project, so it's got people, budget, and materials to manage. Ho hum. It's not easy to manage all those business processes, but it sure doesn't demo well.

On the other hand, showing off the grid computing infrastructure does.

Once it's operating, the ATLAS detectors alone will produce a gigabyte an hour of image data. All of it needs to be processed. "Processing" here means running through some amazing pattern recognition programs to analyze events, looking for anomalies. There will be far too many collisions generated every day for a physicist to look at all of them, so automated techniques have to weed out "uninteresting" collisions and call attention to ones that dont' fit the profile.

CERN estimates that 100,000 CPUs will be needed to process the data. They've built a coalition of facilities into a multi-tier grid. Even today, they're running 16,000 jobs on the grid across hundreds of data centers. With that many nodes involved, they need some good management and visualization tools, and we got to see one. It's a 3D world model with iconified data centers showing their status and capacity. Jobs fly from one to another along geodesic links. Very cool stuff.

Summary

Java is a mature technology that's being used in many spheres other than application server programming. For me, and many other JavaOne attendees, this session really underscored the fact that none of our own projects are anywhere near as cool as these demos. I'm left with the desire to go build something cool, which was probably the point.

SOA: Time For a Rethink

2008-05-08T16:00:00-05:00

The notion of a service-oriented architecture is real, and it can deliver. The term "SOA", however, has been entirely hijacked by a band of dangerous Taj Mahal architects. They seem innocuous, it part because they'll lull you to sleep with endless protocol diagrams. Behind the soporific technology discussion lies a grave threat to your business.

"SOA" has come to mean top-down, up-front, strong-governance, all-or-nothing process (moving at glacial speed) implemented by an ill-conceived stack of technologies. SOAP is not the problem. WSDL is not the problem. Even BPEL is not the problem. The problem begins with the entire world view.

We need to abandon the term "SOA" and invent a new one. "SOA" is chasing a false goal. The idea that services will be so strongly defined that no integration point will ever break is unachievable. Moreover, it's optimizing for the wrong thing. Most business today are not safety-critical. Instead, they are highly competitive.

We need loosely-coupled services, not orchestration.

We need services that emerge from the business units they serve, not an IT governance panel.

We need services to change as rapidly as the business itself changes, not after a chartering, funding, and governance cycle.

Instead of trying to build an antiseptic, clockwork enterprise, we need to embrace the messy, chaotic, Darwinian nature of business. We should be enabling rapid experimentation, quick rollout of "barely sufficient" systems, and fast feedback. We need to enable emergence, not stifle it.

Anything that slows down that cycle of experimentation and adjustment puts your business on the same evolutionary path as the Great Auk. I never thought I'd find myself quoting Tom Peters in a tech blog, but the key really is to "Test fast, fail fast, adjust fast."

The JVM is Great, But

2008-05-08T14:29:06-05:00

Much of the interest in dynamic languages like Groovy, JRuby, and Scala comes from running on the JVM. That lets them leverage the tremendous R&D that’s gone into JVM performance and stability. It also opens up the universe of Java libraries and frameworks.

And yet, much of my work deals with the 80% of cost that comes after the development project is done. I deal with the rest of the software’s lifetime. The end of development is the beginning of the software’s life. Throughout that life, many of the biggest, toughest problems exist around and between the JVM’s: Scalability, stability, interoperability, and adaptability.

As I previously showed in this graphic, the easiest thing for a Java developer to create is a slow, unscalable, and unstable web application. Making high-performance, robust, scalable applications still requires serious expertise. This is a big problem, and I don’t see it getting better. Scala might help here in terms of stability, but I’m not yet convinced it’s suitable for the largest masses of Java developers. Normal attrition means that the largest population of developers will always be the youngest and least experienced. This is not a training problem: in the post-commoditization world, the majority of code will always be written by undertrained, least-cost coders. That means we need platforms where the easiest thing to do is also the right thing to do.

Scaling distributed systems has gotten better over the last few years. Distributed memory caching has reached the mainstream. Terracotta and Coherence are both mature products, and they both let you try them out for free. In the open source crowd, as usual, you lose some manageability and some time-to-implement, but the projects work when you use them right. All of these do the job of connecting individual JVMs to a caching layer. On the other hand, I can’t help but feel that the need for these products points to a gap in the platform itself.

OSGi is finally reaching the mainstream. It’s been needed for a long time, for a couple of reasons. First, it’s still too common to see gigantic classpaths containing multiple versions of JAR files, leading to the perennial joy of finding obscure, it-works-fine-in-QA bugs. So, keeping individual projects in their own bundles, with no classpath pollution will be a big help. Versioning application bundles is also important for application management and deployment. OSGi is what we should have had since the beginning, instead of having the classpath inflicted on us.

I predict that we’ll see more production operations moving to hot deployment on OSGi containers. For enterprise services that require 100% uptime, it’s just no longer acceptable to bring down the whole cluster in order to do deployments. Even taking an entire server down to deploy a new revision may become a thing of the past. In the Erlang world, it’s common to see containers running continuously for months or years. In Programming Erlang, Joe Armstrong talks about sending an Erlang process a message to “become” a new package. It works without disrupting any current requests and it happens atomically between one service request and the next. (In fact, Joe says that one of the first things he does on a new system is deploy the container processes, at the very beginning of the project. Later, once he knows what the system is supposed to do, he deploys new packages into those containers.) Hot deployment can be safe, if the code being deployed is sufficiently decoupled from the container itself. OSGi does that.

OSGi also enables strong versioning of the bundles and their dependencies. This is an all-around good thing, since it will let developers and operations agree on exactly versions of which components belong in production at a given time.

SAP's SOA ESR

2008-05-07T13:01:22-05:00

SAP has been talking up their suite of SOA tools. The names all run together a bit, since they each contain some permutation of "enterprise" and "builder", but it's a very impressive set of tools.

Everything SAP does comes off of an enterprise service repository (ESR). This includes a UDDI registry, and it supports service discovery and lookup. Development tools allow developers to search and discover services through their "ES Workspace". Interestingly, this workspace is open to partners as well as internal developers.

From the ESR, a developer can import enough of a service defition to build a composite application. Composite applications include business process definitions, new services of their own, local UI components, and remote service references.

Once a developer creates a composite application, it can be deployed to a local container or a test server. Presumably, there's a similar tool available for administrators to deploy services, composite applications, and other enterprise components onto servers.

Through it all, the complete definition of every component goes into the ESR.

In order to make the entire service lifecycle work, SAP has defined a strong meta-model and a very strong governance process.

This is the ultimate expression of the top-down, strong-governance model for enterprise SOA.

If you're into that sort of thing.

Type Inference Without Gagging

2008-05-07T09:34:09-05:00

I am not a language designer, nor a researcher in type systems. My concerns are purely pragmatic. I want usable languages, ones where doing the easy thing also happens to be doing the right thing.

Even today, I see a lot of code that handles exceptions poorly (whether checked or unchecked!). Even after 13 years and some trillion words of books, most Java developers barely understand when to synchronize code. (And, by the way, I now believe that there's only one book on concurrency you actually need.)

I still recall the agony of converting a large C++ code base to const-correctness. That's something that you can't just do a little bit. You add one little "const" keyword and sooner or later, you end up writing some gibberish that looks like:

const int foo(const char * const * blah) = const;

I'm exaggerating a little bit, but I bet somebody more current on C++ can come up with an even worse example.

That's the path I don't want to see Java tread.

On the other hand, Robert Fischer pointed out that type inference doesn't have to hurt so much. His post on OCAML's type inferencing system is a breath of fresh air.

There's quite a bit of other interesting stuff in there, too. I particularly like this remark:

What the Rubyists call a "DSL", Ocamlists call "readable code".

I'm still working on wrapping my head around Erlang right now (it's my "new language" for 2008), but I might just have to give OCAML preferred position for my 2009 new language.

When Should You Jump? JSR 308. That's When.

2008-05-06T23:10:11-05:00

One of the frequently asked questions at the No Fluff, Just Stuff expert panels boils down to, "When should I get off the Java train?" There may be good money out there for the last living COBOL programmer, but most of the Java developers we see still have a lot of years left in their careers, too many to plan on riding Java off into it's sunset.

Most of the panelists talk about the long future ahead of Java the Platform, no matter what happens with Java the Language. Reasonable. I also think that a young developer's best bet is to stick with the Boy Scout motto: Be Prepared. Keep learning new languages and new programming paradigms. Work in many different domains, styles, and architectures. That way, no matter what the future brings, the prepared developer can jump from one train to the next.

After today, I think I need to revise my usual answer.

When should a Java developer jump to a new language? Right after JSR 308 becomes part of the language.

Beware: this stuff is like Cthulu rising from the vasty deep. There's an internal logic here, but if you're not mentally prepared, it could strip away your sanity like a chill wind across a foggy moor. I promise that's the last hypergolic metaphor. Besides, that was another post.

JSR 308 aims to bring Java a more precise type system, and to make the type system arbitrarily extensible. I'll admit that I had no idea what that meant, either. Fortunately, presenter and MIT Associate Professor Michael Ernst gave us several examples to consider.

The expert group sees two problems that need to be addressed.

The first problem is a syntactic limitation with annotations today: they can only be applied to type declarations. So, for example, we can say:

@NonNull List strings;

If the right annotation processor is loaded, this tells the compiler that strings will never be null. The compiler can then help us enforce that by warning on any assignment that could result in strings taking on a null value.

Today, however, we cannot say:

@NonNull List<@NonNull String> strings;

This would mean that the variable strings will never take a null value, and that no list element it contains will be null.

Consider another example:

@NonEmpty List<@NonNull String> strings = ...;

This is a list whose elements may not be null. The list itself will not be empty. The compiler---more specifically, an annotation processor used by the compiler---will help enforce this.

They would also add the ability to annotate method receivers:

void marshal(@Readonly Object jaxbElement, @Mutable Writer writer) @Readonly { ... }

This tells the type system that jaxbElement will not be changed inside the method, that writer will be changed, and that executing marshal will not change the receiving object itself.

Presumably, to enforce that final constraint, marshal would only be permitted to call other methods that the compiler could verify as consistent with @Readonly. In other words, applying @Readonly to one method will start to percolate through into other methods it calls.

The second problem the expert group addresses is more about semantics than syntax. The compiler keeps you from making obvious errors like:

int i = "JSR 308";

But, it doesn't prevent you from calling getValue().toString() when getValue() could return null. More generally, there's no way to tell the compiler that a variable is not null, immutable, interned, or tainted.

Their solution is to add a pluggable type system to the Java compiler. You would be able to annotate types (both at declaration and at usage) with arbitrary type qualifiers. These would be statically carried through compilation and made available to pluggable processors. Ernst showed us an example of a processor that can check and enforce not-null semantics. (Available for download from the link above.) In a sample source code base (of approximately 5,000 LOC) the team added 35 not-null annotations and suppressed 4 warnings to uncover 8 latent NullPointerException bugs.

Significantly, Findbugs, Jlint, and PMD all missed those errors, because none of them include an inferencer that could trace all usages of the annotated types.

That all sounds good, right? Put the compiler to work. Let it do the tedious work tracing the extended semantics and checking them against the source code.

Why the Lovecraftian gibbering, then?

Every language has a complexity budget. Java blew through it with generics in Java 5. Now, seriously, take another look at this:

@NotEmpty List<@NonNull String> strings = new ArrayList<@NonNull String>();

Does that even look like Java? That complexity budget is just a dim smudge in our rear-view mirror here. We're so busy keeping the compiler happy here, we'll completely forget what our actual project it.

All this is coming at exactly the worst possible time for Java the Language. The community is really, really excited about dynamic languages now. Instead of those contortions, we could just say:

var strings = ["one", "two"];

Now seriously, which one would you rather write? True, the dynamic version doesn't let me enlist the compiler's aid for enforcement. True, I do need many more unit tests with the dynamic code. Still, I'd prefer that "low ceremony" approach to the mouthful of formalism above.

So, getting back to that mainstream Java developer... it looks like there are only two choices: more dynamic or more static. More formal and strict, or more loosey-goosey and terse. JSR 308 will absolutely accelerate this polarization.

And, by the way, in case you were thinking that Java the Language might start to follow the community move toward dynamic languages, Alex Buckley, Sun's spec lead for the Java language, gave us the answer today.

He said, "Don't look for any 'var' keywords in Java."

SOA at 3.5 Million Transactions Per Hour

2008-05-06T15:47:01-05:00

Matthias Schorer talked about FIDUCIA IT AG and their service-oriented architecture. This financial services provider works with 780 banks in Europe, processing 35,000,000 transactions during the banking day. That works out to a little over 3.5 million transactions per hour.

Matthias described this as a service-oriented architecture, and it is. Be warned, however, that SOA does not imply or require web services. The services here exist in the middle tier. Instead of speaking XML, they mainly use serialized Java objects. As Matthias said, "if you control both ends of the communication, using XML is just crazy!"

They do use SOAP when communicating out to other companies.

They've done a couple of interesting things. They favor asynchronous communication, which makes sense when you architect for latency. Where many systems push data into the async messages, FIDUCIA does not. Instead, they put the bulk data into storage (usually files, sometimes structured data) and send control messages instructing the middle tier to process the records. This way, large files can be split up and processed in parallel by a number of the processing nodes. Obviously, this works when records are highly independent of each other.

Second, they have defined explicit rules and regulations about when to expect transactional integrity. There are enough restrictions that these are a minority of transactions. In all other cases, developers are required to design for the fact that ACID properties do not hold.

Third, they've build a multi-layered middle tier. Incoming requests first hit a pair of "Central Process Servers" which inspect the request. Requests are dispatched to individual "portals" based on their customer ID. Different portals will run different versions of the software, so FIDUCIA supports customers with different production versions of their software. Instead of attempting to combine versions on a single cluster, they just partition the portals (clusters.)

Each portal has its own load distribution mechanism, using work queues that the worker nodes listen to.

This multilevel structure lets them scale to over 1,000 nodes while keeping each cluster small and manageable.

The net result is that they can process up to 2,500 transactions per second, with no scaling limit in sight.

Project Hydrazine

2008-05-06T11:56:16-05:00

Part of Sun's push behind JavaFX will be called "Project Hydrazine". (Hydrazine is a toxic and volatile rocket fuel.) This is still a bit fuzzy, and they only left the boxes-and-arrows slide up for a few seconds, but here's what I was able to glean.

Hydrazine includes common federated services for discovery, personalization, deployment, location, and development. There's a "cloud" component to it, which wasn't entirely clear from their presentation. Overall, the goal appears to be an easier model for creating end-user applications based on a service component architecture. All tied together and presented with JavaFX, of course.

One very interesting extension---called "Project Insight"---that Rich Green and Jonathan Schwartz both discussed is the ability to instrument your applications to monitor end-user activity in your apps.

(This immediately reminded me of Valve's instrumentation of Half-Life 2, episode 2. The game itself reports back to Valve on player stats: time to complete levels, map locations where they died, play time and duration, and so on. Valve has previously talked about using these stats to improve their level design by finding out where players get frustrated, or quit, and redesigning those levels.)

I can see this being used well: making apps more usable, proactively analyzing what features users appreciate or don't understand, and targeting development effort at improving the overall experience.

Of course, it can also be used to target advertising and monitor impressions and clicks. Rich promoted this as the way to monetize apps built using Project Hydrazine. I can see the value in it, but I'm also ambivalent about creating even more channels for advertising.

In any event, users will be justifiably anxious about their TV watching them back. It's just a little too Max Headroom for a lot of people. Sun says that the data will only appear in the aggregate. This leads me to believe that the apps will report to a scalable, cloud-based aggregation service from which developers can get the aggregated data. Presumably, this will be run by Sun.

Unlike Apple's iron-fisted control over iPhone application delivery, Sun says they will not be exercising editorial control. According to Schwartz, Hydrazine will all be free: free in price, freely available, and free in philosophy.

JavaOne: After the Revolution

2008-05-06T11:37:19-05:00

What happens to the revolutionaries, once they've won?

It's been about ten years since I last made the pilgramage to JavaOne, back when Java was still being called an "emerging technology".

Many things have changed since then. Java is now so mainstream that the early adopters are getting itchy feet and looking hard for the next big thing. (The current favorite is some flavor of dynamic language running on the JVM: Groovy, Scala, JRuby, Jython, etc.) Java, the language, has found a home inside large enterprises and their attendant consultancies and commoditized outsourcers.

We just heard Sun say that, Java SE is on 91% of all PCs and laptops, 85% of mobile phones, and 100% of all Blu-Ray players. It's safe to say that the revolution is over. We won.

A couple of things haven't changed about JavaOne in the last ten years.

The crowds in Moscone are still completely absurd. There aren't lines, so much as there are tides. People ebb and flow like a non-Newtonian fluid.

Sun still keeps a tight reign on the Message. (This control is one of the major tensions between Sun and the broader Java community.) This year, Sun's focus is clearly on JavaFX. The leading keynote talked repeatedly about "all the screens of your life" and said that the JavaFX runtime will be the access layer to reach your content from any device anywhere. We also heard about JavaFX's animation, 3D, audio, and video capabilities.

Glassfish got a brief mention. Version 3 is supposed to have a new kernel that slims down to 98KB is its minimal deployment. Add-on modules provide HTTP service, SIP service, and so on. Rich Green said hat Glassfish will scale up to the data center and down to set top boxes.

Perhaps it's just my perspective, since I'm mostly a server-side developer, but I had the oddest sense of deja-vu. Instead of Rich Green in 2008, I felt the strange sense that I was listening to Scott McNealy in 1998. Same message: Java from the handset to the data center. Set top boxes. Headspace for audio. (Anyone else remember Thomas Dolby at the keynote? This year we got Neil Young.)

So, here we are, at the 13th JavaOne, and Sun is still trying to get developers to see Java as more than a server-side platform.

Well, the more things change, the more they stay the same, I suppose.

Who Ordered That?

2008-05-05T18:29:57-05:00

Yesterday, I let myself get optimistic about what Jonathan Schwartz coyly hinted about over the weekend.

The actual announcement came today. OpenSolaris will be available on EC2. Honestly, I'm not sure how relevant that is. Are people actually demanding Solaris before they'll support EC2?

There is a message here for Microsoft, though. The only sensible license cost for a cloud-based platform is $0.00 per instance.

Addendum

I said that OpenSolaris would be available on EC2. Looks like I should have used the present tense, instead.

$ ec2-describe-images -a | grep -i solaris
IMAGE	ami-8946a3e0	opensolaris.thoughtworks.com/opensolaris-mingle-2_0_8540-64.manifest.xml	089603041495	available	public		x86_64	machine	aki-ab3cd9c2	ari-2838dd41

Yep, ThoughtWorks already has an OpenSolaris image configured as a Mingle server.

(I've said it before, but there's just no need to pay money for development infrastructure any more. Conversely, there's no excuse for any development team to run without version control, automated builds, and continuous integration.)

Sun to Emerge from Behind in the Clouds?

2008-05-04T20:48:25-05:00

Nobody can miss the dramatic proliferation of cloud computing platforms and initiatives over the last couple of years. All through the last year, Sun has remained oddly silent on the whole thing. There is a clear, natural synergy between Linux, commodity x86 hardware, and cloud computing. Sun is conspicuously absent from all of those markets. Sun clearly needs to regain relevance in this space.

On the one hand, Project Caroline now has its own website. Anybody can create an account that allows forum reading, but don't count on getting your hands on hardware unless you've got an idea that Sun approves of.

Apart from that, Om Malik reports that we may see a joint announcement Monday morning from Sun and Amazon.

I suspect that the announcement will look something like this:

Based on AWS for accounts, billing, storage, and infrastructure
Java-based application deployment into a Sun grid container
AWS to handle load balancing, networking, etc.

In other words: it will look a lot like Project Caroline and the Google App Engine, running Java applications using Sun containers on top of AWS.

Agile IT! Experience

2008-04-23T14:53:10-05:00

On June 26-28, 2008, I'll be speaking at the inagural Agile IT! Experience symposium in Reston, VA. Agile ITX is about consistently delivering better software. It's for development teams and management, working and learning together.

It's a production of the No Fluff, Just Stuff symposium series. Like all NFJS events, attendance is capped, so be sure to register early.

From the announcement email:

The central theme of the Agile ITX conference (www.agileitx.com) is to help your development team/management consistently deliver better software. We'll focus on the entire software development life cycle, from requirements management to test automation to software process. You'll learn how to Develop in Iterations, Collborate with Customers, and Respond to Change. Software is a difficult field with high rates of failure. Our world-class speakers will help you implement best practices, deal with persistent problems, and recognize opportunities to improve your existing practices.

Dates: June 26-28, 2008

Location: Sheraton Reston

Attendance: Developers/ Technical Management

Sessions at Agile ITX will cover topics such as:

Continuous Integration (CI)
Test Driven Development (TDD)
Testing Strategies, Team Building
Agile Architecture
Dependency Management
Code Metrics & Analysis
Acceleration & Automation
Code Quality

Agile ITX speakers are successful leaders, authors, mentors, and trainers who have helped thousands of developers create better software. You will have the opportunity to hear and interact with:

Jared Richardson - co-author of Ship It!
Michael Nygard - author of Release It!
Johanna Rothman - author of Manage It!
Esther Derby - co-author of Behind Closed Doors: Secrets of Great Management
Venkat Subramaniam - co-author of Practices of an Agile Developer
David Hussman - Agility Instructor/Mentor
Andrew Glover - co-author of Continuous Integration
J.B. Rainsberger - author of JUnit Recipes
Neal Ford - Application Architect at ThoughtWorks
Kirk Knoernshild - contributor to The Agile Journal
Chris D'Agostino - CEO of Near Infinity
David Bock - Principal Consultant with CodeSherpas
Mark Johnson - Director of Consulting at CGI
Ryan Shriver - Managing Consultant with Dominion Digital
John Carnell - IT Architect at Thrivent Financial
Scott Davis - Testing Expert

Amazon Blows Away Objections

2008-04-14T07:59:31-05:00

Amazon must have been burning more midnight oil than usual lately.

Within the last two weeks, they've announced three new features that basically eliminate any remaining objections to their AWS computing platform.

Elastic IP Addresses

Elastic IP addresses solve a major problem on the front end. When an EC2 instance boots up, the "cloud" assigns it a random IP address. (Technically, it assigns two: one external and one internal. For now, I'm only talking about the external IP.) With a random IP address, you're forced to use some kind of dynamic DNS service such as DynDNS. That lets you update your DNS entry to connect your long-lived domain name with the random IP address.

Dynamic DNS services work pretty well, but not universally well. For one thing, there is a small amount of delay. Dynamic DNS works by setting a very short time-to-live (TTL) on the DNS entries, which instructs intermediate DNS servers to cache the entry only for a few minutes. When that works well, you still have a few minutes of downtime when you need to reassign your DNS name to a new IP address. For some parts of the Net, dynamic DNS doesn't work well, usually when some ISP doesn't respect the TTL on DNS entries, but caches them for a longer time.

Elastic IP addresses solve this problem. You request an elastic IP address through a Web Services call. The easiest way is with the command-line API:

$ ec2-allocate-address
ADDRESS 75.101.158.25

Once the address is allocated, you own it until you release it. At this point, it's attached to your account, not to any running virtual machine. Still, this is good enough to go update your domain registrar with the new address. After you start up an instance, then you can attach the address to the machine. If the machine goes down, then the address is detached from that instance, but you still "own" it.

So, for a failover scenario, you can reassign the elastic IP address to another machine, leave your DNS settings alone, and all traffic will now come to the new machine.

Now that we've got elastic IPs, there's just one piece missing from a true HA architecture: load distribution. With just one IP address attached to one instance, you've got a single point of failure (SPOF). Right now, there are two viable options to solve that. First, you can allocate multiple elastic IPs and use round-robin DNS for load distribution. Second, you can attach a single elastic IP address to an instance that runs a software load balancer: pound, nginx, or Apache+mod_proxy_balancer. (It wouldn't surprise me to see Amazon announce an option for load-balancing-in-the-cloud soon.) You'd run two of these, with the elastic IP attached to one at any given time. Then, you need a third instance monitoring the other two, ready to flip the IP address over to the standby instance if the active one fails. (There are already some open-source and commercial products to make this easy, but that's the subject for another post.)

Availability Zones

The second big gap that Amazon closed recently deals with geography.

In the first rev of EC2, there was absolutely no way to control where your instances were running. In fact, there wasn't any way inside the service to even tell where they were running. (You had to resort to pingtracing or geomapping of the IPs). This presents a problem if you need high availability, because you really want more than one location.

Availability Zones let you specify where your EC2 instances should run. You can get a list of them through the command-line (which, let's recall, is just a wrapper around the web services):

$ ec2-describe-availability-zones
AVAILABILITYZONE    us-east-1a    available
AVAILABILITYZONE    us-east-1b    available
AVAILABILITYZONE    us-east-1c    available

Amazon tells us that each availability zone is built independently of the others. That is, they might be in the same building or separate buildings, but they have their own network egress, power systems, cooling systems, and security. Beyond that, Amazon is pretty opaque about the availability zones. In fact, not every AWS user will see the same availability zones. They're mapped per account, so "us-east-1a" for me might map to a different hardware environment than it does for you.

How do they come into play? Pretty simply, as it turns out. When you start an instance, you can specify which availability zone you want to run it in.

Combine these two features, and you get a bunch of interesting deployment and management options.

Persistent Storage

Storage has been one of the most perplexing issues with EC2. Simply put, anything you stored to disk while your instance was running would be lost when you restart the instance. Instances always go back to the bundled disk image stored on S3.

Amazon has just announced that they will be supporting persistent storage in the near future. A few lucky users get to try it out now, in it's pre-beta incarnation.

With persistent storage, you can allocate space in chunks from 1 GB to 1 TB. That's right, you can make one web service call to allocate a freaking terabyte! Like IP addresses, storage is owned by your account, not by an individual instance. Once you've started up an instance---say a MySQL server, for example---you attach the storage volume to it. To the virtual machine, the storage looks just like a device, so you can use it raw or format it with whatever filesystem you want.

Best of all, because this is basically a virtual SAN, you can do all kinds of SAN tricks, like snapshot copies for backups to S3.

Persistent storage done this way obviates some of the other dodgy efforts that have been going on, like FUSE-over-S3, or the S3 storage engine for MySQL.

SimpleDB is still there, and it's still much more scalable than plain old MySQL data storage, but we've got scores of libraries for programming with relational databases, and very few that work with key-value stores. For most companies, and for the forseeable future, programming to a relational model will be the easiest thing to do. This announcement really lowers the barrier to entry even further.

With these announcements, Amazon has cemented AWS as a viable computing platform for real businesses.

Geography Imposes Itself On the Clouds

2008-04-09T07:54:04-05:00

In a comment to my last post, gvwilson asks, "Are you aware that the PATRIOT Act means it's illegal for companies based in Ontario, BC, most European jurisdictions, and many other countries to use S3 and similar services?"

This is another interesting case of the non-local networked world intersecting with real geography. Not surprisingly, it quickly becomes complex.

I have heard some of the discussion about S3 and the interaction between the U.S. PATRIOT act and the EU and Canadian privacy laws. I'm not a lawyer, but I'll relate the discussion for other readers who haven't been tracking it.

Canada and the European Union have privacy laws that lean toward their citizens, and are quite protective of them. In the U.S., where laws are written about privacy at all, they are heavily biased in favor of large data-collecting corporations, such as credit rating agencies. A key provision of the privacy laws in Canada and the EU is that companies cannot transmit private data to any jurisdiction that lacks substantially similar protections. It's kind of like the "incorporation" clause in the GPL that way.

In the U.S., particularly with respect to the USA PATRIOT act, companies are required to turn over private customer data to a variety of government agencies. In some cases, they are required to do this even without a search warrant or court order. These are pretty much just fishing expeditions; casting a broad net to see if you catch anything. Therefore, the EU/Canadian privacy laws judge that the U.S. does not have substantially similar privacy protections, and companies in those covered nations are barred from exporting, transmitting, or storing customer data in any U.S. location where they might be subject to PATRIOT act search.

(Strictly speaking, this is not just a PATRIOT act problem. It also relates to RICO and a wide variety of other U.S. laws, mostly aimed at tracking down drug dealers by their banking transactions.)

Enter S3. S3 built to be a geographically-replicated distributed storage mechanism! There is no way even to figure out where the individual bits of your data are physically located. Nor is there any way to tell Amazon what legal jurisdictions your data can, or must, reside in. This is a big problem for personal customer data. It's also a problem that Amazon is aware they must solve. For EC2, they recently introduced Availability Zones that let you define what geographic location your virtual servers will exist in. I would expect to see something similar for S3.

This would also appear to be a problem for EU and Canadian companies using Google's AppEngine. It does not offer any way to confine data to specific geographies, either.

Does this mean it's illegal for Canadian companies to use S3? Not in general. Web pages, software downloads, media files... these would all be allowed. Just stay away from the personal data.

Suggestions for a 90-minute app

2008-04-08T21:10:44-05:00

Some of you know my obsession with Lean, Agile, and ToC. Ideas are everywhere. Idea is nothing. Execution is everything.

In that vein, one of my No Fluff, Just Stuff talks is called "The 90 Minute Startup". In it, I build a real, live dotcom site during the session. You can't get a much shorter time-to-market than 90 minutes, and I really like that.

In case you're curious, I do it through the use of Amazon's EC2 and S3 services.

The app I've used for the past couple of sessions is a quick and dirty GWT app that implements a Net Promoter Score survey about the show itself. It has a little bit of AJAX-y stuff to it, since GWT makes that really, really simple. On the other hand, it's not all that exciting as an application. It certainly doesn't make anyone sit up and go "Wow!"

So, anyone want to offer up a suggestion for a "Wow!" app they'd like to see built and deployed in 90 minutes or less? Since this is for a talk, it should be about the size of one user story. I doubt I'll be taking live requests from the audience during the show, but I'm happy to take suggestions here in the comments.

(Please note: thanks to the pervasive evil of blog comment spam, I moderate all comments here. If you want to make a suggestion, but don't want it published, just make a note of that in the comment.)

Google's AppEngine Appears, Disappoints

2008-04-08T09:37:02-05:00

Google finally got into the cloud infrastructure game, announcing their Google AppEngine. As rumored, AppEngine opens parts of Google's legendary scalable infrastructure for hosted applications.

AppEngine is in beta, with only 10,000 accounts available. They're already long gone, but you can download the SDK and run a local container.

Here are some quick pros and cons:

Pro

Dynamically scalable
Good lifecycle management
Quota-based management for cost containment

Con

Python apps only
You deploy code, not virtual machines
Web apps only

At this point, I'm a bit underwhelmed. Essentially, they're providing a virtual scalable app runtime, but not a generalized computing platform. (Similar to Sun's Project Caroline.) Access to the really cool Google features, like GFS, is through Python APIs that Google provides.

If you fit Google's profile of a Python-based Web application developer, this could be a very fast path to market with dynamic scalability. Still, I think I'm going to stick with Amazon Web Services, instead.

Reality

2008-03-26T22:51:58-05:00

OmniFocus Coming to the iPhone

2008-03-18T23:37:25-05:00

Over the last six months, I've grown thoroughly dependent on OmniFocus. It's a "Getting Things Done" application that lets me juggle more projects, personal and professional, than I ever thought I could.

Now, Omni says they're going to bring OmniFocus to the iPhone. So far, the iPhone hasn't compelled me, but I think that will be the trigger.

Release It has won a Jolt Productivity award

2008-03-06T21:55:10-06:00

It's an honor and a thrill for me to report that Release It received a Jolt Productivity award!

Steve Jobs Made Me Miss My Flight

2008-03-06T13:25:16-06:00

Or: On my way to San Jose.

On waking, I reach for my blackberry. It tells me what city I'm in; the hotel rooms offer no clues. Every Courtyard by Marriott is interchangeable. Many doors into the same house. From the size of my suitcase, I can recall the length of my stay: one or two days, the small bag. Three or four, the large. Two bags means more than a week.

CNBC, shower, coffee, email. Quick breakfast, $10.95 (except in California, where it's $12.95. Another clue.)

Getting there is the worst part. Flying is an endless accumulation of indignities. Airlines learned their human factors from hospitals. I've adapted my routine to minimize hassles.

Park in the same level of the same ramp. Check in at the less-used kiosks in the transit level. Check my bag so I don't have to fuck around with the overhead bins. I'd rather dawdle at the carousel than drag the thing around the terminal anyway.

Always the frequent flyer line at the security checkpoint. Sometimes there's an airline person at the entrance of that line to check my boarding pass, sometimes not. An irritation. I'd rather it was always, or never. Sometimes means I don't know if I need my boarding pass out or not.

Same words to the TSA agent. Standard responses. "Doing fine," whether I am or not. Same belt. It's gone through the metal detector every time. I don't need to take it off.

Only... today, something is different. Instead of my bags trundling through the x-ray machine, she stops the belt. Calls over another agent, a palaver. Another agent flocks to the screen. A gabble, a conference, some consternation.

They pull my laptop, my new laptop making its first trip with me, out of the flow of bags. One takes me aside to a partitioned cubicle. Another of the endless supply of TSA agents takes the rest of my bags to a different cubicle. No yellow brick road here, just a pair of yellow painted feet on the floor, and my flight is boarding. I am made to understand that I should stand and wait. My laptop is on the table in front of me, just beyond reach, like I am waiting to collect my personal effects after being paroled.

I'm standing, watching my laptop on the table, listening to security clucking just behind me. "There's no drive," one says. "And no ports on the back. It has a couple of lines where the drive should be," she continues.

A younger agent, joins the crew. I must now be occupying ten, perhaps twenty, percent of the security force. At this checkpoint anyway. There are three score more at the other five checkpoints. The new arrival looks at the printouts from x-ray, looks at my laptop sitting small and alone. He tells the others that it is a real laptop, not a "device". That it has a solid-state drive instead of a hard disc. They don't know what he means. He tries again, "Instead of a spinning disc, it keeps everything in flash memory." Still no good. "Like the memory card in a digital camera." He points to the x-ray, "Here. That's what it uses instead of a hard drive."

The senior agent hasn't been trained for technological change. New products on the market? They haven't been TSA approved. Probably shouldn't be permitted. He requires me to open the "device" and run a program. I do, and despite his inclination, the lead agent decides to release me and my troublesome laptop. My flight is long gone now, so I head for the service center to get rebooked.

Behind me, I hear the younger agent, perhaps not realizing that even the TSA must obey TSA rules, repeating himself.

"It's a MacBook Air."

The Granularity Problem

2008-02-20T19:24:16-06:00

I spend most of my time dealing with large sites. They're always hungry for more horsepower, especially if they can serve more visitors with the same power draw. Power goes up much faster with more chassis than with more CPU core. Not to mention, administrative overhead tends to scale with the number of hosts, not the number of cores. For them, multicore is a dream come true.

I ran into an interesting situation the other day, on the other end of the spectrum.

One of my team was working with a client that had relatively modest traffic levels. They're in a mature industry with a solid, but not rabid, customer base. Their web traffic needs could easily be served by one Apache server running one CPU and a couple of gigs of RAM.

The smallest configuration we could offer, and still maintain SLAs, was two hosts, with a total of 8 CPU cores running at 2 GHz, 32 gigs of RAM, and 4 fast Ethernet ports.

Of course that's oversized! Of course it's going to cost more than it should! But at this point in time, if we're talking about dedicated boxes, that's the smallest configuration we can offer! (Barring some creative engineering, like using fully depreciated "classics" hardware that's off its original lease, but still has a year or two before EOL.)

As CPUs get more cores, the minimum configuration is going to become more and more powerful. The quantum of computing is getting large.

Not every application will need it, and that's another reason I think private clouds make a lot of sense. Companies can buy big boxes, then allocate them to specific applications in fractions. Gains cost efficiency in adminstration, power, and space consumption (though not heat production!) while still letting business units optimize their capacity downward to meet their actual demand.

Sun Joining the Cloud Crowd

2008-02-20T19:11:19-06:00

As I was writing my last post, I somehow missed the news that Sun is building their own cloud platform, called Project Caroline.

There's a PDF about it. It appears to be a presentation for JavaOne. It may be locked down at any minute, so the link might not work by the time you read this.

Caroline looks a lot like Amazon EC2, but with some very nice control over VLANs (I suppose they would be Virtual VLANs?), load balancing policies, and DNS... all things that EC2 lacks today. ZFS instead of S3, that will make for a more familiar storage model. No trickery needed to make data persist across restarts.

All in all, it looks very nice.

(Hmmm. On second glance, this presentation is from JavaOne 2007! Not much of a scoop there, Reg.)

Does anyone know what happened to this project?

A Cloud For Everyone

2008-02-20T18:17:57-06:00

The trajectory of many high-tech products looks like this:

Very expensive. Only a few exist in the world. They are heavily time-shared, and usually oversubscribed.
Within the reach of institutions and corporations, but not individuals. The organization wants to maximize utilization.
Corporations own many, as productivity enhancers, some wealthy or forward-looking individuals own one. Families time share theirs.
Virtually everyone has one. To lack one is to fall behind. No longer a competitive advantage, the lack of the technology puts one at a disadvantage.
Invisibility. Most people have or use several, but are not aware of it.

Depending on your age, you might have been thinking "cell phones", "computers", or even "televisions". I don't think I have any blog readers old enough to have been thinking "telephones", "telegraphs", or "electric motors", but they all went through the same stages, too.

I feel very comfortable putting "cloud computing" in that list, too. Cloud computing is at stage 1. It's expensive enough that there are a few in the world: Amazon AWS, Mosso, BungeeConnect, even Force.com. They're shared, multitenant, and soon to be oversubscribed.

One day, I suspect that we'll each have our own computing cloud attending us, formed out of the many computing devices that surround us every day, but I'm getting ahead of myself.

Before that, we'll see enterprises, first large then medium and small, building their own computing clouds.

"Wait a minute," you object. "That misses the whole point of cloud computing. The entire purpose is to not own the infrastructure."

That's true, today. It was also true, at one time, that farmers did not want to own their own steam engines. So, they outsourced the job. Farmers would own machines like threshers that had everything except the troublesome boiler and engine. Those required technical expertise to run, so the farmers left that job up to folks who would bring their steam engine around, hook it up to the thresher, and charge the farmer for the length of time he needed it. As steam engines got cheaper and safer, they eventually got built right into the thresher.

This next part may sound like FUD. It isn't. I like cloud computing. I like virtualization. In fact, I think it's about to revolutionize our industry.

I like it so much that I think every company should have one.

Why should a company build its own cloud, instead of going to one of the providers? Several reasons, some positive, some not so much.

On the positive side, an IT manager running a cloud can finally do real chargebacks to the business units that drive demand. Some do today, but on a larger-grained level... whole servers. With a private cloud, the IT manager could charge by the compute-hour, or by the megabit of bandwidth. He could charge for storage by the gigabyte, and with tiered rates for different avaialbility/continuity guarantees. Even better, he could allow the business units to do the kind of self-service that I can do today with a credit card and The Planet. (OK, The Planet isn't a cloud provider, but I bet they're thinking about it. Plus, I like them.)

I actually think this kind of self-service and fine-grained chargeback could help curb the out-of-control growth in IT spending, but that's a different post.

This would seriously raise the level of discourse. Instead of fighting about server classes, rack space, power consumption, and rampant storage sprawl, IT could talk to the business about levels of service. Does this app need 24x7 performance management with automatic resource allocation to maintain a 2 second response time? Great, we can do that! This other one doesn't need to be fast, but it had better work every single time a transaction goes through? We can do that, too! This application needs user experience monitoring, that database only needs non-redundant storage, because it can be recreated from other sources... it's a better conversation to have than, "No, our corporate standard is WebSphere running on RedHat Enterprise Linux 4, with Dell PowerEdge servers. You can have any server you want, as long as it's a Dell PowerEdge."

I also think that the gloss will come off of the cloud computing providers. (I know, most people still haven't heard of them yet, but the gloss will inevitably come off.)

Accidents happen. Networks still break, today, and they will in the future too. Power failures happen. How would you defend yourself in a shareholders' lawsuit after millions in losses thanks to a service provider failure? (Actually, that suggests there may be an insurance market developing here. Any time you've got quantifiable risk and someone willing to pay to defray that risk, sure as hell, you'll find insurance companies.)

Service providers get oversubscribed. What happens when your application is slow, and remains slow for months? Having an SLA only means you get some money back, it doesn't mean your problem will get fixed. It's a dirty secret that some service providers are quite happy paying out credits, if they can avoid bigger costs. What's your recourse? Transition costs. It costs a lot.

Latency matters. It might matter more today than ever before, since most internal applications have gone to web interfaces. Keeping your endpoints on your own network at least lets you control your own latency.

Then there's security. Many of my clients are dealing with PCI audits and compliance. I have no idea what they'd say if I suggested moving their data into the cloud. I'm pretty certain I wouldn't still be in the room to hear what they said. I'd probably be standing outside in the rain, trying to catch a cab back to the airport.

Like I said, I'm not trying to FUD cloud computing. I think that it's so good that every company should have one.

There's one more reason I think it makes sense to build internal clouds. I'll talk about that in my next post.

Outrunning Your Headlights

2008-02-19T15:13:29-06:00

Agile developers measure their velocity. Most teams define velocity as the number of story points delivered per iteration. Since the size of a "story point" and the length of an iteration vary from team to team, there's not much use in comparing velocity from one team to the next. Instead, the team tracks its own velocity from iteration to iteration.

Tracking velocity has two purposes. The first is estimation. If you know how many story points are left for this release, and you know how many points you complete per iteration, then you know how long it will be until you can release. (This is the "burndown chart".) After two or three iterations, this will be a much better projection of release date than I've ever seen any non-agile process deliver.

The second purpose of velocity tracking is to figure out ways to go faster.

In the iteration retrospective, a team will recalibrate estimating technique, to see if they can actually estimate the story cards or backlog items. Second, they'll look at ways to accomplish more during an iteration. Maybe that's refactoring part of the code, or automating some manual process. It might be as simple as adding templates to the IDE for commonly recurring code patterns. (That should always raise a warning flag, since recurring code patterns are a code smell. Some languages just won't let you completely eliminate it, though. And by "some languages" here, I mainly mean Java.)

Going faster should always be better, right? That means the development team is delivering more value for the same fixed cost, so it should always be a benefit, shouldn't it?

I have an example of a case where going faster didn't matter. To see why, we need to look past the boundaries of the development team. Developers often treat software requirements as if they come from a sort of ATM; there's an unlimited reserve of requirement and we just need to decide how many of them to accept into development.

Taking a cue from "Lean Software Development", though, we can look at the end-to-end value stream. The value stream is drawn from the customer's perspective. Step by step, the value stream map shows us how raw materials (requirements) are turned into finished goods. "Finished goods" does not mean code. Code is inventory, not finished goods. A finished good is something a customer would buy. Customers don't buy code. On the web, customers are users, interacting with a fully deployed site running in production. For shrink-wrapped software, customers buy a CD, DVD, or installer from a store. Until the inventory is fully transformed into one of these finished goods, the value stream isn't done.

Figure 1 shows a value stream map for a typical waterfall development process. This process has an annual funding cycle, so "inventory" from "suppliers" (i.e., requirements from the business unit) wait, on average, six months to get funded. Once funded and analyzed, they enter the development process. For clarity here, I've shown the development process as a single box, with 100% efficiency. That is, all the time spent in development is spent adding value---as the customer perceives it---to the product. Obviously, that's not true, but we'll treat it as a momentarily convenient fiction. Here, I'm showing a value stream map for a web site, so the final steps are staging and deploying the release.

Figure 1 - Value Stream Map of a Waterfall Process

This is not a very efficient process. It takes 315 business days to go from concept to cash. Out of that time, at most 30% of it is spent adding value. In reality, if we unpack the analysis and development processes, we'll see that efficiency drop to around 5%.

From the "Theory of Constraints", we know that the throughput of any process is limited by exactly one constraint. An easy way to find the constraint is by looking at queue sizes. In an unoptimized process, you almost always find the largest queue right before the constraint. In the factory environment that ToC came from, it's easy to see the stacks of WIP (work in progress) inventory. In a development process, WIP shows up in SCR systems, requirements spreadsheets, prioritization documents, and so on.

Indeed, if we overlay the queues on that waterfall process, as in Figure 2, it's clear that Development and Testing is the constraint. After Development and Testing completes, Staging and Deployment take almost no time and have no queued inventory.

Figure 2 - Waterfall Value Stream, With Queues

In this environment, it's easy to see why development teams get flogged constantly to go faster, produce more, catch up. They're the constraint.

Lean Software Development has ten simple rules to optimize the entire value stream.

ToC says to elevate the constraint and subordinate the entire process to the throughput of the constraint. Elevating the constraint---by either going faster with existing capacity, or expanding capacity---adds to throughput, while running the whole process at the throughput of the constraint helps reduce waste and WIP.

In a certain sense, Agile methods can be derived from Lean and ToC.

All of that, though, presupposes a couple of things:

Development is the constraint.

There's an unlimited supply of requirements.

Figure 3 shows the value stream map for a project I worked on in 2005. This project was to replace an existing system, so at first, we had a large backlog of stories to work on. As we approached feature parity, though, we began to run out of stories. The users had been waiting for this system for so long, that they hadn't given much thought, or at least recent thought, to what they might want after the initial release. Shortly after the second release (a minor bug fix), it became clear that we were actually consuming stories faster than they would be produced.

Figure 3 - Value Stream Map of an Agile Project

On the output side, we ran into the reverse problem. This desktop software would be distributed to hundreds of locations, with over a thousand users who needed to be expert on the software in short order. The internal training group, responsible for creating manuals and computer based training videos, could not keep revising their training modules as quickly as we were able to change the application. We could create new user interface controls, metaphors, and even whole screens much faster than they could create training materials.

Once past the training group, a release had to be mastered and replicated onto installation discs. These discs were distributed to the store locations, where associates would call the operations group for a "talkthrough" of the installation process. Operations has a finite capacity, and can only handle so many installations every day. That set a natural throttle on the rate of releases. At one stage---after I rolled off the project---I know that a release which had passed acceptance testing in October was still in the training group by the following March.

In short, the development team wasn't the constraint. There was no point in running faster. We would exhaust the inventory of requirements and build up a huge queue of WIP in front of training and deployment. The proper response would be to slow down, to avoid the buildup of unfinished inventory. Creating slack in the workday would be one way to slow down, but drawing down the team size would be another perfectly valid response. Another perfectly valid response would be to increase the capacity of the training team. There are other places to optimize the value stream, too. But the one thing that absolutely wouldn't help would be increasing the development team's velocity.

For nearly the entire history of software development, there has been talk of the "software crisis", the ever-widening gap between government and industry's need for software and the rate at which software can be produced. For the first time in that history, agile methods allow us to move the constraint off of the development team.

Software Failure Takes Down Blackberry Services

2008-02-13T08:41:16-06:00

Anyone who's addicted to a Blackberry already knows about Monday's four-hour outage. For some of us, the Blackberry isn't just an electronic leash, it's part of our business operations.

Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM's infrastructure, out through the wireless carriers. It was RIM's own infrastructure that suffered from intermittent failures during the outage.

Data Center Knowledge reports that the outage was caused by a failed software upgrade.

Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they'll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

In the interim, there are plenty of checkpoints where the release can be rolled back.

I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it's "baked" for a while in production, then roll it to the remaining app servers.

This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

Finally, an aside: RIM's statement about the outage isn't reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company's own web site.

Tim Ross' C# Circuit Breaker

2008-02-10T17:18:00-06:00

Tim Ross has published his implementation of the Circuit Breaker pattern from Release It, complete with unit tests.

I barely speak C#, so I'm not in any position to review his implementation, but I'm delighted to see it!

The Pragmatic Architect on Security

2008-02-06T11:45:07-06:00

Catching up on some reading, I finally got a chance to read Ted Neward's article "Pragmatic Architecture: Security". It's very good. (Actually, the whole series is pretty good, and I recommend them all. At least as of February 2008... I make no assertions about future quality!)

Ted nails it. I agree with all of the principles he identifies, and I particularly like his advice to "fail securely".

I would add one more, though: Be visible.

After any breach, the three worst questions are always:

How long has this been happening?
How much have we lost?
Why didn't we know about it sooner?

The answers are always, respectively, "Far too long", "We have no idea", and "We didn't expect that exploit". To which the only possible response is, "Well, duh, if you'd expected it, you would have closed the vulnerability."

Successful exploits are always successful because they stay hidden. Are you sure that nobody's in your systems right now, leaching data, stealing credit card numbers, or stealing products? Of course not. For a vivid case in point, google "Kerviel Societe Generale".

While you cannot prove a negative, you can improve your odds of detecting nefarious activity by making sure that everything interesting is logged. (And by "interesting", I mean "potentially valuable".)

There are some pretty spiffy event correlation tools out there these days. They can monitor logs across hundreds of servers and network devices, extracting patterns of anomalous behavior. But, they only work if your application exposes data that could indicate a breach.

For example, you might not be able to log every login attempt, but you probably should log every admin login attempt.

Or, you might consider logging every price change. (I shudder to think about collusion between a merchant with pricing control and an outside buyer. Imagine a 10-minute long sale on laptops: 90% off for 10 minutes only.)

If your internal web service listens on a port, then it should only accept connections from known sources. Whether you enforce that through IPTables, a hardware firewall, or inside the application itself, make sure you're logging refused connections.

Then, of course, once you're logging the data, make sure someone's monitoring it and keeping pattern and signature definitions up to date!

Two Books That Belong In Your Library

2008-01-19T16:50:07-06:00

I seldom plug books---other than my own, that is. I've just read two important books, however, that really deserve your attention.

Concurrency, Everybody's Doing It

The first is "Java Concurrency in Practice by Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. I've been doing Java development for close to thirteen years now, and I learned an enormous amount from this fantastic book. For example, I knew what the textbook definition of a volatile variable was, but I never knew why I would actually want to use one. Now I know when to use them and when they won't solve the problem.

Of course, JCP talks about the Java 5 concurrency library at great length. But this is no paraphrasing of the javadoc. (It was Doug Lea's original concurrency utility library that eventually got incorporated into Java, and we're all better off for it.) The authors start with illustrations of real issues in concurrent programming. Before they introduce the concurrency utilities, they explain a problem and illustrate potential solutions. (Usually involving at least one naive "solution" that has serious flaws.) Once they show us some avenues to explore, they introduce some neatly-packaged, well-tested utility class that either solves the problem or makes a solution possible. This removes the utility classes from the realm of "inscrutable magic" and presents them as "something difficult that you don't have to write."

The best part about JCP, though, is the combination of thoroughness and clarity with which it presents a very difficult subject. For example, I always understood about the need to avoid concurrent modification of mutable state. But, thanks to this book, I also see why you have to synchronize getters, not just setters. (Even though assignment to an integer is guaranteed to happen atomically, that isn't enough to guarantee that the change is visible to other threads. The only way to guarantee ordering is by crossing a synchronization barrier on the same lock.)

Blocked Threads are one of my stability antipatterns. I've seen hundreds of web site crashes. Every single one of them eventually boils down to blocked threads somewhere. Java Concurrency in Practice has the theory, practice, and tools that you can apply to avoid deadlocks, live locks, corrupted state, and a host of other problems that lurk in the most innocuous-looking code.

Capacity Planning is Science, Not Art

The second book that I want to recommend today is "Capacity Planning for Web Services". I've had this book for a while. When I first started reading it, I put it down right away thinking, "This is way too basic to solve any real problems." That was a big error.

Capacity Planning may get off to a slow start, but that's only because the authors are both thorough and deliberate. Later in the book, that deliberate pace is very helpful, because it lets us follow the math.

This is the only book on capacity planning I've seen that actually deals with transmission time for HTTP requests and repsonses. In fact, some of the examples even compute the number of packets that a request or reply will need.

I have objected to some capacity planning books because they assume that every process can be represented by an average. Not this one. In the section on standalone web servers, for example, the authors break files into several classes, then use a weighted distribution of file sizes to compute the expected response time and bandwidth requirements. This is a very real-world approach, since web requests tend toward a bimodal distribution: small HTML, Javascript, and CSS intermixed with large media files and images. (In fact, I plan on using the models in this book to quantify the effect of segregating media files from dynamic pages.)

This is also the only book I've seen that recognizes that capacity limits can propagate both downward and upward through tiers. There's a great example of how doubling the CPU performance in an app tier ends up increasing the demand on the database server, which almost totally nullifies the effect of the CPU upgrade. It also recognizes that all requests are not created equal, and recommends clustering request types by their CPU and I/O demands, instead of averaging them all together.

Nearly every result or abstract law has an example, written in concrete terms, which helps bridge theory and practice.

Both of these books deal with material that easily leads off into clouds of theory and abstraction. (JCP actually quips, "What's a memory model, and why would I want one?") These excellent works avoid the Ivory Tower trap and present highly pragmatic, immediately useful wisdom.

Well Begun Is Half Done

2008-01-15T23:49:56-06:00

How long is your checklist for setting up a new development environment? It might seem like a trivial thing, but setup costs are part of the overall friction in your project. I've seen three page checklists that required multiple downloads, logging in as several users (root and non-root), and hand-typing SQL strings to set up the local database server.

I think the paragon of environment setup is the ubiquitous GNU autoconf system. Anyone familiar with Linux, BSD, or other flavors of UNIX will surely recognize this three-line incantation:

./configure
make
make install

The beauty of autoconf is that it adapts to you. In the open-source world, you can't stipulate one particular set of packages or versions, at least, not if you actually want people to use your software and contribute to your project. In the corporate world, though, it's pretty common to see a project that requires a specific point-point rev of some Jakarta Commons library, but without actually documenting the version.

Then there are different places to put things: inside the project, in source control, or in the system. I recently went back to a project's code base after being away for more than two years. I thought we had done a good job of addressing the environment setup. We included all the deliverable jars in the codebase, so they were all version controlled. But, we decided to keep the development-only jars (like EasyMock, DBUnit, and JUnit) outside the code base. We did use Eclipse variables to abstract out the exact filesystem location, but when I returned to that code base, finding and restoring exactly the right versions of those build-time jars wasn't easy. In retrospect, we should have put the build-time jars under version control and kept them inside the code base.

Yes, I know that version control systems aren't good at versioning binaries like jar files. Who cares? We don't rev the jar files so often that the lack of deltas matters. Putting a new binary in source control when you upgrade from Spring 2.5 to Spring 2.5.1 really won't kill your repository. The cost of the extra disk space is nothing compared to the benefit of keeping your code base self-contained.

Maven users will be familiar with another approach. On a Maven project, you express external dependencies in a project model file. On the first build, Maven will download those dependencies from their "official" archives, then cache them locally. After that, Maven will just use the locally cached jar file, at least until you move your declared dependency to a newer revision. I have nothing against Maven. I know some people who swear by it, and others who swear at it. Personally, I just never got into it.

Then there are JRE extensions. This project uses JAI, which wants to be installed inside the JRE itself. We went along with that, but I was stumped for a while today when I saw hundreds of compile errors even though my Eclipse project's build path didn't show any unresolved dependencies. Of course, when you install JAI inside the JRE, it just becomes part of the Java runtime. That makes it an implicit dependency. I eventually remembered that trick, but it took a while. In retrospect, I wish we had tried harder to bring JAI's jars and native libraries into the code base as an explicit dependency.

Does developer environment setup time matter? I believe it does. It might be tempting to say, "That's a one-time cost, there's no point in optimizing it." It's not really a one-time cost, though. It's one time per developer, every time that developer has to reinstall. My rough observation says that, between migrating to a new workstation, Windows reinstalls, corporate re-imaging, and developer churn, you should expect three to five developer setups per year on an internal project.

For an open-source project, the sky is the limit. Keep in mind that you'll lose potential contributors at every barrier they encounter. Environment setup is the first one.

So, what's my checklist for a good environment setup checklist?

Keep the project self contained. Bring all dependencies into the code base. Same goes for RPMs or third-party installers.
Make sure all JAR files have version numbers in their file names. If the upstream project doesn't build their JAR files with version numbers, go ahead and rename the jars.
Make bootstrap scripts for database actions such as user creation or schema builds.
If you absolutely must embed a dependency on something that lives outside the code base, make your build script detect its location. Don't rely on specific path names.
Don't assume your code base is in any particular filesystem on the build machine.

I'd love to see your with your own rules for easy development setup.

"Release It" is a Jolt Award Finalist

2008-01-13T15:31:06-06:00

The Jolt Awards have been described as "the Oscar's of our industry". (Really. It's on the front page of the site.) The list of past book winners reads like an essential library for the software practitioner. Even the finalists and runners-up are essential reading.

Release It has now joined the company of finalists. The competition is very tough... I've read "Beautiful Code" and "Manage It!", and both are excellent. I'll be on pins and needles until the awards ceremony on March 5th. Honestly, though, I'm just thrilled to be in such good company.

Should Email Errors Keep Customers From Buying?

2008-01-06T22:31:54-06:00

Somewhere inside every commerce site, there's a bit of code sending emails out to customers. Email campaigning might have been in the requirements and that email code stands tall at the brightly-lit service counter. On the other hand, it might have been added as an afterthought, languishing in some dark corner with the "lost and found" department. Either way, there's a good chance it's putting your site at risk.

The simplest way to code an email sending routine looks something like this:

Get a javax.mail.Session instance
Get a javax.mail.Transport instance from the Session
Construct a javax.mail.internet.MimeMessage instance
Set some fields on the message: from, subject, body. (Setting the body may involve reading a template from a file and interpolating values.)
Set the recipients' Addresses on the message
Ask the Transport to send the message
Close the Transport
Discard the Session

This goes into a servlet, a controller, or a stateless session bean, depending on which MVC framework or JEE architecture blueprint you're using.

There are two big problems here. (Actually, there are three, but I'm not going to deal with the "one connection per message" issue.)

Request-Handling Threads at Risk

As written, all the work of sending the email happens on the request-handling thread that's also responsible for generating the response page. Even on a sunny day, that means you're spending some precious request-response cycles on work that doesn't help build the page.

You should always look at a call out to an external server with suspicion. Many of them can execute asynchronously to page generation. Anything that you can offload to a background thread, you should offload so the request-handler can get back in the pool sooner. The user's experience will be better, and your site's capacity will be better, if you do.

Also, keep in mind that SMTP servers aren't always 100% reliable. Neither are the DNS servers that point you to them. That goes double if you're connecting to some external service. (And please, please don't even tell me you're looking up the recipient's MX record and contacting the receiving MTA directly!)

If the MTA is slow to accept your connection, or to process the email, then the request-handling thread could be blocked for a long time: seconds or even minutes. Will the user wait around for the response? Not likely. He'll probably just hit "reload" and double-post the form that triggered the email in the first place.

Poor Error Recovery

The second problem is the complete lack of error recovery. Yes, you can log an exception when your connection to the MTA fails. But that only lets the administrator know that some amount of mail failed. It doesn't say what the mail was! There's no way to contact the users who didn't get their messages. Depending on what the messages said, that could be a very big deal.

At a minimum, you'd like to be able to detect and recovery from interruptions at the MTA---scheduled maintenance, Windows patching, unscheduled index rebulids, and the like. Even if "recovery" means someone takes the users' info from the log file and types in a new message on their desktops, that's better than nothing.

A Better Way

The good news is that there's a handy way to address both of these problems at once. Better still, it works whether you're dealing with internal SMTP based servers or external XML-over-HTTP bulk mailers.

Whenever a controller decides it's time to reach out and touch a user through email, it should drop a message on a JMS queue. This lets the request-handling thread continue with page generation immediately, while leaving the email for asynchronous processing.

You can either go down the road of message-driven beans (MDB) or you can just set up a pool of background threads to consume messages from the queue. On receipt of a message, the subscriber just executes the same email generation and transmission as before, with one exception. If the message fails due to a system error, such as a broken socket connection, the message can just go right back onto the message queue for later retry. (You'll probably want to update the "next retry time" to avoid livelock.)

Better Still

If you have a cluster of application servers that can all generate outbound email, why not take the next step? Move the MDBs out into their own app server and have the message queues from all the app servers terminate there? (If you're using pub-sub instead of point-to-point, this will be pretty much transparent.) This application will resemble a message broker... for good reason. It's essentially just pulling messages in from one protocol, transforming them, then sending them out over another protocol.

The best part? You don't even have to write the message broker yourself. There are plenty of open-source and commercial alternatives.

Summary

Sending email directly from the request-handling thread performs poorly, creates unpredictable page latency for users and risks dropping their emails right on the floor. It's better to drop a message in a queue for asynchronous transformation by a message broker: it's faster, more reliable, and there's less code for you to write.

Two Sites, One Antipattern

2007-12-20T20:36:19-06:00

This week, I had Groundhog Day in December. I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They're a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy. It's good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer's cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product. (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match. The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog. It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second. With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn't, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn't overspent for the last two years to compensate.

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad. In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks. All the other app servers would block on those row locks. The team applied a band-aid to the query logic, and those crashes stopped.

What's the common factor here? It's what I call an "Unbounded Result Set". Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process. In essence, both of these applications trusted their databases. The apps weren't prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data. If your app processes one record at a time, then looping through an entire result set might be OK---as long as you're not making a user wait while you do. But if your app that turns rows into objects, then it had better be very selective about its SELECTs. The relationships might not be what you expect. The data producer might have changed in a surprising way, particularly if it's not under your control. Purging routines might not be in place, or might have gotten broken. Definitely don't trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Read-write splitting with Oracle

2007-12-12T11:07:49-06:00

Speaking of databases and read/write splitting, Oracle had a session at OpenWorld about it.

Building a read pool of database replicas isn't something I usually think of doing with Oracle, mainly due to their non-zero license fees. It changes the scaling equation.

Still, if you are on Oracle and the fees work for you, consider Active Data Guard. Some key facts from the slides:

Average latency for replication was 1 second
The maximum latency spike they observed was 10 seconds.
A node can take itself offline if it detects excessive latency.
You can use DBLinks to allow applications to think they're writing to a read node. The node will transparently pass the writes through to the master.
This can be done without any tricky JDBC proxies or load-balancing drivers, just the normal Oracle JDBC driver with the bugs we all know and love.
Active Data Guard requires Oracle 11g.

Budgetecture and it's ugly cousins

2007-12-12T09:38:17-06:00

It's the time of year for family gatherings, so here's a repulsive group portrait of some nearly universal pathologies. Try not to read this while you're eating.

Budgetecture

We've all been hit with budgetecture. That's when sound technology choices go out the window in favor of cost-cutting. The conversation goes something like this.

"Do we really need X?" asks the project sponsor. (A.k.a. the gold owner.)

For "X", you can substitute nearly anything that's vitally necessary to make the system run: software licenses, redundant servers, offsite backups, or power supplies. It's always asked with a sort of paternalistic tone, as though the grown-up has caught us blowing all our pocket money on comic books and bubble gum, whilst the serious adults are trying to get on with buying more buckets to carry their profits around in.

The correct way to answer this is "Yes. We do." That's almost never the response.

After all, we're trained as engineers, and engineering is all about making trade-offs. We know good and well that you don't really need extravagances like power supplies, so long as there's a sufficient supply of hamster wheels and cheap interns in the data center. So instead of simply saying, "Yes. We do," we go on with something like, "Well, you could do without a second server, provided you're willing to accept downtime for routine maintenance and whenever a RAM chip gets hit by a cosmic ray and flips a bit, causing a crash, but if we get error-checking parity memory then we get around that, so we just have to worry about the operating system crashing, which it does about every three-point-nine days, so we'll have to institute a regime of nightly restarts that the interns can do whenever they're taking a break from the power-generating hamster wheels."

All of which might be completely true, but is utterly the wrong thing to say. The sponsor has surely stopped listening after the word, "Well..."

The problem is that you see your part as an engineering role, while your sponsor clearly understands he's engaged in a negotiation. And in a negotiation, the last thing you want to do is make concessions on the first demand. In fact, the right response to the "do we really need" question is something like this:

"Without a second server, the whole system will come crashing down at least three times daily, particularly when it's under heaviest load or when you are doing a demo for the Board of Directors. In fact, we really need four servers so we can take an HA pair down independently at any time while still maintaining 100% of our capacity, even in case one of the remaining pair crashes unexpectedly."

Of course, we both know you don't really need the third and fourth servers. This is just a gambit to get the sponsor to change the subject to something else. You're upping the ante and showing that you're already running at the bare, dangerous, nearly-irresponsible minimum tolerable configuration. And besides, if you do actually get the extra servers, you can certainly use one to make your QA environment match production, and the other will make a great build box.

Schedule Quid Pro Quo

Another situation in which we harm ourselves by bringing engineering trade-offs to a negotiation comes when the schedule slips. Statistically speaking, we're more likely to pick up the bass line from "La Bamba" from a pair of counter-rotating neutron stars than we are to complete a project on time. Sooner or later, you'll realize that the only way to deliver your project on time and under budget is to reduce it to roughly the scope of "Hello, world!"

When that happens, being a responsible developer, you'll tell your sponsor that the schedule needs to slip. You may not realize it, but by uttering those words, you've given the international sign of negotiating weakness.

Your sponsor, who has his or her own reputation---not to mention budget---tied to the delivery of this project, will reflexively respond with, "We can move the date, but if I give you that, then you have to give me these extra features."

The project is already going to be late. Adding features will surely make it more late, particularly since you've already established that the team isn't moving as fast as expected. So why would someone invested in the success of the project want to further damage it by increasing the scope? It's about as productive as soaking a grocery store bag (the paper kind) in water, then dropping a coconut into it.

I suspect that it's sort of like dragging a piece of yarn in front of a kitten. It can't help but pounce on it. It's just what kittens do.

My only advice in this situation is to counter with data. Produce the burndown chart showing when you will actually be ready to release with the current scope. Then show how the fractally iterative cycle of slippage followed by scope creep produces a delivery date that will be moot, as the sun will have exploded before you reach beta.

The Fallacy of Capital

When something costs a lot, we want to use it all the time, regardless of how well suited it is or is not.

This is sort of the inverse of budgetecture. For example, relational databases used to cost roughly the same as a battleship. So, managers got it in their heads that everything needed to be in the relational database. Singular. As in, one.

Well, if one database server is the source of all truth, you'd better be pretty careful with it. And the best way to be careful with it is to make sure that nobody, but nobody, ever touches it. Then you collect a group of people with malleable young minds and a bent toward obsessive-compulsive abbreviation forming, and you make them the Curators of Truth.

But, because the damn thing cost so much, you need to get your money's worth out of it. So, you mandate that every application must store its data in The Database, despite the fact that nobody knows where it is, what it looks like, or even if it really exists. Like Schrodinger's cat, it might already be gone, it's just that nobody has observed it yet. Still, even that genetic algorithm with simulated annealing, running ten million Monte Carlo fitness tests is required to keep its data in The Database.

(In the above argument, feel free to substitute IBM Mainframe, WebSphere, AquaLogic, ESB, or whatever your capital fallacy du jour may be.)

Of course, if databases didn't cost so much, nobody would care how many of them there are. Which is why MySQL, Postgres, SQLite, and the others are really so useful. It's not an issue to create twenty or thirty instances of a free database. There's no need to collect them up into a grand "enterprise data architecture". In fact, exactly the opposite is true. You can finally let independent business units evolve independently. Independent services can own their own data stores, and never let other applications stick their fingers into its guts.

So there you have it, a small sample of the rogue's gallery. These bad relations don't get much photo op time with the CEO, but if you look, you'll find them lurking in some cubicle just around the corner.

Releasing a free SingleLineFormatter

2007-12-08T19:14:38-06:00

A number of readers have asked me for reference implementations of the stability and capacity patterns.

I've begun to create some free implementations to go along with Release It. As of today, it just includes a drop-in formatter that you can use in place of the java.util.logging default (which is horrible).

This formatter keeps all the fields lined up in columns, including truncating the logger name and method name if necessary. A columnar format is much easier for the human eye to scan. We all have great pattern-matching machinery in our heads. I can't for the life of me understand why so many vendors work so hard to defeat it. The one thing that doesn't get stuffed into a column is a stack trace. It's good for a stack trace to interrupt the flow of the log file... that's something that you really want to pop out when scanning the file.

It only takes a minute to plug in the SingleLineFormatter. Your admins will thank you for it.

Read about the library.

Download it as .zip or .tgz.

A Dozen Levels of Done

2007-11-28T16:09:59-06:00

What does "done" mean to you? I find that my definition of "done" continues to expand. When I was still pretty green, I would say "It's done" when I had finished coding. (Later, a wiser and more cynical colleague taught me that "done" meant that you had not only finished the work, but made sure to tell your manager you had finished the work.)

The next meaning of "done" that I learned had to do with version control. It's not done until it's checked in.

Several years ago, I got test infected and my definition of "done" expanded to include unit testing.

Now that I've lived in operations for a few years and gotten to know and love Lean Software Development, I have a new definition of "done".

Here goes:

A feature is not "done" until all of the following can be said about it:

All unit tests are green.
The code is as simple as it can be.
It communicates clearly.
It compiles in the automated build from a clean checkout.
It has passed unit, functional, integration, stress, longevity, load, and resilience testing.
The customer has accepted the feature.
It is included in a release that has been branched in version control.
The feature's impact on capacity is well-understood.
Deployment instructions for the release are defined and do not include a "point of no return".
Rollback instructions for the release are defined and tested.
It has been deployed and verified.
It is generating revenue.

Until all of these are true, the feature is just unfinished inventory.

Postmodern Programming

2007-11-19T08:00:00-06:00

It's taken me a while to get to this talk. Not because it was uninteresting, just because it sent my mind in so many directions that I needed time to collect my scattered thoughts.

Objects and Lego Blocks

On Thursday, James Noble delivered a Keynote about "The Lego Hypothesis". As you might guess, he was talking about the dream of building software as easily as a child assembles a house from Lego bricks. He described it as an old dream, using quotes from the very first conference on Software Engineering... the one where they utterly invented the term "Software Engineering" itself. In 1968.

The Lego Hypothesis goes something like this: "In the future, software engineering will be set free from the mundane necessity of programming." To realize this dream, we should look at the characteristics of Lego bricks and see if software at all mirrors those characteristics.

Noble ascribed the following characteristics to components:

Small
Indivisible
Substitutable
More similar than different
Abstract encapsulations
Coupled to a few, close neighbors
No action at a distance

(These actually predate that 1968 software engineering conference by quite a bit. They were first described by the Greek philosopher Democritus in his theory of atomos.)

The first several characteristics sound a lot like the way we understand objects. The last two are problematic, though.

Examining many different programs and languages, Noble's research group has found that objects are typically not connected to just a few nearby objects. The majority of objects are coupled to just one or two others. But the extremal cases are very, very extreme. In a Self program, one object had over 10,000,000 inbound references. That is, it was coupled to more than 10,000,000 other objects in the system. (It's probably 'nil', 'true', 'false', or perhaps the integer object 'zero'.)

In fact, object graphs tend to form scale-free networks that can be described by power laws.

Lots of other systems in our world form scale-free networks with power law distributions:

City sizes
Earthquake magnitudes
Branches in a roadway network
The Internet
Blood vessels
Galaxy sizes
Impact crater diameters
Income distributions
Books sales

One of the first things to note about power law distributions is that they are not normal. That is, words like "average" and "median" are very misleading. If the average inbound coupling is 1.2, but the maximum is 10,000,000, how much does the average tell you about the large scale behavior of the system?

(An aside: this is the fundamental problem that makes random events so problematic in Nassim Taleb's book The Black Swan. Benoit Mandelbrot also considers this in The (Mis)Behavior of Markets. Yes, that Mandelbrot.)

Noble made a pretty good case that the Lego Hypothesis is dead as disco. Then came a leap of logic that I must have missed.

Postmodernism

"The ultimate goal of computer science is the program."

You are assigned to write a program to calculate the first 100 prime numbers. If you are a student, you have to write this as if it exists in a vacuum. That is, you code as if this is the first program in the universe. It isn't. Once you leave the unique environs of school, you're not likely to sit down with pad of lined paper and a mechanical pencil to derive your own prime-number-finding algorithm. Instead, your first stop is probably Google.

Searching for "prime number sieve" currently gives me about 644,000 results in three-tenths of a second. The results include implementations in JavaScript, Java, C, C++, FORTRAN, PHP, and many others. In fact, if I really need prime numbers rather than a program to find numbers, I can just parasitize somebody else's computing power with online prime number generators.

Noble quotes Steven Conner from the Cambridge Companion to Postmodernism:

"...that condition in which, for the first time, and as a result of technologies which allow the large-scale storage, access, and re-production of records of the past, the past appears to be included in the present."

In art and literature, postmodernism incorporates elements of past works, directly and by reference. In programming, it means that every program ever written is still alive. They are "alive" in the sense that even dead hardware can be emulated. Papers from the dawn of computing are available online. There are execution environments for COBOL that run in Java Virtual Machines, possibly on virtual operating systems. Today's systems can completely contain every previous language, program, and execution environment.

I'm now writing well beyond my actual understanding of postmodern critical theory and trying to report what Noble was talking about in his keynote.

The same technological changes that caused the rise of postmodernism in art, film, and literature are now in full force in programming. In a very real sense, we did it to ourselves! We technologists and programmers created the technology---globe-spanning networks, high compression codecs, indexing and retrieval, collaborative filtering, virtualization, emulation---that are now reshaping our profession.

In the age of postmodern programming, there are no longer "correct algorithms". Instead, there are contextual decisions, negotiations, and contingencies. Instead of The Solution, we have individual solutions that solve problems in a context. This should sound familiar to anyone in the patterns movement.

Indeed, he directly references patterns and eXtreme Programming as postmodern programming phenomena, along with "scrap-heap" programming, mashups, glue programming, and scripting languages.

I searched for a great way to wrap this piece up, but ultimately it seemed more appropriate to talk about the contextual impact it had on me. I've never been fond of postmodernism; it always seemed simultaneously precious and pretentious. Now, I'll be giving that movement more attention. Second, I've always thought of mashups as sort of tawdry and sordid---not real programming, you know? I'll be reconsidering that position as well.

Conference: "Velocity"

2007-11-16T10:19:56-06:00

O'Reilly has announced an upcoming conference called Velocity.

From the announcement:

Web companies, big and small, face many of the same challenges: sites must be faster, infrastructure needs to scale, and everything must be available to customers at all times, no matter what. Velocity is the place to obtain the crucial skills and knowledge to build successful web sites that are fast, scalable, resilient, and highly available.
Unfortunately, there are few opportunities to learn from peers, exchange ideas with experts, and share best practices and lessons learned.
Velocity is changing that by providing the best information on building and operating web sites that are fast, reliable, and always up. We're bringing together people from around the world who are doing the best performance work, to improve the experience of web users worldwide. Pages will be faster. Sites will have higher up-time. Companies will achieve more with less. The next cool startup will be able to more quickly scale to serve a larger audience, globally. Velocity is the key for crossing over from cool Web 2.0 features to sustainable web sites.

That statement could have been the preface to my book, so I'll be submitting several proposals for talks.

Putting My Mind Online

2007-11-13T15:44:44-06:00

Along with the longer analysis pieces, I've decided to post the entirety of my notes from QCon San Francisco. A few of my friends and colleagues are fellow mind-mappers, so this is for them.

Nygard's Mind Map from QCon

This file works with FreeMind, an fast, fluid, and free mind mapping tool.

Two Ways To Boost Your Flagging Web Site

2007-11-10T00:52:56-06:00

Being fast doesn't make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

Suppose you're site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you're probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you're paying for big boxes that are guaranteed to be 50% idle.

Let's look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

Cache Farms

Cache farming doesn't mean armies of Chinese gamers stomping rats and making vests. It doesn't involve registering a ton of domain names, either.

Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you're running multiple instances on the same hosts, you could be caching the same object once per instance.

Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

So what do we have? If there are four app instances per host, then a frequently accessed object---like a product featured on the home page---will be duplicated eight times. Can we do better? Well, since I'm writing this article, you might suspect the answer is "yes". You'd be right.

The caches I've described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you'll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server's internal cache, they're probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it's already been re-added to the cache.

External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

If the external cache doesn't have the item, the app server hits the database as usual. So I'll turn my attention to the database tier.

Read Pools

The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I'm going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

When your database starts to get that middle-age paunch and it just isn't as zippy as it used to be, think about offloading those reads. At a minimum, you'll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you'll probably get more performance out of your writes once all that pesky read I/O is off the write master.

How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That's the Foghorn Leghorn version of the objection. I'll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

Does that matter?

To a certain extent, I can't answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I'll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What's the total latency before that pricing error will be corrected? Let's look at the end-to-end process.

A human detects the pricing error.
The observer notifies the responsible merchant.
The merchant verifies that the price is in error and determines the correct price.
Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
The merchant locates the item and enters the correct price
She hits the "publish" button.
The fast path system connects to the write master in production and updates the price.
The read pool receives the logs with the update and applies them.
The read pool process sends a message to invalidate the item in the app servers' caches.
The next time users request that product detail page, they see the correct price.

That's the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

Summing It Up

The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.

References

For more on read pooling, see Cal Henderson's excellent book, "Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications".

The most popular open-source external caching framework I've seen is memcached. It's a flexible, multi-lingual caching daemon.

On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

Two Quick Observations

2007-11-09T14:13:41-06:00

Several of the speakers here have echoed two themes about databases.

1. MySQL is in production in a lot of places. I think the high cost of commercial databases (read: Oracle) leads to a kind of budgetechture that concentrates all data in a single massive database. If you remove that cost from the equation, the idea of either functionally partitioning your data stores or creating multiple shards becomes much more palatable.

2. By far the most common database cluster structure has one write master with many read masters. Ian Flint spoke to us about the architectures behind Yahoo Groups and Yahoo Bix. Bix has 30 MySQL read servers and just one write master. Dan Pritchett from eBay had a similar ratio. (His might have been 10:1 rather than 30:1.) In a commerce site, where 98% of the traffic is browsing and only 2% is buying, a read-pooled cluster makes a lot of sense.

Three Vendors Worth Evaluating

2007-11-09T11:11:48-06:00

Several vendors are sponsoring QCon. (One can only wonder what the registration fees would be if they didn't.) Of these, I think three have products worth immediate evaluation.

Semmle

In the category of "really cool, but would I pay for it?" is Semmle. Their flagship product, SemmleCode, lets you treat your codebase as a database against which you can run queries. SemmleCode groks the static structure of your code, including relationships and dependencies. Along the way, it calculates pretty much every OO metric yet invented. It also looks at the source repository.

What can you do with it? Well, you can create a query that shows you all the cyclic dependencies in your code. The results can be rendered as a tree with explanations, a graph, or a chart. Or, you can chart your distribution of cyclomatic complexity scores over time. You can look for the classes or packages most likely to create a ripple effect.

Semmle ships with a sample project: the open-source drawing framework JHotDraw. In a stunning coincidence, I'm a contributor to JHotDraw. I wrote the glue code that uses Batik to export a drawing as SVG. So I can say with confidence, that when Semmle showed all kinds of cyclic dependencies in the exporters, it's absolutely correct. Every one of the queries I saw run against JHotDraw confirmed my own experience with that codebase. Where Semmle indicated difficulty, I had difficulty. Where Semmle showed JHotDraw had good structure, it was easy to modify and extend.

There are an enormous number of things you could do with this, but one thing they currently lack is build-time automation. Semmle integrates with Eclipse, but not ANT or Maven. I'm told that's coming in a future release.

3Tera

Virtualization is a hot topic. VMWare has the market lead in this space, but I'm very impressed with 3Tera's AppLogic.

AppLogic takes virtualization up a level. It lets you visually construct an entire infrastructure, from load balancers to databases, app servers, proxies, mail exchangers, and everything. These are components they keep in a library, just like transistors and chips in a circuit design program.

Once you've defined your infrastructure, a single button click will deploy the whole thing into the grid OS. And there's the rub. AppLogic doesn't work with just any old software and it won't work on top of an existing "traditional" infrastructure.

As a comparison, HP's SmartFrog just runs an agent on a bunch of Windows, Linux, or HP-UX servers. A management server sends instructions to the agents about how to deploy and configure the necessary software. So SmartFrog could be layered on top of an existing traditional infrastructure.

Not so with AppLogic. You build a grid specifically to support this deployment style. That makes it possible to completely virtualize load balancers and firewalls along with servers. Of course, it also means complete and total lock-in to 3tera.

Still, for someone like a managed hosting provider, 3tera offers the fastest, most complete definition and provisioning system I've seen.

GigaSpaces

What can I say about GigaSpaces? Anyone who's heard me speak knows that I adore tuple-spaces. GigaSpaces is a tuple-space in the same way that Tibco is a pub-sub messaging system. That is to say, the foundation is a tuple-space, but they've added high-level capabilities based on their core transport mechanism.

So, they now have a distributed caching system. (They call it an "in-memory data grid". Um, OK.) There's a database gateway, so your front end can put a tuple into memory (fast) while a back-end process takes the tuple and writes it into the database.

Just this week, they announced that their entire stack is free for startups. (Interesting twist: most companies offer the free stuff to open-source projects.) They'll only start charging you money when you get over $5M in revenue.

I love the technology. I love the architecture.

Catching up through the day

2007-11-09T11:03:36-06:00

One of the great things about virtual infrastructure is that you can treat it as a service. I use Yahoo's shared hosting service for this blog. That gives me benefits: low cost and very quick setup. On the down side, I can't log in as root. So when Yahoo has a problem, I have a problem.

Yesterday, there was something wrong with Yahoo's install of Movable Type. As a result, I couldn't post my "five things". I'll be catching up today, as time permits.

My butt is planted in one track all day today, "Architectures You've Always Wondered About." We'll be hearing about the architecture that runs Second Life, Yahoo, eBay, LinkedIn, and Orbitz. I may need a catheter and an IV.

Architecting for Latency

2007-11-09T09:42:43-06:00

Dan Pritchett, Technical Fellow at eBay, spoke about "Architecting for Latency". His aim was not to talk about minimizing latency, as you might expect, but rather to architect as though you believe latency is unavoidable and real.

We all know the effect latency can have on performance. That's the first-level effect. If you consider synchronous systems---such as SOAs or replicated DR systems---then latency has a dramatic effect on scalability as well. Whenever a synchronous call reaches across a long wire, the latency gets added directly to the processing time.

For example, if client A calls service B, then A's processing time will be at least the sum of B's processing time, plus the latency between A and B. (Yes, it seems obvious when you state it like that, but many architectures still act as though latency is zero.)

Furthermore, latency over IP networks is fundamentally variable. That means A's performance is unpredictable, and can never be made predictable.

Latency also introduces semantic problems. A replicated database will always have some discrepancy with the master database. A functionally or horizontally partitioned system will either allow discrepancies or must serialize traffic and give up scaling. You can imagine that eBay is much more interested in scaling than serializing traffic.

For example, when a new item is posted to eBay, it does not immediately show up in the search results. The ItemNode service posts a message that eventually causes the item to show up in search results. Admittedly, this is kept to a very short period of time, but still, the item will reach different data centers at different times. So, the search service inside the nearest data center will get the item before the search service inside the farthest. I suspect many eBay users would be shocked, and probably outraged, to hear that shoppers see different search results depending on where they are.

Now, the search service is designed to get consistent within a limited amount of time---for that item. With a constant flow of items, being posted from all over the country, you can imagine that there is a continuous variance among the search services. Like the quantum foam, however, this is near-impossible to observe. One user cannot see it, because a user gets pinned to a single data center. It would take multiple users, searching in the right category, with synchronized clocks, taking snapshots of the results to even observe that the discrepancies happen. And even then, they would only have a chance of seeing it, not a certainty.

Another example. Dan talked about payment transfer from one user to another. In the traditional model, that would look something like this.

You can think of the two databases as being either shards that contain different users or tables that record different parts of the transaction.

This is a design that pretends latency doesn't exist. In other words, it subscribes to Fallacy #2 of the Fallacies of Distributed Computing. Performance and scalability will suffer here.

(Frankly, it has an availability problem, too, because the availability of the payment service Pr(Payment) will now be Pr(Database 1) * Pr(Network 1) * Pr(Database 2) * Pr(Network 2). In other words, the availability of the payment service is coupled to the availability of Database 1, Database 2, and the two networks connecting Payment to the two databases.)

Instead, Dan recommends a design more like this:

In this case, the payment service can set an expectation with the user that the money will be credited within some number of minutes. The availability and performance of the payment service is now independent from that of Database 2 and the reconciliation process. Reconciliation happens in the back end. It can be message-driven, batch-driven, or whatever. The main point is that it is decoupled in time and space from the payment service. Now the databases can exist in the same data center or on separate sides of the globe. Either way, the performance, availability, and scalability characteristics of the payment service don't change. That's architecting for latency.

Instead of ACID semantics, we think about BASE semantics. (A cute retronym for Basically Available Soft-state Eventually-consistent.)

Now, many analysts, developers, and business users will object to the loss of global consistency. We heard a spirited debate last night between Dan and Nati Shalom, founder and CTO of GigaSpaces about that very subject.

I have two arguments of my own to support architecting for latency.

First, any page you show to a user represents a point-in-time snapshot. It can always be inaccurate even by the time you finish generating the page. Think about a commerce site saying "Ships in 2 - 3 days". That's true at the instant when the data is fetched. Ten milliseconds later, it might not be true. By the time you finish generating the page and the user's browser finishes rendering it (and fetching the 29 JavaScript files needed for the whizzy AJAX interface) the data is already a few seconds old. So global consistency is kind of useless in that case, isn't it? Besides, I can guarantee there's already a large amount of latency in the data path from inventory tracking to the availability field in the commerce database anyway.

Second, the cost of global consistency is global serialization. If you assume a certain volume of traffic you must support, the cost of a globally consistent solution will be a multiple of the cost of a latency-tolerant solution. That's because global consistency can only be achieved by making a single master system of record. When you try to reach large scales, that master system of record is going to be hideously expensive.

Latency is simply an immutable fact of the universe. If we architect for it, we can use it to our advantage. If we ignore it, we will instead be its victims.

For more of Dan's thinking about latency, see his article on InfoQ.com.

SOA Without the Edifice

2007-11-08T08:38:59-06:00

Sometimes the best interactions at a conference aren't the talks, they're shouting. An crowded bar with an over-amped DJ may seem like an unlikely place for a discussion on SOA. Even so, when it's Jim Webber, ThoughtWorks' SOA practice lead doing the shouting, it works. Given that Jim's topic is "Guerilla SOA", shouting is probably more appropriate than the hushed and reverential cathedral whispers that usually attend SOA discussions.

Jim's attitude is that SOA projects tend to attract two things: Taj Mahal architects and parasitic vendors. (My words, not Jim's.) The combined efforts of these two groups results in monumentally expensive edifices that don't deliver value. Worse still, these efforts consume work and attention that could go to building services aligned with the real business processes, not some idealized vision of what the processes ought to be.

Jim says that services should be aligned with business processes. When the business process changes, change the service. (To me, this automatically implies that the service cannot be owned by some enterprise governance council.) When you retire the business process, simply retire the service.

These sound like such common sense, that it's hard to imagine they could be controversial.

I'll be in the front row for Jim's talk later today.

Cameron Purdy: 10 Ways to Botch Enterprise Java Scalability and Reliability

2007-11-07T22:08:52-06:00

Here at QCon, well-known Java developer Cameron Purdy gave a fun talk called "10 Ways to Botch Enterprise Java Scalability and Reliability". (He also gave this talk at JavaOne.)

While I could quibble with Cameron's counting---there were actually more like 16 points thanks to some numerical overloading---I liked his content. He echoes many of the antipatterns from Release It. In particular, he talks about the problem I call "Unbounded Result Sets". That is, whether using an ORM tool or straight database queries, you can always get back more than you expect.

Sometimes, you get back way, way more than you expect. I once saw a small messaging table, that normally held ten or twenty rows, grow to over ten million rows. The application servers never contemplated there could be so many messages. Each one would attempt to fetch the entire contents of the table and turn them into objects. So, each app server would run out of memory and crash. That rolled back the transaction, allowing the next app server to impale itself on the same table.

Unbounded Result Sets don't just happen from "SELECT * FROM FOO;", though. Think about an ORM handling the parent-child relationship for you. Simply calling something like customer.getOrders() will return every order for that customer. By writing that call, you implicitly assume that the set of orders for a customer will always be small. Maybe. Maybe not. How about blogUser.getPosts()? Or tickerSymbol.getTrades()?

Unbounded Result Sets also happen with web services and SOAs. A seemingly innocuous request for information could create an overwhelming deluge---an avalanche of XML that will bury your system. At the least, reading the results can take a long time. In the worst case, you will run out of memory and crash.

The fundamental flaw with an Unbounded Result Set is that you are trusting someone else not to harm you, either a data producer or a remote web service.

Take charge of your own safety!

Be defensive!

Don't get hurt again in another dysfunctional relationship!

Three Programming Language Problems Solved Forever

2007-11-07T21:38:41-06:00

It's often been the case that a difficult problem can be made easier by transforming it into a different representation. Nowhere is that more true than in mathematics and the pseudo-mathematical realm of programming languages.

For example, LISP, Python, and Ruby all offer beautiful and concise constructs for operating on lists of things. In each of them, you can make a function which iterates across a list, performing some operation on each element, and returning the resulting list. C, C++, and Java do not offer any similar construct. In each of these languages, iterating a list is a control-flow structure that requires multiple lines to express. More significantly, the function expression of list comprehension can be composed. That is, you can embed a list comprehension structure inside of another function call or list operation. In reading Programming Collective Intelligence: Building Smart Web 2.0 Applications">Programming Collective Intelligence, which uses Python as its implementation language, I've been amazed at how eloquent complex operations can be, especially when I mentally transliterate the same code into Java.

In the evening keynote at QCon, Richard Gabriel covered 50 language topics, with a 50 word statement about each---along with a blend of music, art, and poetry. (If you've never seen Richard perform at a conference, it's quite an experience.) His presentation "50 in 50" also covered 50 years of programming and introduced languages as diverse as COBOL, SNOBOL, Piet, LISP, Perl, C, Algol, APL, IPL, Befunge, and HQ9+.

HQ9+ particularly caught my attention. It takes the question of "simplifying the representation of problems" to the utmost extreme.

HQ9+ has a simple grammar. There are 4 operations, each represented by a single character.

'+' increments the register.

'H' prints every languages natal example, "Hello, world!"

'Q' makes every program into a quine. It causes the interpreter to print the program text. Quines are notoriously difficult assignments for second-year CS students.

'9' causes the interpreter to print the lyrics to the song "99 Bottles of Beer on the Wall." This qualifies HQ9+ as a real programming language, suitable for inclusion in the ultimate list of languages.

These three operators solve for some very commonly expressed problems. In a certain sense, they are the ultimate solution to those problem. They cannot be reduced any further... you can't get shorter than one character.

Of course, in an audience of programmers, HQ9+ always gets a laugh. In fact, it was created specifically to make programmers laugh. And, in fact, it's a kind of meta-level humor. It's not the programs that are funny, but the design of the language itself... an inside joke from one programmer to the rest of us.

Eric Evans: Strategic Design

2007-11-07T21:23:57-06:00

Eric Evans, author of Domain-Driven Design: Tackling Complexity in the Heart of Software">Domain-Driven Design and founder of Domain Language, embodies the philosophical side of programming.

He gave a wonderful talk on "Strategic Design". During this talk, he stated a number of maxims that are worth pondering.

"Not all of a large system will be well designed."

"There are always multiple models."

"The diagram is not the model, but it is an expression of part of the model."

These are not principles to be followed, Evans says. Rather, these are fundamental laws of the universe. We must accept them and act accordingly, because disregarding them ends in tears.

Much of this material comes from Part 4 of Domain-Driven Design. Evans laconically labeled this, "The part no one ever gets to." Guilty. But when I get back home to my library, I will make another go of it.

Evans also discusses the relative size of code, amount of time spent, and value of the three fundamental portions of a system: the core domain, supporting subdomains, and generic subdomains.

Generic subdomains are horizontal. You might find these in any system in any company in the world.

Supporting subdomains are business-specific, but not of value to this particular system. That is, they are necessary cost, but do not provide value.

The core domain is the reason for the system. It is the business-specific functionality that makes this system worth building.

Now, in a typical development process (and especially a rewrite project), where does the team's time go? Most of it will go to the largest bulk: the generic subdomains. This is the stuff that has to exist, but it adds no value and is not specific to the company's business. The next largest fraction goes to the supporting subdomains. Finally, the smallest portion of time---and usually the last portion of time---goes to the core domain.

That means the very last thing delivered is the reason for the system's existance in the first place. Ouch.

Kent Beck's Keynote: "Trends in Agile Development"

2007-11-07T12:50:28-06:00

Kent Beck spoke with his characteristic mix of humor, intelligence, and empathy. Throughout his career, Kent has brought a consistently humanistic view of development. That is, software is written by humans--emotional, fallible, creative, and messy--for other humans. Any attempt to treat development as robotic will end in tears.

During his keynote, Kent talked about engaging people through appreciative inquiry. This is a learnable technique, based in human psychology, that helps focus on positive attributes. It counters the negaitivity that so many developers and engineers are prone to. (My take: we spend a lot of time, necessarily, focusing on how things can go wrong. Whether by nature or by experience, that leads us to a pessimistic view of the world.)

Appreciative inquiry begins by asking, "What do we do well?" Even if all you can say is that the garbage cans get emptied every night, that's at least something that works well. Build from there.

He specifically recommended The Thin Book of Appreciative Inquiry, which I've already ordered.

I should also note that Kent has a new book out, called Implementation Patterns, which he described as being about, "Communicating with other people, through code."

From QCon San Francisco

2007-11-07T12:39:10-06:00

I'm at QCon San Francisco this week. (An aside: after being a speaker at No Fluff, Just Stuff, it's interesting to be the audience again. As usual, on returning from travels in a different domain, one has a new perspective on familiar scenes.) This conference targets senior developers, architects, and project managers. One of the very appealing things is the track on "Architectures you've always wondered about". This coveres high-volume architectures for sites such as LinkedIn and eBay as well as other networked applications like Second Life. These applications live and work in thin air, where traffic levels far outstrip most sites in the world. Performance and scalability are two of my personal themes, so I'm very interested in learning from these pioneers about what happens when you've blown past the limits of traditional 3-tier, app-server centered architecture.

Through the remainder of the week, I'll be blogging five ideas, insights, or experiences from each day of the conference.

Pragmatic Podcast

2007-10-26T16:44:27-05:00

Has anyone ever been happy to listen to their own voice? Probably not.

The Pragmatic Podcast is up and running on the redesigned Pragmatic Programmers site. In the first episode, Daniel Steinberg interviews me about the book.

Also available on iTunes.

Make Time a Weapon

2007-10-23T21:50:30-05:00

Here's an list of books about putting time to work as your own weapon, instead of being victimized by it:

The Mind of War
The Art of Maneuver
Lean Thinking
Birth of the Chaordic Age
Agile Software Development
Lean Software Development
Software by Numbers

Normal Accidents

2007-09-24T10:27:53-05:00

While I was writing Release It!, I was influenced by James R. Chile's book Inviting Disaster. One of Chile's sources is Normal Accidents, by Charles Perrow. I've just started reading, and even the first two pages offer great insight.

Normal Accidents describes systems that are inherently unstable, to the point that system failures are inevitable and should be expected. These "normal" accidents result from systems that exhibit the characteristics of high "interactive complexity" and "tight coupling".

Interactive complexity refers to internal linkages, hidden from the view of operators. These invisible relations between components or subsystems produce multiple effects from a single cause. They can also produce outcomes that do not seem to relate to their inputs.

In software systems, interactive complexity is endemic. Any time two programs share a server or database, they are linked. Any time a system contains a feedback loop, it inherently has higher interactive complexity. Feedback loops aren't always obvious. For example, suppose a new software release consumes a fraction more CPU per transaction than before. That small increment might puch the server from a non-contending regime and a contending one. Once in contention, the added CPU usage creates more latency. That latency, and the increase in task-switching overhead, produces more latency. Positive feedback.

High interactive complexity leads operators to misunderstand the system and its warning signs. Thus misinformed, they act in ways that do not avert the crisis and may actually precipitate it.

When processes happen very fast, and there is no way to isolate one part of the system from another, the system is tightly coupled. Tight coupling allows small incidents to spread into large-scale failures.

Classic "web architecture" exhibits both high interactive complexity and tight coupling. Hence, we should expect "normal" accidents. Uptime will be dominated by the occurence of these accidents, rather than the individual probability of failure in each component.

The first section of Release It! deals exclusively with system stability. It shows how to reduce coupling and diminish interactive complexity.

You Keep Using That Word. I Do Not Think It Means What You Think It Means.

2007-09-16T10:35:17-05:00

"Scalable" is a tricky word. We use it like there's one single definition. We speak as if it's binary: this architecture is scalable, that one isn't.

The first really tough thing about scalability is finding a useful definition. Here's the one I use:

Marginal revenue / transaction > Marginal cost / transaction

The cost per transaction has to account for all cost factors: bandwidth, server capacity, physical infrastructure, administration, operations, backups, and the cost of capital.

(And, by the way, it's even better when the ratio of revenue to cost per transaction grows as the volume increases.)

The second really tough thing about scalability and architecture is that there isn't one that's right. An architecture may work perfectly well for a range of transaction volumes, but fail badly as one variable gets large.

Don't treat "scalability" as either a binary issue or a moral failing. Ask instead, "how far will this architecture scale before the marginal cost deteriorates relative to the marginal revenue?" Then, follow that up with, "What part of the architecture will hit a scaling limit, and what can I incrementally replace to remove that limit?"

Engineering in the White Space

2007-09-13T12:59:27-05:00

"Is software Engineering, or is it Art?"

Debate between the Artisans and the Engineers has simmered, and occasionally boiled, since the very introduction of the phrase "Software Engineering". I won't restate all the points on both sides here, since I would surely forget someone's pet argument, and also because I see no need to be redundant.

Deep in my heart, I believe that building programs is art and architecture, but not engineering.

But, what if you're not just building programs?

Programs and Systems

A "program" has a few characteristics that I'll assign here:

It accepts input.
It produces output.
It runs a sequence of instructions.
Statically, it exhibits cohesion in its executable form. [*]
Dynamically, it exhibits cohesion in its address space. [**]

* That is, the transitive closure of all code to be executed is finite, although it may not all be known in advance of execution. This allows dynamic extension via plugins, but not, for example, dynamic execution of any scripts or code found on the Web. So, a web browser is a program, but Javascript executed on some page is an independent program, not part of the browser itself.

** For "address space", feel free to substitute "object space", "process space", or "virtual memory". Cohesion requires that all the code that can access the address space should be regarded as a single program. (IPC through shared memory is a special case of an output, and should be considered more akin to a database or memory-mapped file than to part of the program's own address space.)

Suppose you have two separate scripts that each manipulate the same database. I would regard those as two separate---though not independent---programs. A single instance of Tomcat may contain several independent programs, but all the servlets in one EAR file are part of one program.

For the moment, I will not consider trivial objections, such as two distinct sets of functionality that happen to be packaged and delivered in a single EAR file. It's less interesting to me whether code does access the entire address space then whether it could. A library checkout program that includes functions for both librarians and patrons may not use common code for card number lookup, but it could. (And, arguably, it should.) That makes it one program, in my eyes.

A "System", on the other hand, consists of interdependent programs that have commonalities in their inputs and outputs. They could be arranges in a chain, a web, or a loop. No matter, if one program's input depends on another program's output, then they are part of a system.

Systems can be composed, whereas programs cannot.

Tricky White Space

Some programs run all the time, responding to intermittent inputs, these we call "servers". It is very common to see servers represented as a deceptively simple little rectangle on a diagram. Between servers, we draw little arrows to indicate communication, of some sort.

One little arrow might mean, "Synchronous request/reply using SOAP-XML over HTTP." That's quite a lot of information for one little glyph to carry. There's not usually enough room to write all that, so we label the unfortunate arrow with either "XML over HTTP"---if viewing it from an internal perspective---or "SKU Lookup"---if we have an external perspective.

That little arrow, bravely bridging the white space between programs, looks like a direct contact. It is Voyager, carrying its recorded message to parts unknown. It is Aricebo, blasting a hopeful greeting into the endless dark.

Well, not really...

These days, the white space isn't as empty as it once was. A kind of lumeniferous ether fills the void between servers on the diagram.

The Substrate

There is many a slip 'twixt cup and lip. In between points A and B on our diagram, there exist some or all of the following:

Network interface cards
Network switches
Layer 2 - 3 firewalls
Layer 7 (application) firewalls
Intrusion Detection and Prevention Systems
Message queues
Message brokers
XML transformation engines
Flat file translations
FTP servers
Polling jobs
Database "landing zone" tables
ETL scripts
Metro-area SoNET rings
MPLS gateways
Trunk lines
Oceans
Ocean liners
Phillipine fishing trawlers (see, "Underwater Cable Break")

Even in the simple cases, there will be four or five computers between program A and B, each running their own programs to handle things like packet switching, traffic analysis, routing, threat analysis, and so on.

I've seen a single arrow, running from one server to another, labelled "Fulfillment". It so happened that one server was inside my client's company while the other server was in a fulfillment house's company. That little arrow, so critical to customer satisfaction, really represented a Byzantine chain of events that resembled a game of "Mousetrap" more than a single interface. It had messages going to message brokers that appended lines to files, which were later picked up by an hourly job that would FTP the files to the "gateway" server (still inside my client's company.) The gateway server read each line from the file and constructed and XML message, which it then sent via HTTP to the fulfillment house.

It Stays Up

We analogize bridge-building as the epitome of engineering. (Side note: I live in the Twin Cities area, so we're a little leery of bridge engineering right now. Might better find another analogy, OK?) Engineering a bridge starts by examining the static and dynamic load factors that the bridge must support: traffic density, weight, wind and water forces, ice, snow, and so on.

Bridging between two programs should consider static and dynamic loads, too. Instead of just "SOAP-XML over HTTP", that one little arrow should also say, "Expect one query per HTTP request and send back one response per HTTP reply. Expect up to 100 requests per second, and deliver responses in less than 250 milliseconds 99.999% of the time."

It Falls Down

Building the right failure modes is vital. The last job of any structure is to fall down well. The same is true for programs, and for our hardy little arrow.

The interface needs to define what happens on each end when things come unglued. What if the caller sends more than 100 requests per second? Is it OK to refuse them? Should the receiver drop requests on the floor, refuse politely, or make the best effort possible?

What should the caller do when replies take more than 250 milliseconds? Should it retry the call? Should it wait until later, or assume the receiver has failed and move on without that function?

What happens when the caller sends a request with version 1.0 of the protocol and gets back a reply in version 1.1? What if it gets back some HTML instead of XML? Or an MP3 file instead of XML?

When a bridge falls down, it is shocking, horrifying, and often fatal. Computers and networks, on the other hand, fall down all the time. They always will. Therefore, it's incumbent on us to ensure that individual computers and networks fail in predictable ways. We need to know what happens to that arrow when one end disappears for a while.

In the White Space

This, then, is the essence of engineering in the white space. Decide what kind of load that arrow must support. Figure out what to do when the demand is more than it can bear. Decide what happens when the substrate beneath it falls apart, or when the duplicitous rectangle on the other end goes bonkers.

Inside the boxes, we find art.

The arrows demand engineering.

On the Widespread Abuse of SLAs

2007-08-06T13:55:45-05:00

Technical terminology sneaks into common use. Terms such as "bandwidth" and "offline" get used and abused, slowly losing touch with their original meaning. ("Bandwidth" has suffered multiple drifts. It started out in radio, not computer networking, let alone the idea of "personal attention space".) It is the nature of language to evolve, so I would have no problem with this linguistic drift, if it were not for the way that the mediocre and the clueless clutch to these seemingly meaningful phrases.

The latest victim of this linguistic vampirism is the "Service Level Agreement". This term, birthed in IT governance, sounds wonderful. It sounds formal and official.

An example of the vulgar usage: "I have a five-day SLA."

It sounds so very proactive and synergistic and leveraged, doesn't it? Theoretically, it means that we've got an agreement between our two groups; I am your customer and you commit to delivering service within five days.

A real SLA has important dimensions that I never see addressed with internal "organizational" SLAs.

First, boundaries.

When does that five day clock begin ticking? Is it when I submit my request to the queue? Or, is it when someone from your group picks the request up from the queue? If the latter, then how long do requests sit in queue before they get picked up? What's the best case? Worst case? Average?

When does the clock stop ticking? If you just say, "not approved" or "needs additional detail", does that meet your SLA? Do I have to resubmit for the next iteration, with a whole new five day clock? Or, does the original five day SLA run through resolution rather than just response?

An internal SLA must begin with submission into the request queue and end when the request is fully resolved.

Second, measurement and tracking.

How often do you meet your internal SLA? 100% of the time? 95% of the time? 50% of the time? Unless you can tell me your "on-time performance", there's no way for me to have confidence in your SLA.

How many requests have to be escalated or prioritized in order to meet SLA? Do any non-escalated requests actually get resolved within the alloted time?

How well does your on-time performance correlate with the incoming workload? If the request volume goes up by 25%, but your on-time performance does not change, then your SLA is too loose.

An SLA must be tracked and trended. It must be correlated with demand metrics.

Third, consequences.

If there is no penalty, then there is no SLA. In fact, the IT Infrastructure Library considers penalties to be the defining characteristic of SLAs. (Of course, ITIL also says that SLAs are only possible with external suppliers, because it is only with external suppliers that you can have a contract.)

When was the last time that an internal group had its budget dinged for breaking an SLA? What would that even mean? How would the health and performance of the whole company be aided by taking resources away from a unit that already cannot perform? The Theory of Constraints says that you devote more resources to the bottleneck, not less. Penalizing you for breaking SLA probably makes your performance worse, not better.

(External suppliers are different because a) you're paying them, and b) they have a profit margin. I doubt the same is true for your own internal groups.)

If there's no penalty, then it's not an SLA.

Fourth, consent.

SLAs are defined by joint consent of both the supplier and consumer of the service. As a subscriber to your service, I can make economic judgments about how much to pay for what level of service. You can make economic judgments about how well you can deliver service at the required level for the offered payment.

When are internal "service level agreements" actually an "agreement"? Never. I always see SLAs being imposed by one group upon all of their subscribers.

An SLA must be an agreement, not a dictum.

If any of these conditions are not met, then it's not really an SLA. It's just a "best effort response time". As a consumer, and sometimes victim, of the service, I cannot plan to the SLA time. Rather, I must manage around it. Calling a "best effort response time" an "SLA" is just an attempt to deceive both of us.

Y B Slow?

2007-07-25T13:07:54-05:00

I've long been a fan of the Firebug extension for Firefox. It gives you great visibility into the ebb and flow of browser traffic. It sure beats rolling your own SOCKS proxy to stick between your browser and the destination site.

Now, I have to also endorse YSlow from Yahoo. YSlow adds interpretation and recommendations to Firebug's raw data.

For example, when I point YSlow at www.google.com, here's how it "grades" Google's performance:

Not bad. On the other hand, www.target.com doesn't fare as well.

Along with the high-level recommendations, YSlow will also tally up the page weight, including a nice breakdown of cached versus non-cached requests and download size.

There are so many good reasons to use this tool. In Release It, I spend a lot of time talking about the money companies waste on bloated HTML and unnecessary page requests. Fat pages hurt users and they hurt companies. Users don't want to wait for all your extra whitespace, table-formatting, and shims to download. Companies shouldn't have to pay for all the added, useless bandwidth. YSlow is a great tool to help eliminate the bloat, speed up page delivery, and make happy users.

The 5 A.M. Production Problem

2007-06-25T12:33:26-05:00

I've got a new piece up at InfoQ.com, discussing the limits of unit and functional testing:

"Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you're calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that's all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.

One of my recurring themes in Release It is that every call to another system, without exception, will someday try to kill your application. It usually comes from behavior outside the specification. When that happens, you must be able to peel back the layers of abstraction, tear apart the constructed fictions of "concurrent users", "sessions", and even "connections", and get at what's really happening."

ITIL and Extreme Programming

2007-05-20T15:37:30-05:00

Esther Schindler asked if I'd be willing to post my earlier article on staying agile in the face of ITIL at CIO.com. How could I say no? The piece is here.

ITIL and XP

2007-05-06T12:57:14-05:00

The Agile Manifesto is explicit about it. "We value individuals and interactions over processes and tools." How should an Agile team---more specifically, an XP team---respond to the IT Infrastructure Library (ITIL), then? After all, ITIL takes seven books just to define the customizable framework for the actual practices. An IT organization usually takes at least seven more binders to define its actual processes.

Can XP and ITIL coexist in the same building, or is XP just incompatible with ITIL? In short: no.

ITIL and XP (or agile in general) are not fundamentally incompatible, but there will definitely be an interface between the XP world and the ITIL world. Whether this interface becomes an impedance barrier or not depends entirely on the way that your company chooses to implement ITIL.

I'll run down the Service Support processes and identify some of the problems I've encountered. (I'm focusing on Service Support because businesses tend to implement these processes first. Few of them get far enough down the road to really attack the Service Delivery processes. It's a shame, because I see a lot of value in the Service Delivery approach.) I will cover the service delivery processes in a future article.

Service Desk

An effective service desk can be a great asset to any team, including an XP team. Getting accurate feedback on issues your users are having can only benefit your development efforts and ultimately, the users themselves. The key here is to make sure that the service desk is well-prepared to accept responsibility for support calls on your app.

I strongly recommend that you start working with the service desk at least six weeks before your first application release. If the service desk is mature, they'll have job aids for capturing app support needs. These will provide the minimum initial information needed for the knowledge base. The service desk personnel will augment that knowledge base over time with whatever solutions, rumors, superstitions and folk remedies they come up with. Be sure you have access to the knowledge base, so you can help weed out the "false solutions."

You also want to get on the distribution list for ticket reports from the service desk. These will tell you what issues your users are encountering. Commonly recurring or high-impact issues should become cards for consideration in your next iteration. This feeds your interface to the Problem Management process.

If the service desk is not mature, you haven't prepared them well, or they do not perform resolution for application incidents, you will be looped in as part of the Incident Management process, below. This has some special challenges.

Incident Management

ITIL defines an "incident" as any disruption to the normal operation of a system or application. This includes bugs, outages, and even "PEBKAC" problems. The Incident Management process begins with notification of an incident. This can be logged by the service desk in response to a user call. It can even be automatically created by a monitoring system. It ends when normal functioning of the system is restored.

Note that this does not include root cause analysis or correction! Incident Management is all about restoring service.

Ideally, the service desk handles the entire Incident Management process and your team will not even need to be involved. In less ideal cases, you may be called on to help resolve "novel" incidents--ones that do not have a solution in the service desk's knowledge base.

When incidents come into the development room, you have some negative forces to deal with. By definition, the incident needs to be resolved expeditiously, making it both interrupt driven and urgent. Therefore, every incident will automatically split a pair and take somebody off their card. This is damaging to flow.

In worse cases, the entire team may get derailed and start huddling around the incident. Fire-fighting is exciting quadrant I work. It's natural to get a rush from being the hero. The problem is obvious, though. If the entire team is chasing the incident, nobody is making forward progress on the iteration. If you have a large user community or a lot of incidents, you can lose an entire day---or an entire iteration---before you realize it.

This can be exacerbated if your service desk never resolves application support incidents. In such cases, I recommend the "Designated Sacrifice" pattern. Assign one member of the team to handle the "Bat-Phone" calls and be the primary point of contact for incident resolution. This is a crappy job---you get pulled away constantly, can't maintain focus, get almost no card work done---so you'll want to rotate that position frequently. (On the other hand, there is that hero factor that provides some consolation.) Even doing it for one full iteration can be very demoralizing.

Problem Management

Recurring incidents can be identified as Problems that require correction. This is the job of the Problem Management process.

Identifying a Problem is often done by the service desk, but it can also come from other quarters. The decision about which Problems require correction often becomes very slow and bureaucratic. This is a process you want to work with very closely. Problem Management typically tolerates a much higher level or outstanding defects than an XP team wants to allow. I've seen teams get chewed out for fixing Problems that weren't scheduled to be addressed for a couple of iterations! Imagine how surreal that meeting feels!

Problem managers should be encouraged to write cards. Your team should even reserve a fraction of your velocity in each iteration just to handle Problems. You also need to communicate back to the problem managers when Problem cards are completed. Really good Problem Management identifies a few problem states such as "known problem", "known workaround", and "known solution". An XP team will typically move through these states pretty quickly.

Bear in mind that the ITIL definition of Problem Management is all about oversight, not the actual changes needed to fix the problem. The actual changes are deployed as part of Release Management.

Change Management

No part of ITIL gives more people cold sweats than Change Management. This is the process that so easily slips into heavyweight bureaucracy or, worse, meaningless CAB meetings.

Change Management as defined simply means tracking changes, their impact to configuration items, and ensuring that changes are applied in an orderly way. It doesn't have to hurt.

In reality, however, XP teams will spend a lot of time preparing for change advisory board meetings. Beware: the XP team may get a bad reputation for creating "too much" change.

I recommend standardizing your change and deployment process. Get into a regular rhythm of releases and deployments so the CAB just knows to expect that every third Tuesday (or whenever), your team will have a deployment. Standardize the deployment mechanics and system impact statement so you can templatize and re-use your change requests. Familiarity will create confidence with the CAB. Constantly showing them change requests they've never seen before will raise their level of scrutiny.

Failed changes also trigger more scrutiny. Your XP team will have an advantage here, because your rigorous approach to automated testing will reduce the incidence of failed changes, right?

Configuration Management

Configuration Management is *not* the act of changing configuration items. It's the process for tracking planned, executed, and retired configurations. As you plan each release, you should identify the CIs that will be affected by the release.

In a well-executed ITIL rollout, configuration management is vital for change management, incident management, the service desk, and release management. In a poorly-executed ITIL rollout, configuration management doesn't exist, or it only addresses servers or network devices.

CM should cover servers, network topology, applications, business processes, documentation, and the dependencies among all of them. That way, proposed changes to one CI (e.g., upgrade to front-end firewalls) can be analyzed for its impact. This is CM nirvana, seldom achieved.

The XP team should have an advantage here again, because you've already broken story cards down to tasks at the beginning of an iteration. That means you already know which applications and servers will be changed in that iteration. Roll up a few iterations into a release, and the CIs affected by the release should be well known.

On the other hand, if you've taken XP to its "no documentation" extreme, then you will not have tracked the CIs touched by each iteration. This underscores a common misinterpretation of XP; it doesn't eschew all documentation, just the documentation that doesn't add value from the customer's perspective. So, does tracking changes against CIs add value from the customer's perspective? Not directly, no. There is an indirect benefit, in that the customer will receive better uptime and performance, but that may seem remote to the team. The best I can say is that this is one place where you'll have to chalk it up to "necessary overhead".

Release Management

This is an easy one to integrate with your XP team. Release Management dovetails quite naturally with XP's release planning cycle. Engage early, though, because the ITIL process will likely require longer lead times than your team is used to.

Release It holding strong at Amazon

2007-04-30T13:19:02-05:00

Well, Release It continues to hold the #1 spot in Amazon's "Hot New Releases" list for Design Tools and Techniques. I've even got a couple of five-star reviews... and they weren't written by friends or family.

Heads down

2007-04-30T09:44:11-05:00

I've been quiet lately for a couple of reasons.

First, I'm thrilled to say that I'm joining the No Fluff, Just Stuff stable of speakers. It's an honor and a pleasure to be invited to keep such company. The flip side is, I'm spending a lot of my free time polishing up my inventory of presentations. More frankly, I'm rebuilding them all with Keynote. (Brief aside, I'm coming to love Keynote. It has some flaws and annoyances, but the result is worth it!)

I'll debut the first of these new presentations at OTUG on May 15th. I'll be speaking about "Design for Operations". The talk will be about 70% from the last part of Release It, and about 30% original content. OTUG will be giving away a couple of copies of my book, but you have to be there to win!

Finally, I'm working on an article about performance and capacity management. Most capacity planning work is done entirely within Operations, without much involvement from Development. At the same time, most developers don't have a visceral appreciation for how dramatically the application's efficiency can affect the system's overall profitability.

This article will show the relationship between application response time, system capacity, and financial success. I'm hoping to include a simulator app for download that you can use to play with different scenarios to see what a dramatic difference 100ms can make.

Coach and Team From Same Firm

2007-04-22T15:23:18-05:00

Is it an antipattern to have a consulting firm provide both the coach and developers? By providing the developers, the firm is motivated to deliver on the project, with coaching as an adjunct. If, instead, the firm provides just the coach, it will be judged by how well the client adopts the process. These two motives can easily conflict.

Case in point: at a previous client of mine, my employer was charged with completing the project, using a 50-50 mix of contractors and client developers. My employer, a consulting firm, provided several developers experienced with XP and Scrum, as well as an agile coach. The firm was thus charged with two imperatives: first, deliver the project; second, introduce agile methods within the client.

With project success as a requirement, the firm decided to intereview the developers at the outset of the project. The client's developers (rightly) perceived that they were interviewing for their own jobs. This started a negative dynamic that ultimately resulted in 80% attrition among the client's developers.

On a pure coaching engagement, the coach would probably have "made do" with whomever the client provided.

We delivered all the features, basically on time, with very high quality. Financially speaking, it was a success, generating more orders and more revenue per order than its predecessor. It is harder to say that the engagement as a whole was a success, though. Almost all of the developers were contractors, so the client got their product, but very little adoption of agile methods.

Perhaps if the coach and the contract developers had come from different firms, the motivations would not have been as tangled, and more of the client's valuable people would have stayed. The team might not have suffered from the strained, unhealthy environment from the early days of the project.

Then again, perhaps not. The client may have been expecting that level of attrition. Maybe that's just to be expected when you trying to bring a random selection of corporate developers over to agile methods, especially if the methods are decreed from above instead of brought upward by grass-roots. Maybe the dynamic would have existed even with a coach that was totally disinterested in the project outcome.

Moving Your Home Directory on Leopard

2007-04-20T17:10:51-05:00

Since NetInfo Manager is going away under Leopard, we've got a gap in capability. How do you relocate your home directory without the GUI?

There are a few reasons you might want to move your home directory to another volume. For example, you might reinstall your OS frequently. Or, perhaps you just want to keep your data on a bigger disk than the one that came in the machine. In my case, both.

The venerable NetInfo is being replaced entirely with Directory Services. (Try "man 8 DirectoryServices" for more information.) There's a handy command-line tool you can use to interact with the DirectoryServices.

Let's start by opening up a Terminal window. (Applications > Utilities > Terminal) At first, you'll be logged in as yourself, not as root.

Last login: Wed Dec 31 18:00:00 on ttyp0
donk:~ mtnygard$

The first thing is to get out of your home directory, because we're going to delete it in about a minute and a half. Change to the root directory and make yourself into the root user with "sudo".

Last login: Wed Dec 31 18:00:00 on ttyp0
donk:~ mtnygard$ sudo su -
Password:
donk:~ root#

Next, fire up "dscl", the directory services command line. Without arguments, this gives you an interactive, shell-like environment to explore the directory. It also spews a bunch of help messages. If you give it "localhost", then it quietly assumes you wanted to interact with the directory.

You can list entries, cd around the directory hierarchy, and even create entries or change attributes.

User information is stored under /Local/Users, so we'll cd to that now.

donk:~ root# dscl localhost
 > cd /Local/Users
/Local/Users >

Now, running "ls" will show you all the users that your machine knows. Try it now.

donk:~ root# dscl localhost
 > cd /Local/Users
/Local/Users > ls
_amavisd
_appowner
_appserver
_ard
_calendar
_clamav
_cvs
_cyrus
_eppc
_installer
_jabber
_lp
_mailman
_mcxalr
_mdnsresponder
_mysql
_pcastagent
_pcastserver
_postfix
_qtss
_securityagent
_serialnumberd
_spotlight
_sshd
_svn
_teamsserver
_tokend
_unknown
_update_sharing
_uucp
_windowserver
_www
_xgridagent
_xgridcontroller
daemon
mtnygard
nobody
root
/Local/Users >

Holy crap! Who the hell are all these people?

Well, of course, they aren't people. All the usernames starting with an underscore are application IDs. Root, nobody, and daemon are all part of the OS. Once you eliminate them, there should just be the people you've actually created accounts for. If you see any names you don't recognize at this point, this would be a good time to shut off your network connection.

At this point, you could "cd" directly into the entry for your user. It won't show you anything special; users do not have subnodes in the directory. It would set up your context for future commands, limiting them to just that user. In this case, however, we'll stay at /Local/Users and run "cat" on my username.

/Local/Users > cat mtnygard
dsAttrTypeNative:_writers_hint: mtnygard
dsAttrTypeNative:_writers_jpegphoto: mtnygard
dsAttrTypeNative:_writers_passwd: mtnygard
dsAttrTypeNative:_writers_picture: mtnygard
dsAttrTypeNative:_writers_realname: mtnygard
dsAttrTypeNative:authentication_authority: ;ShadowHash;
dsAttrTypeNative:generateduid: 7F6A8EDE-63EC-4A34-9391-031A9C77806D
dsAttrTypeNative:gid: 501
dsAttrTypeNative:hint: 
dsAttrTypeNative:home: /Users/mtnygard
dsAttrTypeNative:jpegphoto:
 ffd8ffe0 00104a46 49460001 01000001 00010000 ffdb0043 00020202 ... 7fffd9
dsAttrTypeNative:name: mtnygard
dsAttrTypeNative:passwd: ********
dsAttrTypeNative:picture:
 /Library/User Pictures/Sports/Tennis.tif
dsAttrTypeNative:realname:
 Michael Nygard
dsAttrTypeNative:shell: /bin/bash
dsAttrTypeNative:uid: 501
AppleMetaNodeLocation: /Local/Default
AuthenticationAuthority: ;ShadowHash;
AuthenticationHint: 
GeneratedUID: 7F6A8EDE-63EC-4A34-9391-031A9C77806D
JPEGPhoto:
 ffd8ffe0 00104a46 49460001 01000001 00010000 ffdb0043 00020202 ... 7fffd9
NFSHomeDirectory: /Users/mtnygard
Password: ********
Picture:
 /Library/User Pictures/Sports/Tennis.tif
PrimaryGroupID: 501
RealName:
 Michael Nygard
RecordName: mtnygard
RecordType: dsRecTypeStandard:Users
UniqueID: 501
UserShell: /bin/bash
/Local/Users >

Hmm. Seems like it must mean something. This is listing the values of all the attributes of my user profile. It's what I want, but there's a big pile of noise in the middle. That noise is a textual representation of my profile's JPEG. (I've edited it out of this transcript.) If you scroll up past that, you'll see the attribute of real interest.

The property dsAttrTypeNative:home tells the OS where to find my home directory.

I can change it with dscl's "change" command. The format of change is a little strange because it has to deal with multi-valued properties (as do all of the directory services commands.)

/Local/Users > change mtnygard dsAttrTypeNative:home /Users/mtnygard /Volumes/Data/mtnygard
/Local/Users >

The first parameter is the object to change, the second parameter is the attribute to change. The third parameter is the old value that you want to replace (multi-valued list for each attribute, remember.) Finally, the fourth parameter is the new value you want to set.

Whew.

Not quite done yet, though. I've given the OS a bogus home directory. There's no such directory as /Volumes/Data/mtnygard yet.

To get there, I have to move my directory from under /Users to the new location. I have to do this as root, but I don't want root to end up owning all my personal stuff. Fortunately, there's a "cp" option for that.

donk:~ # cp -Rp /Users/mtnygard /Volumes/Data/

Now, we're almost, almost done. Log off and log back on into your roomy new home directory.

Caveats:

I don't know how to do this if you've got a shared directory tree set up. You might have that if you're on a Mac network at work, for example. You should definitely try this at home.
The "cp" command I use will do really funky things if you've got hard links, symlinks, or especially circular symlinks in your home directory. Then again, if you've done that to yourself, you probably know enough Unix to work out your own parameters for "cp", "tar", "mv" or "cpio".
One more thing: I'm not sure if this is from running a developer seed of Leopard, or if it's due to this home directory move technique, but I keep running into permissions problems. I couldn't automatically install dashboard widgets, for example. Adium complained that it couldn't create its "sounds" directory.

What makes a POJO so great, anyway?

2007-04-15T23:42:04-05:00

My friend David Hussman once said to me, "The next person that says the word 'POJO' to me is going to get stabbed in the eye with a pen." At the time, I just commiserated about people who follow crowds rather than making their own decisions.

David's not a violent person. He's not prone to fits of violence or even hyperbole. What made this otherwise level-headed coach and guru resort to non-approved uses of a Bic?

This weekend in No Fluff, Just Stuff, I had occasion to contemplate POJOs again. There were many presentations about "me too" web frameworks. These are the latest crop of Java web frameworks that are furiously copying Ruby on Rails features as fast as they can. These invariably make a big deal out of using POJOs for data-mapped entities or for the beans accessed by whatever flavor of page template they use. (See JSF, Seam, WebFlow, Grails, and Tapestry 5 for examples.)

Mainly, I think the infuriating bit is the use of the word "POJO" as if it's a synonym for "good". There's nothing inherently virtuous about plain old Java objects. It's a retronym; a name made up for an old thing to distinguish it from the inferior new replacement.

People only care about POJOs because EJB2 was so unbelievably bad.

Nobody gives a crap about "POROs" (Plain old Ruby objects) because ActiveRecord doesn't suck.

Release It! is shipping

2007-04-08T19:53:17-05:00

Release It is now shipping! People who ordered directly from The Pragmatic Programmers are receiving their hardcopies now. It will take Amazon and Barnes and Noble a few days or a week to work the inventory through their supply chain, but they should be shipping soon, too!

Flash Mobs and TCP/IP Connections

2007-04-08T19:35:43-05:00

In Release It, I talk about users and the harm they do to our systems. One of the toughest types of user to deal with is the flash mob. A flash mob often results from Attacks of Self-Denial, like when you suddenly offer a $3000 laptop for $300 by mistake.

When a flash mob starts to arrive, you will suddenly see a surge of TCP/IP connection requests at your load-distribution layer. If the mob arrives slowly enough (less than 1,000 connections per second) then the app servers will be hurt the most. For a really fast mob, like when your site hits the top spot on digg.com, you can get way more than 1,000 connections per second. This puts the hurt on your web servers.

As the TCP/IP connection requests arrive, the OS queues them for servicing by the application. As the application gets around to calling "accept" on the server socket, the server's TCP/IP stack sends back the SYN/ACK packet and the connection is established. (There's a third step, but we can skip it for the moment.) At that point, the server hands the established connection off to a worker thread to process the request. Meanwhile, the thread that accepted the connection goes back to accept the next one.

Well, when a flash mob arrives, the connection requests arrive faster than the application can accept and dispatch them. The TCP/IP stack protects itself by limiting the number of pending connection requests, so if the requests arrive faster than the application can accept them, the queue will grow until the stack has to start refusing connection requests. At that point, your server will be returning intermittent errors and you're already failing.

The solution is much easier said than done: accept and dispatch connections faster than they arrive.

Filip Hanik compares some popular open-source servlet containers to see how well they stand up to floods of connection requests. In particular, he demonstrates the value of Tomcat 6's new NIO connector. Thanks to some very careful coding, this connector can accept 4,000 connections in 4 seconds on one server. Ultimately, he gets it to accept 16,000 concurrent connections on a single server. (Not surprisingly, RAM becomes the limiting factor.)

It's not clear that these connections can actually be serviced at that point, but that's a story for another day.

Release It! is released!

2007-03-30T14:05:55-05:00

"Release It!" has been officially announced in this press release. Andy Hunt, my editor, also posted announcements to several mailing lists.

It's been a long road, so I'm thrilled to see this release.

When you release a new software system, that's not the end of the process, but just the beginning of the system's life. It is the same thing here. Though it's taken me two years to get this book done and on the market, this is not the end of the book's creation, but the beginning of it's life.

Self-Inflicted Wounds

2007-03-25T12:29:30-05:00

My friend and colleague Paul Lord said, "Good marketing can kill you at any time."

He was describing a failure mode that I discuss in "Release It!: Design and Deploy Production-Ready Software" as "Attacks of Self-Denial". These have all the characteristics of a distributed denial-of-service attack (DDoS), except that a company asks for it. No, I'm not blaming the victim for electronic vandalism... I mean, they actually ask for the attack.

The anti-pattern goes something like this: marketing conceives of a brilliant promotion, which they send to 10,000 customers. Some of those 10,000 pass the offer along to their friends. Some of them post it to sites like FatWallet or TechBargains. On the appointed day, hour, and minute, the site has a date with destiny as a million or more potential customers hit the deep link that marketing sent around in the email. You know, the one that bypasses the content distribution network, embeds a session ID in the URL, and uses SSL?

Nearly every retailer I know has done this to themselves at one point. Two holidays ago, one of my clients did it to themselves, when they announced that XBox 360 preorders would begin at a certain day and time. Between actual customers and the amateur shop-bots that the tech-savvy segment cobbled together, the site got crushed. (Yes, this was one where marketing sent the deep link that bypassed all the caching and bot-traps.)

Last holiday, Amazon did it to themselves when they discounted the XBox 360 by $300. (What is it about the XBox 360?) They offered a thousand units at the discounted price and got ten million shoppers. All of Amazon was inaccessible for at least 20 minutes. (It may not sound like much, but some estimates say Amazon generates $1,000,000 per hour during the holiday season, so that 20 minute outage probably cost them around $200,000!)

In Release It!, I discuss some non-technical ways to mitigate this behavior, as well as some design and architecture patterns you can apply to minimize damage when one of these Attacks of Self-Denial occur.

Design Patterns in Real Life

2007-03-16T16:53:52-05:00

I've seen walking cliches before. There was this one time in the Skyway that I actually saw a guy with a white cane being led by a woman with huge dark sunglasses and a guide dog. Today, though, I realized I was watching a design pattern played out with people instead of objects.

I've used the Reactor pattern in my software before. It's particularly helpful when you combine it with non-blocking multiplexed I/O, such as Java's NIO package.

Consider a server application such as a web server or mail transfer agent. A client connects to a socket on the server to send a request. The server and client talk back and forth a little bit, then the server either processes or denies the client's request.

If the server just used one thread, then it could only handle a single client at a time. That's not likely to make a winning product. Instead, the server uses multiple threads to handle many client connections.

The obvious approach is to have one thread handle each connection. In other words, the server keeps a pool of threads that are ready and waiting for a request. Each time through its main loop, the server gets a thread from the pool and, on that thread, calls the socket "accept" method. If there's already a client connection request waiting, then "accept" returns right away. If not, the thread blocks until a client connects. Either way, once "accept" returns, the server's thread has an open connection to a client.

At that point, the thread goes on to read from the socket (which blocks again) and, depending on the protocol, may write a response or exchange more protocol handshaking. Eventually, the demands of protocol satisfied, the client and server say goodbye and each end closes the socket. The worker thread pulls a Gordon Freeman and disappears into the pool until it gets called up for duty again.

It's a simple, obvious model. It's also really inefficient. Any given thread spends most of its life doing nothing. It's either blocked in the pool, waiting for work, or it's blocked on a socket "accept", "read", or "write" call.

If you think about it, you'll also see that the naive server can handle only as many connections as it has threads. To handle more connections, it must fork more threads. Forking threads is expensive in two ways. First, starting the thread itself is slow. Second, each thread requires a certain amount of scheduling overhead. Modern JVMs scale well to large numbers of threads, but sooner or later, you'll still hit the ceiling.

I won't go into all the details of non-blocking I/O here. (I can point you to a decent article on the subject, though.) Its greatest benefit is you do not need to dedicate a thread to each connection. Instead, a much smaller pool of threads can be allocated, as needed, to handle individual steps of the protocol. In other words, thread 13 doesn't necessarily handle the whole conversation. Instead, thread 4 might accept the connection, thread 29 reads the initial request, thread 17 starts writing the response and thread 99 finishes sending the response.

This model employs threads much more efficiently. It also scales to many more concurrent requests. Bookkeeping becomes a hassle, though. Keeping track of the state of the protocol when each thread only does a little bit with the conversation becomes a challenge. Finally, the (hideously broken) multithreading restrictions in Java's "selector" API make fully multiplexed threads impossible.

The Reactor pattern predates Java's NIO, but works very well here. It uses a single thread, called the Acceptor, to await incoming "events". This one thread sleeps until any of the connections needs service: either due to an incoming connection request, a socket ready to read, or a socket ready for write. As soon as one of these events occurs, the Acceptor hands the event off to a dispatcher (worker) thread that then processes the event.

You can visualize this by sitting in a TGI Friday's or Chili's restaurant. (I'm fond of the crowded little ones inside airports. You know, the ones with a third of the regular menu and a line stretching out the door. Like a home away from home for me lately.) The "greeter" accepts incoming connections (people) and hands them off to a "worker" (server). The greeter is then ready for the next incoming request. (The line out the door is the listen queue, in case you're keeping score.) When the kitchen delivers the food, it doesn't wait for the original worker thread. Instead, a different worker thread (a runner) brings the food out to the table.

I'll keep my eyes open for other examples of object-oriented design patterns in real life--though I don't expect to see many based on polymorphism.

Another Path to a Killer Product

2007-03-06T10:40:28-06:00

Give individuals powers once reserved for masses

Here's a common trajectory:

1. Something is so expensive that groups (or even an entire government) have to share them. Think about mainframe computers in the Sixties.

2. The price comes down until a committed individual can own one. Think homebrew computers in the Seventies. The "average" person wouldn't own one, but the dedicated geek-hobbyist would.

3. The price comes down until the average individual can own one. Think PCs in the Eighties.

4. The price comes down until the average person owns dozens. PCs, game consoles, MP3 players, GPS navigators, laptops, embedded processors in toasters and cars. An average person may have half a dozen devices that once were considered computers.

Along the way, the product first gains broader and broader functionality, then becomes more specific and dedicated.

Telephones, radios and televisions all followed the same trajectory. You would probably call these moderately successful products.

So: find something so expensive that groups have to purchase and share it. Make it cheap enough for a private individual.

Quantum Manipulations

2007-02-10T14:15:03-06:00

I work in information technology, but my first love is science. Particularly the hard sciences of physics and cosmology.

There've been a series of experiments over the last few years that have demonstrated quantum manipulations of light and matter that approach the macroscopic realm.

A recent result from Harvard (HT to Dion Stewart for the link) has gotten a lot of (incorrect) play. It involves absorbing photons with a Bose-Einstein condensate, then reproducing identical photons at some distance in time and space. I've been reading about these experiments with a lot of interest, along with the experiments going the "other" direction: supraluminal group phase travel.

I wish the science writers would find a new metaphor, though. They all talk in terms of "stopping light" or "speeding up light". None of these have to do with changing the speed of light, either up or down. This is about photons, not the speed of light.

In fact, this latest one is even more interesting when you view it in terms of the "computational universe" theory of Seth Lloyd. What they've done is captured the complete quantum state of the photons, somehow 'imprinted' on the atoms in the condensate, then recreated the photons from that quantum state.

This isn't mere matter-energy conversion as the headlines have said. It's something much more.

The Bose-Einstein condensate can be described as a phase of matter colder than a solid. It's much weirder than that, though. In the condensate, all the particles in all the atoms achieve a single wavefunction. You can describe the entire collection of protons, neutrons and electrons as if it were one big particle with its own wavefunction.

This experiment with the photons shows that the photons' wavefunctions can be superposed with the wavefunction of the condesnate, then later extracted to separate the photons from the condensate.

The articles somewhat misrepresent this as being about converting light (energy) to matter, but its really about converting the photon particles to pure information then using that information to recreate identical particles elsewhere. Yikes!

A path to a product

2007-02-04T10:07:11-06:00

Here's a "can't lose" way to identify a new product: Enable people to plan ahead less.

Take cell phones. In the old days, you had to know where you were going before you left. You had to make reservations from home. You had to arrange a time and place to meet your kids at Disney World.

Now, you can call "information" to get the number of a restaurant, so you don't have to decide where you're going until the last possible minute. You can call the restaurant for reservations from your car while you're already on your way.

With cell phones, your family can split up at a theme park without pre-arranging a meeting place or time.

Cell phones let you improvise with success. Huge hit.

GPS navigation in cars is another great example. No more calling AAA weeks before your trip to get "TripTix" maps. No more planning your route on a road atlas. Just get in your car, pick a destination and start driving. You don't even have to know where to get gas or food
along the way.

Credit and debit cards let you go places without planning ahead and carrying enough cash, gold, or jewels to pay your way.

The Web is the ultimate "preparation avoidance" tool. No matter what you're doing, if you have an always-on 'Net connection, you can improvise your way through meetings, debates, social engagements, and work situations.

Find another product that lets procrastinators succeed, and you've got a sure winner. There's nothing that people love more than the personal liberation of not planning ahead.

How to become an "architect"

2007-01-28T01:04:40-06:00

Over at The Server Side, there's a discussion about how to become an "architect". Though TSS comments often turn into a cesspool, I couldn't resist adding my own two cents.

I should also add that the title "architect" is vastly overused. It's tossed around like a job grade on the technical ladder: associate developer, developer, senior developer, architect. If you talk to a consulting firm, it goes more like: senior consultant (1 - 2 years experience), architect (3 - 5 years experience), senior technical architect (5+ years experience). Then again, I may just be too cynical.

There are several qualities that the architecture of a system should be:

Shared. All developers on the team should have more or less the same vision of the structure and shape of the overall system.
Incremental. Grand architecture projects lead only to grand failures.
Adaptable. Successful architectures can be used for purposes beyond their designers' original intentions. (Examples: Unix pipes, HTTP, Smalltalk)
Visible. The "sacred, invisible architecture" will fall into disuse and disrepair. It will not outlive its creator's tenure or interest.

Is the designated "architect" the only one who can produce these qualities? Certainly not. He/she should be the steward of the system, however, leading the team toward these qualities, along with the other -ilities, of course.

Finally, I think the most important qualification of an architect should be: someone who has created more than one system and lived with it in production. Note that automatically implies that the architect must have at least delivered systems into production. I've run into "architects" who've never had a project actually make it into production, or if they have, they've rolled off the project---again with the consultants---just as Release 1.0 went out the door.

In other words, architects should have scars.

Planning to Support Operations

2007-01-14T10:44:14-06:00

In 2005, I was on a team doing application development for a system that would be deployed to 600 locations. About half of those locations would not have network connections. We knew right away that deploying our application would be key, particularly since it is a "rich-client" application. (What we used to call a "fat client", before they became cool again.) Deployment had to be done by store associates, not IT. It had to be safe, so that a failed deployment could be rolled back before the store opened for business the next day. We spent nearly half of an iteration setting up the installation scripts and configuration. We set our continuous build server up to create the "setup.exe" files on every build. We did hundreds of test installations in our test environment.

Operations said that our software was "the easiest installation we've ever had." Still, that wasn't the end of it. After the first update went out, we asked operations what could be done to improve the upgrade process. Over the next three releases, we made numerous improvements to the installers:

Make one "setup.exe" that can install either a server or a client, and have the installer itself figure out which one to do.
Abort the install if the application is still running. This turned out to be particularly important on the server.
Don't allow the user to launch the application twice. Very hard to implement in Java. We were fortunate to find an installer package that made this a check-box feature in the build configuration file!
Don't show a blank Windows command prompt window. (An artifact of our original .cmd scripts that were launching the application.)
Create separate installation discs for the two different store brands.
When spawning a secondary application, force it's window to the front, avoiding the appearance of a hang if the user accidentally gives focus to the original window.

These changes reduced support call volume by nearly 50%.

My point is not to brag about what a great job we did. (Though we did a great job.) To keep improving our support for operations, we deliberately set aside a portion of our team capacity each iteration. Operations had an open invitation to our iteration planning meetings, where they could prioritize and select story cards the same as our other stakeholders. In this manner, we explicitly included Operations as a stakeholder in application construction. They consistently brought us ideas and requests that we, as developers, would not have come up with.

Furthermore, we forged a strong bond with Operations. When issues arose---as they always will---we avoided all of the usual finger-pointing. We reacted as one team, instead of two disparate teams trying to avoid responsibility for the problems. I attribute that partly to the high level of professionalism in both development and operations, and partly to the strong relationship we created through the entire development cycle.

"Us" and "Them"

2007-01-02T22:34:43-06:00

As a consultant, I've joined a lot of projects, usually not right when the team is forming. Over the years, I've developed a few heuristics that tell me a lot about the psychological health of the team. Who lunches together? When someone says "whole team meeting," who is invited? Listen for the "us and them" language. How inclusive is the "us" and who is relegated to "them?" These simple observations speak volumes about the perception of the development team. You can see who they consider their stakeholders, their allies, and their opponents.

Ten years ago, for example, the users were always "them." Testing and QA was always "them." Today, particularly on agile teams, testers and users often get "us" status (As an aside, this may be why startups show such great productivity in the early days. The company isn't big enough to allow "us" and "them" thinking to set in. Of course, the converse is true as well: us and them thinking in a startup might be a failure indicator to watch out for!). Watch out if an "us" suddenly becomes "them." Trouble is brewing!

Any conversation can create a "happy accident;" some understanding that obviates a requirement, avoids a potential bug, reduces cost, or improves the outcome in some other way. Conversations prevented thanks to an armed-camp mentality are opportunities lost.

One of the most persistent and perplexing "us" and "them" divisions I see is between development and operations. Maybe it's due to the high org-chart distance (OCD) between development groups and operations groups. Maybe it's because development doesn't tend to plan as far ahead as operations does. Maybe it's just due to a long-term dynamic of requests and refusals that sets each new conversation up for conflict. Whatever the cause, two groups that should absolutely be working as partners often end up in conflict, or worse, barely speaking at all.

This has serious consequences. People in the "us" tent get their requests built very quickly and accurately. People in the "them" tent get told to write specifications. Specifications have their place. Specifications are great for the fourth or fifth iteration of a well-defined process. During development, though, ideas need to be explored, not specified. If a developer has a vague idea about using the storage area network to rapidly move large volumes of data from the content management system into production, but he doesn't know how to write the request, the idea will wither on the vine.

The development-operations divide virtually ensures that applications will not be transitioned to operations as effectively as possible. Some vital bits of knowledge just don't fit into a document template. For example, developers have knowledge about the internals of the application that can help diagnose and recover from system failures. (Developer: "Oh, when you see all the request handling threads blocked inside the K2 client library, just bounce the search servers. The app will come right back." Operations: "Roger that. What's a thread?") These gaps in knowledge degrade uptime, either by extending outages or preventing operations from intervening. If the company culture is at all political, one or two incidents of downtime will be enough to start the finger-pointing between development and operations. Once that corrosive dynamic gets started, nothing short of changing the personnel or the leadership will stop it.

Inviting Domestic Disaster

2006-12-26T18:22:03-06:00

We had a minor domestic disaster this morning. It's not unusual. With four children, there's always some kind of crisis. Today, I followed a trail of water along the floor to my youngest daughter. She was shaking her "sippy cup" upside down, depositing a full cup of water on the carpet... and on my new digital grand piano.

Since the entire purpose of the "sippy cup" is to contain the water, not to spread it around this house, this was perplexing.

On investigation, I found that this failure in function actually mimicked common dynamics of major disasters. In "Inviting Disaster", James R. Chiles describes numerous mechanical and industrial disasters, each with a terrible cost in lives. In Release It, I discuss software failures that cost millions of dollars---though, thankfully, no lives. None of these failures come as a bolt from the blue. Rather, each one has precursor incidents: small issues whose significance are only obvious in retrospect. Most of these chains of events also involve humans and human interaction with the technological environment.

The proximate cause of this morning's problem was inside the sippy cup itself. The removable valve was inserted into the lid backwards, completely negating its purpose. A few weeks earlier, I had pulled a sippy cup from the cupboard with a similarly backward valve. I knew it had been assembled by my oldest, who has the job of emptying the dishwasher, so I made a mental note to provide some additional instruction. Of course, mental notes are only worth the paper they're written on. I never did get around to speaking with her about it.

Today, my wonderful mother-in-law, who is visiting for the holidays, filled the cup and gave it to my youngest child. My mother-in-law, not having dealt with thousands of sippy cup fillings, as I have, did not notice the reversed valve, or did not catch its significance.

My small-scale mess was much easier to clean up than the disasters in "Release It!" or "Inviting Disaster". It shared some similar features, though. The individual with experience and knowledge to avert the problem--me--was not present at the crucial moment. The preconditions were created by someone who did not recognize the potential significance of her actions. The last person who could have stopped the chain of events did not have the experience to catch and stop the problem. Change any one of those factors and the crisis would not have occurred.

Book Completed

2006-12-13T20:44:00-06:00

I'm thrilled to report that my book is now out of my hands and into the hands of copy editors and layout artists.

It's been a long trip. At the beginning, I had no idea just how much work was needed to write an entire book. I started this project 18 months ago, with a sample chapter, a table of contents, and a proposal. That was a few hundred pages, three titles, and a thousand hours ago.

Now "Release It! Design and Deploy Production-Ready Software" is close to print. Even in these days of the permanent ephemerance of electronic speech, there's still something incomparably electric about seeing your name in print.

Along with publication of the book, I will be making some changes to this blog. First, it's time to find a real home. That means a new host, but it should be transparent to everyone but me. Second, I will be adding non-blog content: excerpts from the book, articles, and related content. (I have some thoughts about capacity management that need a home.) Third, if there is interest, I will start a discussion group or mailing list for conversation about survivable software.

Reflexivity and Introspection

2006-10-07T15:22:00-05:00

A fascinating niche of programming languages consists of those languages which are constructed in themselves. For instance, Squeak is a Smalltalk whose interpreter is written in Squeak. Likewise, the best language for writing a LISP interpreter turns out to be LISP itself. (That one is more like nesting than bootstrapping, but it's closely related.)

I think Ruby has enough introspection to be built the same way. Recently, a friend clued me in to PyPy, a Python interpreter written in Python.

I'm sure there are many others. In fact the venerable GCC is written in its own flavor of C. Compiling GCC from scratch requires a bootstrapping phase, by compiling a small version of GCC, written in a more portable form of C, with some other C compiler. Then, the phase I micro-GCC compiles the whole GCC for the target platform.

Reflexivity arises when the language has sufficient introspective capabilities to describe itself. I cannot help but be reminded of Godel, Escher, Bach and the difficulties that reflexivity cause. Godel's Theorem doesn't really kick in until a formal system is complex enough to describe itself. At that point, Godel's Theorem proves that there will be true statements, expressed in the language of the formal system, that cannot be proven true. These are inevitably statements about themselves---the symbolic logic form of, "This sentence is false."

Long-time LISP programmers create works with such economy of expression that we can only use artistic metaphors to describe them. Minimalist. Elegant. Spare. Rococo.

Forth was my first introduction to self-creating languages. FORTH starts with a tiny kernel (small enough that it fit into a 3KB cartridge for my VIC-20) that gets extended one "word" at a time. Each word adds to the vocabulary, essentially customizing the language to solve a particular problem. It's really true that in FORTH, you don't write programs to solve problems. Instead, you invent a language in which solving the problem is trivial, then you spend your time implementing that language.

Another common aspect of these self-describing languages seems to be that they never become widely popular. I've heard several theories that attempted to explain this. One says that individual LISP programmers are so productive that they never need large teams. Hence, cross-pollination is limited and it is hard to demonstrate enough commercial demand to seem convincing. Put another way: if your started with equal populations of Java and LISP programmers, demand for Java programmers would quickly outstrip demand for LISP programmers... not because it's a superior language, but just because you need more Java programmers for any given task. This demand becomes self-reinforcing, as commercial programmers go where the demand is, and companies demand what they see is available.

I also think there's a particular mindset that admires and relates to the dynamic of the self-creating language. I suspect that programmers possessing that mindset are also the ones who get excited by metaprogramming.

Education as mental immune system

2006-09-25T17:22:00-05:00

Education and intelligence act like a memetic immune system. For instance, anyone with knowledge of chemistry understands that "binary liquid explosives" are a movie plot, not a security threat. On the other hand, lacking education, TSA officials told a woman in front of me to throw away her Dairy Queen ice cream cones before she could board the plane. Ice cream.

How in the hell is anyone supposed to blow up a plane with ice cream? It defies imagination.

She was firmly and seriously told, "Once it melts, it will be a liquid and all liquids and gels are banned from the aircraft."

I wanted to ask him what the TSA's official position was on collodal solids. They aren't gels or liquids, but amorphous liquids trapped in a suspension of solid crystals. Like a creamy mixture of dairy fats, egg yolks, and flavoring trapped in a suspension of water ice crystals.

I didn't of course. I've heard the chilling warnings, "Jokes or inappropriate remarks to security officials will result in your detention and arrest." (Real announcement. I heard it in Houston.) In other words, mouth off about the idiocy of the system and you'll be grooving to Brittney Spears in Gitmo.

On the other hand, there are other ideas that only make sense if you're overly educated. Dennis Prager is fond of saying that you have to go to graduate school to believe things like, "The Republican party is more dangerous than Hizbollah."

Of course, I don't think he's really talking about post-docs in Chemical Engineering.

Expressiveness, revisited

2006-05-05T21:28:00-05:00

I previously mused about the expressiveness of Ruby compared to Java. Dion Stewart pointed me toward F-Script, an interpreted, Smalltalk-like scripting language for Mac OS X and Cocoa. In F-Script, invoking a method on every object in an array is built-in syntax. Assuming that updates is an array containing objects that understand the preProcess and postProcess messages.

updates preProcess
updates postProcess

That's it. Iterating over the elements of the collection is automatic.

F-Script admits much more sophisticated array processing; multilevel iteration, row-major processing, column-major processing, inner products, outer products, "compression" and "reduction" operations. The most amazing thing is how natural the idioms look, thanks to their clean syntax and the dynamic nature of the language.

It reminds me of a remark about General Relativity, that economy of expression allowed vast truths to be stated in one simple, compact equation. It would, however, require fourteen years of study to understand the notation used to write the equation, and that one could spend a lifetime understanding the implications.

Technorati Tags: java, beyondjava, ruby, fscript

Inviting Disaster

2006-01-11T19:36:00-06:00

I'm reading a fabulous book called "Inviting Disaster", by James R. Chiles. He discusses hundreds of engineering and mechanical disasters. Most of them caused serious loss of life.

There are several common themes:

1. Enormously complex systems that react in sometimes unpredictable ways

2. Inadequate testing, training, or preparedness for failures -- particularly for multiple concurrent failures

3. A chain of events leading to the "system fracture". Usually exacerbated by human error

4. Politics or budget pressure causing otherwise responsible people to rush things out. This often involves whitewashing or pooh-poohing legitimate criticism and concern from experts involved.

The parallels to some projects I've worked on are kind of eerie. Particularly when he's talking about things like the DC-10 and the Hubble Space Telescope. In both of those cases, warning signs were visible during the construction and early testing, but because each of the people involved had tunnel vision limited to that person's silo, the clues got missed.

The scary part is that there is no solution here. Sometimes, you can't even place the blame very squarely. When half-a-dozen people were involved with unloading and handling of oxygen-generating cylinders on a ValuJet flight, no single individual really did something wrong (or contrary to procedure, anyway). Still, the net effect of their actions cost the lives of every single person on that flight.

It's grim stuff, but it ought to be required reading. If you ever leave your house again, you'll be much better prepared for building and operating complex systems.

New Interview Question

2005-12-26T22:53:00-06:00

So many frameworks... so much alphabet soup on the resumes.

Anyone that ever reads The Server Side or Monster.com knows exactly which boxes to hit when they're writing a resume. The recruiters telegraph their needs a mile away. (Usually because they couldn't care less about the differences or similarities between Struts, JSF, WebWork, etc.) As long as the candidate knows how to spell Spring and Hibernate, they'll get submitted to the "preferred vendor" system.

Being one of those candidates is tough, but that's not the part I'm concerned about now. I'm interested in weeding out the know-nothings, the poseurs, and the fast talkers.

When I'm interviewing somebody, my main criterion is this: would I want to work on a two-person project with this candidate? My secondary criterion is "Would I feel comfortable leaving this person along at a client site? Will they deliver value to the client? Will they look like an idiot, and by extension, make me look like an idiot?"

My friend Dion Stewart had a great idea for a weed-out question. No matter what frameworks the candidate shows on the resume, ask them what they disliked the most about the framework. (I have my top three list for each framework I've worked in... except NeXT's Enterprise Objects Framework. But that's another story.)

If they can't answer at all, then they haven't actually worked with the framework. They're just playing buzzword bingo.

If they answer, but it sounds like bullshit, then odds are they're bullshitting you.

If they have never thought about it, haven't formed an opinion, or say "it's all good", then they lack passion about what they do.

A candidate that is driven, that cares about the quality-without-a-name should be able to go on a rant about something in each framework they've actually worked with. In fact, you've really hit the jackpot if your candidate can go on a rant, but does it in a professional, reasoned way. I love to see a candidate that can show some fire without seeming like a loon. That's when I can see how they'll react when the client makes a decision the candidate considers boneheaded. (I've seen some spectacular pyrotechnics from consultants that forgot whose money they're spending. But that's another story.)

Technorati Tags: resume, jobs

JAI 1.1.3 in beta

2005-12-22T11:46:00-06:00

I've been using JAI 1.1.2 for the past year. It's an incredibly powerful tool, though I will confess that the API is more than a bit quirky.

Early this year, Sun made JAI an open-source project available at java.net. That project has been working on the 1.1.3 release for most of the year. It's now in beta, with a few enhancements and a lot of bug fixes.

The most significant enhancement is that JAI can now be used with Java WebStart. Previously it had to be installed as a JRE extension.

Also, one of the big bugs is fixed. Issue #13 is fixed in the beta. It could cause the JPEG codec to use excessive amounts of memory when decoding large untiled images. (Which we do in our app a lot!)

Technorati Tags: java

Ruby expressiveness and repeating yourself

2005-12-10T19:57:00-06:00

Just this week, I was reminded again of how Java forces you to repeat yourself. I had an object that contains a sequence of "things to be processed". The sequence has to be traversed twice, once before an extended process runs and once afterwards.

The usual Java idiom looks like this:

public void preProcess(ActionContext context) {
  for (Iterator iter = updates.iterator(); iter.hasNext(); ) {
    TwoPhaseUpdate update = (TwoPhaseUpdate) iter.next();
    update.preProcess(context);
  }
}

public void postProcess(ActionContext context) {
  for (Iterator iter = updates.iterator(); iter.hasNext(); ) {
    TwoPhaseUpdate update = (TwoPhaseUpdate) iter.next();
    update.preProcess(context);
  }
}

Notice that there are only two symbols different between these two methods, out of 20 semantically significant symbols. According to the Pragmatic Programmers, even iterating over the collection counts as a kind of repetition (and therefore a violation of DRY - don't repeat yourself.)

The Ruby equivalent would be something like:

def preProcess(context)
   updates.each { |u| u.preProcess(context) }
end

def postProcess(context)
   updates.each { |u| u.postProcess(context) }
end

Now, there are two differening symbols out of 10 (20% variance instead of 10%). There's been no loss of expressiveness, in fact, the main intention of the code is clearer in the Ruby version than in the Java version.

Can we make the variance higher? Perhaps.

def preProcess(context)
   each_update(:preProcess, context)
end

def postProcess(context)
   each_update(:postProcess, context)
end

def each_update(method, context)
   updates.each { |u| u.send(method, context) }
end

Now the two primary methods have 2 symbols out of 7 different or nearly 28%. The expressiveness is damaged a little bit by the dynamic dispatch via "send". It would be unthinkable to use reflection in Java to make the code clearer. (Anyone who's worked with reflection knows what I mean.) Here, it's not unthinkable, but it might just not help clarity.

Technorati Tags: java, beyondjava, ruby

MySQL 5.0 Stored Procedures

2005-10-15T16:05:00-05:00

The MySQL 5.0 release is finally adding stored procedures, triggers, and views. This is a welcome addition. With the strong storage management features, clustering, and replication from the 4.x releases, MySQL now has all the capabilities of an "enterprise" database. (Of course, the lack of these features didn't stop thousands of users from deploying earlier versions in enterprises, even for "mission-critical" applications.)*

Here's a fairly trivial example:

create procedure count_table_rows ()  reads sql data begin
      select table_name, table_rows from information_schema.tables;
end

* Somtime, I have to post about the perversions of language perpetrated by people in business. "Mission-critical" means "without this, the mission will fail." What percentage of applications labelled as mission-critical would actually cause the company to fail? Most of the time, the "mission-critical" label really just means "this application's sponsor has large political clout".

Technorati Tags: mysql

The dumbest thing I've seen today

2005-10-06T11:19:00-05:00

I generally like Swing, but I just found something in the Metal L&F for JSlider that strikes me as a big WTF. The BasicSliderUI allows you to click in the "track" of the slider to scroll by a block. That's either 10% of the span of the slider, or a minimum of 1 unit. The MetalSliderUI overrides that sensible behavior with a method that scrolls by just one unit. Period.

Here's a quick fix:

JSlider slider = new JSlider(); 
slider.setUI(new MetalSliderUI() {
  protected void scrollDueToClickInTrack(int dir) {
    scrollByBlock(dir);
  }
});

Technorati Tags: java, swing

Programmer productivity measurements don't work.

2005-09-08T21:36:00-05:00

Programmer productivity measurements don't work.

The most common metric was discredited decades ago, but continues to be used: KLOC. Only slightly better is function points. At least it's tied to some deliverable value. Still, the best function point is the one you don't have to develop. Likewise, the best line of code is the one you don't need to write. In fact, sometimes my most productive days are the ones in which I delete the most code. Why are these metrics so misleading?

Because they are counting inventory as an asset. Lines of code are inventory. Function points are inventory. Any metric that only measures the rate of inventory production is fatally flawed. We need metrics that measure throughput instead.

Technorati Tags: lean, agile

More Beanshell Goodness

2005-06-11T22:15:00-05:00

Thanks to the clean layered architecture in our application, we've got a very clear interface between the user interface (just Swing widgets) and the "UI Model". In the canonical MVC mode, our UI Model is part controller and part model. It isn't the domain model, however. It's a model of the user interface. It has concepts like "form" and "command". A "form" is mainly a collection of property objects that are named and typed. The UI interacts with the rest of the application by binding to the properties.

The upshot is that anything the UI can do by setting and getting properties (including executing commands via CommandProperty objects) can be done through test fixtures or automated interfaces. Enter beanshell.

After integrating beanshell, all of our forms and properties were immediately available. Today, I worked with one of my teammates to build a beanshell script to drive through the application. It creates a customer and goes through the entire workflow. Run the script a million times or so, and you've got a great pile of test data. Schema changes? Domain model changes? No problem. Just re-run the script (and wait an hour or so) and you've got updated test data.

Technorati Tags: java

Smalltalk style prototyping for Java?

2005-05-29T13:15:00-05:00

I've been eyeing Beanshell for some time now. It's a very straightforward scripting language for Java. Its syntax is about what you would expect if I said, "Java with optional types and no need to pre-declare variables." So, a Java programmer probably needs all of about thirty seconds to understand the language.

What I didn't expect was how quickly I could integrate it into my applications. Here's an example. I've got a Swing desktop application for which I wanted to add a small command shell pane. I spent about an hour working with Swing's JTextArea and rolling my own parser. It was late at night and I was short on bright ideas. Finally, about the time I realized I was going to need variables and flow control, I pulled the emergency stop cord and backed up.

After downloading the full beanshell JAR file (about 280K) and adding it to my build path, all I had to do was this:

  
JConsole console = new JConsole();   
frame.getContentPane().add(new JScrollPane(console), BorderLayout.SOUTH);  
Interpreter interpreter = new Interpreter(console);   
new Thread(interpreter).start();

Those four lines of code are literally all that is needed to get a beanshell prompt running inside of your application. From the prompt, you have access to every class and method within your application. If you've got some kind of object registry or namespace, or any kind of Singletons, then you'll also have access to real object instances.

Beanshell has a built-in command called "desktop()". The one line will launch a Smalltalk-style IDE, with class browser and interpreter window. This desktop is still part of your application's JVM. It lacks most of the power of Smalltalk's library, which evolved together with the workspace to support highly dynamic programming. Nevertheless, the beanshell desktop retains the immediacy of working in Smalltalk.

References:

Beanshell
Squeak - a Free, modern Smalltalk
Java Object Inspector - A simple Swing-based inspector, can be launched on any Java object from a Beanshell prompt. A great complement to Beanshell

Technorati Tags: agile, java

One of the most fun features of my current project

2005-04-12T00:32:00-05:00

One of the most fun features of my current project is our "extreme feedback monitor". We're using CruiseControl to build our entire codebase, including unit tests, acceptance tests, and quality metrics, every five minutes. To make a broken build painfully obvious, we've got a stoplight hanging on one wall of the room. (I may post some pictures later, if there's interest.)

Kyle Larson found the stoplight itself in a gift shop (Spencer's, maybe, I can't remember... Kyle, help me out here). It had just one plug but you could push each light as a separate switch.

Well, it looks pretty dumb to walk over and push on the red light to show a broken build. It's not pragmatic and it's not automated. So, Kyle rewired it with two additional cords, so each lamp has its own plug.

I plugged each lamp into an X10 lamp module so each color could be turned on and off individually. I hooked a "FireCracker" wireless transmitter up to the serial port on the build box. With one switched receiver and two lamp modules, we were ready to go.

CruiseControl supports a publisher that is supposed to integrate directly with X10 devices over the serial port. Unfortunately, the installation and setup for Java programs to work with X10 devices on Linux is... problematic. First off, the JavaComm API appears to be totally stagnant. It does not support Linux at all, so you have to install the Solaris SPARC version, but supply an open-source Linux implementation of the API (www.rxtx.org), replacing a .properties file. Then you have to make sure that the user running your build loop is a member of the "tty" group. Then just cross your fingers.

I got all of the above to work from my Java test apps, but the X10 publisher built into CC still couldn't open the serial port.

I finally gave up on the built-in publisher. I used wget, BottleRocket, and a shell script to check the build status web page every 30 seconds and change the lights accordingly.

Now, within a minute of a broken build, we can all see it. When the light is green, the build is clean.

If the red light means "broken build", and the green light means "good build", you might wonder what we use yellow for.

Yellow means that someone is in the process of synchronizing and committing code. Along with the FireCracker module, we also got a remote control. That normally sits in the middle of the tables in the lab. Whenever a pair needs to check in code, they grab the remote (i.e., take the semaphore) and turn on the yellow light. As an added "feature", the wireless switched receiver is the only module that makes an audible "click" when it switches. We use that one to control the yellow lamp, so we also have an auditory cue when a pair starts their commit dance.

After committing, the pair turns off the yellow light and replaces the remote, thus putting the semaphore and allowing the next pair to commit. In the event of multiple blocked pairs, FIFO behavior is not guaranteed. Semaphore holders have been known to be susceptible to flattery and bribery.

Technorati Tags: agile, automation, CruiseControl, pragmatic

I forgot to mention that I will be speaking at OTUG

2005-03-22T23:08:00-06:00

I forgot to mention that I will be speaking at OTUG on April 19th! I will be speaking on "Living With Systems in Production: Avoiding Heartbreak in Long-Term Relationships With Your Code"

From the summary of the talk:

Everything changes after Release 1.0. One batch of consultants leave, key developers jockey to get themselves reassigned, and the free-wheeling development environment is replaced by the painful rigor of operations. Or, at least, it should be. Systems in production require a different kind of care and feeding. If you have to live with a system in production, your quality of life is largely determined by the things you put in place before Release 1.0. This talk covers the topics that will give you God-like powers over your production systems. If you are an architect or developer who has ever put a system in production--or expects to put a system into production--then this talk is for you.

Much of this will be derived from my experiences at Totality and Best Buy. Spending time in operations gave me a great education about building systems to run, instead of building them to pass QA.

Technorati Tags: operations, OTUG, speaking

Leaving AntHill for CruiseControl

2005-03-22T22:55:00-06:00

We've been using AntHill to do continuous builds. It has served us well, but we're now moving away from it and towards CruiseControl.

There are a few main reasons for this. First and foremost, AntHill runs inside of Tomcat. This is billed as a feature, but for us, it was a big problem. There are two layers of Java containers between your OS and your build. Trying to get environment variables (like "DISPLAY=localhost:99.0") passed from an init script, to Tomcat, to AntHill, to ANT, was just becoming too burdensome.

We also experienced some serious classpath pollution. Some things just acted differently between ANT builds on our development boxes and ANT builds on the build box. That's unacceptable, but we found that with AntHill it was impossible to eliminate the differences. Finally, through some jar file unpacking and decompilation, we found that our builds were picking up classes from AntHill's ow n jars.

The ability to fix these things exists, but only in AntHill Pro. I downloaded CruiseControl today and spent an hour going through the quick start and FAQ. At the end of it, I had our build process replicated on CruiseControl.

I did run into a problem... the checkstyle task that we have been running as part of our build all along started failing. I assumed that it was something wrong with the build box, or with my project configuration for CruiseControl. After half an hour or so, I ran the same build on a dev box, but from the command line. It failed there, too. It turns out that checkstyle includes a version of the Jakarta commons-collections classes that is not compatible with the Jakarta digester version that we've added to our code base.

This problem existed all along. Running the build under CruiseControl was enough like running it from the command line that it uncovered a problem which had been present for over two weeks. For some reason, running under AntHill never revealed this problem.

Bottom line is, a CI server needs to be as close to running a command-line build as possible. If I have to spend time figuring out what environmental conditions the CI tool is imposing on my build, then it is defeating the purpose.

Now, I just have to figure out why in the hell checkstyle's classpath is leaking into the classpath of the code it is checking.

Technorati Tags: agile, automation, CruiseControl, pragmatic

The Veteran and the Master

2005-02-05T22:24:00-06:00

The aged veteran said to the master, "See how many programs I have written in my labors. All of these works I have created needed no more than a text editor and a compiler." The master said, "I do have an editor; indeed, I have also a compiler."

Said the aged one, "Yet you shackle them within an 'environment'. Why must your environment be integrated? My environment has never been integrated, yet I am a mighty programmer."

The master said, "You are truly a mighty programmer. I perceive that you, in your keen intellect, can hold entire class hierarchies in mind at once. Such abilities of apprehension are to be respected."

The veteran was well pleased and said, "It is true. Hence I am lead programmer."

The master nodded. "Sadly, I have not your powers of visualization. I cannot hold entire hierarchies in my minds eye at once. In my limited faculties, I must focus entirely on one class at a time. The tool remembers the rest, as I cannot."

Emboldened, the aged veteran boasted, "See the commands fly from my fingertips! I type faster than other programmers think!"

Again, the master nodded his agreement, "I am not so blessed with speed as you. It is a burden and a trial to move so slowly. Behold, this measure of the marvel of your fingers. Such is the flight of your keystrokes that in the time it takes you to execute a regex replace across thirty files; compile the project; note the errors; and edit the twelve files with failed replacements; I will have barely completed the 'rename refactor' which I started by typing shift-alt-r."

Brazen in his opponents weakness, the veteran cried, "While you sit meditating at the green bar, I pound out another four thousand lines of code!"

Again, the master nodded, "Yes. And worse, while you write the next thousand, I will surely erase a thousand more, leaving us barely past where we began. It is clear that I cannot long contend in this field against such as yourself."

The battle-scarred veteran, his opponent beaten, laughed aloud. Barely bothering to express his contempt, he sneered, "And what fine code it is, too! You write a fraction of the code a real programmer could produce. As a coward in the grain, you shrink from any real challenge. Fearing to tread where real programmers dwell, you trade in coin like a merchant, purchasing the work of others, or worse, living on the charity of those motley-clad coders who give away the fruits of their work."

"Again, your perspicacity has unmasked me," said the master. "Knowing myself to produce bugs in my code, I prefer to write little of it. I do rely upon the work of others who, if not being smarter than myself, are at least more numerous than I. Had I your fleet fingers, I might not need to download these gifts offered by others. Indeed, I am certain that your mighty editor would surely outpace my mere web browser, and you could then code a new SVG renderer long before I will finish downloading Batik to do the same work. Alas, lacking your skills, I must fend for myself as best I can by reusing that which I can. Since each line of code costs me so greatly, it behooves me to write little, and I must needs make use of what aids I can."

Shaking his head, the aged veteran stalked away, safely assured that he had gauged the so-called master truly. He returned to his labors, building a parser for the scripting language of his workflow engine. This would be placed inside of an application that would someday have users.

Shaking his head, the master returned his eye to the red bar of his users' new acceptance tests. Reaching deliberately for the keyboard, he changed two methods and added one test case. In the serene green light of the test bar, he reflected a moment on the code he had added. Unruffled by the staccato typing in the direction of the veteran, he renamed four fields, extracted a method, and pulled it up into a new base class. Comforted by the tranquil green light, the master rested his hands a moment, then lifted them from the keyboard and walked away.

From the corner of his eye, the veteran observed the master leaving. "Charlatan," he snarled, as the regexes flew from his hands, long, long into the night.

Technorati Tags: agile, lean, pragmatic

On Relativism and Social Constructions

2005-01-17T18:51:00-06:00

The key operative precept of post-modernism is that all reality is a social construct. Since no institution or normative behavior stems from natural cause, and there is no objective, external reality, then all institutions and attitudes are just social constructs. They exist only through the agreement of the participants.

Nothing can be sacred, since sanctification comes from outside, by definition.

If nothing is sacred, and institutions have no more reality than a children's amorphous game of ball, they deduce that any construct can be reconstructed through willful choice.

Even if you accept the precept that there is no objective, external (let alone universal) value system, you can still see the fundamental fallacy in this thinking.

Anyone who has ever tried to bring change into a hidebound organization knows that social constructs are far harder to change than any physical or legal structure. You can reorganize units, bring locations together, shuffle management, or get rid of half of the people. Still, underlying social organization will re-emerge as long as there is any vestige of continuity.

Much of the heat energy in the ongoing culture war arises from this inertia. Those who are so tiresomely labelled as "liberal", "progressive", the "Left", the "Cultural Elite", etc. represent a large force of people aimed at deliberately reconstructing every institution in Western life. They have decided, based on their own feelings, bereft of natural or religious law, that any institution observed by men for more than one hundred years cannot be endured. They are organized around the post-modern paradigm--armed with Hayakawa and Chomsky--and don't accept that some hidebound Neanderthals will not welcome forceful re-education.

I suppose that I follow a third way. I can agree that our institutions are social cosntructs. That does not mean that they can, or should be, tampered with lightly. The concept of "natural law" teaches that certain modes of behavior, certain morals, generate a more successful society. Our social institutions--like marriage--have undergone the same forces of competitive pressures and differential reproduction that drive neo-Darwinian evolution. That means the institutions we observe today--such as preserving the integrity of personal property--are the ones that worked.

There is an argument to be made that I'm advocating cultural imperialism. It could perhaps be seen that way, though such is not my intent. Rather, just as we should justifiably be wary of changing our own genetic code, we should be wary of making large changes to our social institutions. We do not know what will result. There are many paths down the mountain, but only one upward. Most random mutations result in death. Even well-planned changes have unintended, sometimes catastrophic, effects.

References

An IKEA Weekend

2005-01-11T20:34:00-06:00

I've been building a new office in my downstairs space for quite a while now. It's a "weekends" project for someone who doesn't have very many weekends. In early December, I broke down and hired a contractor to install the laminate ("cardboard") flooring, which was the penultimate step in the master plan.

Last comes furniture, then moving in. (Which starts the chain of dominoes, as my eldest gets the bedroom which used to be my office, then my youngest takes her spot, which makes room for the new baby. The challenge is to finish with the hole migration before the new electron gets injected. No, that wasn't a spelling error.)

So this weekend, I had thirty-six boxes of IKEA modular furniture from "Work IKEA" to assemble.

You have time to meditate on many lessons when you are assembling thirty-six boxes of IKEA modular furniture.

For example, I've never seen a company that makes it so difficult to purchase from them. I don't really want to know that the six-shelf bookshelf I picked out from the design software actually comes as three separate SKUs. Just sell me the damn shelf.

I shouldn't have to learn what a "CDO" is in order to pick out a bunch of stuff and have them deliver it on a specific day. I shouldn't have to make three trips into the store because they cannot take my credit card number over the phone.

And can someone please explain why I have to remove items from my delivery order because the local store doesn't have them in stock? In some fields of endeavor, timing is everything, but why should I have to call them every day to find out when the left-handed tabletop comes in, then rush to the store and place my order so the piece can be pulled from inventory?

It makes no sense to me. The whole process was implemented for the convenience of IKEA, not IKEA's customers. They've made a business decision to optimize for cost control rather than customer satisfaction. IKEA is certainly free to make that choice, and they do seem to be making profits, but I'm not likely to choose them for future furniture purchases.

Exposing that much of your internal process to the customer--or end user--is never a good way to win the hearts and minds of your customers.

Most of the assembly went without incident, though I was often perplexed by trying to map the low-level components into the high-level items I designed with. IKEA offers zero-cost software for download to design a floorplan with their lines, but it works at a higher level of abstraction. I was often left wondering which item a particular component was supposed to construct.

The components were very well designed. Each piece can either fit together in only one way, or it is rotationally symmetric so either orientation works. In either case, I, the assembler, am not left with an ambiguous situation, where something might fit but does not work.

The toughest pieces were the desks. Desks can be configured in about eighty-nine different ways. The components are all modular and generally have the same interfaces. I have a lot of flexibility at my disposal, but at the expense of complexity. A significant number of sample configurations helped me understand the complexity of options and pick a reasonable structure, but I can't help but wonder how the experience could be simplified.

The furniture is all assembled now, and the office sits expectantly waiting for its occupant, full of unrealized potential.

Uniting Reason and Passion

2004-12-12T15:14:00-06:00

Reason and Passion need not conflict. Reason without passion is dusty, dry, and dead. Reason without passion leads to moral relativity. If nothing moves the thinker to passion, then all subjects are equal and without distinction. As well to discuss the economic benefits of the euthanasia of infants as the artistic merits of urinals.

Passion without reason brings the indiscriminate energy of a summer's thunderstorm. Too much energy unbound, without direction, it's fury as constant as the winds of the air.

Passion provides energy, the drive to accomplish, change, improve, or destroy. Reason provides direction. Reason channels Passion and achieves goals by identifying targets, foci, leverage points. Passion powers Reason. It brings motive power. Passion knows that things must be done and that change is possible. Reason knows how change may be effected.

I was reminded of the fallacy of Passion without Reason recently. At lunch with a friend, she talked about working with a non-profit organization. Workers for non-profits epitomize those who are driven by Passion. Agree or disagree with their aims, you must admit that they earnestly mean to change the world. My friend, who comes from the profit-driven corporate world, was explaining some aspects of statistical process control and how it could be applied to improve fundraising results on their website. She was told that she needed to have more heart and feel for those unfortunates that this group helps.

Her critic obviously felt that her approach was too analytical. Too driven by Reason, not enough Passion. In fact, the opposite was true. She was applying the combination of Reason and Passion. Passion showed her that the cause was worthy and that she could help. Reason showed her where leverage could be gained and a small effort input could result in a large change in output.

In various disfunctional organizations which I have inhabited, I've seen many examples of the opposite. Reason reveals problems and solutions to those poor sapient cogs in the low levels of the machine. They lack the Passion to see that change is possible and so divest themselves of the power to improve their own lot in life. Problems or challenges will always overcome such people, because they give the problem power and remove it from themselves.

More Wiki

2004-12-10T11:47:00-06:00

My personal favorite is TWiki. It has some nice features like file attachments, a great search interface, high configurability, and a rich set of available plugins (including an XP tracker plugin.)

One cool thing about TWiki: configuration settings are accomplished through text on particular topics. For example, each "web" (set of interrelated topics) has a topic called "WebPreferences". The text on the WebPreferences topics actually controls the variables. Likewise, if you want to set personal preferences, you set them as variables--in text--on your personal topic. It's a lot harder to describe than it is to use.

There are some other nice features like role-based access control (each topic can have a variable that says which users or groups can modify the topic), multiple "webs", and so on.

The search interface is available as variable interpolation on a topic, so something like the "recent changes" topic just ends up being a date-ordered search of changes, limited to ten topics. This means that you can build dynamic views based on content, metadata, attachments, or form values. I once put a search variable on my home topic that would show me any task I was assigned to work on or review.

I've also been looking at Oahu Wiki. It's an open source Java wiki. It's fairly short on features at this point, but it has by far the cleanest design I've seen yet. I look forward to seeing more from this project.

Wiki Proliferation

2004-12-10T11:14:00-06:00

Wikis have been thoroughly mainstreamed now. You know how I can tell? Spammers are targeting them.

Any wiki without access control is going to get steamrolled by a bunch of Russian computers that are editing wiki pages. They replace all the legitimate content with links to porn sites, warez, viagra, get rich now, and the usual panoply of digital plaque.

The purpose does not appear to be driving traffic directly to those sites from the wikis. Instead, they are trying to pollute Google's page rankings by creating thousands upon thousands of additional inbound links.

If you run a wiki, be sure to enable access control and versioning (so you can recover after an attack). It is a shame that the open, freewheeling environment of the wiki has to end. It seems that the only way to preserve the value of the community is to weaken the core value of open participation that made the community worthwhile.

Moving on

2004-12-07T22:19:00-06:00

The latest in my not-exactly-daily news and commentary...

As of December 10th, I will be leaving Totality Corporation. It has been a challenge and an education. It has also been an interesting time, as we uncovered the hidden linkages from daily activities to ultimate profitability. The managed service provider space is still new enough that the business models are not all so well-defined and understood as in consulting. I earnestly hope that I am leaving Totality in a much better place than it was when I joined.

Still, a number of positive attractions to the new position and some negative forces away from my current position have overcome inertia.

I will be joining Advanced Technologies Integration as a consultant. I will be forming a team with Kyle Larson, Dale Schumacher, and Dion Stewart to do a development project for one of ATI's clients. The project itself has some moderately interesting requirements... it's not just another random commerce site. (I'm really, really bored with shopping carts!) The thing that really attracted me though, is that this is a hardcore agile methods project. We'll be using a combination of Scrum and XP.

For a long time, I've advocated small teams of highly skilled developers. I have seen such teams produce many times the business value (and ROI) of the typical team. ATI and this client are willing to subscribe to the theory that a small, high-caliber team will outperform an army of cheap morons.

It's going to be a blast proving them right!

Too Much Abstraction

2004-04-25T13:09:00-05:00

The more I deal with infrastructure architecture, the more I think that somewhere along the way, we have overspecialized. There are too many architects that have never lived with a system in production, or spent time on an operations team. Likewise, there are a lot of operations people that insulate themselves from the specification and development of systems for which they will ultimately take responsibility.

The net result is suboptimization in the hardware/software fit. As a result, overall availability of the application suffers.

Here's a recent example.

First, we're trying to address the general issue of flowing data from production back into pre-production systems -- QA, production support, development, staging. The first attempt took 6 days to complete. Since the requirements of the QA environment stipulate that the data should be no more than one week out of date relative to production, that's a big problem. On further investigation, it appears that the DBA who was executing this process spent most of the time doing scps from one host to another. It's a lot of data, so in one respect 10 hour copies are reasonable.

But the DBA had never been told about the storage architecture. That's the domain of a separate "enterprise service" group. They are fairly protective of their domain and do not often allow their architecture documents to be distributed. They want to reserve the right to change them at will. Now, they will be quite helpful if you approach them with a storage problem, but the trick is knowing when you have a storage problem on your hands.

You see, all of the servers that the DBA was copying files from and to are all on the same SAN. An scp from one host on the SAN to another host on the SAN is pretty redundant.

There's an alternative solution that involves a few simple steps: Take a database snapshot onto a set of disks with mirrors, split the mirrors, and join them onto another set of mirrors, then do an RMAN "recovery" from that snapshot into the target database. Total execution time is about 4 hours.

From six days to four hours, just by restating the problem to the right people.

This is not intended to criticize any of the individuals involved. Far from it, they are all top-notch professionals. But the solution required merging the domains of knowledge from these two groups -- and the organizational structure explicitly discouraged that merging.

Another recent example.

One of my favorite conferences is the Colorado Software Summit. It's a very small, intensely technical crowd. I sometimes think half the participants are also speakers. There's a year-round mailing list for people who are interested in, or have been to, the Summit. These are very skilled and talented people. This is easily the top 1% of the software development field.

Even there, I occasionally see questions about how to handle things like transparent database connection failover. I'll admit that's not exactly a journeyman topic. Bring it up at a party and you'll have plenty of open space to move around in. What surprised me is that there are some fairly standard infrastructure patterns for enabling database connection failover that weren't known to people with decades of experience in the field. (E.g., cluster software reassigns ownership of a virtual IP address to one node or the other, with all applications using the virtual IP address for connections).

This tells me that we've overspecialized, or at least, that the groups are not talking nearly enough. I don't think it's possible to be an expert in high availability, infrastructure architecture, enterprise data management, storage solutions, OOA/D, web design, and network architecture. Somehow, we need to find an effective way to create joint solutions, so we don't have software being developed that's completely ignorant of its deployment architecture, nor should we have infrastructure investments that are not capable of being used by the software. We need closer ties between operations, architecture, and development.

The Lights Are On, Is Anybody Home?

2003-05-01T12:18:56-05:00

We pay a lot of attention to stakeholders when we create systems. The end users get a say, as do the Gold Owners. Analysts put their imprimatur on the requirements. In better cases, operations and administration adds their own spin. It seems like the only group that doesn't have any input during requirements gathering is the development team itself. That is truly unfortunate.

Not even the users will have to live with the system more than the developers will. Developers literally inhabit the system for most of their waking hours, just as much (or maybe more) than they inhabit their cubes or offices. When the code is messy, nobody suffers more than the developers. When living in the system becomes unpleasant, morale will suffer. Any time you hear a developer ask for a few weeks of "cleanup" after a release, what they are really saying is, "This room is a terrible mess. We need to remodel."

A code review is just like an episode of "Trading Spaces". Developers get to trade problems for a while, to see if somebody else can see possibilities in their dwelling. Rip out that clunky old design that doesn't work any more! Hang some fabric on the walls and change the lighting.

Whether your virtual working environment becomes a cozy place, a model of efficiency, or a cold, drab prison, you create your own living space. It is worth taking some care to create a place you enjoy inhabiting. You will spend a lot of time there before the job is done.

Don't Build Systems That Boink

2003-04-01T16:00:04-06:00

Note: This piece originally appeared in the "Marbles Monthly" newsletter in April 2003

I caught an incredibly entertaining special on The Learning Channel last week. A bunch of academics decided that they were going to build an authentic Roman-style catapult, based on some ancient descriptions. They had great plans, engineering expertise, and some really dedicated and creative builders. The plan was to hurl a 57 pound stone 400 yards, with a machine that weighed 30 tons. It was amazing to see the builders faces swing between hope and fear. The excitement mingled with apprehension.

At one point, the head carpenter said that it would be wonderful to see it work, but "I'm fairly certain it's going to boink." I immediately knew what he meant. "Boink" sums up all the myriad ways this massive device could go horribly wrong and wreak havoc upon them all. It could fall over on somebody. It could break, releasing all that kinetic energy in the wrong direction, or in every direction. The ball could fly off backwards. The rope might relax so much that it just did nothing. One of the throwing arms could break. They could both break. In other words, it could do anything other than what it was intended to do.

That sounds pretty familiar. I see the same expressions on my teammates' faces every day. This enormous project we're slaving on could fall over and crush us all into jelly. It could consume our hours, our minds, and our every waking hour. Worst case, it might cost us our families, our health, our passion. It could embarrass the company, or cost it tons of money. In fact, just about the most benign thing it could do is nothing.

So how do you make a system that don't boink? It is hard enough just making the system do what it is supposed to. The good news is that some simple "do's and don'ts" will take us a long way toward non-boinkage.

Automation is Your Friend #1: Runs lots of tests -- and run them all the time

Automated unit tests and automated functional tests will guarantee that you don't backslide. They provide concrete evidence of your functionality, and they force you to keep your code integrated.

Automation is Your Friend #2: Be fanatic about build and deployment processes

A reliable, fully automated build process will prevent headaches and heartbreaks. A bad process--or a manual process--will introduce errors and make it harder to deliver on an iterative cycle.

Start with a fully automated build script on day one. Start planning your first production-class deployment right away, and execute a deployment within the first three weeks. A build machine (it can be a workstation) should create a complete, installable numbered package. That same package should be delivered into each environment. That way, you can be absolutely certain that QA gets exactly the same build that went into integration testing.

Avoid the temptation to check out the source code to each environment. An unbelievable amount of downtime can be traced to a version label being changed between when the QA build and the production build got done.

Everything In Its Place

Keep things separated that either change at different speeds. Log files change very fast, so isolate them. Data changes a little less quickly but is still dynamic. "Content" changes slower yet, but is still faster than code. Configuration settings usually come somewhere between code and content. Each of these things should go in their own location, isolated and protected from each other.

Be transparent

Log everything interesting that happens. Log every exception or warning. Log the start and end of long-running tasks. Always make sure your logs include a timestamp!

Be sure to make the location of your log files configurable. It's not usually a good idea to keep log files in the same filesystem as your code or data. Filling up a filesystem with logs should not bring your system down.

Keep your configuration out of your code

It is always a good idea to separate metadata from code. This includes settings like host names, port numbers, database URLs and passwords, and external integrations.

A good configuration plan will allow your system to exist in different environments -- QA versus production, for example. It should also allow for clustered or replicated installations.

Keep your code and your data separated

The object-oriented approach is a good wasy to build software, but it's a lousy way to deploy systems. Code changes at a different frequency than data. Keep them separated. For example, in a web system, it should be easy to deploy a new code drop without disrupting the content of the site. Likewise, new content should not affect the code.

Plugging the Marbles Newsletter

2003-03-24T21:34:00-06:00

Not too much going on here lately. Most of my waking hours have been billable for the past few months. That's good and bad, in so many different ways.

Most of my recent writing has been for the Marbles, Inc. monthly newsletter.

Dec 2006 Edit: Marbles IT has not been a going concern for some time. My articles for the Marbles Monthly newsletter are now available under the Marbles category of this blog.

Multiplier Effects

2003-02-01T10:53:11-06:00

Here's one way to think about the ethics of software, in terms of multipliers. Think back to the last major email virus, or when the movie "The Two Towers" was released. No doubt, you heard or read a story about how much lost productivity this bane would cause. There is always some analyst willing to publish some outrageous estimate of damages due to these intrusions into the work life. I remember hearing about the millions of dollars supposedly lost to the economy when Star Wars Episode I was released.

(By the way, I have to take a minute to disassemble this kind of analysis. Stick with me, this won't take long.

If you take 1.5 seconds to delete the virus, it costs nothing. It's an absolutely immeasurable impact to your day. It won't even affect your productivity. You will probably spend more time than that discussing sports scores, going to the bathroom, chatting with a client, or any of the hundreds of other things human beings do during a day. It's literally lost in the noise. Nevertheless, some analyst who likes big numbers will take that 1.5 seconds and multiply it by the millions of other users and their 1.5 seconds, then multiply that by the "national average salary" or some such number.

So, even though it takes you longer to blow your nose than to delete the virus email, somehow it still ends up "costing the economy" 5x10^6 USD in "lost productivity". The underlying assumptions here are so flawed that the result cannot be taken seriously. Nevertheless, this kind of analysis will be dragged out every time there's a news story--or better yet, a trial--about an email worm.)

The real moral of this story isn't about innumeracy in the press, or spotlight seekers exploiting said innumeracy. It's about multipliers, and the very real effect they can have.

Suppose you have a decision to make about a particular feature. You can do it the easy way in about a day, or the hard way in about a week. (Hypothetical.) Which way should you do it? Suppose that the easy way makes four new fields required, whereas doing it the hard way makes the program smart enough to handle incomplete data. Which way should you do it?

Required fields seem innocuous, but they are always an imposition on the user. They require the user to gather more information before starting their jobs. This in turn often means they have to keep their data on Post-It notes until they are ready to enter it, resulting in lost data, delays, and general frustration.

Let's consider an analogy. Suppose I'm putting a sign up on my building. Is it OK to mount the sign six feet up on the wall, so that pedestrians have to duck or go around it? It's much easier for me to hang the sign if I don't have to set up a ladder and scaffold. It's only a minor annoyance to the pedestrians. It's not like it would block the sidewalk or anything. All they have to do is duck. So, I get to save an hour installing the sign, at the expense of taking two seconds away from every pedestrian passing my store. Over the long run, all of those two second diversions are going to add up to many, many times more than the hour that I saved.

It's not ethical to worsen the lives of others, even a small bit, just to make things easy for yourself. Successful software is measured in millions of people. Every requirements decision you make is an imposition of your will on your users' lives, even if it is a tiny one. Always be mindful of the impact your decisions--even small ones--have on those people. You should be willing to bear large burdens to ease the burden on those people, even if your impact on any given individual is miniscule.

Keep Your Secrets

2002-12-30T22:10:00-06:00

Here's a system I call "KeepYourSecrets.org". Recall a film noir detective telling the criminal mastermind that unless he drops a postcard in the mail in the next three days, all the details will go straight to the newspaper.

You can upload any kind of file -- it's all treated like binary. You can set some parameters like a distribution list and a checkin frequency. The system uses an IRC-like network to split your file in n parts, of which some k parts are needed to re-create the original. Up to n-k parts can be lost or compromised without losing or compromising the whole. (See "Applied Cryptography" for details.) With lots of hosts, you can split a document into multiple overlapping sets of pieces to provide another layer of resiliency against damage.

From then on, if you do not check in with the network on some periodic basis, the document goes out to the distribution list. NYTimes, Washington Post, CIA, whoever is on the distribution list for your file.

The network of server don't ever have to know who you are. They just need to know that you hold the private key that matches the public key that was used to upload the package.

It's possible to construct voting algorithms that the servers can use to decide if you have really checked in or not. This lets the network protect against a single compromised or hostile host. (You have to be resilient against hostile implementations.)

Because the hosts all communicate via some pub/sub or relay-chat protocol (Jabber, maybe?), the networks of hosts can be self-forming and self-identifying. If there is no central point of control, then the network as a whole cannot be stopped, subverted or forced to give up secrets by any single agency.

What you end up with is a secure, anonymous drop box that cannot be blocked, traced, or inflitrated. It is self-forming and highly resilient to the loss of constituent pieces.

--------

The Paradox of Honor

2002-10-22T21:59:00-05:00

You can use a person's honor against him only if he values honor. Only the honest man is threatened by the pointed finger. The liar is unaffected by that kind of accusation. I think it is because there is no such thing as "dishonesty". There is only honesty or it's lack. Not a thing and it's opposite, but a thing and it's absence. One or zero, not one or minus-one. One who is lacking a thing cannot be threatened at the prospect of its loss.

I think I'd like to

2002-10-22T21:47:00-05:00

I think I'd like to do some Smalltalk (or Squeak) development sometime. Just for myself. It would be good for me -- like an artist going to a retreat and setting aside all notions of practicality. I know I'll never work in Squeak professionally. That's why it would be like saying to yourself, "In this now, purity of expression is all that matters. Tomorrow, I will worry about making something I can sell. Tomorrow I will design so the mediocre masses that follow me cannot corrupt it. Today, I will work for the joy I find in the work."

The burdens of responsibility leave no room for such indulgence. So I turn back to Java and C#. I'll write another Address class and deal with another session manager, and more cookies. Always with the cookies.

Nostalgia

2002-09-16T13:19:00-05:00

This kind of thing makes me wish I were back at Caltech.

Bill Joy Knocks the Open Source Business Model

2002-08-16T11:29:00-05:00

Bill Joy had some doubts to voice about Linux. Of course, like so many others he immediately jumps to the wrong conclusion. "The open-source business model hasn't worked very well," he says.

Tough nuts. Here's the point that seems to get missed over and over again. There is no "open source business model". There never was, and I doubt there ever will be. It doesn't exist. It's a contradiction in terms.

Open source needs no business model.

Look, GNU existed before anyone ever talked about "open source". Linux was built before there were companies like RedHat and IBM interested (let alone Sun). The thing that the corps and the pundits cannot seem to grasp is their absolute irrelevance.

It's like Bruce Sterling's speech. Harangue. Whatever you want to call it. I see it as yet another person getting up and trying to tell the "open-source community" what they need to do. Getting on their case about not being organized enough... or something.

Or it's like those posters on Slashdot that wish either GNOME or KDE would shut down so everyone can focus on one "standard" desktop.

Or Scott McNealy, lamenting the fact that open source Java application servers inhibit the expenditure of dollars that could be used to market J2EE against .Net.

Or the UI designers who froth at the mouth about how terrible an open source applications user interface may be. They say moronic things like "when will coders learn that they shouldn't design user interfaces?" (Or the more extreme form, "Programmers should never design UIs.")

Or it's like anyone who looks at an application and says, "That's pretty good. You know what you really need to do?"

All of these people don't get the true point. I'll say it here as baldly as I can.

There is nobody in charge. Not IBM, not Linus Torvalds, not Richard Stallman. Nobody.

All you will find is an anarchic collection of self-interested individuals. Sometimes they collaborate. Some of them work together, some work apart, some work against each other. To the extent that some clusters of individuals share a vision, they collaborate to tackle bigger, cooler projects.

There is no one in control. Nobody gets to decree what open source projects live or die, or what direction they go in. These projects are an expression of free will, created by those capable of expressing themselves in that medium. Decisions happen in code, because coders make them happen.

Free will, baby. It's my project, and I'll do what I want with it. If I want to create the most god-awful user interface ever seen by Man, that's my perogative. (If I want lots of users, I probably won't do that, but who says I have to want lots of users? It's my choice!)

As long as one GNOME hacker wants to keep working on GNOME, it will continue to evolve. As long as one Linux kernel hacker keeps coding, Linux will continue. None of these things require corporations, IPOs, or investement dollars to continue. The only true investments in open source are time and brainpower. Money is useful in that it can be used to purchase time, the greatest gift you can give a coder. Corporations are useful in that they are effective at aggregating and channeling money. "Useful", not "required".

As long as coders have free will and the tools to express it, open source software will continue. In fact, even if you take away their tools, they'll build new ones! To truly kill open source software, you must kill free will itself.

(And, by the way, there are those who want to do exactly that.)

Needles, Haystacks

2002-07-22T22:09:00-05:00

So, this may seem a little off-topic, but it comes round in the end. Really, it does.

I've been aggravated with the way members of the fourth estate have been treating the supposed "information" that various TLAs had before the September 11 attacks. (That used to be my birthday, by the way. I've since decided to change it.) We hear that four of five good bits of information scattered across the hundreds of FBI, CIA, NSA, NRO, IRS, DEA, INS, or IMF offices "clearly indicate" that terrorists were planning to fly planes into buildings. Maybe so. Still, it doesn't take a doctorate in complexity theory to figure out that you could probably find just as much data to support any conclusion you want. I'm willing to bet that if the same amount of collective effort were invested, we could prove that the U. S. Government has evidence that Saddam Hussein and aliens from Saturn are going to land in Red Square to re-establish the Soviet Union and launch missiles at Guam.

You see, if you already have the conclusion in hand, you can sift through mountain ranges of data to find those bits that best support your conclusion. That's just hindsight. It's only good for gossipy hens clucking over the backyard fence, network news anchors, and not-so-subtle innuendos by Congresscritters.

The trouble is, it doesn't work in reverse. How many documents does just the FBI produce every day? 10,000? 50,000? How would anyone find exactly those five or six documents that really matter and ignore all of the chaff? That's the job of analysis, and it's damn hard. A priori, you could only put these documents together and form a conclusion through sheer dumb luck. No matter how many analysts the agencies hire, they will always be crushed by the tsunami of data.

Now, I'm not trying to make excuses for the alphabet soup gang. I think they need to reconsider some of their basic operations. I'll leave questions about separating counter-intelligence from law enforcement to others. I want to think about harnessing randomness. You see, government agencies are, by their very nature, bureaucratic entities. Bureaucracies thrive on command-and-control structures. I think it comes from protecting their budgets. Orders flow down the hierarchy, information flows up. Somewhere, at the top, an omniscient being directs the whole shebang. A command-and-control structure hates nothing more than randomness. Randomness is noise in the system, evidence of an inadequate procedures. A properly structured bureaucracy has a big, fat binder that defines who talks to whom, and when, and under what circumstances.

Such a structure is perfectly optimized to ignore things. Why? Because each level in the chain of command has to summarize, categorize, and condense information for its immediate superior. Information is lost at every exchange. Worse yet, the chance for somebody to see a pattern is minimized. The problem is this whole idea that information flows toward a converging point. Whether that point is the head of the agency, the POTUS, or an army of analysts in Foggy Bottom, they cannot assimilate everything. There isn't even any way to build information systems to support the mass of data produced every day, let alone correlating reports over time.

So, how do Dan Rather and his cohorts find these things and put them together? Decentralization. There are hordes of pit-bull journalists just waiting for the scandal that will catapult them onto CNN. ("Eat your heart out Wolf, I found the smoking gun first!")

Just imagine if every document produced by the Minneapolis field office of the FBI were sent to every other FBI agent and office in the country. A vast torrent of data flowing constantly around the nation. Suppose that an agent filing a report about suspicious flight school activity could correlate that with other reports about students at other flight schools. He might dig a little deeper and find some additional reports about increased training activity, or a cluster of expired visas that overlap with the students in the schools. In short, it would be a lot easier to correlate those random bits of data to make the connections. Humans are amazing at detecting patterns, but they have to see the data first!

This is what we should focus on. Not on rebuilding the $6 Billion Bureaucracy, but on finding ways to make available all of the data collected today. (Notice that I haven't said anything that requires weakening our 4th or 5th Amendment rights. This can all be done under laws that existed before 9/11.) Well, we certainly have a model for a global, decentrallized document repository that will let you search, index, and correlate all of its contents. We even have technologies that can induce membership in a set. I'd love to see what Google Sets would do with the 19 hijackers names, after you have it index the entire contents of the FBI, CIA, and INS databases. Who would it nominate for membership in that set?

Basically, the recipe is this: move away from ill-conceived ideas about creating a "global clearinghouse" for intelligence reports. Decentralize it. Follow the model of the Internet, Gnutella, and Google. Maximize the chances for field agents and analysts to be exposed to that last, vital bit of data that makes a pattern come clear. Then, when an agent perceives a pattern, make damn sure the command-and-control structure is ready to respond.

MLP

2002-07-11T15:16:00-05:00

Here's a good roundup of recent traffic regarding REST.

Here's my number one frustration

2002-06-24T12:44:00-05:00

Here's my number one frustration with the state of the industry today. I am a professional. I regard my work as a craft to be studied and learned. Yet, in most domains, there is no benefit to developing a high level of skill. You end up surrounded by people who don't understand a word you say, can't work at that level, and don't really give a damn. They'll get the same rewards and go home happy at 5:00 every day. It's like, once you achieve a base level of mediocrity, there's no benefit for further personal development. In fact, there's a distinct disadvantage, in that you end up pulling ridiculous hours to clean up their garbage.

Bah, there I go being bitter again. Maybe I just need to work in some other domain--one where skills count for something, and being good at your job is a benefit, not a hindrance. I'm sick of writing Address classes, anyway.

Multiplier Effects

2002-06-08T22:28:00-05:00

Here's another way to think about the ethics of software, in terms of multipliers. Think back to the last major virus scare, or when Star Wars Episode II was released. Some "analyst"--who probably found his certificate in a box of Cracker Jack--publishing some ridiculous estimate of damages.

BTW, I have to take a minute to disassemble this kind of analysis. Stick with me, it won't take long.

If you take 1.5 seconds to delete the virus, it costs nothing. It's an absolutely immeasurable impact to your day. It won't even affect your productivity. You will probably spend more time than that discussing sports scores, going to the bathroom, chatting with a client, or any of the hundreds of other things human beings do during a day. It's literally lost in the noise. Nevertheless, some peabrain analyst who likes big numbers will take that 1.5 seconds and multiply it by the millions of other users and their 1.5 seconds, then multiply that by the "national average salary" or some such number.

So, even though it takes you longer to blow your nose than to delete the virus email, somehow it still ends up "costing the economy" 5x10^6 USD in "lost productivity". The underlying assumptions here are so thoroughly rotten that the result cannot be anything but a joke. Sure as hell though, you'll see this analysis dragged out every time there's a news story--or better yet, a trial--about an email worm.

The real moral of this story isn't about innumeracy in the press, or spotlight seekers exploiting innumeracy. It's about multipliers.

Suppose you have a decision to make about a particular feature. You can do it the easy way in about a week, or the hard way in about a month. (Hypothetical.) Which way should you do it? Suppose that the easy way makes the user click an extra button, whereas doing it the hard way makes the program a bit smarter and saves the user one click. Just one click. Which way should you do it?

Let's consider an analogy. Suppose I'm putting a sign up on my building. Is it OK to mount the sign six feet up on the wall, so that pedestrians have to duck or go around it? It's much easier for me to hang the sign if I don't have to set up a ladder and scaffold. It's only a minor annoyance to the pedestrians. It's not like it would block the sidewalk or anything. All they have to do is duck. (We'll just ignore the fact that pissing off all your potential customers is not a good business strategy.)

It's not ethical to worsen the lives of others, even a small bit, just to make things easy for yourself. These days, successful software is measured in millions of users, of people. Always be mindful of the impact your decisions--even small ones--have on those people. Accept large burdens to ease the burden on those people, even if your impact on any given individual is miniscule. The cumulative good you do that way will always overwhelm the individual costs you pay.

REST and Change in APIs

2002-05-14T11:34:00-05:00

In case it didn't come through, I'm intrigued by REST, because it seems more fluid than the WS-* specifications. I can do an HTTP request in about 5 lines of socket code in any modern language, from any client device.

The WS-splat crowd seem to be building YABS (yet another brittle standard). Riddle me this: what use is a service description in a standardized form if there is only one implementor of that service? WSDL only attains full value when there are standards built on top of WSDL. Just like XML, WSDL is a meta-standard. It is a standard for specifying other standards. Collected and diverse industry behemoths and leviathans make the rules for that playground.

I see two, equally likely, outcomes for any given service definition:

A defining body will standardize the interface for a particular web service. This will take far too long.
A dominant company in a star-like topography with its customers and suppliers (think Wal-mart) will impose an interface that its business partners must use.

Once such interfaces are defined, how easily might they be changes? I mean the WSDL (or other) definition of the service itself. Can anyone say CORBAservices? You'd better define your services right the first time, because there appears to be substantial friction opposing change.

How does REST avoid this issue? By eliminating layers. If I support a URI naming scheme like http://company.com/groupName/divisionName/departmentName/purchaseOrders/poNumber as a RESTful way to access purchase orders, and I find that we need to change it to /purchaseOrders/departmentNumber/poNumber, then both forms can co-exist. The alternative change in SOAP/WSDL-land would either modify the original endpoint (an incompatible change!) or would define a new service to support the new mode of lookup. (I suppose other hacks are available, too. Service.getPurchaseOrder2() or Service.getPurchaseOrderNew() for example.)

Of course, neither of these service architectures are implemented widely enough to really evaluate which one will be more accepting of change. I can tell you, though, that one of the huge CORBA-killers was the slow pace and resistance to change in the CORBAservices.

Here's another excellent discussion about

2002-05-09T10:50:00-05:00

Here's another excellent discussion about REST for web services.

Debating "Web Services"

2002-05-07T21:30:00-05:00

There is a huge and contentious debate under way right now related to "Web services". A sizable contingent of the W3C and various XML pioneers are challenging the value of SOAP, WSDL, and other "Web service" technology.

This is a nuanced discussion with many different positions being taken by the opponents. Some are critical of the W3C's participation in something viewed as a "pay to play" maneuver from Microsoft and IBM. Others are pointing out serious flaws in SOAP itself. To me, the most interesting challenge comes from the W3C's Technical Architecture Group (TAG). This is the group tasked with defining what the web is and is not. Several of the TAG, including the president of the Apache Foundation, are arguing that "Web services" as defined by SOAP, fundamentally are not "the web". ("The web" being defined crudely as "things are named via URI's" and "every time I ask for the same URI, I get the same results". My definition, not theirs.) With a "Web service", a URI doesn't name a thing, it names a process. What I get when I ask for a URI is no longer dependent solely on the state of the thing itself. Instead, what I get depends on my path through the application.

I'd encourage you to all sample this debate, as summarized by Simon St. Laurent (one of the original XML designers).

Decoupling

2002-05-06T09:46:00-05:00

For the ultimate in temporal, architectural, language and spatial decoupling, try two of my favorite fluid technologies: publish-subscribe messaging and tuple-spaces.

Prison of our own Making

2002-04-19T21:49:00-05:00

Ethical decisions in software development

2002-04-04T22:19:00-06:00

Ethical decisions in software development do not only arise when we are talking about malware or copyright infringement.

If my programs are successful, then they impact the lives of thousands or millions of people. That impact can be positive or negative. The program can make their lives better or worse–even if just in minute proportions.

Every time I make a decision about how a program behaves, I am really deciding what my users can and cannot do. If I make an input required, I am forcing them to abide by my rules. (Hopefully, it is a rule they expressed first, at least.) Conversely, if I allow partial entry, then I am allowing some licentiousness. They can get away with less rigorous work.

That makes every programming decision an ethical decision.

Designing for Emergent Behavior

2002-03-25T22:44:00-06:00

Lately, I’ve been grooving on emergent behavior. This fuzzy term comes from the equally fuzzy field of complexity studies. Mix complex rules together with non-linear effects (like humans) and you are likely to observe emergent behavior.

Recent example: web browser security holes. Any program inherently constitutes a complex system. Add in some dynamic reprogramming, downloadable code, system-level scripting, and millions upon millions of users and you’ve got a perfect petri dish. Sit back and watch the show. Unpredictable behavior will surely result.

In fact, "emergent" sometimes gets used as a synonym for "unpredictable". By and large, I believe that’s true. In traditional systems design, "unpredictable" definitely equals "sloppy". Command-and-control, baby. Emergent behavior is what happens when your program goes off the rails.

The thing is, emergent behavior is where all the really interesting things happen. Predictable programs are boring. Big batch runs are predictable.

But, you have to consider the complete system. In a big batch run, the system is linear: inputs, transformation, outputs. No feedback. No humans. When you include humans in your view of the system, all these messy feedback loops start to appear. It gets even worse when you have multiple humans connected via the programs. Feedback loops that stretch from one person, through at least two programs, out to another person and back.

Any system that involves humans will exhibit emergent behaviors – and this is a very good thing.

Are "designed" behavior and "emergent" behavior inherently incompatible? I don’t think so. I think it may be possible to design for emergent behavior. I mean that certain designs will encourage some kinds of emergent behavior, whereas other designs encourage other kinds of emergent behavior. We can study the behaviors produced by various systems and designs to build a compendium of factors that are likely to facilitate one class of behavior or another.

For example: In every corporation, I see large volumes of data stored and shared in two different formats. The nature of the two systems encourages very different behaviors.

First we have relational databases. These tend to be large, expensive systems. As a result, they are centralized to one degree or another. The nature of relational algebra is that of a static schema. Therefore, changes are rigidly controlled. Centralized, rigidly controlled assets require guardians (DBAs) and gatekeepers (data modelers). Because the schema is well-defined and changes slowly, the database gains a degree of transparency. Applications are integrated through their databases. Generic tools for backup, reporting, extraction, and modeling become possible. The data can be accessed from a variety of applications in a relatively generic fashion.

The other data storage tool I see used widely is the spreadsheet. I almost never see a spreadsheet used to calculate numbers. Instead, most are used as a schema-less data storage tool. Often created directly by the business analysts, these spreadsheets are very conducive to change. Sharing is as simple as sending the file through email. Of course, this leads to version conflicts and concurrent update issues that have to be settled by hand (usually by printing a timestamp on the hardcopies!) There is not a central definition of the data structure. Indeed, neither the data nor the structures from spreadsheets can be reused. A spreadsheet makes the 2-dimensional structure of a table obvious, but it makes relationships difficult, if not impossible, to represent. Ergo, spreadsheet users don’t do relationships. Access to the spreadsheets is always mediated by a single application.

So, two different systems. Both store structured (or at least semi-structured) data. The nature of each produces very different emergent behaviors. In one case, we find the evolution of acolytes of the RDBMS. In the other case, we find that a numeric analysis tool is being used for widespread data storage and sharing.

Given enough examples, enough time, and enough study, can we not learn to extrapolate from the essential nature of our designs to the most probable emergent behaviors? Even perhaps, to select the emergent behaviors that we desire first, and, starting from those, decide what essential nature our designs must embody to most likely to encourage those behaviors?

Names have Power

2002-03-19T23:11:00-06:00

Names have power. Shamanic primitives guard their true names -- give me your name and you give me power over you. In the ether, your name is your only identity. Give away your name and you give away yourself. No cause, issue, or crusade has a follower until it has a name. A good name evokes images, emotions. A well-named issue becomes uncontestable. (Who is really opposed to "family values", anyway?)

Naming things well may be one of the hardest jobs in design. Somebody once said that object-oriented design was about creating the language that you would use to solve the problem. Start with the language (a collection of names, and rules about how to assemble the names), then deal with the problem.

I'm struggling with naming something right now. I can sense what it is. There is a real thing there. I can feel it. I need to define it, give it boundaries. When I can name it, I will give it life.

Find the line, find the shape
Through the grain
Find the outline, things will
Tell you their name
--Susanne Vega

The best name I've come up with yet is fluid. There are fluid methods, fluid tools, fluid technologies, fluid designs, and so on. Things that are fluid welcome change. They adapt. They are pleasant to modify. If I have a fluid architecture, then integrating a new system into the mix does not cause massive headaches and heartburn. (Hmmm. So dropping a new system into a fluid architecture doesn't cause a ripple effect? Right. See how hard it is to name things?) Fluid "stuff" does not resist change. Being fluid means nothing is ever carved in stone. Things that are fluid encourage certian emergent properties that we value: fast, flexible, joyous.

Pah. That's damn close to gibberish.

Let's try analogy and contrast:

Fluid	Not fluid
Publish-subscribe messaging	Flat file integration
Typeless languages	Strongly-typed languages
Tuple-spaces	Relational databases
eXtreme Programming	SEI CMM Level 5
Cross-functional teams	Silos
Whiteboard task lists	GANTT charts
Web	Client-server
20-person startup	The same company at 150 people

Does that help? The items on the left share some essential, underlying attributes. The things on the right lack those attributes; they embody different values. (I don't like the semiotics of "fluid". Call that a working title, not a true name. Besides, the natural opposites of "fluid" would be "solid" or "concrete". These are both positively-connoted terms.)

So what can I name this quality? Is there really something essential there, or is this just reflecting nothing more than the way I like to work?

Lately, I have been struggling

2002-03-16T22:42:00-06:00

Lately, I have been struggling to find the meaning in my work. I suppose that’s not surprising. I am a human being–a mortal creature. My age will soon flip a decimal digit. (I decline to specify which.) These can certainly cause one to spend time reflecting on one’s legacy. They can also cause one to buy a flaming red sports car. I may explore that option later.

I also work in a field of incredible transience. Two hundred years from now, no cathedral will bear my mark. No train depot of my design will grace the National Register of Historic Places. No literary critics will deconstruct the significance of my characters' middle initials. In truth, the shelf life of my work compares poorly to that of a gallon of milk.

I am a programmer.

I and my comrades can usually be found behind our glowing screens, working hour after hour to bring some other person’s vision to life. We who grapple with chaos and ether and mud expend our spirit, energy, life, time, soul, and qi in the name of creation. We work long after the managers have left. We learn the janitors' names. I have often gazed out my window to the neon street below, full of the theater signs, restaurants, and wandering crowds seeking to be entertained. I have wondered what kind of life I should have led to be in that crowd instead of watching it. I’ve wondered how I could rejoin that human mass. I think I’d have to change careers.

I cannot deny, however, that my work brings me deep–if ephemeral–satisfaction. The harsh joy of self-sacrifice combines with the exultant delight of success when a project comes together. When I finally get my programs to work, it’s a kind of magic, dense and layered. At one level, the thought that my work will be useful to someone–that it will make dozens, hundreds, maybe millions of people more individually powerful–is heady and exciting.

At another level, I have a fierce pride that my software works at all. Knowing that my creation is strong enough, powerful enough to survive the threat of millions of users doing their damndest to destroy it. Despite the teeming millions trying to prove that there is no such thing as "foolproof", my software keeps working. "Robust", we call it. "Resilient". "Come on", it says, "bring it on."

Deeper still, I take a craftman’s pride in a job well done. Like a mason or a carpenter, I know what is under the surface. I know how well it is put together. I know what skill went into its construction. No one else may see this, but I know.

Wide Awake Developers

Constrain the Provider to Liberate Callers

Rule of Eights

80 milliseconds

800 milliseconds

8 seconds

80 seconds

8 minutes

80 minutes

8 hours

8 days

The Bad Idea Game

Everything We Build Has a Future Cost

Four Meanings of Priorities

Transactions Aren't Everything

Counterfactuals are not Causality

Unlimited Counterfactuals

Speaking of Blame

Using Counterfactuals For Good

Stepping Farther Away From Reality

Conclusions

"Manual" and "Automated" are just words

Two-legs good, four-legs bad

Out of the Morass

Blocker? Pre-requisite.

Delay Induces Lamination

Complexity Collapse

Staggering Skeleton

Weakness Invites Competition

Scaffold or Straightjacket?

Scaffold or Straightjacket?

Process as (Accidental) Constraint

Taiichi Ohno’s Kind of Process

Scaffolding versus Straightjackets

Deleting From Databases is Not Cleanup

Narrow but Deep?

Consequences are not Pros or Cons

Why did we stop at 2?

Messaging Topics

Databases

Other Uses

Time Emerges From Events

Reading List

​Architecture​ & Development

Require​d Reading

Recomm​ended Reading

Suggested Reading

Shared Mutable Team State

Shared State

Mutable State for Humans

Consequences of Shared State

Less Shared Metadata Means Less Penalties

Immutable Metadata

Consider Structuring Teams Around Shared State

More to Explore

My Favorite Bit of Language Design

Networking Topics

Joyful Isolation

Evolving Away From Entities

Point of Departure

Think About Behavior

Take Two

Not One But Two

What Was That About Hats Again?

Hats for a User

Adding a hat to a comment

Displaying a hat

Conclusions

Data is the New Oil

Coherence Penalty for Humans

Amdahl's Law

From Amdahl to USL

Effect of USL

USL in Teams?

USL and Microservices

What to do about it?

Services By Lifecycle

Avoiding the Entity Service

Focus on Behavior Instead of Data

Model Like It’s 1999

Architecture & Development

Required Reading

Recommended Reading