Wide Awake Developers

From QCon San Francisco

| Comments

I’m at QCon San Francisco this week.  (An aside: after being a speaker at No Fluff, Just Stuff, it’s interesting to be the audience again.  As usual, on returning from travels in a different domain, one has a new perspective on familiar scenes.) This conference targets senior developers, architects, and project managers.  One of the very appealing things is the track on "Architectures you’ve always wondered about".  This coveres high-volume architectures for sites such as LinkedIn and eBay as well as other networked applications like Second Life.  These applications live and work in thin air, where traffic levels far outstrip most sites in the world.  Performance and scalability are two of my personal themes, so I’m very interested in learning from these pioneers about what happens when you’ve blown past the limits of traditional 3-tier, app-server centered architecture.

Through the remainder of the week, I’ll be blogging five ideas, insights, or experiences from each day of the conference.

Normal Accidents

| Comments

While I was writing Release It!, I was influenced by James R. Chile’s book Inviting Disaster. One of Chile’s sources is Normal Accidents, by Charles Perrow. I’ve just started reading, and even the first two pages offer great insight.

Normal Accidents describes systems that are inherently unstable, to the point that system failures are inevitable and should be expected.  These "normal" accidents result from systems that exhibit the characteristics of high "interactive complexity" and "tight coupling".

Interactive complexity refers to internal linkages, hidden from the view of operators. These invisible relations between components or subsystems produce multiple effects from a single cause.  They can also produce outcomes that do not seem to relate to their inputs.

In software systems, interactive complexity is endemic. Any time two programs share a server or database, they are linked. Any time a system contains a feedback loop, it inherently has higher interactive complexity. Feedback loops aren’t always obvious.  For example, suppose a new software release consumes a fraction more CPU per transaction than before. That small increment might puch the server from a non-contending regime and a contending one.  Once in contention, the added CPU usage creates more latency. That latency, and the increase in task-switching overhead, produces more latency. Positive feedback.

High interactive complexity leads operators to misunderstand the system and its warning signs. Thus misinformed, they act in ways that do not avert the crisis and may actually precipitate it. 

When processes happen very fast, and there is no way to isolate one part of the system from another, the system is tightly coupled.  Tight coupling allows small incidents to spread into large-scale failures.

Classic "web architecture" exhibits both high interactive complexity and tight coupling. Hence, we should expect "normal" accidents.  Uptime will be dominated by the occurence of these accidents, rather than the individual probability of failure in each component.

The first section of Release It! deals exclusively with system stability.  It shows how to reduce coupling and diminish interactive complexity. 

You Keep Using That Word. I Do Not Think It Means What You Think It Means.

| Comments

"Scalable" is a tricky word. We use it like there’s one single definition. We speak as if it’s binary: this architecture is scalable, that one isn’t.

The first really tough thing about scalability is finding a useful definition. Here’s the one I use:

Marginal revenue / transaction > Marginal cost / transaction

The cost per transaction has to account for all cost factors: bandwidth, server capacity, physical infrastructure, administration, operations, backups, and the cost of capital.

(And, by the way, it’s even better when the ratio of revenue to cost per transaction grows as the volume increases.)

The second really tough thing about scalability and architecture is that there isn’t one that’s right.  An architecture may work perfectly well for a range of transaction volumes, but fail badly as one variable gets large.

Don’t treat "scalability" as either a binary issue or a moral failing. Ask instead, "how far will this architecture scale before the marginal cost deteriorates relative to the marginal revenue?" Then, follow that up with, "What part of the architecture will hit a scaling limit, and what can I incrementally replace to remove that limit?"

Engineering in the White Space

| Comments

"Is software Engineering, or is it Art?"

Debate between the Artisans and the Engineers has simmered, and occasionally boiled, since the very introduction of the phrase "Software Engineering".  I won’t restate all the points on both sides here, since I would surely forget someone’s pet argument, and also because I see no need to be redundant.

Deep in my heart, I believe that building programs is art and architecture, but not engineering.

But, what if you’re not just building programs?

Programs and Systems

A "program" has a few characteristics that I’ll assign here:

  1. It accepts input.
  2. It produces output.
  3. It runs a sequence of instructions.
  4. Statically, it exhibits cohesion in its executable form. [*]
  5. Dynamically, it exhibits cohesion in its address space. [**]

* That is, the transitive closure of all code to be executed is finite, although it may not all be known in advance of execution.  This allows dynamic extension via plugins, but not, for example, dynamic execution of any scripts or code found on the Web.  So, a web browser is a program, but Javascript executed on some page is an independent program, not part of the browser itself.

** For "address space", feel free to substitute "object space", "process space", or "virtual memory". Cohesion requires that all the code that can access the address space should be regarded as a single program.  (IPC through shared memory is a special case of an output, and should be considered more akin to a database or memory-mapped file than to part of the program’s own address space.)

Suppose you have two separate scripts that each manipulate the same database.  I would regard those as two separate—though not independent—programs.  A single instance of Tomcat may contain several independent programs, but all the servlets in one EAR file are part of one program.

For the moment, I will not consider trivial objections, such as two distinct sets of functionality that happen to be packaged and delivered in a single EAR file.  It’s less interesting to me whether code does access the entire address space then whether it could.  A library checkout program that includes functions for both librarians and patrons may not use common code for card number lookup, but it could.  (And, arguably, it should.)  That makes it one program, in my eyes.

A "System", on the other hand, consists of interdependent programs that have commonalities in their inputs and outputs.  They could be arranges in a chain, a web, or a loop.  No matter, if one program’s input depends on another program’s output, then they are part of a system.

Systems can be composed, whereas programs cannot.  

Tricky White Space

Some programs run all the time, responding to intermittent inputs, these we call "servers".  It is very common to see servers represented as a deceptively simple little rectangle on a diagram.  Between servers, we draw little arrows to indicate communication, of some sort.

One little arrow might mean, "Synchronous request/reply using SOAP-XML over HTTP." That’s quite a lot of information for one little glyph to carry.  There’s not usually enough room to write all that, so we label the unfortunate arrow with either "XML over HTTP"—if viewing it from an internal perspective—or "SKU Lookup"—if we have an external perspective.

That little arrow, bravely bridging the white space between programs, looks like a direct contact.  It is Voyager, carrying its recorded message to parts unknown.  It is Aricebo, blasting a hopeful greeting into the endless dark.

Well, not really…

These days, the white space isn’t as empty as it once was.  A kind of lumeniferous ether fills the void between servers on the diagram.

The Substrate

There is many a slip ‘twixt cup and lip.  In between points A and B on our diagram, there exist some or all of the following:

  • Network interface cards
  • Network switches
  • Layer 2 - 3 firewalls
  • Layer 7 (application) firewalls
  • Intrusion Detection and Prevention Systems
  • Message queues
  • Message brokers
  • XML transformation engines
  • Flat file translations
  • FTP servers
  • Polling jobs
  • Database "landing zone" tables
  • ETL scripts
  • Metro-area SoNET rings
  • MPLS gateways
  • Trunk lines
  • Oceans
  • Ocean liners
  • Phillipine fishing trawlers (see, "Underwater Cable Break")

Even in the simple cases, there will be four or five computers between program A and B, each running their own programs to handle things like packet switching, traffic analysis, routing, threat analysis, and so on.

I’ve seen a single arrow, running from one server to another, labelled "Fulfillment".  It so happened that one server was inside my client’s company while the other server was in a fulfillment house’s company.  That little arrow, so critical to customer satisfaction, really represented a Byzantine chain of events that resembled a game of "Mousetrap" more than a single interface.  It had messages going to message brokers that appended lines to files, which were later picked up by an hourly job that would FTP the files to the "gateway" server (still inside my client’s company.)  The gateway server read each line from the file and constructed and XML message, which it then sent via HTTP to the fulfillment house.

It Stays Up

We analogize bridge-building as the epitome of engineering. (Side note: I live in the Twin Cities area, so we’re a little leery of bridge engineering right now.  Might better find another analogy, OK?)  Engineering a bridge starts by examining the static and dynamic load factors that the bridge must support: traffic density, weight, wind and water forces, ice, snow, and so on.

Bridging between two programs should consider static and dynamic loads, too.  Instead of just "SOAP-XML over HTTP", that one little arrow should also say, "Expect one query per HTTP request and send back one response per HTTP reply.  Expect up to 100 requests per second, and deliver responses in less than 250 milliseconds 99.999% of the time."

It Falls Down

Building the right failure modes is vital. The last job of any structure is to fall down well. The same is true for programs, and for our hardy little arrow.

The interface needs to define what happens on each end when things come unglued. What if the caller sends more than 100 requests per second? Is it OK to refuse them? Should the receiver drop requests on the floor, refuse politely, or make the best effort possible?

What should the caller do when replies take more than 250 milliseconds? Should it retry the call? Should it wait until later, or assume the receiver has failed and move on without that function?

What happens when the caller sends a request with version 1.0 of the protocol and gets back a reply in version 1.1? What if it gets back some HTML instead of XML?  Or an MP3 file instead of XML?

When a bridge falls down, it is shocking, horrifying, and often fatal. Computers and networks, on the other hand, fall down all the time.  They always will.  Therefore, it’s incumbent on us to ensure that individual computers and networks fail in predictable ways. We need to know what happens to that arrow when one end disappears for a while.

In the White Space

This, then, is the essence of engineering in the white space. Decide what kind of load that arrow must support.  Figure out what to do when the demand is more than it can bear.  Decide what happens when the substrate beneath it falls apart, or when the duplicitous rectangle on the other end goes bonkers.

Inside the boxes, we find art.

The arrows demand engineering.

 

On the Widespread Abuse of SLAs

| Comments

Technical terminology sneaks into common use. Terms such as "bandwidth" and "offline" get used and abused, slowly losing touch with their original meaning. ("Bandwidth" has suffered multiple drifts. It started out in radio, not computer networking, let alone the idea of "personal attention space".) It is the nature of language to evolve, so I would have no problem with this linguistic drift, if it were not for the way that the mediocre and the clueless clutch to these seemingly meaningful phrases.

The latest victim of this linguistic vampirism is the "Service Level Agreement". This term, birthed in IT governance, sounds wonderful. It sounds formal and official.

An example of the vulgar usage: "I have a five-day SLA."

It sounds so very proactive and synergistic and leveraged, doesn’t it? Theoretically, it means that we’ve got an agreement between our two groups; I am your customer and you commit to delivering service within five days.

A real SLA has important dimensions that I never see addressed with internal "organizational" SLAs.

First, boundaries.

When does that five day clock begin ticking? Is it when I submit my request to the queue? Or, is it when someone from your group picks the request up from the queue? If the latter, then how long do requests sit in queue before they get picked up? What’s the best case? Worst case? Average?

When does the clock stop ticking? If you just say, "not approved" or "needs additional detail", does that meet your SLA? Do I have to resubmit for the next iteration, with a whole new five day clock? Or, does the original five day SLA run through resolution rather than just response?

An internal SLA must begin with submission into the request queue and end when the request is fully resolved.

Second, measurement and tracking.

How often do you meet your internal SLA? 100% of the time? 95% of the time? 50% of the time? Unless you can tell me your "on-time performance", there’s no way for me to have confidence in your SLA.

How many requests have to be escalated or prioritized in order to meet SLA? Do any non-escalated requests actually get resolved within the alloted time?

How well does your on-time performance correlate with the incoming workload? If the request volume goes up by 25%, but your on-time performance does not change, then your SLA is too loose.

An SLA must be tracked and trended. It must be correlated with demand metrics.

Third, consequences.

If there is no penalty, then there is no SLA. In fact, the IT Infrastructure Library considers penalties to be the defining characteristic of SLAs. (Of course, ITIL also says that SLAs are only possible with external suppliers, because it is only with external suppliers that you can have a contract.)

When was the last time that an internal group had its budget dinged for breaking an SLA? What would that even mean? How would the health and performance of the whole company be aided by taking resources away from a unit that already cannot perform?  The Theory of Constraints says that you devote more resources to the bottleneck, not less. Penalizing you for breaking SLA probably makes your performance worse, not better.

(External suppliers are different because a) you’re paying them, and b) they have a profit margin. I doubt the same is true for your own internal groups.)

If there’s no penalty, then it’s not an SLA.

Fourth, consent.

SLAs are defined by joint consent of both the supplier and consumer of the service. As a subscriber to your service, I can make economic judgments about how much to pay for what level of service. You can make economic judgments about how well you can deliver service at the required level for the offered payment.

When are internal "service level agreements" actually an "agreement"? Never. I always see SLAs being imposed by one group upon all of their subscribers.

An SLA must be an agreement, not a dictum.

 

If any of these conditions are not met, then it’s not really an SLA. It’s just a "best effort response time". As a consumer, and sometimes victim, of the service, I cannot plan to the SLA time. Rather, I must manage around it. Calling a "best effort response time" an "SLA" is just an attempt to deceive both of us.

 

Y B Slow?

| Comments

I’ve long been a fan of the Firebug extension for Firefox.  It gives you great visibility into the ebb and flow of browser traffic.  It sure beats rolling your own SOCKS proxy to stick between your browser and the destination site.

Now, I have to also endorse YSlow from Yahoo.  YSlow adds interpretation and recommendations to Firebug’s raw data.

For example, when I point YSlow at www.google.com, here’s how it "grades" Google’s performance:

Google gets an A for performance

Not bad.  On the other hand, www.target.com doesn’t fare as well.

Target gets an F for performance

Along with the high-level recommendations, YSlow will also tally up the page weight, including a nice breakdown of cached versus non-cached requests and download size.

Cache stats for Target.com

There are so many good reasons to use this tool. In Release It, I spend a lot of time talking about the money companies waste on bloated HTML and unnecessary page requests.  Fat pages hurt users and they hurt companies.  Users don’t want to wait for all your extra whitespace, table-formatting, and shims to download.  Companies shouldn’t have to pay for all the added, useless bandwidth.  YSlow is a great tool to help eliminate the bloat, speed up page delivery, and make happy users.

The 5 A.M. Production Problem

| Comments

I’ve got a new piece up at InfoQ.com, discussing the limits of unit and functional testing: 

"Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you’re calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that’s all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.

One of my recurring themes in Release It is that every call to another system, without exception, will someday try to kill your application. It usually comes from behavior outside the specification. When that happens, you must be able to peel back the layers of abstraction, tear apart the constructed fictions of "concurrent users", "sessions", and even "connections", and get at what’s really happening."