Wide Awake Developers

Main

Webber and Fowler on SOA Man-Boobs

InfoQ posted a video of Jim Webber and Martin Fowler doing a keynote speech at QCon London this Spring. It's a brilliant deconstruction of the concept of the Enterprise Service Bus. I can attest that they're both funny and articulate (whether on the stage or off.)

Along the way, they talk about building services incrementally, delivering value at every step along the way. They advocate decentralized control and direct alignment between services and the business units that own them. 

I agree with every word, though I'm vaguely uncomfortable with how often they say "enterprise man boobs".

Coincidence or Back-end Problem?

An odd thing happened to me today. Actually, an odd thing happened yesterday, but it's having the same odd thing happen today that really makes it odd. With me so far?

Yesterday, while I was shopping at Amazon, Amazon told me that my American Express card had expired. While it is set for a May expiration, it's several years in the future. I didn't think too much of it, because when I re-entered the same information, Amazon accepted it.

Today, I got the same thing with the same card on iTunes!

Online stores don't do a whole lot with your credit cards. For the most part, they just make a call out to a credit card processor. Small stores have to go through a second-tier CCVS system that charges a few pennies per transaction. Large ones---and do they get larger than Amazon?---generally connect directly to a payment processor. The payment processor may charge a fraction of a cent per transaction, but they definitely make it up in volume.

(There are other business factors, too, like the committed transaction volume, response time SLAs, and the like.)

Asynchronously, the payment processor collects from the issuing bank. It's the issuing bank that actually bills you, and sets your interest rate and payment terms.

Whereas VISA and MasterCard work with thousands of issuers, American Express doesn't. When you get an AmEx card, they are the issuing bank as well as the payment processor.

Which makes it highly suspect that the same card gave me the same error through two different sites. It makes me think that American Express has introduced a bug in their validation system, causing spurious declines for expiration. 

SOA: Time For a Rethink

The notion of a service-oriented architecture is real, and it can deliver. The term "SOA", however, has been entirely hijacked by a band of dangerous Taj Mahal architects. They seem innocuous, it part because they'll lull you to sleep with endless protocol diagrams. Behind the soporific technology discussion lies a grave threat to your business.

"SOA" has come to mean top-down, up-front, strong-governance, all-or-nothing process (moving at glacial speed) implemented by an ill-conceived stack of technologies. SOAP is not the problem. WSDL is not the problem. Even BPEL is not the problem. The problem begins with the entire world view.

We need to abandon the term "SOA" and invent a new one. "SOA" is chasing a false goal. The idea that services will be so strongly defined that no integration point will ever break is unachievable. Moreover, it's optimizing for the wrong thing. Most business today are not safety-critical. Instead, they are highly competitive.

We need loosely-coupled services, not orchestration.

We need services that emerge from the business units they serve, not an IT governance panel.

We need services to change as rapidly as the business itself changes, not after a chartering, funding, and governance cycle.

Instead of trying to build an antiseptic, clockwork enterprise, we need to embrace the messy, chaotic, Darwinian nature of business. We should be enabling rapid experimentation, quick rollout of "barely sufficient" systems, and fast feedback. We need to enable emergence, not stifle it.

Anything that slows down that cycle of experimentation and adjustment puts your business on the same evolutionary path as the Great Auk. I never thought I'd find myself quoting Tom Peters in a tech blog, but the key really is to "Test fast, fail fast, adjust fast."

The JVM is Great, But...

Much of the interest in dynamic languages like Groovy, JRuby, and Scala comes from running on the JVM. That lets them leverage the tremendous R&D that's gone into JVM performance and stability. It also opens up the universe of Java libraries and frameworks.

And yet, much of my work deals with the 80% of cost that comes after the development project is done. I deal with the rest of the software's lifetime. The end of development is the beginning of the software's life. Throughout that life, many of the biggest, toughest problems exist around and between the JVM's: Scalability, stability, interoperability, and adaptability.

As I previously showed in this graphic, the easiest thing for a Java developer to create is a slow, unscalable, and unstable web application. Making high-performance, robust, scalable applications still requires serious expertise. This is a big problem, and I don't see it getting better. Scala might help here in terms of stability, but I'm not yet convinced it's suitable for the largest masses of Java developers. Normal attrition means that the largest population of developers will always be the youngest and least experienced. This is not a training problem: in the post-commoditization world, the majority of code will always be written by undertrained, least-cost coders. That means we need platforms where the easiest thing to do is also the right thing to do.

Scaling distributed systems has gotten better over the last few years. Distributed memory caching has reached the mainstream. Terracotta and Coherence are both mature products, and they both let you try them out for free. In the open source crowd, as usual, you lose some manageability and some time-to-implement, but the projects work when you use them right. All of these do the job of connecting individual JVMs to a caching layer. On the other hand, I can't help but feel that the need for these products points to a gap in the platform itself.

OSGi is finally reaching the mainstream. It's been needed for a long time, for a couple of reasons. First, it's still too common to see gigantic classpaths containing multiple versions of JAR files, leading to the perennial joy of finding obscure, it-works-fine-in-QA bugs. So, keeping individual projects in their own bundles, with no classpath pollution will be a big help. Versioning application bundles is also important for application management and deployment. OSGi is what we should have had since the beginning, instead of having the classpath inflicted on us.

I predict that we'll see more production operations moving to hot deployment on OSGi containers. For enterprise services that require 100% uptime, it's just no longer acceptable to bring down the whole cluster in order to do deployments. Even taking an entire server down to deploy a new revision may become a thing of the past. In the Erlang world, it's common to see containers running continuously for months or years. In Programming Erlang, Joe Armstrong talks about sending an Erlang process a message to "become" a new package. It works without disrupting any current requests and it happens atomically between one service request and the next. (In fact, Joe says that one of the first things he does on a new system is deploy the container processes, at the very beginning of the project. Later, once he knows what the system is supposed to do, he deploys new packages into those containers.) Hot deployment can be safe, if the code being deployed is sufficiently decoupled from the container itself. OSGi does that.

OSGi also enables strong versioning of the bundles and their dependencies. This is an all-around good thing, since it will let developers and operations agree on exactly versions of which components belong in production at a given time.

SAP's SOA ESR

SAP has been talking up their suite of SOA tools. The names all run together a bit, since they each contain some permutation of "enterprise" and "builder", but it's a very impressive set of tools.

Everything SAP does comes off of an enterprise service repository (ESR). This includes a UDDI registry, and it supports service discovery and lookup. Development tools allow developers to search and discover services through their "ES Workspace". Interestingly, this workspace is open to partners as well as internal developers.

From the ESR, a developer can import enough of a service defition to build a composite application. Composite applications include business process definitions, new services of their own, local UI components, and remote service references.

Once a developer creates a composite application, it can be deployed to a local container or a test server. Presumably, there's a similar tool available for administrators to deploy services, composite applications, and other enterprise components onto servers.

Through it all, the complete definition of every component goes into the ESR.

In order to make the entire service lifecycle work, SAP has defined a strong meta-model and a very strong governance process.

This is the ultimate expression of the top-down, strong-governance model for enterprise SOA.

If you're into that sort of thing.

 

SOA at 3.5 Million Transactions Per Hour

Matthias Schorer talked about FIDUCIA IT AG and their service-oriented architecture. This financial services provider works with 780 banks in Europe, processing 35,000,000 transactions during the banking day. That works out to a little over 3.5 million transactions per hour.

Matthias described this as a service-oriented architecture, and it is. Be warned, however, that SOA does not imply or require web services. The services here exist in the middle tier. Instead of speaking XML, they mainly use serialized Java objects. As Matthias said, "if you control both ends of the communication, using XML is just crazy!"

They do use SOAP when communicating out to other companies.

They've done a couple of interesting things. They favor asynchronous communication, which makes sense when you architect for latency. Where many systems push data into the async messages, FIDUCIA does not. Instead, they put the bulk data into storage (usually files, sometimes structured data) and send control messages instructing the middle tier to process the records. This way, large files can be split up and processed in parallel by a number of the processing nodes. Obviously, this works when records are highly independent of each other.

Second, they have defined explicit rules and regulations about when to expect transactional integrity. There are enough restrictions that these are a minority of transactions. In all other cases, developers are required to design for the fact that ACID properties do not hold.

Third, they've build a multi-layered middle tier. Incoming requests first hit a pair of "Central Process Servers" which inspect the request. Requests are dispatched to individual "portals" based on their customer ID. Different portals will run different versions of the software, so FIDUCIA supports customers with different production versions of their software. Instead of attempting to combine versions on a single cluster, they just partition the portals (clusters.)

Each portal has its own load distribution mechanism, using work queues that the worker nodes listen to.

This multilevel structure lets them scale to over 1,000 nodes while keeping each cluster small and manageable.

The net result is that they can process up to 2,500 transactions per second, with no scaling limit in sight.

Amazon Blows Away Objections

Amazon must have been burning more midnight oil than usual lately.

Within the last two weeks, they've announced three new features that basically eliminate any remaining objections to their AWS computing platform.

Elastic IP Addresses 

Elastic IP addresses solve a major problem on the front end.  When an EC2 instance boots up, the "cloud" assigns it a random IP address. (Technically, it assigns two: one external and one internal.  For now, I'm only talking about the external IP.) With a random IP address, you're forced to use some kind of dynamic DNS service such as DynDNS. That lets you update your DNS entry to connect your long-lived domain name with the random IP address.

Dynamic DNS services work pretty well, but not universally well.  For one thing, there is a small amount of delay.  Dynamic DNS works by setting a very short time-to-live (TTL) on the DNS entries, which instructs intermediate DNS servers to cache the entry only for a few minutes.  When that works well, you still have a few minutes of downtime when you need to reassign your DNS name to a new IP address.  For some parts of the Net, dynamic DNS doesn't work well, usually when some ISP doesn't respect the TTL on DNS entries, but caches them for a longer time.

Elastic IP addresses solve this problem. You request an elastic IP address through a Web Services call.  The easiest way is with the command-line API:

$ ec2-allocate-address
ADDRESS    75.101.158.25   

Once the address is allocated, you own it until you release it. At this point, it's attached to your account, not to any running virtual machine. Still, this is good enough to go update your domain registrar with the new address. After you start up an instance, then you can attach the address to the machine. If the machine goes down, then the address is detached from that instance, but you still "own" it.

So, for a failover scenario, you can reassign the elastic IP address to another machine, leave your DNS settings alone, and all traffic will now come to the new machine.

Now that we've got elastic IPs, there's just one piece missing from a true HA architecture: load distribution. With just one IP address attached to one instance, you've got a single point of failure (SPOF). Right now, there are two viable options to solve that. First, you can allocate multiple elastic IPs and use round-robin DNS for load distribution. Second, you can attach a single elastic IP address to an instance that runs a software load balancer: pound, nginx, or Apache+mod_proxy_balancer. (It wouldn't surprise me to see Amazon announce an option for load-balancing-in-the-cloud soon.) You'd run two of these, with the elastic IP attached to one at any given time. Then, you need a third instance monitoring the other two, ready to flip the IP address over to the standby instance if the active one fails. (There are already some open-source and commercial products to make this easy, but that's the subject for another post.)

Availability Zones 

The second big gap that Amazon closed recently deals with geography.

In the first rev of EC2, there was absolutely no way to control where your instances were running. In fact, there wasn't any way inside the service to even tell where they were running. (You had to resort to pingtracing or geomapping of the IPs). This presents a problem if you need high availability, because you really want more than one location.

Availability Zones let you specify where your EC2 instances should run. You can get a list of them through the command-line (which, let's recall, is just a wrapper around the web services):

$ ec2-describe-availability-zones
AVAILABILITYZONE    us-east-1a    available
AVAILABILITYZONE    us-east-1b    available
AVAILABILITYZONE    us-east-1c    available

Amazon tells us that each availability zone is built independently of the others. That is, they might be in the same building or separate buildings, but they have their own network egress, power systems, cooling systems, and security. Beyond that, Amazon is pretty opaque about the availability zones. In fact, not every AWS user will see the same availability zones. They're mapped per account, so "us-east-1a" for me might map to a different hardware environment than it does for you.

How do they come into play? Pretty simply, as it turns out. When you start an instance, you can specify which availability zone you want to run it in.

Combine these two features, and you get a bunch of interesting deployment and management options.

Persistent Storage

Storage has been one of the most perplexing issues with EC2. Simply put, anything you stored to disk while your instance was running would be lost when you restart the instance. Instances always go back to the bundled disk image stored on S3.

Amazon has just announced that they will be supporting persistent storage in the near future. A few lucky users get to try it out now, in it's pre-beta incarnation.

With persistent storage, you can allocate space in chunks from 1 GB to 1 TB.  That's right, you can make one web service call to allocate a freaking terabyte! Like IP addresses, storage is owned by your account, not by an individual instance. Once you've started up an instance---say a MySQL server, for example---you attach the storage volume to it. To the virtual machine, the storage looks just like a device, so you can use it raw or format it with whatever filesystem you want.

Best of all, because this is basically a virtual SAN, you can do all kinds of SAN tricks, like snapshot copies for backups to S3.

Persistent storage done this way obviates some of the other dodgy efforts that have been going on, like  FUSE-over-S3, or the S3 storage engine for MySQL.

SimpleDB is still there, and it's still much more scalable than plain old MySQL data storage, but we've got scores of libraries for programming with relational databases, and very few that work with key-value stores. For most companies, and for the forseeable future, programming to a relational model will be the easiest thing to do. This announcement really lowers the barrier to entry even further.

 

With these announcements, Amazon has cemented AWS as a viable computing platform for real businesses.

Reality

The Granularity Problem

I spend most of my time dealing with large sites. They're always hungry for more horsepower, especially if they can serve more visitors with the same power draw. Power goes up much faster with more chassis than with more CPU core. Not to mention, administrative overhead tends to scale with the number of hosts, not the number of cores. For them, multicore is a dream come true.

I ran into an interesting situation the other day, on the other end of the spectrum.

One of my team was working with a client that had relatively modest traffic levels. They're in a mature industry with a solid, but not rabid, customer base. Their web traffic needs could easily be served by one Apache server running one CPU and a couple of gigs of RAM.

The smallest configuration we could offer, and still maintain SLAs, was two hosts, with a total of 8 CPU cores running at 2 GHz, 32 gigs of RAM, and 4 fast Ethernet ports.

Of course that's oversized! Of course it's going to cost more than it should! But at this point in time, if we're talking about dedicated boxes, that's the smallest configuration we can offer! (Barring some creative engineering, like using fully depreciated "classics" hardware that's off its original lease, but still has a year or two before EOL.)

As CPUs get more cores, the minimum configuration is going to become more and more powerful. The quantum of computing is getting large.

Not every application will need it, and that's another reason I think private clouds make a lot of sense. Companies can buy big boxes, then allocate them to specific applications in fractions. Gains cost efficiency in adminstration, power, and space consumption (though not heat production!) while still letting business units optimize their capacity downward to meet their actual demand. 

Software Failure Takes Down Blackberry Services

Anyone who's addicted to a Blackberry already knows about Monday's four-hour outage. For some of us, the Blackberry isn't just an electronic leash, it's part of our business operations.

Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM's infrastructure, out through the wireless carriers. It was RIM's own infrastructure that suffered from intermittent failures during the outage.

Data Center Knowledge reports that the outage was caused by a failed software upgrade

Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they'll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

In the interim, there are plenty of checkpoints where the release can be rolled back.

I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it's "baked" for a while in production, then roll it to the remaining app servers.

This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

Finally, an aside: RIM's statement about the outage isn't reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company's own web site. 

The Pragmatic Architect on Security

Catching up on some reading, I finally got a chance to read Ted Neward's article "Pragmatic Architecture: Security".  It's very good.  (Actually, the whole series is pretty good, and I recommend them all.  At least as of February 2008... I make no assertions about future quality!)

Ted nails it.  I agree with all of the principles he identifies, and I particularly like his advice to "fail securely". 

I would add one more, though: Be visible.

After any breach, the three worst questions are always:

  1. How long has this been happening?
  2. How much have we lost?
  3. Why didn't we know about it sooner?

The answers are always, respectively, "Far too long", "We have no idea", and "We didn't expect that exploit". To which the only possible response is, "Well, duh, if you'd expected it, you would have closed the vulnerability."

Successful exploits are always successful because they stay hidden. Are you sure that nobody's in your systems right now, leaching data, stealing credit card numbers, or stealing products? Of course not. For a vivid case in point, google "Kerviel Societe Generale".

While you cannot prove a negative, you can improve your odds of detecting nefarious activity by making sure that everything interesting is logged. (And by "interesting", I mean "potentially valuable".) 

There are some pretty spiffy event correlation tools out there these days. They can monitor logs across hundreds of servers and network devices, extracting patterns of anomalous behavior. But, they only work if your application exposes data that could indicate a breach.

For example, you might not be able to log every login attempt, but you probably should log every admin login attempt.

Or, you might consider logging every price change. (I shudder to think about collusion between a merchant with pricing control and an outside buyer.  Imagine a 10-minute long sale on laptops: 90% off for 10 minutes only.)

If your internal web service listens on a port, then it should only accept connections from known sources. Whether you enforce that through IPTables, a hardware firewall, or inside the application itself, make sure you're logging refused connections.

Then, of course, once you're logging the data, make sure someone's monitoring it and keeping pattern and signature definitions up to date!

Two Books That Belong In Your Library

I seldom plug books---other than my own, that is. I've just read two important books, however, that really deserve your attention.

Concurrency, Everybody's Doing It

The first is Java Concurrency in Practice by Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. I've been doing Java development for close to thirteen years now, and I learned an enormous amount from this fantastic book. For example, I knew what the textbook definition of a volatile variable was, but I never knew why I would actually want to use one. Now I know when to use them and when they won't solve the problem.

Of course, JCP talks about the Java 5 concurrency library at great length. But this is no paraphrasing of the javadoc. (It was Doug Lea's original concurrency utility library that eventually got incorporated into Java, and we're all better off for it.) The authors start with illustrations of real issues in concurrent programming. Before they introduce the concurrency utilities, they explain a problem and illustrate potential solutions. (Usually involving at least one naive "solution" that has serious flaws.) Once they show us some avenues to explore, they introduce some neatly-packaged, well-tested utility class that either solves the problem or makes a solution possible. This removes the utility classes from the realm of "inscrutable magic" and presents them as "something difficult that you don't have to write."

The best part about JCP, though, is the combination of thoroughness and clarity with which it presents a very difficult subject. For example, I always understood about the need to avoid concurrent modification of mutable state. But, thanks to this book, I also see why you have to synchronize getters, not just setters. (Even though assignment to an integer is guaranteed to happen atomically, that isn't enough to guarantee that the change is visible to other threads. The only way to guarantee ordering is by crossing a synchronization barrier on the same lock.)

Blocked Threads are one of my stability antipatterns. I've seen hundreds of web site crashes. Every single one of them eventually boils down to blocked threads somewhere. Java Concurrency in Practice has the theory, practice, and tools that you can apply to avoid deadlocks, live locks, corrupted state, and a host of other problems that lurk in the most innocuous-looking code.

Capacity Planning is Science, Not Art

The second book that I want to recommend today is Capacity Planning for Web Services. I've had this book for a while. When I first started reading it, I put it down right away thinking, "This is way too basic to solve any real problems." That was a big error.

Capacity Planning may get off to a slow start, but that's only because the authors are both thorough and deliberate. Later in the book, that deliberate pace is very helpful, because it lets us follow the math.

This is the only book on capacity planning I've seen that actually deals with transmission time for HTTP requests and repsonses. In fact, some of the examples even compute the number of packets that a request or reply will need.

I have objected to some capacity planning books because they assume that every process can be represented by an average. Not this one. In the section on standalone web servers, for example, the authors break files into several classes, then use a weighted distribution of file sizes to compute the expected response time and bandwidth requirements. This is a very real-world approach, since web requests tend toward a bimodal distribution: small HTML, Javascript, and CSS intermixed with large media files and images. (In fact, I plan on using the models in this book to quantify the effect of segregating media files from dynamic pages.)

This is also the only book I've seen that recognizes that capacity limits can propagate both downward and upward through tiers. There's a great example of how doubling the CPU performance in an app tier ends up increasing the demand on the database server, which almost totally nullifies the effect of the CPU upgrade. It also recognizes that all requests are not created equal, and recommends clustering request types by their CPU and I/O demands, instead of averaging them all together.

Nearly every result or abstract law has an example, written in concrete terms, which helps bridge theory and practice.

Both of these books deal with material that easily leads off into clouds of theory and abstraction. (JCP actually quips, "What's a memory model, and why would I want one?") These excellent works avoid the Ivory Tower trap and present highly pragmatic, immediately useful wisdom.

Should Email Errors Keep Customers From Buying?

Somewhere inside every commerce site, there's a bit of code sending emails out to customers.  Email campaigning might have been in the requirements and that email code stands tall at the brightly-lit service counter.  On the other hand, it might have been added as an afterthought, languishing in some dark corner with the "lost and found" department.  Either way, there's a good chance it's putting your site at risk.

The simplest way to code an email sending routine looks something like this:

  1. Get a javax.mail.Session instance
  2. Get a javax.mail.Transport instance from the Session
  3. Construct a javax.mail.internet.MimeMessage instance
  4. Set some fields on the message: from, subject, body.  (Setting the body may involve reading a template from a file and interpolating values.)
  5. Set the recipients' Addresses on the message
  6. Ask the Transport to send the message
  7. Close the Transport
  8. Discard the Session

This goes into a servlet, a controller, or a stateless session bean, depending on which MVC framework or JEE architecture blueprint you're using.

There are two big problems here. (Actually, there are three, but I'm not going to deal with the "one connection per message" issue.)

Request-Handling Threads at Risk

As written, all the work of sending the email happens on the request-handling thread that's also responsible for generating the response page. Even on a sunny day, that means you're spending some precious request-response cycles on work that doesn't help build the page.

You should always look at a call out to an external server with suspicion. Many of them can execute asynchronously to page generation. Anything that you can offload to a background thread, you should offload so the request-handler can get back in the pool sooner. The user's experience will be better, and your site's capacity will be better, if you do.

Also, keep in mind that SMTP servers aren't always 100% reliable. Neither are the DNS servers that point you to them. That goes double if you're connecting to some external service. (And please, please don't even tell me you're looking up the recipient's MX record and contacting the receiving MTA directly!)

If the MTA is slow to accept your connection, or to process the email, then the request-handling thread could be blocked for a long time: seconds or even minutes. Will the user wait around for the response? Not likely. He'll probably just hit "reload" and double-post the form that triggered the email in the first place.

Poor Error Recovery

The second problem is the complete lack of error recovery.  Yes, you can log an exception when your connection to the MTA fails. But that only lets the administrator know that some amount of mail failed. It doesn't say what the mail was! There's no way to contact the users who didn't get their messages. Depending on what the messages said, that could be a very big deal.

At a minimum, you'd like to be able to detect and recovery from interruptions at the MTA---scheduled maintenance, Windows patching, unscheduled index rebulids, and the like. Even if "recovery" means someone takes the users' info from the log file and types in a new message on their desktops, that's better than nothing.

A Better Way

The good news is that there's a handy way to address both of these problems at once. Better still, it works whether you're dealing with internal SMTP based servers or external XML-over-HTTP bulk mailers.

Whenever a controller decides it's time to reach out and touch a user through email, it should drop a message on a JMS queue. This lets the request-handling thread continue with page generation immediately, while leaving the email for asynchronous processing.

You can either go down the road of message-driven beans (MDB) or you can just set up a pool of background threads to consume messages from the queue. On receipt of a message, the subscriber just executes the same email generation and transmission as before, with one exception. If the message fails due to a system error, such as a broken socket connection, the message can just go right back onto the message queue for later retry. (You'll probably want to update the "next retry time" to avoid livelock.)

Better Still

If you have a cluster of application servers that can all generate outbound email, why not take the next step? Move the MDBs out into their own app server and have the message queues from all the app servers terminate there? (If you're using pub-sub instead of point-to-point, this will be pretty much transparent.) This application will resemble a message broker... for good reason. It's essentially just pulling messages in from one protocol, transforming them, then sending them out over another protocol.

The best part? You don't even have to write the message broker yourself. There are plenty of open-source and commercial alternatives.

Summary

Sending email directly from the request-handling thread performs poorly, creates unpredictable page latency for users and risks dropping their emails right on the floor. It's better to drop a message in a queue for asynchronous transformation by a message broker: it's faster, more reliable, and there's less code for you to write.

Two Sites, One Antipattern

This week, I had Groundhog Day in December.  I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They're a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy.  It's good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer's cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product.  (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match.  The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog.  It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second.  With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn't, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn't overspent for the last two years to compensate. 

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad.  In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks.  All the other app servers would block on those row locks.  The team applied a band-aid to the query logic, and those crashes stopped.

What's the common factor here? It's what I call an "Unbounded Result Set".  Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process.  In essence, both of these applications trusted their databases.  The apps weren't prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data.   If your app processes one record at a time, then looping through an entire result set might be OK---as long as you're not making a user wait while you do.  But if your app that turns rows into objects, then it had better be very selective about its SELECTs.  The relationships might not be what you expect.  The data producer might have changed in a surprising way, particularly if it's not under your control.  Purging routines might not be in place, or might have gotten broken.  Definitely don't trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Read-write splitting with Oracle

Speaking of databases and read/write splitting, Oracle had a session at OpenWorld about it.

Building a read pool of database replicas isn't something I usually think of doing with Oracle, mainly due to their non-zero license fees.  It changes the scaling equation.

Still, if you are on Oracle and the fees work for you, consider Active Data Guard.   Some key facts from the slides:

  • Average latency for replication was 1 second
  • The maximum latency spike they observed was 10 seconds.
  • A node can take itself offline if it detects excessive latency.
  • You can use DBLinks to allow applications to think they're writing to a read node.  The node will transparently pass the writes through to the master.
  • This can be done without any tricky JDBC proxies or load-balancing drivers, just the normal Oracle JDBC driver with the bugs we all know and love.
  • Active Data Guard requires Oracle 11g.

A Dozen Levels of Done

What does "done" mean to you?  I find that my definition of "done" continues to expand. When I was still pretty green, I would say "It's done" when I had finished coding.  (Later, a wiser and more cynical colleague taught me that "done" meant that you had not only finished the work, but made sure to tell your manager you had finished the work.)

The next meaning of "done" that I learned had to do with version control. It's not done until it's checked in.

Several years ago, I got test infected and my definition of "done" expanded to include unit testing.

Now that I've lived in operations for a few years and gotten to know and love Lean Software Development, I have a new definition of "done".

Here goes:

A feature is not "done" until all of the following can be said about it:

  1. All unit tests are green.
  2. The code is as simple as it can be.
  3. It communicates clearly.
  4. It compiles in the automated build from a clean checkout.
  5. It has passed unit, functional, integration, stress, longevity, load, and resilience testing.
  6. The customer has accepted the feature.
  7. It is included in a release that has been branched in version control.
  8. The feature's impact on capacity is well-understood.
  9. Deployment instructions for the release are defined and do not include a "point of no return".
  10. Rollback instructions for the release are defined and tested.
  11. It has been deployed and verified.
  12. It is generating revenue.

Until all of these are true, the feature is just unfinished inventory.

Two Ways To Boost Your Flagging Web Site

Being fast doesn't make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

13 Threads Needed When Requests Take 700ms

You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

3 Threads Needed When Requests Take 200ms

Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

Suppose you're site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you're probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you're paying for big boxes that are guaranteed to be 50% idle.

Classic Six Pack

Let's look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

Cache Farms

Cache farming doesn't mean armies of Chinese gamers stomping rats and making vests. It doesn't involve registering a ton of domain names, either.

Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you're running multiple instances on the same hosts, you could be caching the same object once per instance.

Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

So what do we have? If there are four app instances per host, then a frequently accessed object---like a product featured on the home page---will be duplicated eight times. Can we do better? Well, since I'm writing this article, you might suspect the answer is "yes". You'd be right.

The caches I've described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you'll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server's internal cache, they're probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it's already been re-added to the cache.

External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

Six Pack With Cache Farm

If the external cache doesn't have the item, the app server hits the database as usual. So I'll turn my attention to the database tier.

Read Pools

The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I'm going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

When your database starts to get that middle-age paunch and it just isn't as zippy as it used to be, think about offloading those reads. At a minimum, you'll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you'll probably get more performance out of your writes once all that pesky read I/O is off the write master.

How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

Six Pack With Cache Farm and Read Pool

By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That's the Foghorn Leghorn version of the objection. I'll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

Does that matter?

To a certain extent, I can't answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I'll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What's the total latency before that pricing error will be corrected? Let's look at the end-to-end process.

  1. A human detects the pricing error.
  2. The observer notifies the responsible merchant.
  3. The merchant verifies that the price is in error and determines the correct price.
  4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
  5. The merchant locates the item and enters the correct price
  6. She hits the "publish" button.
  7. The fast path system connects to the write master in production and updates the price.
  8. The read pool receives the logs with the update and applies them.
  9. The read pool process sends a message to invalidate the item in the app servers' caches.
  10. The next time users request that product detail page, they see the correct price.

That's the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

Summing It Up

The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.

References

For more on read pooling, see Cal Henderson's excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

The most popular open-source external caching framework I've seen is memcached. It's a flexible, multi-lingual caching daemon.

On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

Two Quick Observations

Several of the speakers here have echoed two themes about databases.

1. MySQL is in production in a lot of places. I think the high cost of commercial databases (read: Oracle) leads to a kind of budgetechture that concentrates all data in a single massive database. If you remove that cost from the equation, the idea of either functionally partitioning your data stores or creating multiple shards becomes much more palatable.

2. By far the most common database cluster structure has one write master with many read masters. Ian Flint spoke to us about the architectures behind Yahoo Groups and Yahoo Bix. Bix has 30 MySQL read servers and just one write master. Dan Pritchett from eBay had a similar ratio. (His might have been 10:1 rather than 30:1.) In a commerce site, where 98% of the traffic is browsing and only 2% is buying, a read-pooled cluster makes a lot of sense.

 

Three Vendors Worth Evaluating

Several vendors are sponsoring QCon. (One can only wonder what the registration fees would be if they didn't.) Of these, I think three have products worth immediate evaluation.

Semmle

In the category of "really cool, but would I pay for it?" is Semmle. Their flagship product, SemmleCode, lets you treat your codebase as a database against which you can run queries. SemmleCode groks the static structure of your code, including relationships and dependencies. Along the way, it calculates pretty much every OO metric yet invented. It also looks at the source repository.

What can you do with it? Well, you can create a query that shows you all the cyclic dependencies in your code. The results can be rendered as a tree with explanations, a graph, or a chart. Or, you can chart your distribution of cyclomatic complexity scores over time. You can look for the classes or packages most likely to create a ripple effect.

Semmle ships with a sample project: the open-source drawing framework JHotDraw. In a stunning coincidence, I'm a contributor to JHotDraw. I wrote the glue code that uses Batik to export a drawing as SVG. So I can say with confidence, that when Semmle showed all kinds of cyclic dependencies in the exporters, it's absolutely correct. Every one of the queries I saw run against JHotDraw confirmed my own experience with that codebase. Where Semmle indicated difficulty, I had difficulty. Where Semmle showed JHotDraw had good structure, it was easy to modify and extend.

There are an enormous number of things you could do with this, but one thing they currently lack is build-time automation. Semmle integrates with Eclipse, but not ANT or Maven. I'm told that's coming in a future release.

3Tera

Virtualization is a hot topic. VMWare has the market lead in this space, but I'm very impressed with 3Tera's AppLogic.

AppLogic takes virtualization up a level.  It lets you visually construct an entire infrastructure, from load balancers to databases, app servers, proxies, mail exchangers, and everything. These are components they keep in a library, just like transistors and chips in a circuit design program.

Once you've defined your infrastructure, a single button click will deploy the whole thing into the grid OS. And there's the rub. AppLogic doesn't work with just any old software and it won't work on top of an existing "traditional" infrastructure.

As a comparison, HP's SmartFrog just runs an agent on a bunch of Windows, Linux, or HP-UX servers. A management server sends instructions to the agents about how to deploy and configure the necessary software. So SmartFrog could be layered on top of an existing traditional infrastructure.

Not so with AppLogic. You build a grid specifically to support this deployment style. That makes it possible to completely virtualize load balancers and firewalls along with servers. Of course, it also means complete and total lock-in to 3tera.

Still, for someone like a managed hosting provider, 3tera offers the fastest, most complete definition and provisioning system I've seen.

GigaSpaces

What can I say about GigaSpaces? Anyone who's heard me speak knows that I adore tuple-spaces. GigaSpaces is a tuple-space in the same way that Tibco is a pub-sub messaging system. That is to say, the foundation is a tuple-space, but they've added high-level capabilities based on their core transport mechanism.

So, they now have a distributed caching system.  (They call it an "in-memory data grid". Um, OK.) There's a database gateway, so your front end can put a tuple into memory (fast) while a back-end process takes the tuple and writes it into the database.

Just this week, they announced that their entire stack is free for startups. (Interesting twist: most companies offer the free stuff to open-source projects.) They'll only start charging you money when you get over $5M in revenue. 

I love the technology. I love the architecture.

Architecting for Latency

Dan Pritchett, Technical Fellow at eBay, spoke about "Architecting for Latency". His aim was not to talk about minimizing latency, as you might expect, but rather to architect as though you believe latency is unavoidable and real.

We all know the effect latency can have on performance. That's the first-level effect. If you consider synchronous systems---such as SOAs or replicated DR systems---then latency has a dramatic effect on scalability as well. Whenever a synchronous call reaches across a long wire, the latency gets added directly to the processing time.

For example, if client A calls service B, then A's processing time will be at least the sum of B's processing time, plus the latency between A and B. (Yes, it seems obvious when you state it like that, but many architectures still act as though latency is zero.)

Furthermore, latency over IP networks is fundamentally variable. That means A's performance is unpredictable, and can never be made predictable.

Latency also introduces semantic problems. A replicated database will always have some discrepancy with the master database. A functionally or horizontally partitioned system will either allow discrepancies or must serialize traffic and give up scaling. You can imagine that eBay is much more interested in scaling than serializing traffic.

For example, when a new item is posted to eBay, it does not immediately show up in the search results. The ItemNode service posts a message that eventually causes the item to show up in search results. Admittedly, this is kept to a very short period of time, but still, the item will reach different data centers at different times. So, the search service inside the nearest data center will get the item before the search service inside the farthest. I suspect many eBay users would be shocked, and probably outraged, to hear that shoppers see different search results depending on where they are.

Now, the search service is designed to get consistent within a limited amount of time---for that item. With a constant flow of items, being posted from all over the country, you can imagine that there is a continuous variance among the search services. Like the quantum foam, however, this is near-impossible to observe. One user cannot see it, because a user gets pinned to a single data center. It would take multiple users, searching in the right category, with synchronized clocks, taking snapshots of the results to even observe that the discrepancies happen. And even then, they would only have a chance of seeing it, not a certainty. 

Another example. Dan talked about payment transfer from one user to another. In the traditional model, that would look something like this.

Synchronous write to both databases

You can think of the two databases as being either shards that contain different users or tables that record different parts of the transaction.

This is a design that pretends latency doesn't exist. In other words, it subscribes to Fallacy #2 of the Fallacies of Distributed Computing. Performance and scalability will suffer here.

(Frankly, it has an availability problem, too, because the availability of the payment service Pr(Payment) will now be Pr(Database 1) * Pr(Network 1) * Pr(Database 2) * Pr(Network 2). In other words, the availability of the payment service is coupled to the availability of Database 1, Database 2, and the two networks connecting Payment to the two databases.)

Instead, Dan recommends a design more like this:

Payment service with back end reconciliation

In this case, the payment service can set an expectation with the user that the money will be credited within some number of minutes. The availability and performance of the payment service is now independent from that of Database 2 and the reconciliation process. Reconciliation happens in the back end. It can be message-driven, batch-driven, or whatever. The main point is that it is decoupled in time and space from the payment service. Now the databases can exist in the same data center or on separate sides of the globe. Either way, the performance, availability, and scalability characteristics of the payment service don't change. That's architecting for latency.

Instead of ACID semantics, we think about BASE semantics. (A cute retronym for Basically Available Soft-state Eventually-consistent.) 

Now, many analysts, developers, and business users will object to the loss of global consistency. We heard a spirited debate last night between Dan and Nati Shalom, founder and CTO of GigaSpaces about that very subject.

I have two arguments of my own to support architecting for latency.

First, any page you show to a user represents a point-in-time snapshot. It can always be inaccurate even by the time you finish generating the page. Think about a commerce site saying "Ships in 2 - 3 days". That's true at the instant when the data is fetched. Ten milliseconds later, it might not be true. By the time you finish generating the page and the user's browser finishes rendering it (and fetching the 29 JavaScript files needed for the whizzy AJAX interface) the data is already a few seconds old. So global consistency is kind of useless in that case, isn't it? Besides, I can guarantee there's already a large amount of latency in the data path from inventory tracking to the availability field in the commerce database anyway.

Second, the cost of global consistency is global serialization. If you assume a certain volume of traffic you must support, the cost of a globally consistent solution will be a multiple of the cost of a latency-tolerant solution. That's because global consistency can only be achieved by making a single master system of record. When you try to reach large scales, that master system of record is going to be hideously expensive.

Latency is simply an immutable fact of the universe. If we architect for it, we can use it to our advantage. If we ignore it, we will instead be its victims.

For more of Dan's thinking about latency, see his article on InfoQ.com

SOA Without the Edifice

Sometimes the best interactions at a conference aren't the talks, they're shouting. An crowded bar with an over-amped DJ may seem like an unlikely place for a discussion on SOA. Even so, when it's Jim Webber, ThoughtWorks' SOA practice lead doing the shouting, it works. Given that Jim's topic is "Guerilla SOA", shouting is probably more appropriate than the hushed and reverential cathedral whispers that usually attend SOA discussions.

Jim's attitude is that SOA projects tend to attract two things: Taj Mahal architects and parasitic vendors. (My words, not Jim's.) The combined efforts of these two groups results in monumentally expensive edifices that don't deliver value. Worse still, these efforts consume work and attention that could go to building services aligned with the real business processes, not some idealized vision of what the processes ought to be.

Jim says that services should be aligned with business processes. When the business process changes, change the service. (To me, this automatically implies that the service cannot be owned by some enterprise governance council.) When you retire the business process, simply retire the service.

These sound like such common sense, that it's hard to imagine they could be controversial.

I'll be in the front row for Jim's talk later today.

 

Cameron Purdy: 10 Ways to Botch Enterprise Java Scalability and Reliability

Here at QCon, well-known Java developer Cameron Purdy gave a fun talk called "10 Ways to Botch Enterprise Java Scalability and Reliability".  (He also gave this talk at JavaOne.)

While I could quibble with Cameron's counting---there were actually more like 16 points thanks to some numerical overloading---I liked his content.  He echoes many of the antipatterns from Release It.   In particular, he talks about the problem I call "Unbounded Result Sets".  That is, whether using an ORM tool or straight database queries, you can always get back more than you expect. 

Sometimes, you get back way, way more than you expect. I once saw a small messaging table, that normally held ten or twenty rows, grow to over ten million rows.  The application servers never contemplated there could be so many messages.  Each one would attempt to fetch the entire contents of the table and turn them into objects.  So, each app server would run out of memory and crash.  That rolled back the transaction, allowing the next app server to impale itself on the same table.

Unbounded Result Sets don't just happen from "SELECT * FROM FOO;", though.  Think about an ORM handling the parent-child relationship for you.  Simply calling something like customer.getOrders() will return every order for that customer.  By writing that call, you implicitly assume that the set of orders for a customer will always be small.  Maybe.  Maybe not.  How about blogUser.getPosts()?  Or tickerSymbol.getTrades()?

Unbounded Result Sets also happen with web services and SOAs.  A seemingly innocuous request for information could create an overwhelming deluge---an avalanche of XML that will bury your system.  At the least, reading the results can take a long time.  In the worst case, you will run out of memory and crash.

The fundamental flaw with an Unbounded Result Set is that you are trusting someone else not to harm you, either a data producer or a remote web service. 

Take charge of your own safety! 

Be defensive!

Don't get hurt again in another dysfunctional relationship!

Kent Beck's Keynote: "Trends in Agile Development"

Kent Beck spoke with his characteristic mix of humor, intelligence, and empathy.  Throughout his career, Kent has brought a consistently humanistic view of development.  That is, software is written by humans--emotional, fallible, creative, and messy--for other humans.  Any attempt to treat development as robotic will end in tears.

During his keynote, Kent talked about engaging people through appreciative inquiry.  This is a learnable technique, based in human psychology, that helps focus on positive attributes.  It counters the negaitivity that so many developers and engineers are prone to.  (My take: we spend a lot of time, necessarily, focusing on how things can go wrong.  Whether by nature or by experience, that leads us to a pessimistic view of the world.)

Appreciative inquiry begins by asking, "What do we do well?"  Even if all you can say is that the garbage cans get emptied every night, that's at least something that works well.  Build from there.

He specifically recommended The Thin Book of Appreciative Inquiry, which I've already ordered.

I should also note that Kent has a new book out, called Implementation Patterns, which he described as being about, "Communicating with other people, through code."

From QCon San Francisco

I'm at QCon San Francisco this week.  (An aside: after being a speaker at No Fluff, Just Stuff, it's interesting to be the audience again.  As usual, on returning from travels in a different domain, one has a new perspective on familiar scenes.) This conference targets senior developers, architects, and project managers.  One of the very appealing things is the track on "Architectures you've always wondered about".  This coveres high-volume architectures for sites such as LinkedIn and eBay as well as other networked applications like Second Life.  These applications live and work in thin air, where traffic levels far outstrip most sites in the world.  Performance and scalability are two of my personal themes, so I'm very interested in learning from these pioneers about what happens when you've blown past the limits of traditional 3-tier, app-server centered architecture.

Through the remainder of the week, I'll be blogging five ideas, insights, or experiences from each day of the conference.

You Keep Using That Word. I Do Not Think It Means What You Think It Means.

"Scalable" is a tricky word. We use it like there's one single definition. We speak as if it's binary: this architecture is scalable, that one isn't.

The first really tough thing about scalability is finding a useful definition. Here's the one I use:

Marginal revenue / transaction > Marginal cost / transaction

The cost per transaction has to account for all cost factors: bandwidth, server capacity, physical infrastructure, administration, operations, backups, and the cost of capital.

(And, by the way, it's even better when the ratio of revenue to cost per transaction grows as the volume increases.)

The second really tough thing about scalability and architecture is that there isn't one that's right.  An architecture may work perfectly well for a range of transaction volumes, but fail badly as one variable gets large.

Don't treat "scalability" as either a binary issue or a moral failing. Ask instead, "how far will this architecture scale before the marginal cost deteriorates relative to the marginal revenue?" Then, follow that up with, "What part of the architecture will hit a scaling limit, and what can I incrementally replace to remove that limit?"

Engineering in the White Space

"Is software Engineering, or is it Art?"

Debate between the Artisans and the Engineers has simmered, and occasionally boiled, since the very introduction of the phrase "Software Engineering".  I won't restate all the points on both sides here, since I would surely forget someone's pet argument, and also because I see no need to be redundant.

Deep in my heart, I believe that building programs is art and architecture, but not engineering.

But, what if you're not just building programs?

Programs and Systems

A "program" has a few characteristics that I'll assign here:

  1. It accepts input.
  2. It produces output.
  3. It runs a sequence of instructions.
  4. Statically, it exhibits cohesion in its executable form. [*]
  5. Dynamically, it exhibits cohesion in its address space. [**]

* That is, the transitive closure of all code to be executed is finite, although it may not all be known in advance of execution.  This allows dynamic extension via plugins, but not, for example, dynamic execution of any scripts or code found on the Web.  So, a web browser is a program, but Javascript executed on some page is an independent program, not part of the browser itself.

** For "address space", feel free to substitute "object space", "process space", or "virtual memory". Cohesion requires that all the code that can access the address space should be regarded as a single program.  (IPC through shared memory is a special case of an output, and should be considered more akin to a database or memory-mapped file than to part of the program's own address space.)

Suppose you have two separate scripts that each manipulate the same database.  I would regard those as two separate---though not independent---programs.  A single instance of Tomcat may contain several independent programs, but all the servlets in one EAR file are part of one program.

For the moment, I will not consider trivial objections, such as two distinct sets of functionality that happen to be packaged and delivered in a single EAR file.  It's less interesting to me whether code does access the entire address space then whether it could.  A library checkout program that includes functions for both librarians and patrons may not use common code for card number lookup, but it could.  (And, arguably, it should.)  That makes it one program, in my eyes.

A "System", on the other hand, consists of interdependent programs that have commonalities in their inputs and outputs.  They could be arranges in a chain, a web, or a loop.  No matter, if one program's input depends on another program's output, then they are part of a system.

Systems can be composed, whereas programs cannot.  

Tricky White Space

Some programs run all the time, responding to intermittent inputs, these we call "servers".  It is very common to see servers represented as a deceptively simple little rectangle on a diagram.  Between servers, we draw little arrows to indicate communication, of some sort.

One little arrow might mean, "Synchronous request/reply using SOAP-XML over HTTP." That's quite a lot of information for one little glyph to carry.  There's not usually enough room to write all that, so we label the unfortunate arrow with either "XML over HTTP"---if viewing it from an internal perspective---or "SKU Lookup"---if we have an external perspective.

That little arrow, bravely bridging the white space between programs, looks like a direct contact.  It is Voyager, carrying its recorded message to parts unknown.  It is Aricebo, blasting a hopeful greeting into the endless dark.

Well, not really...

These days, the white space isn't as empty as it once was.  A kind of lumeniferous ether fills the void between servers on the diagram.

The Substrate

There is many a slip 'twixt cup and lip.  In between points A and B on our diagram, there exist some or all of the following:

  • Network interface cards
  • Network switches
  • Layer 2 - 3 firewalls
  • Layer 7 (application) firewalls
  • Intrusion Detection and Prevention Systems
  • Message queues
  • Message brokers
  • XML transformation engines
  • Flat file translations
  • FTP servers
  • Polling jobs
  • Database "landing zone" tables
  • ETL scripts
  • Metro-area SoNET rings
  • MPLS gateways
  • Trunk lines
  • Oceans
  • Ocean liners
  • Phillipine fishing trawlers (see, "Underwater Cable Break")

Even in the simple cases, there will be four or five computers between program A and B, each running their own programs to handle things like packet switching, traffic analysis, routing, threat analysis, and so on.

I've seen a single arrow, running from one server to another, labelled "Fulfillment".  It so happened that one server was inside my client's company while the other server was in a fulfillment house's company.  That little arrow, so critical to customer satisfaction, really represented a Byzantine chain of events that resembled a game of "Mousetrap" more than a single interface.  It had messages going to message brokers that appended lines to files, which were later picked up by an hourly job that would FTP the files to the "gateway" server (still inside my client's company.)  The gateway server read each line from the file and constructed and XML message, which it then sent via HTTP to the fulfillment house.

It Stays Up

We analogize bridge-building as the epitome of engineering. (Side note: I live in the Twin Cities area, so we're a little leery of bridge engineering right now.  Might better find another analogy, OK?)  Engineering a bridge starts by examining the static and dynamic load factors that the bridge must support: traffic density, weight, wind and water forces, ice, snow, and so on.

Bridging between two programs should consider static and dynamic loads, too.  Instead of just "SOAP-XML over HTTP", that one little arrow should also say, "Expect one query per HTTP request and send back one response per HTTP reply.  Expect up to 100 requests per second, and deliver responses in less than 250 milliseconds 99.999% of the time."

It Falls Down

Building the right failure modes is vital. The last job of any structure is to fall down well. The same is true for programs, and for our hardy little arrow.

The interface needs to define what happens on each end when things come unglued. What if the caller sends more than 100 requests per second? Is it OK to refuse them? Should the receiver drop requests on the floor, refuse politely, or make the best effort possible?

What should the caller do when replies take more than 250 milliseconds? Should it retry the call? Should it wait until later, or assume the receiver has failed and move on without that function?

What happens when the caller sends a request with version 1.0 of the protocol and gets back a reply in version 1.1? What if it gets back some HTML instead of XML?  Or an MP3 file instead of XML?

When a bridge falls down, it is shocking, horrifying, and often fatal. Computers and networks, on the other hand, fall down all the time.  They always will.  Therefore, it's incumbent on us to ensure that individual computers and networks fail in predictable ways. We need to know what happens to that arrow when one end disappears for a while.

In the White Space

This, then, is the essence of engineering in the white space. Decide what kind of load that arrow must support.  Figure out what to do when the demand is more than it can bear.  Decide what happens when the substrate beneath it falls apart, or when the duplicitous rectangle on the other end goes bonkers.

Inside the boxes, we find art.

The arrows demand engineering.

 

The 5 A.M. Production Problem

I've got a new piece up at InfoQ.com, discussing the limits of unit and functional testing: 

"Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you're calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that's all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.

One of my recurring themes in