Wide Awake Developers

S3 Outage Report and Perspective

| Comments

Amazon has issued a more detailed statement explaining the S3 outage from June 20, 2008.  In my company, we’d call this a "Post Incident Report" or PIR. It has all the necessary sections:

  • Observed behavior
  • Root cause analysis
  • Followup actions: corrective and operational

This is exactly what I’d expect from any mature service provider.

There are a few interesting bits from the report. First, the condition seems to have arisen from an unexpected failure mode in the platform’s self-management protocol. This shouldn’t surprise anyone. It’s a new way of doing business, and some of the most innovative software development, applied at the largest scales. Bugs will creep in.

In fact, I’d expect to find more cases of odd emergent behavior at large scale.

Second, the back of my envelope still shows S3 at 99.94% availability for the year. That’s better than most data center providers. It’s certainly better than most corporate IT departments do.

Third, Amazon rightly recognizes that transparency is a necessary condition for trust. Many service providers would fall into the "bunker mentality" of the embattled organization. That’s a deteriorating spiral of distrust, coverups, and spin control. Transparency is most vital after an incident. If you cannot maintain transparency then, it won’t matter at any other time.


Article on Building Robust Messaging Applications

| Comments

I’ve talked before about adopting a failure-oriented mindset. That means you should expect every component of your system or application to someday fail. In fact, they’ll usually fail at the worst possible times.

When a component does fail, whatever unit of work it’s processing at the time will most likely be lost. If that unit of work is backed up by a transactional database, well, you’re in luck. The database will do it’s Omega-13 bit on the transaction and it’ll be like nothing ever happened.

Of course, if you’ve got more than one database, then you either need two-phase commit or pixie dust. (OK, compensating transactions can help, too, but only if the thing that failed isn’t the thing that would do the compensating transaction.)

I don’t favor distributed transactions, for a lot of reasons. They’re not scalable, and I find that the risk of deadlock goes way up when you’ve got multiple systems accessing multiple databases. Yes, uniform lock ordering will prevent that, but I never want to trust my own application’s stability to good coding practices in other people’s apps.

Besides, enterprise integration through database access is just… icky.

Messaging is the way to go. Messaging offers superior scalability, better response time to users, and better resilience against partial system failures. It also provides enough spatial, temporal, and logical decoupling between systems that you can evolve the endpoints independently.

Udi Dahan has published an excellent article with several patterns for robust messaging. It’s worth reading, and studying. He addresses the real-world issues you’ll encounter when building messaging apps, such as giant messages clogging up your queue, or "poison" messages that sit in the front of the queue causing errors and subsequent rollbacks.

Kingpins of Filthy Data

| Comments

If large amounts of dirty data are actually valuable, how do you go about collecting it? Who’s in the best position to amass huge piles?

One strategy is to scavenge publicly visible data. Go screen-scrape whatever you can from web sites. That’s Google’s approach, along with one camp of the Semantic Web tribe.

Another approach is to give something away in exchange for that data. Position yourself as a connector or hub. Brokers always have great visibility. The IM servers, the Twitter crowd, and the social networks in general sit in the middle of great networks of people. LinkedIn is pursuing this approach, as are Twitter+Summize, and BlogLines. Facebook has already made multiple, highly creepy, attempts to capitalize on their "man-in-the-middle" status. Meebo is in a good spot, and trying to leverage it further. Metcalfe’s Law will make it hard to break into this space, but once you do, your visibility is a great natural advantage.

Aggregators get to see what people are interested in. FriendFeed is sitting on a torrential flow of dirty data. ("Sewage", perhaps?) FeedBurner sees the value in their dirty data.

Anyone at the endpoint of traffic should be able to get good insight into their own world. While the aggregators and hubs get global visibility, the endpoints are naturally more limited. Still, that shouldn’t stop them from making the most of the dirt flowing their way. Amazon has done well here.

Sun is making a run at this kind of visibility with Project Hydrazine, but I’m skeptical. They aren’t naturally in a position to collect it, and off-to-the-side instrumentation is never as powerful. Although, companies like Omniture have made a market out of off-to-the-side instrumentation, so there’s a possibility there.

Carriers like Verizon, Qwest, and AT&T are in a natural position to take advantage of the traffic crossing their networks, but as parties in a regulated industry, they are mostly prohibited from looking at the traffic crossing their networks.
fantastic visibility

So, if you’re a carrier or a transport network, you’re well positioned to amass tons of dirty data. If you are a hub or broker, then you’ve already got it. Otherwise, consider giving away a service to bring people in. Instead of supporting it with ad revenue, support it by gleaning valuable insight.

Just remember that a little bit of dirty data is a pain in the ass, but mountains of it are pure gold.

Inverting the Clickstream

| Comments

Continuing my theme of filthy data.

A few years ago, there was a lot of excitement around clickstream analysis. This was the idea that, by watching a user’s clicks around a website, you could predict things about that user.

What a backwards idea.

For any given user, you can imagine an huge number of plausible explanations for any given browsing session. You’ll never enumerate all the use cases that motivate someone to spend ten minutes on seven pages of your web site.

No, the user doesn’t tell us much about himself by his pattern of clicks.

But the aggregate of all the users’ clicks… that tells us a lot! Not about the users, but about how the users perceive our site. It tells us about ourselves!

A commerce company may consider two products to be related for any number of reasons. Deliberate cross-selling, functional alignment, interchangability, whatever. Any such relationships we create between products in the catalog only reflect how we view our own catalog. Flip that around, though, and look at products that the users view as related. Every day, in every session, users are telling us that products have some relationship to each other.

Hmm. But, then, what about those times when I buy something for myself and something for my kids during the same session? Or when I got that prank gift for my brother?

Once you aggregate all that dirty data, weak connections like the prank gift will just be part of the background noise. The connections that stand out from the noise are the real ones, the only ones that ultimately matter.

This is an inversion of the clickstream. It tells us nearly nothing about the clicker. Instead, it illuminates the clickee.

Mounds of Filthy Data

| Comments

Data is the future.

The barriers to entering online business are pretty low, these days. You can do it with zero infrastructure, which means no capital spent on depreciating assets like servers and switches. Open source operating systems, databases, servers, middleware, libraries, and development tools mean that you don’t spend money on software licenses or maintenance contracts. All you need is an idea, followed by a SMOP.

With both the cost side trending toward zero, how can there be any barrier to entry?

The "classic" answer is the network effect, also known as Metcalfe’s Law. (The word "classic" in web business models means anything more than two years old, of course.) The first Twitter user didn’t get a whole lot out of it. The ten-million-and-first gets a lot more benefit. That makes it tough for a newcomer like Plurk to get an edge.

I see a new model emerging, though. Metcalfe’s Law is part of it, keeping people engaged. The best thing about having users, though, is that they do things. Every action by every user tells you something, if you can keep track of it all.

Twitter gets a lot of its value from the people connected at the endpoints. But, they also get enormous power from being the hub in the middle of it. Imagine what you can do when you see the content of every message passing through a system that large. A few things come to mind right away. You could extract all the links that people are posting to see what’s hot today. (Zeitgeist.) You could use semantic analysis to tell how people feel about current topics, like Presidential candidates in the U.S. You could track product names and mentions to see which products delight people and which cause frustration. You could publish a slang dictionary that actually keeps up! The possibilities are enormous.

Ah, I can already sense an objection forming. How the heck is anyone supposed to figure out all that stuff from noisy, messy textual human communication? We’re cryptic, ironic, and oblique. We sometimes mean the exact opposite of what we say. Any machine intelligence that tries to grok all of Twitter will surely self-destruct, right? That supposed "data" is just a big steaming pile of human contradictions!

In my view, though, it’s the dirtiness of the data that makes it beautiful. Yes, there will be contradictions. There will be ironic asides. But, those will come out in the wash. They’ll be balanced out by the sincere, meaningful, or obvious. Not every message will be semantically clear or consistent, but given enough messy data, clear patterns will still emerge.

There’s the key: enough data to see patterns. Large amounts. Huge amounts. Vast piles of filthy data.

Over the next couple of days, I’ll post a series of entries exploring how to amass dirty data, who’s got a natural advantage, and programming models that work with it.

Hard Problems in Architecture

| Comments

Many of the really hard problems in web architecture today exist outside the application server.  Here are three problems that haven’t been solved. Partial solutions exist today, but nothing comprehensive.

Uncontrolled demand

Users tend to arrive at web sites in huge gobs, all at once. As the population of the Net continues to grow, and the need for content aggregators/filters grows, the "front page" effect will get worse.

One flavor of this is the "Attack of Self-Denial", an email, radio, or TV campaign that drives enough traffic to crash the site.  Marketing can slow you down. Really good marketing can kill you at any time.

Versioning, deployment and rollback

With large scale applications, as with large enterprise integrations, versioning rapidly becomes a problem. Versioning of schemas, static assets, protocols, and interfaces. Add in a dash of SOA, and you have a real nightmare. You can count on having at least one interface broken at any given time. Or, you introduce such powerful governors that nothing ever changes.

As the number of nodes increases, you eventually find that there’s always at least one deployment going on. A "deployment" becomes less of a point-in-time activity than it is a rolling wave. A new service version will take hours or days to be deployed to every node. In the meantime, both the old and new service version must coexist peacefully. Since both service versions will need to support multiple protocol versions (see above) you have a combinatorial problem.

And, of course, some of these deployments will have problems of their own. Today, many application deployments are "one way" events. The deployment process itself has irreversably destructive effects. This will have to change, so every deployment can be done both forward and back. Oh, and every deployment will also be deploying assets to multiple targets—web, application, and database—while also triggering cache flushes and, possibly, metadata changes to external partners like Akamai.

Applications will need to participate in their own versioning, deployment, and management. 

Blurring the lines

There used to be a distinction between the application and the infrastructure it ran on. That meant you could move applications around and they would behave pretty much the same in a development environment as in the real production environment. These days, firewalls, load balancers, and caching appliances blur the lines between "infrastructure" and "application". It’s going to get worse as the lines between "operational support system" and "production applications" get blurred, too. Automated provisioning, SLA management, and performance management tools will all have interactions with the applications they manage. These will inevitably introduce unexpected interactions… in a word, bugs.

Creeping Fees

| Comments

A couple of years ago, the Minneapolis-St. Paul airport introduced self-pay parking gates. Scan a credit card on the way in and on the way out, and it just debits the card. This obviously saves money on parking attendants, and it’s pretty convenient for parkers.

At first, to encourage adoption, they offered a discount of $2 per day. Every time you’d approach the entry, a friendly voice from a Douglas Adams novel would ask, "Would you like to save $2 per day on parking?" For general parking, that meant $14 instead of $16 per day.

Some time later, this switched from being an incentive for adopting the system to a penalty for avoiding it. How? They raised the rates by $2 per day. So now, the top rate if you use self-pay is back to $16. If you don’t use it, then your top rate bumped up to $18. Clearly they put somebody from the banking industry in charge of this parking system.

Now, it’s changed again, from $2 per day to $2 per transaction. So it’s just $2 off the top of whatever your overall parking fees are.

This gradual creep is really interesting. I wonder what the next step will be. A $2 per year discount would be one way to approach it. Maybe a "frequent parker" program. More likely the discount will drop to $1 per transaction, or it will just be discarded altogether.

That’s OK with me, because swiping the credit card is still more convenient than exchanging cash money with a human anyway.

Besides, back when it was cash based, I always got tagged with the ATM fee anyway. 

Word Cloud Bandwagon

| Comments

Wordle has been meming it’s way around the ‘Net lately.  Figured I’d join the crowd by doing a word cloud for Release It.  This is from the preface.


Considering that this is just from fairly simple text analysis, I’m surprised at how accurately it represents the key concerns. "Software" and "money" have roughly equal prominence. "Life" appears near the middle, along with "excitement", "revenue", "production" and "systems". Not bad for an algortihm.

Webber and Fowler on SOA Man-Boobs

| Comments

InfoQ posted a video of Jim Webber and Martin Fowler doing a keynote speech at QCon London this Spring. It’s a brilliant deconstruction of the concept of the Enterprise Service Bus. I can attest that they’re both funny and articulate (whether on the stage or off.)

Along the way, they talk about building services incrementally, delivering value at every step along the way. They advocate decentralized control and direct alignment between services and the business units that own them. 

I agree with every word, though I’m vaguely uncomfortable with how often they say "enterprise man boobs".

Coincidence or Back-end Problem?

| Comments

An odd thing happened to me today. Actually, an odd thing happened yesterday, but it’s having the same odd thing happen today that really makes it odd. With me so far?

Yesterday, while I was shopping at Amazon, Amazon told me that my American Express card had expired. While it is set for a May expiration, it’s several years in the future. I didn’t think too much of it, because when I re-entered the same information, Amazon accepted it.

Today, I got the same thing with the same card on iTunes!

Online stores don’t do a whole lot with your credit cards. For the most part, they just make a call out to a credit card processor. Small stores have to go through a second-tier CCVS system that charges a few pennies per transaction. Large ones—and do they get larger than Amazon?—generally connect directly to a payment processor. The payment processor may charge a fraction of a cent per transaction, but they definitely make it up in volume.

(There are other business factors, too, like the committed transaction volume, response time SLAs, and the like.)

Asynchronously, the payment processor collects from the issuing bank. It’s the issuing bank that actually bills you, and sets your interest rate and payment terms.

Whereas VISA and MasterCard work with thousands of issuers, American Express doesn’t. When you get an AmEx card, they are the issuing bank as well as the payment processor.

Which makes it highly suspect that the same card gave me the same error through two different sites. It makes me think that American Express has introduced a bug in their validation system, causing spurious declines for expiration.