Wide Awake Developers

« June 2008 | Main | August 2008 »

Beyond the Village

As an organization scales up, it must navigate several transitions. If it fails to make these transitions well, it will stall out or disappear.

One of them happens when the company grows larger than "village-sized". In a village of about 150 people or less, it's possible for you to know everyone else. Larger than that, and you need some kind of secondary structures, because personal relationships don't reach from every person to every other person. Not coincidentally, this is also the size where you see startups introducing mid-level management.

There are other factors that can bring this on sooner. If the company is split into several locations, people at one location will lose track of those in other locations. Likewise, if the company is split into different practice areas or functional groups, those groups will tend to become separate villages on their own. In either case, the village transition will happen sooner than 150.

It's a tough transition, because it takes the company from a flat, familial structure to a hierarchical one. That implicitly moves the axis of status from pure merit to positional. Low-numbered employees may find themselves suddenly reporting to a newcomer with no historical context. It shouldn't come as a surprise when long-time employees start leaving, but somehow the founders never expect it.

This is also when the founders start to lose touch with day-to-day execution. They need to recognize that they will never again know every employee by name, family, skills, and goals. Beyond village size, the founders have to be professional managers. Of course, this may also be when the board (if there is one) brings in some professional managers. It shouldn't come as a surprise when founders start getting replaced, but somehow they never expect it.

 

S3 Outage Report and Perspective

Amazon has issued a more detailed statement explaining the S3 outage from June 20, 2008.  In my company, we'd call this a "Post Incident Report" or PIR. It has all the necessary sections:

  • Observed behavior
  • Root cause analysis
  • Followup actions: corrective and operational

This is exactly what I'd expect from any mature service provider.

There are a few interesting bits from the report. First, the condition seems to have arisen from an unexpected failure mode in the platform's self-management protocol. This shouldn't surprise anyone. It's a new way of doing business, and some of the most innovative software development, applied at the largest scales. Bugs will creep in.

In fact, I'd expect to find more cases of odd emergent behavior at large scale.

Second, the back of my envelope still shows S3 at 99.94% availability for the year. That's better than most data center providers. It's certainly better than most corporate IT departments do.

Third, Amazon rightly recognizes that transparency is a necessary condition for trust. Many service providers would fall into the "bunker mentality" of the embattled organization. That's a deteriorating spiral of distrust, coverups, and spin control. Transparency is most vital after an incident. If you cannot maintain transparency then, it won't matter at any other time.

 

Article on Building Robust Messaging Applications

I've talked before about adopting a failure-oriented mindset. That means you should expect every component of your system or application to someday fail. In fact, they'll usually fail at the worst possible times.

When a component does fail, whatever unit of work it's processing at the time will most likely be lost. If that unit of work is backed up by a transactional database, well, you're in luck. The database will do it's Omega-13 bit on the transaction and it'll be like nothing ever happened.

Of course, if you've got more than one database, then you either need two-phase commit or pixie dust. (OK, compensating transactions can help, too, but only if the thing that failed isn't the thing that would do the compensating transaction.)

I don't favor distributed transactions, for a lot of reasons. They're not scalable, and I find that the risk of deadlock goes way up when you've got multiple systems accessing multiple databases. Yes, uniform lock ordering will prevent that, but I never want to trust my own application's stability to good coding practices in other people's apps.

Besides, enterprise integration through database access is just... icky.

Messaging is the way to go. Messaging offers superior scalability, better response time to users, and better resilience against partial system failures. It also provides enough spatial, temporal, and logical decoupling between systems that you can evolve the endpoints independently.

Udi Dahan has published an excellent article with several patterns for robust messaging. It's worth reading, and studying. He addresses the real-world issues you'll encounter when building messaging apps, such as giant messages clogging up your queue, or "poison" messages that sit in the front of the queue causing errors and subsequent rollbacks.

Kingpins of Filthy Data

If large amounts of dirty data are actually valuable, how do you go about collecting it? Who's in the best position to amass huge piles?

One strategy is to scavenge publicly visible data. Go screen-scrape whatever you can from web sites. That's Google's approach, along with one camp of the Semantic Web tribe.

Another approach is to give something away in exchange for that data. Position yourself as a connector or hub. Brokers always have great visibility. The IM servers, the Twitter crowd, and the social networks in general sit in the middle of great networks of people. LinkedIn is pursuing this approach, as are Twitter+Summize, and BlogLines. Facebook has already made multiple, highly creepy, attempts to capitalize on their "man-in-the-middle" status. Meebo is in a good spot, and trying to leverage it further. Metcalfe's Law will make it hard to break into this space, but once you do, your visibility is a great natural advantage.

Aggregators get to see what people are interested in. FriendFeed is sitting on a torrential flow of dirty data. ("Sewage", perhaps?) FeedBurner sees the value in their dirty data.

Anyone at the endpoint of traffic should be able to get good insight into their own world. While the aggregators and hubs get global visibility, the endpoints are naturally more limited. Still, that shouldn't stop them from making the most of the dirt flowing their way. Amazon has done well here.

Sun is making a run at this kind of visibility with Project Hydrazine, but I'm skeptical. They aren't naturally in a position to collect it, and off-to-the-side instrumentation is never as powerful. Although, companies like Omniture have made a market out of off-to-the-side instrumentation, so there's a possibility there.

Carriers like Verizon, Qwest, and AT&T are in a natural position to take advantage of the traffic crossing their networks, but as parties in a regulated industry, they are mostly prohibited from looking at the traffic crossing their networks.
fantastic visibility

So, if you're a carrier or a transport network, you're well positioned to amass tons of dirty data. If you are a hub or broker, then you've already got it. Otherwise, consider giving away a service to bring people in. Instead of supporting it with ad revenue, support it by gleaning valuable insight.

Just remember that a little bit of dirty data is a pain in the ass, but mountains of it are pure gold.

Inverting the Clickstream

Continuing my theme of filthy data.

A few years ago, there was a lot of excitement around clickstream analysis. This was the idea that, by watching a user's clicks around a website, you could predict things about that user.

What a backwards idea.

For any given user, you can imagine an huge number of plausible explanations for any given browsing session. You'll never enumerate all the use cases that motivate someone to spend ten minutes on seven pages of your web site.

No, the user doesn't tell us much about himself by his pattern of clicks.

But the aggregate of all the users' clicks... that tells us a lot! Not about the users, but about how the users perceive our site. It tells us about ourselves!

A commerce company may consider two products to be related for any number of reasons. Deliberate cross-selling, functional alignment, interchangability, whatever. Any such relationships we create between products in the catalog only reflect how we view our own catalog. Flip that around, though, and look at products that the users view as related. Every day, in every session, users are telling us that products have some relationship to each other.

Hmm. But, then, what about those times when I buy something for myself and something for my kids during the same session? Or when I got that prank gift for my brother?

Once you aggregate all that dirty data, weak connections like the prank gift will just be part of the background noise. The connections that stand out from the noise are the real ones, the only ones that ultimately matter.

This is an inversion of the clickstream. It tells us nearly nothing about the clicker. Instead, it illuminates the clickee.

Mounds of Filthy Data

Data is the future.

The barriers to entering online business are pretty low, these days. You can do it with zero infrastructure, which means no capital spent on depreciating assets like servers and switches. Open source operating systems, databases, servers, middleware, libraries, and development tools mean that you don't spend money on software licenses or maintenance contracts. All you need is an idea, followed by a SMOP.

With both the cost side trending toward zero, how can there be any barrier to entry?

The "classic" answer is the network effect, also known as Metcalfe's Law. (The word "classic" in web business models means anything more than two years old, of course.) The first Twitter user didn't get a whole lot out of it. The ten-million-and-first gets a lot more benefit. That makes it tough for a newcomer like Plurk to get an edge.

I see a new model emerging, though. Metcalfe's Law is part of it, keeping people engaged. The best thing about having users, though, is that they do things. Every action by every user tells you something, if you can keep track of it all.

Twitter gets a lot of its value from the people connected at the endpoints. But, they also get enormous power from being the hub in the middle of it. Imagine what you can do when you see the content of every message passing through a system that large. A few things come to mind right away. You could extract all the links that people are posting to see what's hot today. (Zeitgeist.) You could use semantic analysis to tell how people feel about current topics, like Presidential candidates in the U.S. You could track product names and mentions to see which products delight people and which cause frustration. You could publish a slang dictionary that actually keeps up! The possibilities are enormous.

Ah, I can already sense an objection forming. How the heck is anyone supposed to figure out all that stuff from noisy, messy textual human communication? We're cryptic, ironic, and oblique. We sometimes mean the exact opposite of what we say. Any machine intelligence that tries to grok all of Twitter will surely self-destruct, right? That supposed "data" is just a big steaming pile of human contradictions!

In my view, though, it's the dirtiness of the data that makes it beautiful. Yes, there will be contradictions. There will be ironic asides. But, those will come out in the wash. They'll be balanced out by the sincere, meaningful, or obvious. Not every message will be semantically clear or consistent, but given enough messy data, clear patterns will still emerge.

There's the key: enough data to see patterns. Large amounts. Huge amounts. Vast piles of filthy data.

Over the next couple of days, I'll post a series of entries exploring how to amass dirty data, who's got a natural advantage, and programming models that work with it.

Hard Problems in Architecture

Many of the really hard problems in web architecture today exist outside the application server.  Here are three problems that haven't been solved. Partial solutions exist today, but nothing comprehensive.

Uncontrolled demand

Users tend to arrive at web sites in huge gobs, all at once. As the population of the Net continues to grow, and the need for content aggregators/filters grows, the "front page" effect will get worse.

One flavor of this is the "Attack of Self-Denial", an email, radio, or TV campaign that drives enough traffic to crash the site.  Marketing can slow you down. Really good marketing can kill you at any time.

Versioning, deployment and rollback

With large scale applications, as with large enterprise integrations, versioning rapidly becomes a problem. Versioning of schemas, static assets, protocols, and interfaces. Add in a dash of SOA, and you have a real nightmare. You can count on having at least one interface broken at any given time. Or, you introduce such powerful governors that nothing ever changes.

As the number of nodes increases, you eventually find that there's always at least one deployment going on. A "deployment" becomes less of a point-in-time activity than it is a rolling wave. A new service version will take hours or days to be deployed to every node. In the meantime, both the old and new service version must coexist peacefully. Since both service versions will need to support multiple protocol versions (see above) you have a combinatorial problem.

And, of course, some of these deployments will have problems of their own. Today, many application deployments are "one way" events. The deployment process itself has irreversably destructive effects. This will have to change, so every deployment can be done both forward and back. Oh, and every deployment will also be deploying assets to multiple targets---web, application, and database---while also triggering cache flushes and, possibly, metadata changes to external partners like Akamai.

Applications will need to participate in their own versioning, deployment, and management. 

Blurring the lines

There used to be a distinction between the application and the infrastructure it ran on. That meant you could move applications around and they would behave pretty much the same in a development environment as in the real production environment. These days, firewalls, load balancers, and caching appliances blur the lines between "infrastructure" and "application". It's going to get worse as the lines between "operational support system" and "production applications" get blurred, too. Automated provisioning, SLA management, and performance management tools will all have interactions with the applications they manage. These will inevitably introduce unexpected interactions... in a word, bugs.