Wide Awake Developers


Can you make that meeting?

I'm convinced that the next great productivity revolution will be de-matrixing the organizations we've just spent ten years slicing and dicing.

Yesterday, I ran into a case in point: What are the odds that three people can schedule a meeting this week versus having to push it into next week?

Turns out that if they're each 75% utilized, then there's only a 15% chance they can schedule a one hour meeting this week. (If you always schedule 30 minute meetings instead of one hour, then the odds go up to about 25%.)

Here's the probability curve that the meeting can happen. This assumes, by the way, that there are no lunches or vacation days, and that all parties are in the same time zone. It only gets worse from here.

So, overall, there's about an 85% chance that 3 random people in a meeting-driven company will have to defer until next week.

Bring it up to 10 people, in a consensus-driven, meeting-oriented company, and the odds drop to 0.00095%.

No wonder "time to first meeting" seems to dominate "time to do stuff."

Amazon as the new Intel

Update: Please read this update. The information underlying this post was based on early, somewhat garbled, reports.

A brief digression from the unpleasantness of reliability.

This morning, Sun announced their re-entry into the cloud computing market. After withdrawing Network.com from the marketplace a few months ago, we were all wondering what Sun's approach would be. No hardware vendor can afford to ignore the cloud computing trend... it's going to change how customers view their own data centers and hardware purchases.

One thing that really caught my interest was the description of Sun's cloud offering. It sounded really, really similar to AWS. Then I heard the E-word and it made perfect sense. Sun announced that they will use EUCALYPTUS as the control interface to their solution. EUCALYPTUS is an open-source implementation of the AWS APIs.

Last week at QCon London, we heard Simon Wardley give a brilliant talk, in which he described Canonical's plan to create a de facto open standard for cloud computing by seeding the market with open source implementations. Canonical's plan? Ubuntu and private clouds running EUCALYPTUS.

It looks like Amazon may be setting the standard for cloud computing, in the same way that Intel set the standard for desktop and server computing, by defining the programming interface.

I don't worry about this, for two reasons. One, it forestalls any premature efforts to force a de jure standard. This space is still young enough that an early standard can't help but be a drag on exploration of different business and technical models. Two, Amazon has done an excellent job as a technical leader. If their APIs "win" and become de facto standards, well, we could do a lot worse.

Tracking and Trouble

Pick something in your world and start measuring it.  Your measurements will surely change a little from day to day. Track those changes over a few months, and you might have a chart something like this.

First 100 samples

Now that you've got some data assembled, you can start analyzing it. The average over this sample is 59.5. It's got a variance of 17, which is about 28% of the mean. You can look for trends. For example, we seem to see an upswing for the first few months, then a pullback starting around 90 days into the cycle. In addition, it looks like there is a pretty regular oscillation superimposed on the main trend, so you might be looking at some kind of weekly pattern as well.

The next few months of data should make the patterns clearer.

First 200 samples.

Indeed, from this chart, it looks pretty clear that the pullback around 100 days was the early indicator of a flattening in the overall growth trend from the first few months. Now, the weekly oscillations are pretty much the only movement, with just minor wobbles around a ceiling.

I'll fast forward and show the full chart, spanning 1000 samples (over three years' worth of daily measurements.)

Full chart of 100 samples

Now we can see that the ceiling established at 65 held against upward pressure until about 250 days in, when it finally gave way and we reached a new support at about 80. That support lasted for another year, when we started to see some gradual downward pressure resulting in a pullback to the mid-70s.

You've probably realized by now that I'm playing a bit of a game with you. These charts aren't from any stock market or weather data. In fact, they're completely random. I started with a base value of 55 and added a little random value each "day".

When you see the final chart, it's easy to see it as the result of a random number generator.  If you were to live this chart, day by day, however, it's exceedingly hard not to impose some kind of meaning or interpretation on it. The tough part is that you actually can see some patterns in the data.  I didn't force the weekly oscillations into the random number function, they just appeared in the graph. We are all exceptional good at pattern detection and matching. We're so good, in fact, that we find patterns all over the place. When we are confronted with obvious patterns, we tend to believe that they're real or that they emerge from some underlying, meaningful structure. But sometimes, they're really just nothing more than randomness.

Nassim Nicholas Taleb is today's guru of randomness, but Benoit Mandelbrot wrote about it earlier in the decade, and Benjamin Graham wrote about this problem back in the 1920's. I suspect someone has sounded this warning every decade since statistics were invented. Graham, Mandelbrot, and Taleb all tell us that, if we set out to find patterns in historical data, we will always find them. Whether those patterns have any intrinsic meaning is another question entirely. Unless we discover that there are real forces and dynamics that underlie the data, we risk fooling ourselves again and again.

We can't abandon the idea of prediction, though. Randomness is real, and we have a tendency to be fooled by it. Still, even in the face of those facts, we really do have to make predictions and forecasts. Fortunately, there are about a dozen really effective ways to deal with the fundamental uncertainty of the future. I'll spend a few posts exploring these different ways to deal with the uncertainty of the future.

Constraint, Chaos, Collapse

Patrick Muellr has an interesting post about being brainwashed into believing that the outrageous is normal. It's a good read. (Hat tip to Reddit, whence many good things.) As often happens, I wrote such a long comment to his post that I felt it worthwhile to repost here.

My comment revolves around this chart of the Dow Jones Industrial Average over the last eighty years. (For the record, I'm not disputing anything about the rest of Patrick's post. In fact, I agree with most of what he says. This chart and my comments aren't central to his discussion about web development.) Some of you know that I've worked in finance before, and most of you know I have an interest in dynamics and complex systems. It's been an interesting year.

Here's a snapshot of the chart in question. It's from Yahoo! Finance, and the image links to the live chart.

Most of the chart looks like an exponential, which suggests the effect of compound growth. In a functioning capital-based system you'd expect exactly that. Capital invested produces more capital. Any time an output is also a required input, you get exponential growth. One of Patrick's other commenters points out that it looks almost linear when plotted on a logarithmic scale... a dead giveaway of an exponential.

No real system can produce infinite growth. Instead, they always hit a constraint. That could be a physical limitation on the available inputs. It could be a limit on the throughput of the system itself. In a sense, it almost doesn't matter what the constraint itself happens to be. Rather, you should assume that a constraint exists.

In systems with a chaotic tendency, the system doesn't slow down at all when approaching the constraint. In fact, it may be increasing at it's greatest rate just before the constraint clamps down hardest. In such cases, you'll either see a catastrophic collapse or a chaotic fluctuation.

I don't know what the true constraint was in the financial system. Plenty of other people believe they know, and I'm happy to let them believe what they like. Just from looking at the chart, though, you could make a strong case that we really hit the constraint in 1999 and the rest has been chaos since then.

The Infamous Seinfeld-Gates Ad

The Seinfeld/Gates ad is so laughably bad that people are already building indexes of the negative reactions, less than 24 hours after it launched.

I have my own take on it.

Gates is the most recognizable geek on the planet. For most non-techies, he is the archetype of geekhood.

What kind of name recognition does Steve Ballmer have?  Outside of developers, developers, developers, and developers.  Would a silver-haired manager ever use him for a cheesy business analogy in a meeting?  Nope. Blank looks all around.  Tiger Woods and Bill Gates make good metaphors. Steve Ballmer doesn't.

Ray Ozzie? Not a chance. Even most techies don't know who Ozzie is.

This commercial wasn't about churros, The Conquistador, or briefs riding up. It was all about one line.

"Brain meld".

It slipped by fast, but that was it. That was the line where billg@microsoft.com began the public torch-passing ceremony.

A couple more spots, and we'll see either Ballmer or Ozzie entering the plot. Then we get the handoff, where John Q. Public is now meant to understand, "OK, Bill Gates has retired, but he's passed his wireframe glasses and nervous tics on to this guy."

Seriously, it's torch-passing.  Don't believe me? You will when you see Ballmer air-running past a giant BSOD in the final ad.

Agile Tool Vendors

There seems to be something inherently contradictory about "Enterprise" agile tool vendors. There's never been a tool invented that's as flexible in use or process as the 3x5 card. No matter what, any tool must embed some notion of a process, or at least a meta-process.

I've looked at several of the "agile lifecycle management" and "agile project management" tools this week. To me, they all look exactly like regular project management tools. They just have some different terminology and ajax-y web interfaces.

Vendors listen: just because you've got a drag-and-drop rectangle on a web page doesn't make it agile!

The point of agile tools isn't to move cards around the board in ever-cooler ways. It isn't to automatically generate burndown graphs and publish them for management.

The point of agile tools is this: at any time, the team can choose to rip up the pavement and do it differently next iteration.

What happens once you've paid a bunch of money for some enterprise lifecycle management tool from one of these outfits? (Name them and they appear; so I won't.) Investment requires use. Once you've paid for something---or once your boss has paid for it---you'll be stuck using it.

Now look, I'm not against tools. I use them as force multipliers all the time. I just don't want to get stuck with some albatross of a PLM, ALM, LFCM, or LEM, just because we paid a gob of money for it.

The only agile tools I want are those I can throw away without qualm when the team decides it doesn't fit any more. If the team cannot change its own processes and tools, then it cannot adapt to the things it learns. If it cannot adapt, it isn't agile. Period.

Beyond the Village

As an organization scales up, it must navigate several transitions. If it fails to make these transitions well, it will stall out or disappear.

One of them happens when the company grows larger than "village-sized". In a village of about 150 people or less, it's possible for you to know everyone else. Larger than that, and you need some kind of secondary structures, because personal relationships don't reach from every person to every other person. Not coincidentally, this is also the size where you see startups introducing mid-level management.

There are other factors that can bring this on sooner. If the company is split into several locations, people at one location will lose track of those in other locations. Likewise, if the company is split into different practice areas or functional groups, those groups will tend to become separate villages on their own. In either case, the village transition will happen sooner than 150.

It's a tough transition, because it takes the company from a flat, familial structure to a hierarchical one. That implicitly moves the axis of status from pure merit to positional. Low-numbered employees may find themselves suddenly reporting to a newcomer with no historical context. It shouldn't come as a surprise when long-time employees start leaving, but somehow the founders never expect it.

This is also when the founders start to lose touch with day-to-day execution. They need to recognize that they will never again know every employee by name, family, skills, and goals. Beyond village size, the founders have to be professional managers. Of course, this may also be when the board (if there is one) brings in some professional managers. It shouldn't come as a surprise when founders start getting replaced, but somehow they never expect it.


Kingpins of Filthy Data

If large amounts of dirty data are actually valuable, how do you go about collecting it? Who's in the best position to amass huge piles?

One strategy is to scavenge publicly visible data. Go screen-scrape whatever you can from web sites. That's Google's approach, along with one camp of the Semantic Web tribe.

Another approach is to give something away in exchange for that data. Position yourself as a connector or hub. Brokers always have great visibility. The IM servers, the Twitter crowd, and the social networks in general sit in the middle of great networks of people. LinkedIn is pursuing this approach, as are Twitter+Summize, and BlogLines. Facebook has already made multiple, highly creepy, attempts to capitalize on their "man-in-the-middle" status. Meebo is in a good spot, and trying to leverage it further. Metcalfe's Law will make it hard to break into this space, but once you do, your visibility is a great natural advantage.

Aggregators get to see what people are interested in. FriendFeed is sitting on a torrential flow of dirty data. ("Sewage", perhaps?) FeedBurner sees the value in their dirty data.

Anyone at the endpoint of traffic should be able to get good insight into their own world. While the aggregators and hubs get global visibility, the endpoints are naturally more limited. Still, that shouldn't stop them from making the most of the dirt flowing their way. Amazon has done well here.

Sun is making a run at this kind of visibility with Project Hydrazine, but I'm skeptical. They aren't naturally in a position to collect it, and off-to-the-side instrumentation is never as powerful. Although, companies like Omniture have made a market out of off-to-the-side instrumentation, so there's a possibility there.

Carriers like Verizon, Qwest, and AT&T are in a natural position to take advantage of the traffic crossing their networks, but as parties in a regulated industry, they are mostly prohibited from looking at the traffic crossing their networks.
fantastic visibility

So, if you're a carrier or a transport network, you're well positioned to amass tons of dirty data. If you are a hub or broker, then you've already got it. Otherwise, consider giving away a service to bring people in. Instead of supporting it with ad revenue, support it by gleaning valuable insight.

Just remember that a little bit of dirty data is a pain in the ass, but mountains of it are pure gold.

Inverting the Clickstream

Continuing my theme of filthy data.

A few years ago, there was a lot of excitement around clickstream analysis. This was the idea that, by watching a user's clicks around a website, you could predict things about that user.

What a backwards idea.

For any given user, you can imagine an huge number of plausible explanations for any given browsing session. You'll never enumerate all the use cases that motivate someone to spend ten minutes on seven pages of your web site.

No, the user doesn't tell us much about himself by his pattern of clicks.

But the aggregate of all the users' clicks... that tells us a lot! Not about the users, but about how the users perceive our site. It tells us about ourselves!

A commerce company may consider two products to be related for any number of reasons. Deliberate cross-selling, functional alignment, interchangability, whatever. Any such relationships we create between products in the catalog only reflect how we view our own catalog. Flip that around, though, and look at products that the users view as related. Every day, in every session, users are telling us that products have some relationship to each other.

Hmm. But, then, what about those times when I buy something for myself and something for my kids during the same session? Or when I got that prank gift for my brother?

Once you aggregate all that dirty data, weak connections like the prank gift will just be part of the background noise. The connections that stand out from the noise are the real ones, the only ones that ultimately matter.

This is an inversion of the clickstream. It tells us nearly nothing about the clicker. Instead, it illuminates the clickee.

Mounds of Filthy Data

Data is the future.

The barriers to entering online business are pretty low, these days. You can do it with zero infrastructure, which means no capital spent on depreciating assets like servers and switches. Open source operating systems, databases, servers, middleware, libraries, and development tools mean that you don't spend money on software licenses or maintenance contracts. All you need is an idea, followed by a SMOP.

With both the cost side trending toward zero, how can there be any barrier to entry?

The "classic" answer is the network effect, also known as Metcalfe's Law. (The word "classic" in web business models means anything more than two years old, of course.) The first Twitter user didn't get a whole lot out of it. The ten-million-and-first gets a lot more benefit. That makes it tough for a newcomer like Plurk to get an edge.

I see a new model emerging, though. Metcalfe's Law is part of it, keeping people engaged. The best thing about having users, though, is that they do things. Every action by every user tells you something, if you can keep track of it all.

Twitter gets a lot of its value from the people connected at the endpoints. But, they also get enormous power from being the hub in the middle of it. Imagine what you can do when you see the content of every message passing through a system that large. A few things come to mind right away. You could extract all the links that people are posting to see what's hot today. (Zeitgeist.) You could use semantic analysis to tell how people feel about current topics, like Presidential candidates in the U.S. You could track product names and mentions to see which products delight people and which cause frustration. You could publish a slang dictionary that actually keeps up! The possibilities are enormous.

Ah, I can already sense an objection forming. How the heck is anyone supposed to figure out all that stuff from noisy, messy textual human communication? We're cryptic, ironic, and oblique. We sometimes mean the exact opposite of what we say. Any machine intelligence that tries to grok all of Twitter will surely self-destruct, right? That supposed "data" is just a big steaming pile of human contradictions!

In my view, though, it's the dirtiness of the data that makes it beautiful. Yes, there will be contradictions. There will be ironic asides. But, those will come out in the wash. They'll be balanced out by the sincere, meaningful, or obvious. Not every message will be semantically clear or consistent, but given enough messy data, clear patterns will still emerge.

There's the key: enough data to see patterns. Large amounts. Huge amounts. Vast piles of filthy data.

Over the next couple of days, I'll post a series of entries exploring how to amass dirty data, who's got a natural advantage, and programming models that work with it.

Webber and Fowler on SOA Man-Boobs

InfoQ posted a video of Jim Webber and Martin Fowler doing a keynote speech at QCon London this Spring. It's a brilliant deconstruction of the concept of the Enterprise Service Bus. I can attest that they're both funny and articulate (whether on the stage or off.)

Along the way, they talk about building services incrementally, delivering value at every step along the way. They advocate decentralized control and direct alignment between services and the business units that own them. 

I agree with every word, though I'm vaguely uncomfortable with how often they say "enterprise man boobs".

Project Hydrazine

Part of Sun's push behind JavaFX will be called "Project Hydrazine".  (Hydrazine is a toxic and volatile rocket fuel.)  This is still a bit fuzzy, and they only left the boxes-and-arrows slide up for a few seconds, but here's what I was able to glean.

Hydrazine includes common federated services for discovery, personalization, deployment, location, and development. There's a "cloud" component to it, which wasn't entirely clear from their presentation. Overall, the goal appears to be an easier model for creating end-user applications based on a service component architecture. All tied together and presented with JavaFX, of course.

One very interesting extension---called "Project Insight"---that Rich Green and Jonathan Schwartz both discussed is the ability to instrument your applications to monitor end-user activity in your apps.

(This immediately reminded me of Valve's instrumentation of Half-Life 2, episode 2. The game itself reports back to Valve on player stats: time to complete levels, map locations where they died, play time and duration, and so on. Valve has previously talked about using these stats to improve their level design by finding out where players get frustrated, or quit, and redesigning those levels.)

I can see this being used well: making apps more usable, proactively analyzing what features users appreciate or don't understand, and targeting development effort at improving the overall experience.

Of course, it can also be used to target advertising and monitor impressions and clicks. Rich promoted this as the way to monetize apps built using Project Hydrazine. I can see the value in it, but I'm also ambivalent about creating even more channels for advertising.

In any event, users will be justifiably anxious about their TV watching them back. It's just a little too Max Headroom for a lot of people. Sun says that the data will only appear in the aggregate. This leads me to believe that the apps will report to a scalable, cloud-based aggregation service from which developers can get the aggregated data. Presumably, this will be run by Sun.

Unlike Apple's iron-fisted control over iPhone application delivery, Sun says they will not be exercising editorial control. According to Schwartz, Hydrazine will all be free: free in price, freely available, and free in philosophy.


The Granularity Problem

I spend most of my time dealing with large sites. They're always hungry for more horsepower, especially if they can serve more visitors with the same power draw. Power goes up much faster with more chassis than with more CPU core. Not to mention, administrative overhead tends to scale with the number of hosts, not the number of cores. For them, multicore is a dream come true.

I ran into an interesting situation the other day, on the other end of the spectrum.

One of my team was working with a client that had relatively modest traffic levels. They're in a mature industry with a solid, but not rabid, customer base. Their web traffic needs could easily be served by one Apache server running one CPU and a couple of gigs of RAM.

The smallest configuration we could offer, and still maintain SLAs, was two hosts, with a total of 8 CPU cores running at 2 GHz, 32 gigs of RAM, and 4 fast Ethernet ports.

Of course that's oversized! Of course it's going to cost more than it should! But at this point in time, if we're talking about dedicated boxes, that's the smallest configuration we can offer! (Barring some creative engineering, like using fully depreciated "classics" hardware that's off its original lease, but still has a year or two before EOL.)

As CPUs get more cores, the minimum configuration is going to become more and more powerful. The quantum of computing is getting large.

Not every application will need it, and that's another reason I think private clouds make a lot of sense. Companies can buy big boxes, then allocate them to specific applications in fractions. Gains cost efficiency in adminstration, power, and space consumption (though not heat production!) while still letting business units optimize their capacity downward to meet their actual demand. 

Outrunning Your Headlights

Agile developers measure their velocity. Most teams define velocity as the number of story points delivered per iteration. Since the size of a "story point" and the length of an iteration vary from team to team, there's not much use in comparing velocity from one team to the next. Instead, the team tracks its own velocity from iteration to iteration.

Tracking velocity has two purposes. The first is estimation. If you know how many story points are left for this release, and you know how many points you complete per iteration, then you know how long it will be until you can release. (This is the "burndown chart".) After two or three iterations, this will be a much better projection of release date than I've ever seen any non-agile process deliver.

The second purpose of velocity tracking is to figure out ways to go faster.

In the iteration retrospective, a team will recalibrate estimating technique, to see if they can actually estimate the story cards or backlog items. Second, they'll look at ways to accomplish more during an iteration. Maybe that's refactoring part of the code, or automating some manual process. It might be as simple as adding templates to the IDE for commonly recurring code patterns.  (That should always raise a warning flag, since recurring code patterns are a code smell. Some languages just won't let you completely eliminate it, though.  And by "some languages" here, I mainly mean Java.)

Going faster should always be better, right? That means the development team is delivering more value for the same fixed cost, so it should always be a benefit, shouldn't it?

I have an example of a case where going faster didn't matter. To see why, we need to look past the boundaries of the development team.  Developers often treat software requirements as if they come from a sort of ATM; there's an unlimited reserve of requirement and we just need to decide how many of them to accept into development.

Taking a cue from Lean Software Development, though, we can look at the end-to-end value stream. The value stream is drawn from the customer's perspective. Step by step, the value stream map shows us how raw materials (requirements) are turned into finished goods. "Finished goods" does not mean code. Code is inventory, not finished goods. A finished good is something a customer would buy. Customers don't buy code. On the web, customers are users, interacting with a fully deployed site running in production. For shrink-wrapped software, customers buy a CD, DVD, or installer from a store. Until the inventory is fully transformed into one of these finished goods, the value stream isn't done.

Figure 1 shows a value stream map for a typical waterfall development process. This process has an annual funding cycle, so "inventory" from "suppliers" (i.e., requirements from the business unit) wait, on average, six months to get funded. Once funded and analyzed, they enter the development process. For clarity here, I've shown the development process as a single box, with 100% efficiency. That is, all the time spent in development is spent adding value---as the customer perceives it---to the product. Obviously, that's not true, but we'll treat it as a momentarily convenient fiction. Here, I'm showing a value stream map for a web site, so the final steps are staging and deploying the release.

Value Stream Map of Waterfall Process

Figure 1 - Value Stream Map of a Waterfall Process

This is not a very efficient process. It takes 315 business days to go from concept to cash. Out of that time, at most 30% of it is spent adding value. In reality, if we unpack the analysis and development processes, we'll see that efficiency drop to around 5%.

From the Theory of Constraints, we know that the throughput of any process is limited by exactly one constraint. An easy way to find the constraint is by looking at queue sizes. In an unoptimized process, you almost always find the largest queue right before the constraint. In the factory environment that ToC came from, it's easy to see the stacks of WIP (work in progress) inventory. In a development process, WIP shows up in SCR systems, requirements spreadsheets, prioritization documents, and so on.

Indeed, if we overlay the queues on that waterfall process, as in Figure 2, it's clear that Development and Testing is the constraint. After Development and Testing completes, Staging and Deployment take almost no time and have no queued inventory.

Waterfall Value Stream, With Queues

Figure 2 - Waterfall Value Stream, With Queues

In this environment, it's easy to see why development teams get flogged constantly to go faster, produce more, catch up.  They're the constraint.

Lean Software Development has ten simple rules to optimize the entire value stream.

ToC says to elevate the constraint and subordinate the entire process to the throughput of the constraint. Elevating the constraint---by either going faster with existing capacity, or expanding capacity---adds to throughput, while running the whole process at the throughput of the constraint helps reduce waste and WIP.

In a certain sense, Agile methods can be derived from Lean and ToC.

All of that, though, presupposes a couple of things:

  • Development is the constraint.
  • There's an unlimited supply of requirements.
  • Figure 3 shows the value stream map for a project I worked on in 2005. This project was to replace an existing system, so at first, we had a large backlog of stories to work on. As we approached feature parity, though, we began to run out of stories. The users had been waiting for this system for so long, that they hadn't given much thought, or at least recent thought, to what they might want after the initial release. Shortly after the second release (a minor bug fix), it became clear that we were actually consuming stories faster than they would be produced.

    Value Stream of an Agile Process

    Figure 3 - Value Stream Map of an Agile Project

    On the output side, we ran into the reverse problem. This desktop software would be distributed to hundreds of locations, with over a thousand users who needed to be expert on the software in short order. The internal training group, responsible for creating manuals and computer based training videos, could not keep revising their training modules as quickly as we were able to change the application. We could create new user interface controls, metaphors, and even whole screens much faster than they could create training materials.

    Once past the training group, a release had to be mastered and replicated onto installation discs. These discs were distributed to the store locations, where associates would call the operations group for a "talkthrough" of the installation process. Operations has a finite capacity, and can only handle so many installations every day. That set a natural throttle on the rate of releases. At one stage---after I rolled off the project---I know that a release which had passed acceptance testing in October was still in the training group by the following March.

    In short, the development team wasn't the constraint. There was no point in running faster. We would exhaust the inventory of requirements and build up a huge queue of WIP in front of training and deployment. The proper response would be to slow down, to avoid the buildup of unfinished inventory.  Creating slack in the workday would be one way to slow down, but drawing down the team size would be another perfectly valid response. Another perfectly valid response would be to increase the capacity of the training team. There are other places to optimize the value stream, too. But the one thing that absolutely wouldn't help would be increasing the development team's velocity.

    For nearly the entire history of software development, there has been talk of the "software crisis", the ever-widening gap between government and industry's need for software and the rate at which software can be produced. For the first time in that history, agile methods allow us to move the constraint off of the development team.

    Two Ways To Boost Your Flagging Web Site

    Being fast doesn't make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

    13 Threads Needed When Requests Take 700ms

    You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

    3 Threads Needed When Requests Take 200ms

    Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

    Suppose you're site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you're probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you're paying for big boxes that are guaranteed to be 50% idle.

    Classic Six Pack

    Let's look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

    Cache Farms

    Cache farming doesn't mean armies of Chinese gamers stomping rats and making vests. It doesn't involve registering a ton of domain names, either.

    Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you're running multiple instances on the same hosts, you could be caching the same object once per instance.

    Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

    So what do we have? If there are four app instances per host, then a frequently accessed object---like a product featured on the home page---will be duplicated eight times. Can we do better? Well, since I'm writing this article, you might suspect the answer is "yes". You'd be right.

    The caches I've described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

    External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you'll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

    External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server's internal cache, they're probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it's already been re-added to the cache.

    External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

    Six Pack With Cache Farm

    If the external cache doesn't have the item, the app server hits the database as usual. So I'll turn my attention to the database tier.

    Read Pools

    The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

    But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I'm going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

    When your database starts to get that middle-age paunch and it just isn't as zippy as it used to be, think about offloading those reads. At a minimum, you'll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you'll probably get more performance out of your writes once all that pesky read I/O is off the write master.

    How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

    Six Pack With Cache Farm and Read Pool

    By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

    At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That's the Foghorn Leghorn version of the objection. I'll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

    Does that matter?

    To a certain extent, I can't answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I'll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What's the total latency before that pricing error will be corrected? Let's look at the end-to-end process.

    1. A human detects the pricing error.
    2. The observer notifies the responsible merchant.
    3. The merchant verifies that the price is in error and determines the correct price.
    4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
    5. The merchant locates the item and enters the correct price
    6. She hits the "publish" button.
    7. The fast path system connects to the write master in production and updates the price.
    8. The read pool receives the logs with the update and applies them.
    9. The read pool process sends a message to invalidate the item in the app servers' caches.
    10. The next time users request that product detail page, they see the correct price.

    That's the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

    Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

    Summing It Up

    The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.


    For more on read pooling, see Cal Henderson's excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

    The most popular open-source external caching framework I've seen is memcached. It's a flexible, multi-lingual caching daemon.

    On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

    A path to a product

    Here's a "can't lose" way to identify a new product: Enable people to plan ahead less. 

    Take cell phones.  In the old days, you had to know where you were going before you left.  You had to make reservations from home.  You had to arrange a time and place to meet your kids at Disney World.

    Now, you can call "information" to get the number of a restaurant, so you don't have to decide where you're going until the last possible minute.  You can call the restaurant for reservations from your car while you're already on your way.

    With cell phones, your family can split up at a theme park without pre-arranging a meeting place or time.

    Cell phones let you improvise with success.  Huge hit.

    GPS navigation in cars is another great example.  No more calling AAA weeks before your trip to get "TripTix" maps.  No more planning your route on a road atlas.  Just get in your car, pick a destination and start driving.  You don't even have to know where to get gas or food
    along the way.

    Credit and debit cards let you go places without planning ahead and carrying enough cash, gold, or jewels to pay your way.

    The Web is the ultimate "preparation avoidance" tool.  No matter what you're doing, if you have an always-on 'Net connection, you can improvise your way through meetings, debates, social engagements, and work situations.

    Find another product that lets procrastinators succeed, and you've got a sure winner.  There's nothing that people love more than the personal liberation of not planning ahead.


    "Us" and "Them"

    As a consultant, I've joined a lot of projects, usually not right when the team is forming. Over the years, I've developed a few heuristics that tell me a lot about the psychological health of the team. Who lunches together? When someone says "whole team meeting," who is invited? Listen for the "us and them" language. How inclusive is the "us" and who is relegated to "them?" These simple observations speak volumes about the perception of the development team. You can see who they consider their stakeholders, their allies, and their opponents.

    Ten years ago, for example, the users were always "them." Testing and QA was always "them." Today, particularly on agile teams, testers and users often get "us" status (As an aside, this may be why startups show such great productivity in the early days. The company isn't big enough to allow "us" and "them" thinking to set in. Of course, the converse is true as well: us and them thinking in a startup might be a failure indicator to watch out for!). Watch out if an "us" suddenly becomes "them." Trouble is brewing!

    Any conversation can create a "happy accident;" some understanding that obviates a requirement, avoids a potential bug, reduces cost, or improves the outcome in some other way. Conversations prevented thanks to an armed-camp mentality are opportunities lost.

    One of the most persistent and perplexing "us" and "them" divisions I see is between development and operations. Maybe it's due to the high org-chart distance (OCD) between development groups and operations groups. Maybe it's because development doesn't tend to plan as far ahead as operations does. Maybe it's just due to a long-term dynamic of requests and refusals that sets each new conversation up for conflict. Whatever the cause, two groups that should absolutely be working as partners often end up in conflict, or worse, barely speaking at all.

    This has serious consequences. People in the "us" tent get their requests built very quickly and accurately. People in the "them" tent get told to write specifications. Specifications have their place. Specifications are great for the fourth or fifth iteration of a well-defined process. During development, though, ideas need to be explored, not specified. If a developer has a vague idea about using the storage area network to rapidly move large volumes of data from the content management system into production, but he doesn't know how to write the request, the idea will wither on the vine.

    The development-operations divide virtually ensures that applications will not be transitioned to operations as effectively as possible. Some vital bits of knowledge just don't fit into a document template. For example, developers have knowledge about the internals of the application that can help diagnose and recover from system failures. (Developer: "Oh, when you see all the request handling threads blocked inside the K2 client library, just bounce the search servers. The app will come right back." Operations: "Roger that. What's a thread?") These gaps in knowledge degrade uptime, either by extending outages or preventing operations from intervening. If the company culture is at all political, one or two incidents of downtime will be enough to start the finger-pointing between development and operations. Once that corrosive dynamic gets started, nothing short of changing the personnel or the leadership will stop it.

    Bill Joy Knocks the Open Source Business Model

    Bill Joy had some doubts to voice about Linux. Of course, like so many others he immediately jumps to the wrong conclusion. "The open-source business model hasn't worked very well," he says.

    Tough nuts. Here's the point that seems to get missed over and over again. There is no "open source business model". There never was, and I doubt there ever will be. It doesn't exist. It's a contradiction in terms.

    Open source needs no business model.

    Look, GNU existed before anyone ever talked about "open source". Linux was built before there were companies like RedHat and IBM interested (let alone Sun). The thing that the corps and the pundits cannot seem to grasp is their absolute irrelevance.

    It's like Bruce Sterling's speech. Harangue. Whatever you want to call it. I see it as yet another person getting up and trying to tell the "open-source community" what they need to do. Getting on their case about not being organized enough... or something.

    Or it's like those posters on Slashdot that wish either GNOME or KDE would shut down so everyone can focus on one "standard" desktop.

    Or Scott McNealy, lamenting the fact that open source Java application servers inhibit the expenditure of dollars that could be used to market J2EE against .Net.

    Or the UI designers who froth at the mouth about how terrible an open source applications user interface may be. They say moronic things like "when will coders learn that they shouldn't design user interfaces?" (Or the more extreme form, "Programmers should never design UIs.")

    Or it's like anyone who looks at an application and says, "That's pretty good. You know what you really need to do?"

    All of these people don't get the true point. I'll say it here as baldly as I can.

    There is nobody in charge. Not IBM, not Linus Torvalds, not Richard Stallman. Nobody.

    All you will find is an anarchic collection of self-interested individuals. Sometimes they collaborate. Some of them work together, some work apart, some work against each other. To the extent that some clusters of individuals share a vision, they collaborate to tackle bigger, cooler projects.

    There is no one in control. Nobody gets to decree what open source projects live or die, or what direction they go in. These projects are an expression of free will, created by those capable of expressing themselves in that medium. Decisions happen in code, because coders make them happen.

    Free will, baby. It's my project, and I'll do what I want with it. If I want to create the most god-awful user interface ever seen by Man, that's my perogative. (If I want lots of users, I probably won't do that, but who says I have to want lots of users? It's my choice!)

    As long as one GNOME hacker wants to keep working on GNOME, it will continue to evolve. As long as one Linux kernel hacker keeps coding, Linux will continue. None of these things require corporations, IPOs, or investement dollars to continue. The only true investments in open source are time and brainpower. Money is useful in that it can be used to purchase time, the greatest gift you can give a coder. Corporations are useful in that they are effective at aggregating and channeling money. "Useful", not "required".

    As long as coders have free will and the tools to express it, open source software will continue. In fact, even if you take away their tools, they'll build new ones! To truly kill open source software, you must kill free will itself.

    (And, by the way, there are those who want to do exactly that.)

    Multiplier Effects

    Here's another way to think about the ethics of software, in terms of multipliers. Think back to the last major virus scare, or when Star Wars Episode II was released. Some "analyst"--who probably found his certificate in a box of Cracker Jack--publishing some ridiculous estimate of damages.

    BTW, I have to take a minute to disassemble this kind of analysis. Stick with me, it won't take long.

    If you take 1.5 seconds to delete the virus, it costs nothing. It's an absolutely immeasurable impact to your day. It won't even affect your productivity. You will probably spend more time than that discussing sports scores, going to the bathroom, chatting with a client, or any of the hundreds of other things human beings do during a day. It's literally lost in the noise. Nevertheless, some peabrain analyst who likes big numbers will take that 1.5 seconds and multiply it by the millions of other users and their 1.5 seconds, then multiply that by the "national average salary" or some such number.

    So, even though it takes you longer to blow your nose than to delete the virus email, somehow it still ends up "costing the economy" 5x10^6 USD in "lost productivity". The underlying assumptions here are so thoroughly rotten that the result cannot be anything but a joke. Sure as hell though, you'll see this analysis dragged out every time there's a news story--or better yet, a trial--about an email worm.

    The real moral of this story isn't about innumeracy in the press, or spotlight seekers exploiting innumeracy. It's about multipliers.

    Suppose you have a decision to make about a particular feature. You can do it the easy way in about a week, or the hard way in about a month. (Hypothetical.) Which way should you do it? Suppose that the easy way makes the user click an extra button, whereas doing it the hard way makes the program a bit smarter and saves the user one click. Just one click. Which way should you do it?

    Let's consider an analogy. Suppose I'm putting a sign up on my building. Is it OK to mount the sign six feet up on the wall, so that pedestrians have to duck or go around it? It's much easier for me to hang the sign if I don't have to set up a ladder and scaffold. It's only a minor annoyance to the pedestrians. It's not like it would block the sidewalk or anything. All they have to do is duck. (We'll just ignore the fact that pissing off all your potential customers is not a good business strategy.)

    It's not ethical to worsen the lives of others, even a small bit, just to make things easy for yourself. These days, successful software is measured in millions of users, of people. Always be mindful of the impact your decisions--even small ones--have on those people. Accept large burdens to ease the burden on those people, even if your impact on any given individual is miniscule. The cumulative good you do that way will always overwhelm the individual costs you pay.

    Debating "Web Services"

    There is a huge and contentious debate under way right now related to "Web services". A sizable contingent of the W3C and various XML pioneers are challenging the value of SOAP, WSDL, and other "Web service" technology.

    This is a nuanced discussion with many different positions being taken by the opponents. Some are critical of the W3C's participation in something viewed as a "pay to play" maneuver from Microsoft and IBM. Others are pointing out serious flaws in SOAP itself. To me, the most interesting challenge comes from the W3C's Technical Architecture Group (TAG). This is the group tasked with defining what the web is and is not. Several of the TAG, including the president of the Apache Foundation, are arguing that "Web services" as defined by SOAP, fundamentally are not "the web". ("The web" being defined crudely as "things are named via URI's" and "every time I ask for the same URI, I get the same results". My definition, not theirs.) With a "Web service", a URI doesn't name a thing, it names a process. What I get when I ask for a URI is no longer dependent solely on the state of the thing itself. Instead, what I get depends on my path through the application.

    I'd encourage you to all sample this debate, as summarized by Simon St. Laurent (one of the original XML designers).