Wide Awake Developers

« January 2008 | Main | March 2008 »

The Granularity Problem

I spend most of my time dealing with large sites. They're always hungry for more horsepower, especially if they can serve more visitors with the same power draw. Power goes up much faster with more chassis than with more CPU core. Not to mention, administrative overhead tends to scale with the number of hosts, not the number of cores. For them, multicore is a dream come true.

I ran into an interesting situation the other day, on the other end of the spectrum.

One of my team was working with a client that had relatively modest traffic levels. They're in a mature industry with a solid, but not rabid, customer base. Their web traffic needs could easily be served by one Apache server running one CPU and a couple of gigs of RAM.

The smallest configuration we could offer, and still maintain SLAs, was two hosts, with a total of 8 CPU cores running at 2 GHz, 32 gigs of RAM, and 4 fast Ethernet ports.

Of course that's oversized! Of course it's going to cost more than it should! But at this point in time, if we're talking about dedicated boxes, that's the smallest configuration we can offer! (Barring some creative engineering, like using fully depreciated "classics" hardware that's off its original lease, but still has a year or two before EOL.)

As CPUs get more cores, the minimum configuration is going to become more and more powerful. The quantum of computing is getting large.

Not every application will need it, and that's another reason I think private clouds make a lot of sense. Companies can buy big boxes, then allocate them to specific applications in fractions. Gains cost efficiency in adminstration, power, and space consumption (though not heat production!) while still letting business units optimize their capacity downward to meet their actual demand. 

Sun Joining the Cloud Crowd

As I was writing my last post, I somehow missed the news that Sun is building their own cloud platform, called Project Caroline.

There's a PDF about it. It appears to be a presentation for JavaOne.  It may be locked down at any minute, so the link might not work by the time you read this.

Caroline looks a lot like Amazon EC2, but with some very nice control over VLANs (I suppose they would be Virtual VLANs?), load balancing policies, and DNS... all things that EC2 lacks today. ZFS instead of S3, that will make for a more familiar storage model. No trickery needed to make data persist across restarts.

All in all, it looks very nice.

(Hmmm.  On second glance, this presentation is from JavaOne 2007!  Not much of a scoop there, Reg.)

Does anyone know what happened to this project? 

  

 

A Cloud For Everyone

The trajectory of many high-tech products looks like this:

  1. Very expensive. Only a few exist in the world. They are heavily time-shared, and usually oversubscribed.
  2. Within the reach of institutions and corporations, but not individuals. The organization wants to maximize utilization.
  3. Corporations own many, as productivity enhancers, some wealthy or forward-looking individuals own one. Families time share theirs.
  4. Virtually everyone has one. To lack one is to fall behind. No longer a competitive advantage, the lack of the technology puts one at a disadvantage.
  5. Invisibility. Most people have or use several, but are not aware of it.

Depending on your age, you might have been thinking "cell phones", "computers", or even "televisions".  I don't think I have any blog readers old enough to have been thinking "telephones", "telegraphs", or "electric motors", but they all went through the same stages, too.

I feel very comfortable putting "cloud computing" in that list, too. Cloud computing is at stage 1. It's expensive enough that there are a few in the world: Amazon AWS, Mosso, BungeeConnect, even Force.com. They're shared, multitenant, and soon to be oversubscribed.

One day, I suspect that we'll each have our own computing cloud attending us, formed out of the many computing devices that surround us every day, but I'm getting ahead of myself.

Before that, we'll see enterprises, first large then medium and small, building their own computing clouds.

"Wait a minute," you object. "That misses the whole point of cloud computing. The entire purpose is to not own the infrastructure."

That's true, today. It was also true, at one time, that farmers did not want to own their own steam engines. So, they outsourced the job. Farmers would own machines like threshers that had everything except the troublesome boiler and engine. Those required technical expertise to run, so the farmers left that job up to folks who would bring their steam engine around, hook it up to the thresher, and charge the farmer for the length of time he needed it. As steam engines got cheaper and safer, they eventually got built right into the thresher.

This next part may sound like FUD. It isn't. I like cloud computing. I like virtualization. In fact, I think it's about to revolutionize our industry.

I like it so much that I think every company should have one.

Why should a company build its own cloud, instead of going to one of the providers? Several reasons, some positive, some not so much.

On the positive side, an IT manager running a cloud can finally do real chargebacks to the business units that drive demand. Some do today, but on a larger-grained level... whole servers. With a private cloud, the IT manager could charge by the compute-hour, or by the megabit of bandwidth. He could charge for storage by the gigabyte, and with tiered rates for different avaialbility/continuity guarantees. Even better, he could allow the business units to do the kind of self-service that I can do today with a credit card and The Planet. (OK, The Planet isn't a cloud provider, but I bet they're thinking about it.  Plus, I like them.)

I actually think this kind of self-service and fine-grained chargeback could help curb the out-of-control growth in IT spending, but that's a different post.

This would seriously raise the level of discourse. Instead of fighting about server classes, rack space, power consumption, and rampant storage sprawl, IT could talk to the business about levels of service. Does this app need 24x7 performance management with automatic resource allocation to maintain a 2 second response time? Great, we can do that! This other one doesn't need to be fast, but it had better work every single time a transaction goes through? We can do that, too! This application needs user experience monitoring, that database only needs non-redundant storage, because it can be recreated from other sources... it's a better conversation to have than, "No, our corporate standard is WebSphere running on RedHat Enterprise Linux 4, with Dell PowerEdge servers.  You can have any server you want, as long as it's a Dell PowerEdge."

I also think that the gloss will come off of the cloud computing providers. (I know, most people still haven't heard of them yet, but the gloss will inevitably come off.)

Accidents happen. Networks still break, today, and they will in the future too. Power failures happen. How would you defend yourself in a shareholders' lawsuit after millions in losses thanks to a service provider failure? (Actually, that suggests there may be an insurance market developing here. Any time you've got quantifiable risk and someone willing to pay to defray that risk, sure as hell, you'll find insurance companies.)

Service providers get oversubscribed. What happens when your application is slow, and remains slow for months? Having an SLA only means you get some money back, it doesn't mean your problem will get fixed. It's a dirty secret that some service providers are quite happy paying out credits, if they can avoid bigger costs. What's your recourse? Transition costs. It costs a lot.

Latency matters. It might matter more today than ever before, since most internal applications have gone to web interfaces. Keeping your endpoints on your own network at least lets you control your own latency. 

Then there's security. Many of my clients are dealing with PCI audits and compliance. I have no idea what they'd say if I suggested moving their data into the cloud. I'm pretty certain I wouldn't still be in the room to hear what they said. I'd probably be standing outside in the rain, trying to catch a cab back to the airport.

Like I said, I'm not trying to FUD cloud computing. I think that it's so good that every company should have one.

There's one more reason I think it makes sense to build internal clouds. I'll talk about that in my next post. 

Outrunning Your Headlights

Agile developers measure their velocity. Most teams define velocity as the number of story points delivered per iteration. Since the size of a "story point" and the length of an iteration vary from team to team, there's not much use in comparing velocity from one team to the next. Instead, the team tracks its own velocity from iteration to iteration.

Tracking velocity has two purposes. The first is estimation. If you know how many story points are left for this release, and you know how many points you complete per iteration, then you know how long it will be until you can release. (This is the "burndown chart".) After two or three iterations, this will be a much better projection of release date than I've ever seen any non-agile process deliver.

The second purpose of velocity tracking is to figure out ways to go faster.

In the iteration retrospective, a team will recalibrate estimating technique, to see if they can actually estimate the story cards or backlog items. Second, they'll look at ways to accomplish more during an iteration. Maybe that's refactoring part of the code, or automating some manual process. It might be as simple as adding templates to the IDE for commonly recurring code patterns.  (That should always raise a warning flag, since recurring code patterns are a code smell. Some languages just won't let you completely eliminate it, though.  And by "some languages" here, I mainly mean Java.)

Going faster should always be better, right? That means the development team is delivering more value for the same fixed cost, so it should always be a benefit, shouldn't it?

I have an example of a case where going faster didn't matter. To see why, we need to look past the boundaries of the development team.  Developers often treat software requirements as if they come from a sort of ATM; there's an unlimited reserve of requirement and we just need to decide how many of them to accept into development.

Taking a cue from Lean Software Development, though, we can look at the end-to-end value stream. The value stream is drawn from the customer's perspective. Step by step, the value stream map shows us how raw materials (requirements) are turned into finished goods. "Finished goods" does not mean code. Code is inventory, not finished goods. A finished good is something a customer would buy. Customers don't buy code. On the web, customers are users, interacting with a fully deployed site running in production. For shrink-wrapped software, customers buy a CD, DVD, or installer from a store. Until the inventory is fully transformed into one of these finished goods, the value stream isn't done.

Figure 1 shows a value stream map for a typical waterfall development process. This process has an annual funding cycle, so "inventory" from "suppliers" (i.e., requirements from the business unit) wait, on average, six months to get funded. Once funded and analyzed, they enter the development process. For clarity here, I've shown the development process as a single box, with 100% efficiency. That is, all the time spent in development is spent adding value---as the customer perceives it---to the product. Obviously, that's not true, but we'll treat it as a momentarily convenient fiction. Here, I'm showing a value stream map for a web site, so the final steps are staging and deploying the release.

Value Stream Map of Waterfall Process

Figure 1 - Value Stream Map of a Waterfall Process

This is not a very efficient process. It takes 315 business days to go from concept to cash. Out of that time, at most 30% of it is spent adding value. In reality, if we unpack the analysis and development processes, we'll see that efficiency drop to around 5%.

From the Theory of Constraints, we know that the throughput of any process is limited by exactly one constraint. An easy way to find the constraint is by looking at queue sizes. In an unoptimized process, you almost always find the largest queue right before the constraint. In the factory environment that ToC came from, it's easy to see the stacks of WIP (work in progress) inventory. In a development process, WIP shows up in SCR systems, requirements spreadsheets, prioritization documents, and so on.

Indeed, if we overlay the queues on that waterfall process, as in Figure 2, it's clear that Development and Testing is the constraint. After Development and Testing completes, Staging and Deployment take almost no time and have no queued inventory.

Waterfall Value Stream, With Queues

Figure 2 - Waterfall Value Stream, With Queues

In this environment, it's easy to see why development teams get flogged constantly to go faster, produce more, catch up.  They're the constraint.

Lean Software Development has ten simple rules to optimize the entire value stream.

ToC says to elevate the constraint and subordinate the entire process to the throughput of the constraint. Elevating the constraint---by either going faster with existing capacity, or expanding capacity---adds to throughput, while running the whole process at the throughput of the constraint helps reduce waste and WIP.

In a certain sense, Agile methods can be derived from Lean and ToC.

All of that, though, presupposes a couple of things:

  • Development is the constraint.
  • There's an unlimited supply of requirements.
  • Figure 3 shows the value stream map for a project I worked on in 2005. This project was to replace an existing system, so at first, we had a large backlog of stories to work on. As we approached feature parity, though, we began to run out of stories. The users had been waiting for this system for so long, that they hadn't given much thought, or at least recent thought, to what they might want after the initial release. Shortly after the second release (a minor bug fix), it became clear that we were actually consuming stories faster than they would be produced.

    Value Stream of an Agile Process

    Figure 3 - Value Stream Map of an Agile Project

    On the output side, we ran into the reverse problem. This desktop software would be distributed to hundreds of locations, with over a thousand users who needed to be expert on the software in short order. The internal training group, responsible for creating manuals and computer based training videos, could not keep revising their training modules as quickly as we were able to change the application. We could create new user interface controls, metaphors, and even whole screens much faster than they could create training materials.

    Once past the training group, a release had to be mastered and replicated onto installation discs. These discs were distributed to the store locations, where associates would call the operations group for a "talkthrough" of the installation process. Operations has a finite capacity, and can only handle so many installations every day. That set a natural throttle on the rate of releases. At one stage---after I rolled off the project---I know that a release which had passed acceptance testing in October was still in the training group by the following March.

    In short, the development team wasn't the constraint. There was no point in running faster. We would exhaust the inventory of requirements and build up a huge queue of WIP in front of training and deployment. The proper response would be to slow down, to avoid the buildup of unfinished inventory.  Creating slack in the workday would be one way to slow down, but drawing down the team size would be another perfectly valid response. Another perfectly valid response would be to increase the capacity of the training team. There are other places to optimize the value stream, too. But the one thing that absolutely wouldn't help would be increasing the development team's velocity.

    For nearly the entire history of software development, there has been talk of the "software crisis", the ever-widening gap between government and industry's need for software and the rate at which software can be produced. For the first time in that history, agile methods allow us to move the constraint off of the development team.

    Software Failure Takes Down Blackberry Services

    Anyone who's addicted to a Blackberry already knows about Monday's four-hour outage. For some of us, the Blackberry isn't just an electronic leash, it's part of our business operations.

    Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM's infrastructure, out through the wireless carriers. It was RIM's own infrastructure that suffered from intermittent failures during the outage.

    Data Center Knowledge reports that the outage was caused by a failed software upgrade

    Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

    How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they'll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

    In the interim, there are plenty of checkpoints where the release can be rolled back.

    I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it's "baked" for a while in production, then roll it to the remaining app servers.

    This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

    Finally, an aside: RIM's statement about the outage isn't reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company's own web site. 

    Tim Ross' C# Circuit Breaker

    Tim Ross has published his implementation of the Circuit Breaker pattern from Release It, complete with unit tests.

    I barely speak C#, so I'm not in any position to review his implementation, but I'm delighted to see it!

     

    The Pragmatic Architect on Security

    Catching up on some reading, I finally got a chance to read Ted Neward's article "Pragmatic Architecture: Security".  It's very good.  (Actually, the whole series is pretty good, and I recommend them all.  At least as of February 2008... I make no assertions about future quality!)

    Ted nails it.  I agree with all of the principles he identifies, and I particularly like his advice to "fail securely". 

    I would add one more, though: Be visible.

    After any breach, the three worst questions are always:

    1. How long has this been happening?
    2. How much have we lost?
    3. Why didn't we know about it sooner?

    The answers are always, respectively, "Far too long", "We have no idea", and "We didn't expect that exploit". To which the only possible response is, "Well, duh, if you'd expected it, you would have closed the vulnerability."

    Successful exploits are always successful because they stay hidden. Are you sure that nobody's in your systems right now, leaching data, stealing credit card numbers, or stealing products? Of course not. For a vivid case in point, google "Kerviel Societe Generale".

    While you cannot prove a negative, you can improve your odds of detecting nefarious activity by making sure that everything interesting is logged. (And by "interesting", I mean "potentially valuable".) 

    There are some pretty spiffy event correlation tools out there these days. They can monitor logs across hundreds of servers and network devices, extracting patterns of anomalous behavior. But, they only work if your application exposes data that could indicate a breach.

    For example, you might not be able to log every login attempt, but you probably should log every admin login attempt.

    Or, you might consider logging every price change. (I shudder to think about collusion between a merchant with pricing control and an outside buyer.  Imagine a 10-minute long sale on laptops: 90% off for 10 minutes only.)

    If your internal web service listens on a port, then it should only accept connections from known sources. Whether you enforce that through IPTables, a hardware firewall, or inside the application itself, make sure you're logging refused connections.

    Then, of course, once you're logging the data, make sure someone's monitoring it and keeping pattern and signature definitions up to date!