Wide Awake Developers

Main

Workmen, tools, etc.

We've all heard the old saw, "It's a poor workman that blames his tools." Let's think about that for a minute. Does it actual mean that a skilled craftsman can do great work with shoddy implements?

Well, can a chef make a souffle with a skillet?

Can a cabinetmaker round an edge with dull router bits?

I'm not going to rule it out. Perhaps there's a brilliant chef who---at this very moment---is preparing to introduce the world to the "skiffle." And, it's possible that one could coax a dull router into making a better quarter round through care, attention, and good speed control.

Going by the odds, though, I'd bet on scrambled eggs and splinters.

Like a lot of old sayings, this one doesn't make much sense in it's usual interpretation. Most people take this proverb to mean that you should be able to turn out top-notch work with whatever tools you're given. It's an excuse for bad tools, or lack of interest in improving them.

This homily dates back to a time when workers would bring their own tools to the job, leading to the popular origin story for the phrase "getting sacked". (No comments about møøse bites, please.) Some crafts have evaded the assembly line, and in those, craftsman still bring their own tools. Chefs bring their prized knives. Fine carpenters bring their own hand and bench tools.

There is a grain of truth in the common interpretation that good tools don't make a good workman. There's another level of truth under the surface, though. The 13th Century French version of this saying translates as, "A bad workman will never find a good tool." I like this version a lot better. Tools cannot make one good, but bad tools can hurt a good worker's performance. That sounds a lot less like "quit whining and use whatever's at hand," doesn't it?

On the other hand, if you supply your own tools, you're not as likely to tolerate bad ones, are you? I think this is the most important interpretation. Good workers---if given the choice---will select the best tools and keep them sharp.

Article on Building Robust Messaging Applications

I've talked before about adopting a failure-oriented mindset. That means you should expect every component of your system or application to someday fail. In fact, they'll usually fail at the worst possible times.

When a component does fail, whatever unit of work it's processing at the time will most likely be lost. If that unit of work is backed up by a transactional database, well, you're in luck. The database will do it's Omega-13 bit on the transaction and it'll be like nothing ever happened.

Of course, if you've got more than one database, then you either need two-phase commit or pixie dust. (OK, compensating transactions can help, too, but only if the thing that failed isn't the thing that would do the compensating transaction.)

I don't favor distributed transactions, for a lot of reasons. They're not scalable, and I find that the risk of deadlock goes way up when you've got multiple systems accessing multiple databases. Yes, uniform lock ordering will prevent that, but I never want to trust my own application's stability to good coding practices in other people's apps.

Besides, enterprise integration through database access is just... icky.

Messaging is the way to go. Messaging offers superior scalability, better response time to users, and better resilience against partial system failures. It also provides enough spatial, temporal, and logical decoupling between systems that you can evolve the endpoints independently.

Udi Dahan has published an excellent article with several patterns for robust messaging. It's worth reading, and studying. He addresses the real-world issues you'll encounter when building messaging apps, such as giant messages clogging up your queue, or "poison" messages that sit in the front of the queue causing errors and subsequent rollbacks.

New Article: S2AP + Eclipse + Maven walkthrough

See Getting Started With SpringSource Application Platform, Eclipse, and Maven.

Most of the information out there about programming in S2AP is in blogs or references to really old OSGi tutorials. It took me long enough to configure some basic Eclipse project support that I figured it was worth writing down. All of the frameworks and tool sets are very flexible, which means you have more choices to deal with when setting up a project. Sometimes, being concrete helps... there may be a lot of options, but when it's time to do a project, you only care about one set of choices for those options. This guide is completely specific to using Eclipse to write bundle projects for SpringSource Application Platform.

If that's your specific set of needs, great! If not, that's OK too, because the beauty of the Web is that somebody else will have a tutorial on your exact combination, too. 

Outrunning Your Headlights

Agile developers measure their velocity. Most teams define velocity as the number of story points delivered per iteration. Since the size of a "story point" and the length of an iteration vary from team to team, there's not much use in comparing velocity from one team to the next. Instead, the team tracks its own velocity from iteration to iteration.

Tracking velocity has two purposes. The first is estimation. If you know how many story points are left for this release, and you know how many points you complete per iteration, then you know how long it will be until you can release. (This is the "burndown chart".) After two or three iterations, this will be a much better projection of release date than I've ever seen any non-agile process deliver.

The second purpose of velocity tracking is to figure out ways to go faster.

In the iteration retrospective, a team will recalibrate estimating technique, to see if they can actually estimate the story cards or backlog items. Second, they'll look at ways to accomplish more during an iteration. Maybe that's refactoring part of the code, or automating some manual process. It might be as simple as adding templates to the IDE for commonly recurring code patterns.  (That should always raise a warning flag, since recurring code patterns are a code smell. Some languages just won't let you completely eliminate it, though.  And by "some languages" here, I mainly mean Java.)

Going faster should always be better, right? That means the development team is delivering more value for the same fixed cost, so it should always be a benefit, shouldn't it?

I have an example of a case where going faster didn't matter. To see why, we need to look past the boundaries of the development team.  Developers often treat software requirements as if they come from a sort of ATM; there's an unlimited reserve of requirement and we just need to decide how many of them to accept into development.

Taking a cue from Lean Software Development, though, we can look at the end-to-end value stream. The value stream is drawn from the customer's perspective. Step by step, the value stream map shows us how raw materials (requirements) are turned into finished goods. "Finished goods" does not mean code. Code is inventory, not finished goods. A finished good is something a customer would buy. Customers don't buy code. On the web, customers are users, interacting with a fully deployed site running in production. For shrink-wrapped software, customers buy a CD, DVD, or installer from a store. Until the inventory is fully transformed into one of these finished goods, the value stream isn't done.

Figure 1 shows a value stream map for a typical waterfall development process. This process has an annual funding cycle, so "inventory" from "suppliers" (i.e., requirements from the business unit) wait, on average, six months to get funded. Once funded and analyzed, they enter the development process. For clarity here, I've shown the development process as a single box, with 100% efficiency. That is, all the time spent in development is spent adding value---as the customer perceives it---to the product. Obviously, that's not true, but we'll treat it as a momentarily convenient fiction. Here, I'm showing a value stream map for a web site, so the final steps are staging and deploying the release.

Value Stream Map of Waterfall Process

Figure 1 - Value Stream Map of a Waterfall Process

This is not a very efficient process. It takes 315 business days to go from concept to cash. Out of that time, at most 30% of it is spent adding value. In reality, if we unpack the analysis and development processes, we'll see that efficiency drop to around 5%.

From the Theory of Constraints, we know that the throughput of any process is limited by exactly one constraint. An easy way to find the constraint is by looking at queue sizes. In an unoptimized process, you almost always find the largest queue right before the constraint. In the factory environment that ToC came from, it's easy to see the stacks of WIP (work in progress) inventory. In a development process, WIP shows up in SCR systems, requirements spreadsheets, prioritization documents, and so on.

Indeed, if we overlay the queues on that waterfall process, as in Figure 2, it's clear that Development and Testing is the constraint. After Development and Testing completes, Staging and Deployment take almost no time and have no queued inventory.

Waterfall Value Stream, With Queues

Figure 2 - Waterfall Value Stream, With Queues

In this environment, it's easy to see why development teams get flogged constantly to go faster, produce more, catch up.  They're the constraint.

Lean Software Development has ten simple rules to optimize the entire value stream.

ToC says to elevate the constraint and subordinate the entire process to the throughput of the constraint. Elevating the constraint---by either going faster with existing capacity, or expanding capacity---adds to throughput, while running the whole process at the throughput of the constraint helps reduce waste and WIP.

In a certain sense, Agile methods can be derived from Lean and ToC.

All of that, though, presupposes a couple of things:

  • Development is the constraint.
  • There's an unlimited supply of requirements.
  • Figure 3 shows the value stream map for a project I worked on in 2005. This project was to replace an existing system, so at first, we had a large backlog of stories to work on. As we approached feature parity, though, we began to run out of stories. The users had been waiting for this system for so long, that they hadn't given much thought, or at least recent thought, to what they might want after the initial release. Shortly after the second release (a minor bug fix), it became clear that we were actually consuming stories faster than they would be produced.

    Value Stream of an Agile Process

    Figure 3 - Value Stream Map of an Agile Project

    On the output side, we ran into the reverse problem. This desktop software would be distributed to hundreds of locations, with over a thousand users who needed to be expert on the software in short order. The internal training group, responsible for creating manuals and computer based training videos, could not keep revising their training modules as quickly as we were able to change the application. We could create new user interface controls, metaphors, and even whole screens much faster than they could create training materials.

    Once past the training group, a release had to be mastered and replicated onto installation discs. These discs were distributed to the store locations, where associates would call the operations group for a "talkthrough" of the installation process. Operations has a finite capacity, and can only handle so many installations every day. That set a natural throttle on the rate of releases. At one stage---after I rolled off the project---I know that a release which had passed acceptance testing in October was still in the training group by the following March.

    In short, the development team wasn't the constraint. There was no point in running faster. We would exhaust the inventory of requirements and build up a huge queue of WIP in front of training and deployment. The proper response would be to slow down, to avoid the buildup of unfinished inventory.  Creating slack in the workday would be one way to slow down, but drawing down the team size would be another perfectly valid response. Another perfectly valid response would be to increase the capacity of the training team. There are other places to optimize the value stream, too. But the one thing that absolutely wouldn't help would be increasing the development team's velocity.

    For nearly the entire history of software development, there has been talk of the "software crisis", the ever-widening gap between government and industry's need for software and the rate at which software can be produced. For the first time in that history, agile methods allow us to move the constraint off of the development team.

    Two Ways To Boost Your Flagging Web Site

    Being fast doesn't make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

    13 Threads Needed When Requests Take 700ms

    You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

    3 Threads Needed When Requests Take 200ms

    Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

    Suppose you're site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you're probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you're paying for big boxes that are guaranteed to be 50% idle.

    Classic Six Pack

    Let's look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

    Cache Farms

    Cache farming doesn't mean armies of Chinese gamers stomping rats and making vests. It doesn't involve registering a ton of domain names, either.

    Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you're running multiple instances on the same hosts, you could be caching the same object once per instance.

    Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

    So what do we have? If there are four app instances per host, then a frequently accessed object---like a product featured on the home page---will be duplicated eight times. Can we do better? Well, since I'm writing this article, you might suspect the answer is "yes". You'd be right.

    The caches I've described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

    External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you'll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

    External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server's internal cache, they're probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it's already been re-added to the cache.

    External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

    Six Pack With Cache Farm

    If the external cache doesn't have the item, the app server hits the database as usual. So I'll turn my attention to the database tier.

    Read Pools

    The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

    But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I'm going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

    When your database starts to get that middle-age paunch and it just isn't as zippy as it used to be, think about offloading those reads. At a minimum, you'll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you'll probably get more performance out of your writes once all that pesky read I/O is off the write master.

    How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

    Six Pack With Cache Farm and Read Pool

    By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

    At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That's the Foghorn Leghorn version of the objection. I'll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

    Does that matter?

    To a certain extent, I can't answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I'll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What's the total latency before that pricing error will be corrected? Let's look at the end-to-end process.

    1. A human detects the pricing error.
    2. The observer notifies the responsible merchant.
    3. The merchant verifies that the price is in error and determines the correct price.
    4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
    5. The merchant locates the item and enters the correct price
    6. She hits the "publish" button.
    7. The fast path system connects to the write master in production and updates the price.
    8. The read pool receives the logs with the update and applies them.
    9. The read pool process sends a message to invalidate the item in the app servers' caches.
    10. The next time users request that product detail page, they see the correct price.

    That's the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

    Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

    Summing It Up

    The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.

    References

    For more on read pooling, see Cal Henderson's excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

    The most popular open-source external caching framework I've seen is memcached. It's a flexible, multi-lingual caching daemon.

    On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

    ITIL and XP

    The Agile Manifesto is explicit about it. "We value individuals and interactions over processes and tools." How should an Agile team---more specifically, an XP team---respond to the IT Infrastructure Library (ITIL), then? After all, ITIL takes seven books just to define the customizable framework for the actual practices. An IT organization usually takes at least seven more binders to define its actual processes.

    Can XP and ITIL coexist in the same building, or is XP just incompatible with ITIL? In short: no.

    ITIL and XP (or agile in general) are not fundamentally incompatible, but there will definitely be an interface between the XP world and the ITIL world. Whether this interface becomes an impedance barrier or not depends entirely on the way that your company chooses to implement ITIL.

    I'll run down the Service Support processes and identify some of the problems I've encountered. (I'm focusing on Service Support because businesses tend to implement these processes first. Few of them get far enough down the road to really attack the Service Delivery processes. It's a shame, because I see a lot of value in the Service Delivery approach.) I will cover the service delivery processes in a future article.

    Service Desk

    An effective service desk can be a great asset to any team, including an XP team. Getting accurate feedback on issues your users are having can only benefit your development efforts and ultimately, the users themselves. The key here is to make sure that the service desk is well-prepared to accept responsibility for support calls on your app.

    I strongly recommend that you start working with the service desk at least six weeks before your first application release. If the service desk is mature, they'll have job aids for capturing app support needs. These will provide the minimum initial information needed for the knowledge base. The service desk personnel will augment that knowledge base over time with whatever solutions, rumors, superstitions and folk remedies they come up with. Be sure you have access to the knowledge base, so you can help weed out the "false solutions."

    You also want to get on the distribution list for ticket reports from the service desk. These will tell you what issues your users are encountering. Commonly recurring or high-impact issues should become cards for consideration in your next iteration. This feeds your interface to the Problem Management process.

    If the service desk is not mature, you haven't prepared them well, or they do not perform resolution for application incidents, you will be looped in as part of the Incident Management process, below. This has some special challenges.

    Incident Management

    ITIL defines an "incident" as any disruption to the normal operation of a system or application.  This includes bugs, outages, and even "PEBKAC" problems.  The Incident Management process begins with notification of an incident.  This can be logged by the service desk in response to a user call.  It can even be automatically created by a monitoring system.  It ends when normal functioning of the system is restored.

    Note that this does not include root cause analysis or correction!  Incident Management is all about restoring service.

    Ideally, the service desk handles the entire Incident Management process and your team will not even need to be involved.  In less ideal cases, you may be called on to help resolve "novel" incidents--ones that do not have a solution in the service desk's knowledge base.

    When incidents come into the development room, you have some negative forces to deal with. By definition, the incident needs to be resolved expeditiously, making it both interrupt driven and urgent. Therefore, every incident will automatically split a pair and take somebody off their card. This is damaging to flow.

    In worse cases, the entire team may get derailed and start huddling around the incident. Fire-fighting is exciting quadrant I work. It's natural to get a rush from being the hero. The problem is obvious, though.  If the entire team is chasing the incident, nobody is making forward progress on the iteration. If you have a large user community or a lot of incidents, you can lose an entire day---or an entire iteration---before you realize it.

    This can be exacerbated if your service desk never resolves application support incidents. In such cases, I recommend the "Designated Sacrifice" pattern. Assign one member of the team to handle the "Bat-Phone" calls and be the primary point of contact for incident resolution. This is a crappy job---you get pulled away constantly, can't maintain focus, get almost no card work done---so you'll want to rotate that position frequently. (On the other hand, there is that hero factor that provides some consolation.) Even doing it for one full iteration can be very demoralizing.

    Problem Management

    Recurring incidents can be identified as Problems that require correction. This is the job of the Problem Management process.

    Identifying a Problem is often done by the service desk, but it can also come from other quarters. The decision about which Problems require correction often becomes very slow and bureaucratic. This is a process you want to work with very closely. Problem Management typically tolerates a much higher level or outstanding defects than an XP team wants to allow. I've seen teams get chewed out for fixing Problems that weren't scheduled to be addressed for a couple of iterations! Imagine how surreal that meeting feels!

    Problem managers should be encouraged to write cards. Your team should even reserve a fraction of your velocity in each iteration just to handle Problems. You also need to communicate back to the problem managers when Problem cards are completed. Really good Problem Management identifies a few problem states such as "known problem", "known workaround", and "known solution". An XP team will typically move through these states pretty quickly.

    Bear in mind that the ITIL definition of Problem Management is all about oversight, not the actual changes needed to fix the problem.  The actual changes are deployed as part of Release Management.

    Change Management

    No part of ITIL gives more people cold sweats than Change Management.  This is the process that so easily slips into heavyweight bureaucracy or, worse, meaningless CAB meetings.

    Change Management as defined simply means tracking changes, their impact to configuration items, and ensuring that changes are applied in an orderly way.  It doesn't have to hurt.

    In reality, however, XP teams will spend a lot of time preparing for change advisory board meetings. Beware: the XP team may get a bad reputation for creating "too much" change.

    I recommend standardizing your change and deployment process. Get into a regular rhythm of releases and deployments so the CAB just knows to expect that every third Tuesday (or whenever), your team will have a deployment. Standardize the deployment mechanics and system impact statement so you can templatize and re-use your change requests. Familiarity will create confidence with the CAB. Constantly showing them change requests they've never seen before will raise their level of scrutiny.

    Failed changes also trigger more scrutiny. Your XP team will have an advantage here, because your rigorous approach to automated testing will reduce the incidence of failed changes, right?

    Configuration Management

    Configuration Management is *not* the act of changing configuration items. It's the process for tracking planned, executed, and retired configurations. As you plan each release, you should identify the CIs that will be affected by the release.

    In a well-executed ITIL rollout, configuration management is vital for change management, incident management, the service desk, and release management. In a poorly-executed ITIL rollout, configuration management doesn't exist, or it only addresses servers or network devices.

    CM should cover servers, network topology, applications, business processes, documentation, and the dependencies among all of them. That way, proposed changes to one CI (e.g., upgrade to front-end firewalls) can be analyzed for its impact. This is CM nirvana, seldom achieved.

    The XP team should have an advantage here again, because you've already broken story cards down to tasks at the beginning of an iteration. That means you already know which applications and servers will be changed in that iteration. Roll up a few iterations into a release, and the CIs affected by the release should be well known.

    On the other hand, if you've taken XP to its "no documentation" extreme, then you will not have tracked the CIs touched by each iteration. This underscores a common misinterpretation of XP; it doesn't eschew all documentation, just the documentation that doesn't add value from the customer's perspective. So, does tracking changes against CIs add value from the customer's perspective? Not directly, no. There is an indirect benefit, in that the customer will receive better uptime and performance, but that may seem remote to the team. The best I can say is that this is one place where you'll have to chalk it up to "necessary overhead".

    Release Management

    This is an easy one to integrate with your XP team. Release Management dovetails quite naturally with XP's release planning cycle. Engage early, though, because the ITIL process will likely require longer lead times than your team is used to.

    Coach and Team From Same Firm

    Is it an antipattern to have a consulting firm provide both the coach and developers?  By providing the developers, the firm is motivated to deliver on the project, with coaching as an adjunct.  If, instead, the firm provides just the coach, it will be judged by how well the client adopts the process.  These two motives can easily conflict.

    Case in point: at a previous client of mine, my employer was charged with completing the project, using a 50-50 mix of contractors and client developers.  My employer, a consulting firm, provided several developers experienced with XP and Scrum, as well as an agile coach.  The firm was thus charged with two imperatives: first, deliver the project; second, introduce agile methods within the client. 

    With project success as a requirement, the firm decided to intereview the developers at the outset of the project. The client's developers (rightly) perceived that they were interviewing for their own jobs.  This started a negative dynamic that ultimately resulted in 80% attrition among the client's developers.

    On a pure coaching engagement, the coach would probably have "made do" with whomever the client provided. 

    We delivered all the features, basically on time, with very high quality. Financially speaking, it was a success, generating more orders and more revenue per order than its predecessor.  It is harder to say that the engagement as a whole was a success, though.  Almost all of the developers were contractors, so the client got their product, but very little adoption of agile methods.

    Perhaps if the coach and the contract developers had come from different firms, the motivations would not have been as tangled, and more of the client's valuable people would have stayed.  The team might not have suffered from the strained, unhealthy environment from the early days of the project.

    Then again, perhaps not.  The client may have been expecting that level of attrition. Maybe that's just to be expected when you trying to bring a random selection of corporate developers over to agile methods, especially if the methods are decreed from above instead of brought upward by grass-roots. Maybe the dynamic would have existed even with a coach that was totally disinterested in the project outcome.

    Another Path to a Killer Product

    Give individuals powers once reserved for masses

    Here's a common trajectory:

    1. Something is so expensive that groups (or even an entire government) have to share them.  Think about mainframe computers in the Sixties.

    2. The price comes down until a committed individual can own one.  Think homebrew computers in the Seventies.  The "average" person  wouldn't own one, but the dedicated geek-hobbyist would.

    3. The price comes down until the average individual can own one.  Think PCs in the Eighties.

    4. The price comes down until the average person owns dozens.  PCs, game consoles, MP3 players, GPS navigators, laptops, embedded processors in toasters and cars.  An average person may have half a dozen devices that once were considered computers.

    Along the way, the product first gains broader and broader functionality, then becomes more specific and dedicated.

    Telephones, radios and televisions all followed the same trajectory.  You would probably call these moderately successful products.

    So: find something so expensive that groups have to purchase and share it.  Make it cheap enough for a private individual.

    A path to a product

    Here's a "can't lose" way to identify a new product: Enable people to plan ahead less. 

    Take cell phones.  In the old days, you had to know where you were going before you left.  You had to make reservations from home.  You had to arrange a time and place to meet your kids at Disney World.

    Now, you can call "information" to get the number of a restaurant, so you don't have to decide where you're going until the last possible minute.  You can call the restaurant for reservations from your car while you're already on your way.

    With cell phones, your family can split up at a theme park without pre-arranging a meeting place or time.

    Cell phones let you improvise with success.  Huge hit.

    GPS navigation in cars is another great example.  No more calling AAA weeks before your trip to get "TripTix" maps.  No more planning your route on a road atlas.  Just get in your car, pick a destination and start driving.  You don't even have to know where to get gas or food
    along the way.

    Credit and debit cards let you go places without planning ahead and carrying enough cash, gold, or jewels to pay your way.

    The Web is the ultimate "preparation avoidance" tool.  No matter what you're doing, if you have an always-on 'Net connection, you can improvise your way through meetings, debates, social engagements, and work situations.

    Find another product that lets procrastinators succeed, and you've got a sure winner.  There's nothing that people love more than the personal liberation of not planning ahead.

     

    Too Much Abstraction

    The more I deal with infrastructure architecture, the more I think that somewhere along the way, we have overspecialized. There are too many architects that have never lived with a system in production, or spent time on an operations team. Likewise, there are a lot of operations people that insulate themselves from the specification and development of systems for which they will ultimately take responsibility.

    The net result is suboptimization in the hardware/software fit. As a result, overall availability of the application suffers.

    Here's a recent example.

    First, we're trying to address the general issue of flowing data from production back into pre-production systems -- QA, production support, development, staging. The first attempt took 6 days to complete. Since the requirements of the QA environment stipulate that the data should be no more than one week out of date relative to production, that's a big problem. On further investigation, it appears that the DBA who was executing this process spent most of the time doing scps from one host to another. It's a lot of data, so in one respect 10 hour copies are reasonable.

    But the DBA had never been told about the storage architecture. That's the domain of a separate "enterprise service" group. They are fairly protective of their domain and do not often allow their architecture documents to be distributed. They want to reserve the right to change them at will. Now, they will be quite helpful if you approach them with a storage problem, but the trick is knowing when you have a storage problem on your hands.

    You see, all of the servers that the DBA was copying files from and to are all on the same SAN. An scp from one host on the SAN to another host on the SAN is pretty redundant.

    There's an alternative solution that involves a few simple steps: Take a database snapshot onto a set of disks with mirrors, split the mirrors, and join them onto another set of mirrors, then do an RMAN "recovery" from that snapshot into the target database. Total execution time is about 4 hours.

    From six days to four hours, just by restating the problem to the right people.

    This is not intended to criticize any of the individuals involved. Far from it, they are all top-notch professionals. But the solution required merging the domains of knowledge from these two groups -- and the organizational structure explicitly discouraged that merging.

    Another recent example.

    One of my favorite conferences is the Colorado Software Summit. It's a very small, intensely technical crowd. I sometimes think half the participants are also speakers. There's a year-round mailing list for people who are interested in, or have been to, the Summit. These are very skilled and talented people. This is easily the top 1% of the software development field.

    Even there, I occasionally see questions about how to handle things like transparent database connection failover. I'll admit that's not exactly a journeyman topic. Bring it up at a party and you'll have plenty of open space to move around in. What surprised me is that there are some fairly standard infrastructure patterns for enabling database connection failover that weren't known to people with decades of experience in the field. (E.g., cluster software reassigns ownership of a virtual IP address to one node or the other, with all applications using the virtual IP address for connections).

    This tells me that we've overspecialized, or at least, that the groups are not talking nearly enough. I don't think it's possible to be an expert in high availability, infrastructure architecture, enterprise data management, storage solutions, OOA/D, web design, and network architecture. Somehow, we need to find an effective way to create joint solutions, so we don't have software being developed that's completely ignorant of its deployment architecture, nor should we have infrastructure investments that are not capable of being used by the software. We need closer ties between operations, architecture, and development.

    Don't Build Systems That Boink

    Note: This piece originally appeared in the "Marbles Monthly" newsletter in April 2003

    I caught an incredibly entertaining special on The Learning Channel last week. A bunch of academics decided that they were going to build an authentic Roman-style catapult, based on some ancient descriptions. They had great plans, engineering expertise, and some really dedicated and creative builders. The plan was to hurl a 57 pound stone 400 yards, with a machine that weighed 30 tons. It was amazing to see the builders faces swing between hope and fear. The excitement mingled with apprehension.

    At one point, the head carpenter said that it would be wonderful to see it work, but "I'm fairly certain it's going to boink." I immediately knew what he meant. "Boink" sums up all the myriad ways this massive device could go horribly wrong and wreak havoc upon them all. It could fall over on somebody. It could break, releasing all that kinetic energy in the wrong direction, or in every direction. The ball could fly off backwards. The rope might relax so much that it just did nothing. One of the throwing arms could break. They could both break. In other words, it could do anything other than what it was intended to do.

    That sounds pretty familiar. I see the same expressions on my teammates' faces every day. This enormous project we're slaving on could fall over and crush us all into jelly. It could consume our hours, our minds, and our every waking hour. Worst case, it might cost us our families, our health, our passion. It could embarrass the company, or cost it tons of money. In fact, just about the most benign thing it could do is nothing.

    So how do you make a system that don't boink? It is hard enough just making the system do what it is supposed to. The good news is that some simple "do's and don'ts" will take us a long way toward non-boinkage.

    Automation is Your Friend #1: Runs lots of tests -- and run them all the time

    Automated unit tests and automated functional tests will guarantee that you don't backslide. They provide concrete evidence of your functionality, and they force you to keep your code integrated.

    Automation is Your Friend #2: Be fanatic about build and deployment processes

    A reliable, fully automated build process will prevent headaches and heartbreaks. A bad process--or a manual process--will introduce errors and make it harder to deliver on an iterative cycle.

    Start with a fully automated build script on day one. Start planning your first production-class deployment right away, and execute a deployment within the first three weeks. A build machine (it can be a workstation) should create a complete, installable numbered package. That same package should be delivered into each environment. That way, you can be absolutely certain that QA gets exactly the same build that went into integration testing.

    Avoid the temptation to check out the source code to each environment. An unbelievable amount of downtime can be traced to a version label being changed between when the QA build and the production build got done.

    Everything In Its Place

    Keep things separated that either change at different speeds. Log files change very fast, so isolate them. Data changes a little less quickly but is still dynamic. "Content" changes slower yet, but is still faster than code. Configuration settings usually come somewhere between code and content. Each of these things should go in their own location, isolated and protected from each other.

    Be transparent

    Log everything interesting that happens. Log every exception or warning. Log the start and end of long-running tasks. Always make sure your logs include a timestamp!

    Be sure to make the location of your log files configurable. It's not usually a good idea to keep log files in the same filesystem as your code or data. Filling up a filesystem with logs should not bring your system down.

    Keep your configuration out of your code

    It is always a good idea to separate metadata from code. This includes settings like host names, port numbers, database URLs and passwords, and external integrations.

    A good configuration plan will allow your system to exist in different environments -- QA versus production, for example. It should also allow for clustered or replicated installations.

    Keep your code and your data separated

    The object-oriented approach is a good wasy to build software, but it's a lousy way to deploy systems. Code changes at a different frequency than data. Keep them separated. For example, in a web system, it should be easy to deploy a new code drop without disrupting the content of the site. Likewise, new content should not affect the code.

    Multiplier Effects

    Here's one way to think about the ethics of software, in terms of multipliers. Think back to the last major email virus, or when the movie "The Two Towers" was released. No doubt, you heard or read a story about how much lost productivity this bane would cause. There is always some analyst willing to publish some outrageous estimate of damages due to these intrusions into the work life. I remember hearing about the millions of dollars supposedly lost to the economy when Star Wars Episode I was released.

    (By the way, I have to take a minute to disassemble this kind of analysis. Stick with me, this won't take long.

    If you take 1.5 seconds to delete the virus, it costs nothing. It's an absolutely immeasurable impact to your day. It won't even affect your productivity. You will probably spend more time than that discussing sports scores, going to the bathroom, chatting with a client, or any of the hundreds of other things human beings do during a day. It's literally lost in the noise. Nevertheless, some analyst who likes big numbers will take that 1.5 seconds and multiply it by the millions of other users and their 1.5 seconds, then multiply that by the "national average salary" or some such number.

    So, even though it takes you longer to blow your nose than to delete the virus email, somehow it still ends up "costing the economy" 5x10^6 USD in "lost productivity". The underlying assumptions here are so flawed that the result cannot be taken seriously. Nevertheless, this kind of analysis will be dragged out every time there's a news story--or better yet, a trial--about an email worm.)

    The real moral of this story isn't about innumeracy in the press, or spotlight seekers exploiting said innumeracy. It's about multipliers, and the very real effect they can have.

    Suppose you have a decision to make about a particular feature. You can do it the easy way in about a day, or the hard way in about a week. (Hypothetical.) Which way should you do it? Suppose that the easy way makes four new fields required, whereas doing it the hard way makes the program smart enough to handle incomplete data. Which way should you do it?

    Required fields seem innocuous, but they are always an imposition on the user. They require the user to gather more information before starting their jobs. This in turn often means they have to keep their data on Post-It notes until they are ready to enter it, resulting in lost data, delays, and general frustration.

    Let's consider an analogy. Suppose I'm putting a sign up on my building. Is it OK to mount the sign six feet up on the wall, so that pedestrians have to duck or go around it? It's much easier for me to hang the sign if I don't have to set up a ladder and scaffold. It's only a minor annoyance to the pedestrians. It's not like it would block the sidewalk or anything. All they have to do is duck. So, I get to save an hour installing the sign, at the expense of taking two seconds away from every pedestrian passing my store. Over the long run, all of those two second diversions are going to add up to many, many times more than the hour that I saved.

    It's not ethical to worsen the lives of others, even a small bit, just to make things easy for yourself. Successful software is measured in millions of people. Every requirements decision you make is an imposition of your will on your users' lives, even if it is a tiny one. Always be mindful of the impact your decisions--even small ones--have on those people. You should be willing to bear large burdens to ease the burden on those people, even if your impact on any given individual is miniscule.