Wide Awake Developers

Sun Joining the Cloud Crowd

| Comments

As I was writing my last post, I somehow missed the news that Sun is building their own cloud platform, called Project Caroline.

There’s a PDF about it. It appears to be a presentation for JavaOne.  It may be locked down at any minute, so the link might not work by the time you read this.

Caroline looks a lot like Amazon EC2, but with some very nice control over VLANs (I suppose they would be Virtual VLANs?), load balancing policies, and DNS… all things that EC2 lacks today. ZFS instead of S3, that will make for a more familiar storage model. No trickery needed to make data persist across restarts.

All in all, it looks very nice.

(Hmmm.  On second glance, this presentation is from JavaOne 2007!  Not much of a scoop there, Reg.)

Does anyone know what happened to this project? 

  

 

A Cloud for Everyone

| Comments

The trajectory of many high-tech products looks like this:

  1. Very expensive. Only a few exist in the world. They are heavily time-shared, and usually oversubscribed.
  2. Within the reach of institutions and corporations, but not individuals. The organization wants to maximize utilization.
  3. Corporations own many, as productivity enhancers, some wealthy or forward-looking individuals own one. Families time share theirs.
  4. Virtually everyone has one. To lack one is to fall behind. No longer a competitive advantage, the lack of the technology puts one at a disadvantage.
  5. Invisibility. Most people have or use several, but are not aware of it.

Depending on your age, you might have been thinking "cell phones", "computers", or even "televisions".  I don’t think I have any blog readers old enough to have been thinking "telephones", "telegraphs", or "electric motors", but they all went through the same stages, too.

I feel very comfortable putting "cloud computing" in that list, too. Cloud computing is at stage 1. It’s expensive enough that there are a few in the world: Amazon AWS, Mosso, BungeeConnect, even Force.com. They’re shared, multitenant, and soon to be oversubscribed.

One day, I suspect that we’ll each have our own computing cloud attending us, formed out of the many computing devices that surround us every day, but I’m getting ahead of myself.

Before that, we’ll see enterprises, first large then medium and small, building their own computing clouds.

"Wait a minute," you object. "That misses the whole point of cloud computing. The entire purpose is to not own the infrastructure."

That’s true, today. It was also true, at one time, that farmers did not want to own their own steam engines. So, they outsourced the job. Farmers would own machines like threshers that had everything except the troublesome boiler and engine. Those required technical expertise to run, so the farmers left that job up to folks who would bring their steam engine around, hook it up to the thresher, and charge the farmer for the length of time he needed it. As steam engines got cheaper and safer, they eventually got built right into the thresher.

This next part may sound like FUD. It isn’t. I like cloud computing. I like virtualization. In fact, I think it’s about to revolutionize our industry.

I like it so much that I think every company should have one.

Why should a company build its own cloud, instead of going to one of the providers? Several reasons, some positive, some not so much.

On the positive side, an IT manager running a cloud can finally do real chargebacks to the business units that drive demand. Some do today, but on a larger-grained level… whole servers. With a private cloud, the IT manager could charge by the compute-hour, or by the megabit of bandwidth. He could charge for storage by the gigabyte, and with tiered rates for different avaialbility/continuity guarantees. Even better, he could allow the business units to do the kind of self-service that I can do today with a credit card and The Planet. (OK, The Planet isn’t a cloud provider, but I bet they’re thinking about it.  Plus, I like them.)

I actually think this kind of self-service and fine-grained chargeback could help curb the out-of-control growth in IT spending, but that’s a different post.

This would seriously raise the level of discourse. Instead of fighting about server classes, rack space, power consumption, and rampant storage sprawl, IT could talk to the business about levels of service. Does this app need 24x7 performance management with automatic resource allocation to maintain a 2 second response time? Great, we can do that! This other one doesn’t need to be fast, but it had better work every single time a transaction goes through? We can do that, too! This application needs user experience monitoring, that database only needs non-redundant storage, because it can be recreated from other sources… it’s a better conversation to have than, "No, our corporate standard is WebSphere running on RedHat Enterprise Linux 4, with Dell PowerEdge servers.  You can have any server you want, as long as it’s a Dell PowerEdge."

I also think that the gloss will come off of the cloud computing providers. (I know, most people still haven’t heard of them yet, but the gloss will inevitably come off.)

Accidents happen. Networks still break, today, and they will in the future too. Power failures happen. How would you defend yourself in a shareholders’ lawsuit after millions in losses thanks to a service provider failure? (Actually, that suggests there may be an insurance market developing here. Any time you’ve got quantifiable risk and someone willing to pay to defray that risk, sure as hell, you’ll find insurance companies.)

Service providers get oversubscribed. What happens when your application is slow, and remains slow for months? Having an SLA only means you get some money back, it doesn’t mean your problem will get fixed. It’s a dirty secret that some service providers are quite happy paying out credits, if they can avoid bigger costs. What’s your recourse? Transition costs. It costs a lot.

Latency matters. It might matter more today than ever before, since most internal applications have gone to web interfaces. Keeping your endpoints on your own network at least lets you control your own latency. 

Then there’s security. Many of my clients are dealing with PCI audits and compliance. I have no idea what they’d say if I suggested moving their data into the cloud. I’m pretty certain I wouldn’t still be in the room to hear what they said. I’d probably be standing outside in the rain, trying to catch a cab back to the airport.

Like I said, I’m not trying to FUD cloud computing. I think that it’s so good that every company should have one.

There’s one more reason I think it makes sense to build internal clouds. I’ll talk about that in my next post. 

Outrunning Your Headlights

| Comments

Agile developers measure their velocity. Most teams define velocity as the number of story points delivered per iteration. Since the size of a "story point" and the length of an iteration vary from team to team, there’s not much use in comparing velocity from one team to the next. Instead, the team tracks its own velocity from iteration to iteration.

Tracking velocity has two purposes. The first is estimation. If you know how many story points are left for this release, and you know how many points you complete per iteration, then you know how long it will be until you can release. (This is the "burndown chart".) After two or three iterations, this will be a much better projection of release date than I’ve ever seen any non-agile process deliver.

The second purpose of velocity tracking is to figure out ways to go faster.

In the iteration retrospective, a team will recalibrate estimating technique, to see if they can actually estimate the story cards or backlog items. Second, they’ll look at ways to accomplish more during an iteration. Maybe that’s refactoring part of the code, or automating some manual process. It might be as simple as adding templates to the IDE for commonly recurring code patterns.  (That should always raise a warning flag, since recurring code patterns are a code smell. Some languages just won’t let you completely eliminate it, though.  And by "some languages" here, I mainly mean Java.)

Going faster should always be better, right? That means the development team is delivering more value for the same fixed cost, so it should always be a benefit, shouldn’t it?

I have an example of a case where going faster didn’t matter. To see why, we need to look past the boundaries of the development team.  Developers often treat software requirements as if they come from a sort of ATM; there’s an unlimited reserve of requirement and we just need to decide how many of them to accept into development.

Taking a cue from Lean Software Development, though, we can look at the end-to-end value stream. The value stream is drawn from the customer’s perspective. Step by step, the value stream map shows us how raw materials (requirements) are turned into finished goods. "Finished goods" does not mean code. Code is inventory, not finished goods. A finished good is something a customer would buy. Customers don’t buy code. On the web, customers are users, interacting with a fully deployed site running in production. For shrink-wrapped software, customers buy a CD, DVD, or installer from a store. Until the inventory is fully transformed into one of these finished goods, the value stream isn’t done.

Figure 1 shows a value stream map for a typical waterfall development process. This process has an annual funding cycle, so "inventory" from "suppliers" (i.e., requirements from the business unit) wait, on average, six months to get funded. Once funded and analyzed, they enter the development process. For clarity here, I’ve shown the development process as a single box, with 100% efficiency. That is, all the time spent in development is spent adding value—as the customer perceives it—to the product. Obviously, that’s not true, but we’ll treat it as a momentarily convenient fiction. Here, I’m showing a value stream map for a web site, so the final steps are staging and deploying the release.

Value Stream Map of Waterfall Process

Figure 1 - Value Stream Map of a Waterfall Process

This is not a very efficient process. It takes 315 business days to go from concept to cash. Out of that time, at most 30% of it is spent adding value. In reality, if we unpack the analysis and development processes, we’ll see that efficiency drop to around 5%.

From the Theory of Constraints, we know that the throughput of any process is limited by exactly one constraint. An easy way to find the constraint is by looking at queue sizes. In an unoptimized process, you almost always find the largest queue right before the constraint. In the factory environment that ToC came from, it’s easy to see the stacks of WIP (work in progress) inventory. In a development process, WIP shows up in SCR systems, requirements spreadsheets, prioritization documents, and so on.

Indeed, if we overlay the queues on that waterfall process, as in Figure 2, it’s clear that Development and Testing is the constraint. After Development and Testing completes, Staging and Deployment take almost no time and have no queued inventory.

Waterfall Value Stream, With Queues

Figure 2 - Waterfall Value Stream, With Queues

In this environment, it’s easy to see why development teams get flogged constantly to go faster, produce more, catch up.  They’re the constraint.

Lean Software Development has ten simple rules to optimize the entire value stream.

ToC says to elevate the constraint and subordinate the entire process to the throughput of the constraint. Elevating the constraint—by either going faster with existing capacity, or expanding capacity—adds to throughput, while running the whole process at the throughput of the constraint helps reduce waste and WIP.

In a certain sense, Agile methods can be derived from Lean and ToC.

All of that, though, presupposes a couple of things:

  • Development is the constraint.
  • There’s an unlimited supply of requirements.
  • Figure 3 shows the value stream map for a project I worked on in 2005. This project was to replace an existing system, so at first, we had a large backlog of stories to work on. As we approached feature parity, though, we began to run out of stories. The users had been waiting for this system for so long, that they hadn’t given much thought, or at least recent thought, to what they might want after the initial release. Shortly after the second release (a minor bug fix), it became clear that we were actually consuming stories faster than they would be produced.

    Value Stream of an Agile Process

    Figure 3 - Value Stream Map of an Agile Project

    On the output side, we ran into the reverse problem. This desktop software would be distributed to hundreds of locations, with over a thousand users who needed to be expert on the software in short order. The internal training group, responsible for creating manuals and computer based training videos, could not keep revising their training modules as quickly as we were able to change the application. We could create new user interface controls, metaphors, and even whole screens much faster than they could create training materials.

    Once past the training group, a release had to be mastered and replicated onto installation discs. These discs were distributed to the store locations, where associates would call the operations group for a "talkthrough" of the installation process. Operations has a finite capacity, and can only handle so many installations every day. That set a natural throttle on the rate of releases. At one stage—after I rolled off the project—I know that a release which had passed acceptance testing in October was still in the training group by the following March.

    In short, the development team wasn’t the constraint. There was no point in running faster. We would exhaust the inventory of requirements and build up a huge queue of WIP in front of training and deployment. The proper response would be to slow down, to avoid the buildup of unfinished inventory.  Creating slack in the workday would be one way to slow down, but drawing down the team size would be another perfectly valid response. Another perfectly valid response would be to increase the capacity of the training team. There are other places to optimize the value stream, too. But the one thing that absolutely wouldn’t help would be increasing the development team’s velocity.

    For nearly the entire history of software development, there has been talk of the "software crisis", the ever-widening gap between government and industry’s need for software and the rate at which software can be produced. For the first time in that history, agile methods allow us to move the constraint off of the development team.

    Software Failure Takes Down Blackberry Services

    | Comments

    Anyone who’s addicted to a Blackberry already knows about Monday’s four-hour outage. For some of us, the Blackberry isn’t just an electronic leash, it’s part of our business operations.

    Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM’s infrastructure, out through the wireless carriers. It was RIM’s own infrastructure that suffered from intermittent failures during the outage.

    Data Center Knowledge reports that the outage was caused by a failed software upgrade

    Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

    How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they’ll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

    In the interim, there are plenty of checkpoints where the release can be rolled back.

    I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it’s "baked" for a while in production, then roll it to the remaining app servers.

    This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

    Finally, an aside: RIM’s statement about the outage isn’t reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company’s own web site. 

    The Pragmatic Architect on Security

    | Comments

    Catching up on some reading, I finally got a chance to read Ted Neward’s article "Pragmatic Architecture: Security".  It’s very good.  (Actually, the whole series is pretty good, and I recommend them all.  At least as of February 2008… I make no assertions about future quality!)

    Ted nails it.  I agree with all of the principles he identifies, and I particularly like his advice to "fail securely". 

    I would add one more, though: Be visible.

    After any breach, the three worst questions are always:

    1. How long has this been happening?
    2. How much have we lost?
    3. Why didn’t we know about it sooner?

    The answers are always, respectively, "Far too long", "We have no idea", and "We didn’t expect that exploit". To which the only possible response is, "Well, duh, if you’d expected it, you would have closed the vulnerability."

    Successful exploits are always successful because they stay hidden. Are you sure that nobody’s in your systems right now, leaching data, stealing credit card numbers, or stealing products? Of course not. For a vivid case in point, google "Kerviel Societe Generale".

    While you cannot prove a negative, you can improve your odds of detecting nefarious activity by making sure that everything interesting is logged. (And by "interesting", I mean "potentially valuable".) 

    There are some pretty spiffy event correlation tools out there these days. They can monitor logs across hundreds of servers and network devices, extracting patterns of anomalous behavior. But, they only work if your application exposes data that could indicate a breach.

    For example, you might not be able to log every login attempt, but you probably should log every admin login attempt.

    Or, you might consider logging every price change. (I shudder to think about collusion between a merchant with pricing control and an outside buyer.  Imagine a 10-minute long sale on laptops: 90% off for 10 minutes only.)

    If your internal web service listens on a port, then it should only accept connections from known sources. Whether you enforce that through IPTables, a hardware firewall, or inside the application itself, make sure you’re logging refused connections.

    Then, of course, once you’re logging the data, make sure someone’s monitoring it and keeping pattern and signature definitions up to date!

    Two Books That Belong in Your Library

    | Comments

    I seldom plug books—other than my own, that is. I’ve just read two important books, however, that really deserve your attention.

    Concurrency, Everybody’s Doing It

    The first is Java Concurrency in Practice by Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. I’ve been doing Java development for close to thirteen years now, and I learned an enormous amount from this fantastic book. For example, I knew what the textbook definition of a volatile variable was, but I never knew why I would actually want to use one. Now I know when to use them and when they won’t solve the problem.

    Of course, JCP talks about the Java 5 concurrency library at great length. But this is no paraphrasing of the javadoc. (It was Doug Lea’s original concurrency utility library that eventually got incorporated into Java, and we’re all better off for it.) The authors start with illustrations of real issues in concurrent programming. Before they introduce the concurrency utilities, they explain a problem and illustrate potential solutions. (Usually involving at least one naive "solution" that has serious flaws.) Once they show us some avenues to explore, they introduce some neatly-packaged, well-tested utility class that either solves the problem or makes a solution possible. This removes the utility classes from the realm of "inscrutable magic" and presents them as "something difficult that you don’t have to write."

    The best part about JCP, though, is the combination of thoroughness and clarity with which it presents a very difficult subject. For example, I always understood about the need to avoid concurrent modification of mutable state. But, thanks to this book, I also see why you have to synchronize getters, not just setters. (Even though assignment to an integer is guaranteed to happen atomically, that isn’t enough to guarantee that the change is visible to other threads. The only way to guarantee ordering is by crossing a synchronization barrier on the same lock.)

    Blocked Threads are one of my stability antipatterns. I’ve seen hundreds of web site crashes. Every single one of them eventually boils down to blocked threads somewhere. Java Concurrency in Practice has the theory, practice, and tools that you can apply to avoid deadlocks, live locks, corrupted state, and a host of other problems that lurk in the most innocuous-looking code.

    Capacity Planning is Science, Not Art

    The second book that I want to recommend today is Capacity Planning for Web Services. I’ve had this book for a while. When I first started reading it, I put it down right away thinking, "This is way too basic to solve any real problems." That was a big error.

    Capacity Planning may get off to a slow start, but that’s only because the authors are both thorough and deliberate. Later in the book, that deliberate pace is very helpful, because it lets us follow the math.

    This is the only book on capacity planning I’ve seen that actually deals with transmission time for HTTP requests and repsonses. In fact, some of the examples even compute the number of packets that a request or reply will need.

    I have objected to some capacity planning books because they assume that every process can be represented by an average. Not this one. In the section on standalone web servers, for example, the authors break files into several classes, then use a weighted distribution of file sizes to compute the expected response time and bandwidth requirements. This is a very real-world approach, since web requests tend toward a bimodal distribution: small HTML, Javascript, and CSS intermixed with large media files and images. (In fact, I plan on using the models in this book to quantify the effect of segregating media files from dynamic pages.)

    This is also the only book I’ve seen that recognizes that capacity limits can propagate both downward and upward through tiers. There’s a great example of how doubling the CPU performance in an app tier ends up increasing the demand on the database server, which almost totally nullifies the effect of the CPU upgrade. It also recognizes that all requests are not created equal, and recommends clustering request types by their CPU and I/O demands, instead of averaging them all together.

    Nearly every result or abstract law has an example, written in concrete terms, which helps bridge theory and practice.

    Both of these books deal with material that easily leads off into clouds of theory and abstraction. (JCP actually quips, "What’s a memory model, and why would I want one?") These excellent works avoid the Ivory Tower trap and present highly pragmatic, immediately useful wisdom.

    Well Begun Is Half Done

    | Comments

    How long is your checklist for setting up a new development environment? It might seem like a trivial thing, but setup costs are part of the overall friction in your project. I’ve seen three page checklists that required multiple downloads, logging in as several users (root and non-root), and hand-typing SQL strings to set up the local database server.

    I think the paragon of environment setup is the ubiquitous GNU autoconf system. Anyone familiar with Linux, BSD, or other flavors of UNIX will surely recognize this three-line incantation:

    ./configure
    make
    make install
    

    The beauty of autoconf is that it adapts to you. In the open-source world, you can’t stipulate one particular set of packages or versions, at least, not if you actually want people to use your software and contribute to your project. In the corporate world, though, it’s pretty common to see a project that requires a specific point-point rev of some Jakarta Commons library, but without actually documenting the version.

    Then there are different places to put things: inside the project, in source control, or in the system. I recently went back to a project’s code base after being away for more than two years. I thought we had done a good job of addressing the environment setup. We included all the deliverable jars in the codebase, so they were all version controlled. But, we decided to keep the development-only jars (like EasyMock, DBUnit, and JUnit) outside the code base. We did use Eclipse variables to abstract out the exact filesystem location, but when I returned to that code base, finding and restoring exactly the right versions of those build-time jars wasn’t easy. In retrospect, we should have put the build-time jars under version control and kept them inside the code base.

    Yes, I know that version control systems aren’t good at versioning binaries like jar files. Who cares? We don’t rev the jar files so often that the lack of deltas matters. Putting a new binary in source control when you upgrade from Spring 2.5 to Spring 2.5.1 really won’t kill your repository. The cost of the extra disk space is nothing compared to the benefit of keeping your code base self-contained.

    Maven users will be familiar with another approach. On a Maven project, you express external dependencies in a project model file. On the first build, Maven will download those dependencies from their "official" archives, then cache them locally. After that, Maven will just use the locally cached jar file, at least until you move your declared dependency to a newer revision. I have nothing against Maven. I know some people who swear by it, and others who swear at it. Personally, I just never got into it.

    Then there are JRE extensions. This project uses JAI, which wants to be installed inside the JRE itself. We went along with that, but I was stumped for a while today when I saw hundreds of compile errors even though my Eclipse project’s build path didn’t show any unresolved dependencies. Of course, when you install JAI inside the JRE, it just becomes part of the Java runtime. That makes it an implicit dependency. I eventually remembered that trick, but it took a while. In retrospect, I wish we had tried harder to bring JAI’s jars and native libraries into the code base as an explicit dependency.

    Does developer environment setup time matter? I believe it does. It might be tempting to say, "That’s a one-time cost, there’s no point in optimizing it." It’s not really a one-time cost, though. It’s one time per developer, every time that developer has to reinstall. My rough observation says that, between migrating to a new workstation, Windows reinstalls, corporate re-imaging, and developer churn, you should expect three to five developer setups per year on an internal project.

    For an open-source project, the sky is the limit. Keep in mind that you’ll lose potential contributors at every barrier they encounter. Environment setup is the first one.

    So, what’s my checklist for a good environment setup checklist?

    • Keep the project self contained. Bring all dependencies into the code base. Same goes for RPMs or third-party installers.
    • Make sure all JAR files have version numbers in their file names. If the upstream project doesn’t build their JAR files with version numbers, go ahead and rename the jars.
    • Make bootstrap scripts for database actions such as user creation or schema builds.
    • If you absolutely must embed a dependency on something that lives outside the code base, make your build script detect its location. Don’t rely on specific path names.
    • Don’t assume your code base is in any particular filesystem on the build machine.

    I’d love to see your with your own rules for easy development setup.

    “Release It” Is a Jolt Award Finalist

    | Comments

    The Jolt Awards have been described as "the Oscar’s of our industry". (Really. It’s on the front page of the site.)  The list of past book winners reads like an essential library for the software practitioner. Even the finalists and runners-up are essential reading.

    Release It has now joined the company of finalists. The competition is very tough… I’ve read "Beautiful Code" and "Manage It!", and both are excellent. I’ll be on pins and needles until the awards ceremony on March 5th.  Honestly, though, I’m just thrilled to be in such good company.

    Should Email Errors Keep Customers From Buying?

    | Comments

    Somewhere inside every commerce site, there’s a bit of code sending emails out to customers.  Email campaigning might have been in the requirements and that email code stands tall at the brightly-lit service counter.  On the other hand, it might have been added as an afterthought, languishing in some dark corner with the "lost and found" department.  Either way, there’s a good chance it’s putting your site at risk.

    The simplest way to code an email sending routine looks something like this:

    1. Get a javax.mail.Session instance
    2. Get a javax.mail.Transport instance from the Session
    3. Construct a javax.mail.internet.MimeMessage instance
    4. Set some fields on the message: from, subject, body.  (Setting the body may involve reading a template from a file and interpolating values.)
    5. Set the recipients’ Addresses on the message
    6. Ask the Transport to send the message
    7. Close the Transport
    8. Discard the Session

    This goes into a servlet, a controller, or a stateless session bean, depending on which MVC framework or JEE architecture blueprint you’re using.

    There are two big problems here. (Actually, there are three, but I’m not going to deal with the "one connection per message" issue.)

    Request-Handling Threads at Risk

    As written, all the work of sending the email happens on the request-handling thread that’s also responsible for generating the response page. Even on a sunny day, that means you’re spending some precious request-response cycles on work that doesn’t help build the page.

    You should always look at a call out to an external server with suspicion. Many of them can execute asynchronously to page generation. Anything that you can offload to a background thread, you should offload so the request-handler can get back in the pool sooner. The user’s experience will be better, and your site’s capacity will be better, if you do.

    Also, keep in mind that SMTP servers aren’t always 100% reliable. Neither are the DNS servers that point you to them. That goes double if you’re connecting to some external service. (And please, please don’t even tell me you’re looking up the recipient’s MX record and contacting the receiving MTA directly!)

    If the MTA is slow to accept your connection, or to process the email, then the request-handling thread could be blocked for a long time: seconds or even minutes. Will the user wait around for the response? Not likely. He’ll probably just hit "reload" and double-post the form that triggered the email in the first place.

    Poor Error Recovery

    The second problem is the complete lack of error recovery.  Yes, you can log an exception when your connection to the MTA fails. But that only lets the administrator know that some amount of mail failed. It doesn’t say what the mail was! There’s no way to contact the users who didn’t get their messages. Depending on what the messages said, that could be a very big deal.

    At a minimum, you’d like to be able to detect and recovery from interruptions at the MTA—scheduled maintenance, Windows patching, unscheduled index rebulids, and the like. Even if "recovery" means someone takes the users’ info from the log file and types in a new message on their desktops, that’s better than nothing.

    A Better Way

    The good news is that there’s a handy way to address both of these problems at once. Better still, it works whether you’re dealing with internal SMTP based servers or external XML-over-HTTP bulk mailers.

    Whenever a controller decides it’s time to reach out and touch a user through email, it should drop a message on a JMS queue. This lets the request-handling thread continue with page generation immediately, while leaving the email for asynchronous processing.

    You can either go down the road of message-driven beans (MDB) or you can just set up a pool of background threads to consume messages from the queue. On receipt of a message, the subscriber just executes the same email generation and transmission as before, with one exception. If the message fails due to a system error, such as a broken socket connection, the message can just go right back onto the message queue for later retry. (You’ll probably want to update the "next retry time" to avoid livelock.)

    Better Still

    If you have a cluster of application servers that can all generate outbound email, why not take the next step? Move the MDBs out into their own app server and have the message queues from all the app servers terminate there? (If you’re using pub-sub instead of point-to-point, this will be pretty much transparent.) This application will resemble a message broker… for good reason. It’s essentially just pulling messages in from one protocol, transforming them, then sending them out over another protocol.

    The best part? You don’t even have to write the message broker yourself. There are plenty of open-source and commercial alternatives.

    Summary

    Sending email directly from the request-handling thread performs poorly, creates unpredictable page latency for users and risks dropping their emails right on the floor. It’s better to drop a message in a queue for asynchronous transformation by a message broker: it’s faster, more reliable, and there’s less code for you to write.