Wide Awake Developers

“Release It” Is a Jolt Award Finalist

| Comments

The Jolt Awards have been described as "the Oscar’s of our industry". (Really. It’s on the front page of the site.)  The list of past book winners reads like an essential library for the software practitioner. Even the finalists and runners-up are essential reading.

Release It has now joined the company of finalists. The competition is very tough… I’ve read "Beautiful Code" and "Manage It!", and both are excellent. I’ll be on pins and needles until the awards ceremony on March 5th.  Honestly, though, I’m just thrilled to be in such good company.

Should Email Errors Keep Customers From Buying?

| Comments

Somewhere inside every commerce site, there’s a bit of code sending emails out to customers.  Email campaigning might have been in the requirements and that email code stands tall at the brightly-lit service counter.  On the other hand, it might have been added as an afterthought, languishing in some dark corner with the "lost and found" department.  Either way, there’s a good chance it’s putting your site at risk.

The simplest way to code an email sending routine looks something like this:

  1. Get a javax.mail.Session instance
  2. Get a javax.mail.Transport instance from the Session
  3. Construct a javax.mail.internet.MimeMessage instance
  4. Set some fields on the message: from, subject, body.  (Setting the body may involve reading a template from a file and interpolating values.)
  5. Set the recipients’ Addresses on the message
  6. Ask the Transport to send the message
  7. Close the Transport
  8. Discard the Session

This goes into a servlet, a controller, or a stateless session bean, depending on which MVC framework or JEE architecture blueprint you’re using.

There are two big problems here. (Actually, there are three, but I’m not going to deal with the "one connection per message" issue.)

Request-Handling Threads at Risk

As written, all the work of sending the email happens on the request-handling thread that’s also responsible for generating the response page. Even on a sunny day, that means you’re spending some precious request-response cycles on work that doesn’t help build the page.

You should always look at a call out to an external server with suspicion. Many of them can execute asynchronously to page generation. Anything that you can offload to a background thread, you should offload so the request-handler can get back in the pool sooner. The user’s experience will be better, and your site’s capacity will be better, if you do.

Also, keep in mind that SMTP servers aren’t always 100% reliable. Neither are the DNS servers that point you to them. That goes double if you’re connecting to some external service. (And please, please don’t even tell me you’re looking up the recipient’s MX record and contacting the receiving MTA directly!)

If the MTA is slow to accept your connection, or to process the email, then the request-handling thread could be blocked for a long time: seconds or even minutes. Will the user wait around for the response? Not likely. He’ll probably just hit "reload" and double-post the form that triggered the email in the first place.

Poor Error Recovery

The second problem is the complete lack of error recovery.  Yes, you can log an exception when your connection to the MTA fails. But that only lets the administrator know that some amount of mail failed. It doesn’t say what the mail was! There’s no way to contact the users who didn’t get their messages. Depending on what the messages said, that could be a very big deal.

At a minimum, you’d like to be able to detect and recovery from interruptions at the MTA—scheduled maintenance, Windows patching, unscheduled index rebulids, and the like. Even if "recovery" means someone takes the users’ info from the log file and types in a new message on their desktops, that’s better than nothing.

A Better Way

The good news is that there’s a handy way to address both of these problems at once. Better still, it works whether you’re dealing with internal SMTP based servers or external XML-over-HTTP bulk mailers.

Whenever a controller decides it’s time to reach out and touch a user through email, it should drop a message on a JMS queue. This lets the request-handling thread continue with page generation immediately, while leaving the email for asynchronous processing.

You can either go down the road of message-driven beans (MDB) or you can just set up a pool of background threads to consume messages from the queue. On receipt of a message, the subscriber just executes the same email generation and transmission as before, with one exception. If the message fails due to a system error, such as a broken socket connection, the message can just go right back onto the message queue for later retry. (You’ll probably want to update the "next retry time" to avoid livelock.)

Better Still

If you have a cluster of application servers that can all generate outbound email, why not take the next step? Move the MDBs out into their own app server and have the message queues from all the app servers terminate there? (If you’re using pub-sub instead of point-to-point, this will be pretty much transparent.) This application will resemble a message broker… for good reason. It’s essentially just pulling messages in from one protocol, transforming them, then sending them out over another protocol.

The best part? You don’t even have to write the message broker yourself. There are plenty of open-source and commercial alternatives.


Sending email directly from the request-handling thread performs poorly, creates unpredictable page latency for users and risks dropping their emails right on the floor. It’s better to drop a message in a queue for asynchronous transformation by a message broker: it’s faster, more reliable, and there’s less code for you to write.

Two Sites, One Antipattern

| Comments

This week, I had Groundhog Day in December.  I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They’re a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy.  It’s good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer’s cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product.  (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match.  The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog.  It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second.  With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn’t, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn’t overspent for the last two years to compensate. 

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad.  In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks.  All the other app servers would block on those row locks.  The team applied a band-aid to the query logic, and those crashes stopped.

What’s the common factor here? It’s what I call an "Unbounded Result Set".  Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process.  In essence, both of these applications trusted their databases.  The apps weren’t prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data.   If your app processes one record at a time, then looping through an entire result set might be OK—as long as you’re not making a user wait while you do.  But if your app that turns rows into objects, then it had better be very selective about its SELECTs.  The relationships might not be what you expect.  The data producer might have changed in a surprising way, particularly if it’s not under your control.  Purging routines might not be in place, or might have gotten broken.  Definitely don’t trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Read-write Splitting With Oracle

| Comments

Speaking of databases and read/write splitting, Oracle had a session at OpenWorld about it.

Building a read pool of database replicas isn’t something I usually think of doing with Oracle, mainly due to their non-zero license fees.  It changes the scaling equation.

Still, if you are on Oracle and the fees work for you, consider Active Data Guard.   Some key facts from the slides:

  • Average latency for replication was 1 second
  • The maximum latency spike they observed was 10 seconds.
  • A node can take itself offline if it detects excessive latency.
  • You can use DBLinks to allow applications to think they’re writing to a read node.  The node will transparently pass the writes through to the master.
  • This can be done without any tricky JDBC proxies or load-balancing drivers, just the normal Oracle JDBC driver with the bugs we all know and love.
  • Active Data Guard requires Oracle 11g.

Budgetecture and It’s Ugly Cousins

| Comments

It’s the time of year for family gatherings, so here’s a repulsive group portrait of some nearly universal pathologies. Try not to read this while you’re eating.


We’ve all been hit with budgetecture.  That’s when sound technology choices go out the window in favor of cost-cutting. The conversation goes something like this.

"Do we really need X?" asks the project sponsor. (A.k.a. the gold owner.)

For "X", you can substitute nearly anything that’s vitally necessary to make the system run: software licenses, redundant servers, offsite backups, or power supplies.  It’s always asked with a sort of paternalistic tone, as though the grown-up has caught us blowing all our pocket money on comic books and bubble gum, whilst the serious adults are trying to get on with buying more buckets to carry their profits around in.

The correct way to answer this is "Yes.  We do."  That’s almost never the response.

After all, we’re trained as engineers, and engineering is all about making trade-offs. We know good and well that you don’t really need extravagances like power supplies, so long as there’s a sufficient supply of hamster wheels and cheap interns in the data center.  So instead of simply saying, "Yes. We do," we go on with something like, "Well, you could do without a second server, provided you’re willing to accept downtime for routine maintenance and whenever a RAM chip gets hit by a cosmic ray and flips a bit, causing a crash, but if we get error-checking parity memory then we get around that, so we just have to worry about the operating system crashing, which it does about every three-point-nine days, so we’ll have to institute a regime of nightly restarts that the interns can do whenever they’re taking a break from the power-generating hamster wheels."

All of which might be completely true, but is utterly the wrong thing to say. The sponsor has surely stopped listening after the word, "Well…"

The problem is that you see your part as an engineering role, while your sponsor clearly understands he’s engaged in a negotiation. And in a negotiation, the last thing you want to do is make concessions on the first demand. In fact, the right response to the "do we really need" question is something like this:

"Without a second server, the whole system will come crashing down at least three times daily, particularly when it’s under heaviest load or when you are doing a demo for the Board of Directors. In fact, we really need four servers so we can take an HA pair down independently at any time while still maintaining 100% of our capacity, even in case one of the remaining pair crashes unexpectedly."

Of course, we both know you don’t really need the third and fourth servers. This is just a gambit to get the sponsor to change the subject to something else. You’re upping the ante and showing that you’re already running at the bare, dangerous, nearly-irresponsible minimum tolerable configuration. And besides, if you do actually get the extra servers, you can certainly use one to make your QA environment match production, and the other will make a great build box.

Schedule Quid Pro Quo

Another situation in which we harm ourselves by bringing engineering trade-offs to a negotiation comes when the schedule slips. Statistically speaking, we’re more likely to pick up the bass line from "La Bamba" from a pair of counter-rotating neutron stars than we are to complete a project on time. Sooner or later, you’ll realize that the only way to deliver your project on time and under budget is to reduce it to roughly the scope of "Hello, world!"

When that happens, being a responsible developer, you’ll tell your sponsor that the schedule needs to slip. You may not realize it, but by uttering those words, you’ve given the international sign of negotiating weakness.

Your sponsor, who has his or her own reputation—not to mention budget—tied to the delivery of this project, will reflexively respond with, "We can move the date, but if I give you that, then you have to give me these extra features."

The project is already going to be late. Adding features will surely make it more late, particularly since you’ve already established that the team isn’t moving as fast as expected. So why would someone invested in the success of the project want to further damage it by increasing the scope? It’s about as productive as soaking a grocery store bag (the paper kind) in water, then dropping a coconut into it.

I suspect that it’s sort of like dragging a piece of yarn in front of a kitten. It can’t help but pounce on it. It’s just what kittens do.

 My only advice in this situation is to counter with data. Produce the burndown chart showing when you will actually be ready to release with the current scope. Then show how the fractally iterative cycle of slippage followed by scope creep produces a delivery date that will be moot, as the sun will have exploded before you reach beta.

The Fallacy of Capital

When something costs a lot, we want to use it all the time, regardless of how well suited it is or is not.

This is sort of the inverse of budgetecture.  For example, relational databases used to cost roughly the same as a battleship. So, managers got it in their heads that everything needed to be in the relational database.  Singular. As in, one.

Well, if one database server is the source of all truth, you’d better be pretty careful with it. And the best way to be careful with it is to make sure that nobody, but nobody, ever touches it. Then you collect a group of people with malleable young minds and a bent toward obsessive-compulsive abbreviation forming, and you make them the Curators of Truth.

But, because the damn thing cost so much, you need to get your money’s worth out of it. So, you mandate that every application must store its data in The Database, despite the fact that nobody knows where it is, what it looks like, or even if it really exists.  Like Schrodinger’s cat, it might already be gone, it’s just that nobody has observed it yet. Still, even that genetic algorithm with simulated annealing, running ten million Monte Carlo fitness tests is required to keep its data in The Database.

(In the above argument, feel free to substitute IBM Mainframe, WebSphere, AquaLogic, ESB, or whatever your capital fallacy du jour may be.)

Of course, if databases didn’t cost so much, nobody would care how many of them there are. Which is why MySQL, Postgres, SQLite, and the others are really so useful. It’s not an issue to create twenty or thirty instances of a free database. There’s no need to collect them up into a grand "enterprise data architecture". In fact, exactly the opposite is true. You can finally let independent business units evolve independently. Independent services can own their own data stores, and never let other applications stick their fingers into its guts.


So there you have it, a small sample of the rogue’s gallery. These bad relations don’t get much photo op time with the CEO, but if you look, you’ll find them lurking in some cubicle just around the corner.


Releasing a Free SingleLineFormatter

| Comments

A number of readers have asked me for reference implementations of the stability and capacity patterns.

I’ve begun to create some free implementations to go along with Release It. As of today, it just includes a drop-in formatter that you can use in place of the java.util.logging default (which is horrible).

This formatter keeps all the fields lined up in columns, including truncating the logger name and method name if necessary. A columnar format is much easier for the human eye to scan. We all have great pattern-matching machinery in our heads. I can’t for the life of me understand why so many vendors work so hard to defeat it. The one thing that doesn’t get stuffed into a column is a stack trace. It’s good for a stack trace to interrupt the flow of the log file… that’s something that you really want to pop out when scanning the file.

It only takes a minute to plug in the SingleLineFormatter. Your admins will thank you for it.

Read about the library.

Download it as .zip or .tgz.

A Dozen Levels of Done

| Comments

What does "done" mean to you?  I find that my definition of "done" continues to expand. When I was still pretty green, I would say "It’s done" when I had finished coding.  (Later, a wiser and more cynical colleague taught me that "done" meant that you had not only finished the work, but made sure to tell your manager you had finished the work.)

The next meaning of "done" that I learned had to do with version control. It’s not done until it’s checked in.

Several years ago, I got test infected and my definition of "done" expanded to include unit testing.

Now that I’ve lived in operations for a few years and gotten to know and love Lean Software Development, I have a new definition of "done".

Here goes:

A feature is not "done" until all of the following can be said about it:

  1. All unit tests are green.
  2. The code is as simple as it can be.
  3. It communicates clearly.
  4. It compiles in the automated build from a clean checkout.
  5. It has passed unit, functional, integration, stress, longevity, load, and resilience testing.
  6. The customer has accepted the feature.
  7. It is included in a release that has been branched in version control.
  8. The feature’s impact on capacity is well-understood.
  9. Deployment instructions for the release are defined and do not include a "point of no return".
  10. Rollback instructions for the release are defined and tested.
  11. It has been deployed and verified.
  12. It is generating revenue.

Until all of these are true, the feature is just unfinished inventory.

Postmodern Programming

| Comments

It’s taken me a while to get to this talk. Not because it was uninteresting, just because it sent my mind in so many directions that I needed time to collect my scattered thoughts.

Objects and Lego Blocks 

On Thursday, James Noble delivered a Keynote about "The Lego Hypothesis". As you might guess, he was talking about the dream of building software as easily as a child assembles a house from Lego bricks. He described it as an old dream, using quotes from the very first conference on Software Engineering… the one where they utterly invented the term "Software Engineering" itself.  In 1968.

The Lego Hypothesis goes something like this: "In the future, software engineering will be set free from the mundane necessity of programming." To realize this dream, we should look at the characteristics of Lego bricks and see if software at all mirrors those characteristics.

Noble ascribed the following characteristics to components:

  • Small
  • Indivisible
  • Substitutable
  • More similar than different
  • Abstract encapsulations
  • Coupled to a few, close neighbors
  • No action at a distance

(These actually predate that 1968 software engineering conference by quite a bit. They were first described by the Greek philosopher Democritus in his theory of atomos.)

The first several characteristics sound a lot like the way we understand objects. The last two are problematic, though.

Examining many different programs and languages, Noble’s research group has found that objects are typically not connected to just a few nearby objects. The majority of objects are coupled to just one or two others. But the extremal cases are very, very extreme. In a Self program, one object had over 10,000,000 inbound references. That is, it was coupled to more than 10,000,000 other objects in the system. (It’s probably ‘nil’, ‘true’, ‘false’, or perhaps the integer object ‘zero’.)

In fact, object graphs tend to form scale-free networks that can be described by power laws.

Lots of other systems in our world form scale-free networks with power law distributions:

  • City sizes
  • Earthquake magnitudes
  • Branches in a roadway network
  • The Internet
  • Blood vessels
  • Galaxy sizes
  • Impact crater diameters
  • Income distributions
  • Books sales

One of the first things to note about power law distributions is that they are not normal. That is, words like "average" and "median" are very misleading. If the average inbound coupling is 1.2, but the maximum is 10,000,000, how much does the average tell you about the large scale behavior of the system?

(An aside: this is the fundamental problem that makes random events so problematic in Nassim Taleb’s book The Black Swan. Benoit Mandelbrot also considers this in The (Mis)Behavior of Markets. Yes, that Mandelbrot.)

Noble made a pretty good case that the Lego Hypothesis is dead as disco. Then came a leap of logic that I must have missed.


"The ultimate goal of computer science is the program."

You are assigned to write a program to calculate the first 100 prime numbers. If you are a student, you have to write this as if it exists in a vacuum. That is, you code as if this is the first program in the universe. It isn’t. Once you leave the unique environs of school, you’re not likely to sit down with pad of lined paper and a mechanical pencil to derive your own prime-number-finding algorithm. Instead, your first stop is probably Google.

Searching for "prime number sieve" currently gives me about 644,000 results in three-tenths of a second. The results include implementations in JavaScript, Java, C, C++, FORTRAN, PHP, and many others. In fact, if I really need prime numbers rather than a program to find numbers, I can just parasitize somebody else’s computing power with online prime number generators.

Noble quotes Steven Conner from the Cambridge Companion to Postmodernism:

"…that condition in which, for the first time, and as a result of technologies which allow the large-scale storage, access, and re-production of records of the past, the past appears to be included in the present."

In art and literature, postmodernism incorporates elements of past works, directly and by reference. In programming, it means that every program ever written is still alive. They are "alive" in the sense that even dead hardware can be emulated. Papers from the dawn of computing are available online. There are execution environments for COBOL that run in Java Virtual Machines, possibly on virtual operating systems. Today’s systems can completely contain every previous language, program, and execution environment.

I’m now writing well beyond my actual understanding of postmodern critical theory and trying to report what Noble was talking about in his keynote.

The same technological changes that caused the rise of postmodernism in art, film, and literature are now in full force in programming. In a very real sense, we did it to ourselves! We technologists and programmers created the technology—globe-spanning networks, high compression codecs, indexing and retrieval, collaborative filtering, virtualization, emulation—that are now reshaping our profession.

In the age of postmodern programming, there are no longer "correct algorithms". Instead, there are contextual decisions, negotiations, and contingencies. Instead of The Solution, we have individual solutions that solve problems in a context. This should sound familiar to anyone in the patterns movement.

Indeed, he directly references patterns and eXtreme Programming as postmodern programming phenomena, along with "scrap-heap" programming, mashups, glue programming, and scripting languages.

I searched for a great way to wrap this piece up, but ultimately it seemed more appropriate to talk about the contextual impact it had on me. I’ve never been fond of postmodernism; it always seemed simultaneously precious and pretentious. Now, I’ll be giving that movement more attention. Second, I’ve always thought of mashups as sort of tawdry and sordid—not real programming, you know? I’ll be reconsidering that position as well. 

Conference: “Velocity”

| Comments

O’Reilly has announced an upcoming conference called Velocity.

From the announcement:

Web companies, big and small, face many of the same challenges: sites must be faster, infrastructure needs to scale, and everything must be available to customers at all times, no matter what. Velocity is the place to obtain the crucial skills and knowledge to build successful web sites that are fast, scalable, resilient, and highly available.

Unfortunately, there are few opportunities to learn from peers, exchange ideas with experts, and share best practices and lessons learned.

Velocity is changing that by providing the best information on building and operating web sites that are fast, reliable, and always up. We’re bringing together people from around the world who are doing the best performance work, to improve the experience of web users worldwide. Pages will be faster. Sites will have higher up-time. Companies will achieve more with less. The next cool startup will be able to more quickly scale to serve a larger audience, globally. Velocity is the key for crossing over from cool Web 2.0 features to sustainable web sites.

That statement could have been the preface to my book, so I’ll be submitting several proposals for talks.

Putting My Mind Online

| Comments

Along with the longer analysis pieces, I’ve decided to post the entirety of my notes from QCon San Francisco. A few of my friends and colleagues are fellow mind-mappers, so this is for them.

Nygard’s Mind Map from QCon

This file works with FreeMind, an fast, fluid, and free mind mapping tool.