Wide Awake Developers

« Read-write splitting with Oracle | Main | Should Email Errors Keep Customers From Buying? »

Two Sites, One Antipattern

This week, I had Groundhog Day in December.  I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They're a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy.  It's good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer's cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product.  (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match.  The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog.  It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second.  With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn't, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn't overspent for the last two years to compensate. 

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad.  In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks.  All the other app servers would block on those row locks.  The team applied a band-aid to the query logic, and those crashes stopped.

What's the common factor here? It's what I call an "Unbounded Result Set".  Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process.  In essence, both of these applications trusted their databases.  The apps weren't prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data.   If your app processes one record at a time, then looping through an entire result set might be OK---as long as you're not making a user wait while you do.  But if your app that turns rows into objects, then it had better be very selective about its SELECTs.  The relationships might not be what you expect.  The data producer might have changed in a surprising way, particularly if it's not under your control.  Purging routines might not be in place, or might have gotten broken.  Definitely don't trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)