Wide Awake Developers

« November 2007 | Main | January 2008 »

Two Sites, One Antipattern

This week, I had Groundhog Day in December.  I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They're a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy.  It's good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer's cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product.  (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match.  The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog.  It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second.  With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn't, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn't overspent for the last two years to compensate. 

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad.  In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks.  All the other app servers would block on those row locks.  The team applied a band-aid to the query logic, and those crashes stopped.

What's the common factor here? It's what I call an "Unbounded Result Set".  Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process.  In essence, both of these applications trusted their databases.  The apps weren't prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data.   If your app processes one record at a time, then looping through an entire result set might be OK---as long as you're not making a user wait while you do.  But if your app that turns rows into objects, then it had better be very selective about its SELECTs.  The relationships might not be what you expect.  The data producer might have changed in a surprising way, particularly if it's not under your control.  Purging routines might not be in place, or might have gotten broken.  Definitely don't trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Read-write splitting with Oracle

Speaking of databases and read/write splitting, Oracle had a session at OpenWorld about it.

Building a read pool of database replicas isn't something I usually think of doing with Oracle, mainly due to their non-zero license fees.  It changes the scaling equation.

Still, if you are on Oracle and the fees work for you, consider Active Data Guard.   Some key facts from the slides:

  • Average latency for replication was 1 second
  • The maximum latency spike they observed was 10 seconds.
  • A node can take itself offline if it detects excessive latency.
  • You can use DBLinks to allow applications to think they're writing to a read node.  The node will transparently pass the writes through to the master.
  • This can be done without any tricky JDBC proxies or load-balancing drivers, just the normal Oracle JDBC driver with the bugs we all know and love.
  • Active Data Guard requires Oracle 11g.

Budgetecture and it's ugly cousins

It's the time of year for family gatherings, so here's a repulsive group portrait of some nearly universal pathologies. Try not to read this while you're eating.

Budgetecture

We've all been hit with budgetecture.  That's when sound technology choices go out the window in favor of cost-cutting. The conversation goes something like this.

"Do we really need X?" asks the project sponsor. (A.k.a. the gold owner.)

For "X", you can substitute nearly anything that's vitally necessary to make the system run: software licenses, redundant servers, offsite backups, or power supplies.  It's always asked with a sort of paternalistic tone, as though the grown-up has caught us blowing all our pocket money on comic books and bubble gum, whilst the serious adults are trying to get on with buying more buckets to carry their profits around in.

The correct way to answer this is "Yes.  We do."  That's almost never the response.

After all, we're trained as engineers, and engineering is all about making trade-offs. We know good and well that you don't really need extravagances like power supplies, so long as there's a sufficient supply of hamster wheels and cheap interns in the data center.  So instead of simply saying, "Yes. We do," we go on with something like, "Well, you could do without a second server, provided you're willing to accept downtime for routine maintenance and whenever a RAM chip gets hit by a cosmic ray and flips a bit, causing a crash, but if we get error-checking parity memory then we get around that, so we just have to worry about the operating system crashing, which it does about every three-point-nine days, so we'll have to institute a regime of nightly restarts that the interns can do whenever they're taking a break from the power-generating hamster wheels."

All of which might be completely true, but is utterly the wrong thing to say. The sponsor has surely stopped listening after the word, "Well..."

The problem is that you see your part as an engineering role, while your sponsor clearly understands he's engaged in a negotiation. And in a negotiation, the last thing you want to do is make concessions on the first demand. In fact, the right response to the "do we really need" question is something like this:

"Without a second server, the whole system will come crashing down at least three times daily, particularly when it's under heaviest load or when you are doing a demo for the Board of Directors. In fact, we really need four servers so we can take an HA pair down independently at any time while still maintaining 100% of our capacity, even in case one of the remaining pair crashes unexpectedly."

Of course, we both know you don't really need the third and fourth servers. This is just a gambit to get the sponsor to change the subject to something else. You're upping the ante and showing that you're already running at the bare, dangerous, nearly-irresponsible minimum tolerable configuration. And besides, if you do actually get the extra servers, you can certainly use one to make your QA environment match production, and the other will make a great build box.

Schedule Quid Pro Quo

Another situation in which we harm ourselves by bringing engineering trade-offs to a negotiation comes when the schedule slips. Statistically speaking, we're more likely to pick up the bass line from "La Bamba" from a pair of counter-rotating neutron stars than we are to complete a project on time. Sooner or later, you'll realize that the only way to deliver your project on time and under budget is to reduce it to roughly the scope of "Hello, world!"

When that happens, being a responsible developer, you'll tell your sponsor that the schedule needs to slip. You may not realize it, but by uttering those words, you've given the international sign of negotiating weakness.

Your sponsor, who has his or her own reputation---not to mention budget---tied to the delivery of this project, will reflexively respond with, "We can move the date, but if I give you that, then you have to give me these extra features."

The project is already going to be late. Adding features will surely make it more late, particularly since you've already established that the team isn't moving as fast as expected. So why would someone invested in the success of the project want to further damage it by increasing the scope? It's about as productive as soaking a grocery store bag (the paper kind) in water, then dropping a coconut into it.

I suspect that it's sort of like dragging a piece of yarn in front of a kitten. It can't help but pounce on it. It's just what kittens do.

 My only advice in this situation is to counter with data. Produce the burndown chart showing when you will actually be ready to release with the current scope. Then show how the fractally iterative cycle of slippage followed by scope creep produces a delivery date that will be moot, as the sun will have exploded before you reach beta.

The Fallacy of Capital

When something costs a lot, we want to use it all the time, regardless of how well suited it is or is not.

This is sort of the inverse of budgetecture.  For example, relational databases used to cost roughly the same as a battleship. So, managers got it in their heads that everything needed to be in the relational database.  Singular. As in, one.

Well, if one database server is the source of all truth, you'd better be pretty careful with it. And the best way to be careful with it is to make sure that nobody, but nobody, ever touches it. Then you collect a group of people with malleable young minds and a bent toward obsessive-compulsive abbreviation forming, and you make them the Curators of Truth.

But, because the damn thing cost so much, you need to get your money's worth out of it. So, you mandate that every application must store its data in The Database, despite the fact that nobody knows where it is, what it looks like, or even if it really exists.  Like Schrodinger's cat, it might already be gone, it's just that nobody has observed it yet. Still, even that genetic algorithm with simulated annealing, running ten million Monte Carlo fitness tests is required to keep its data in The Database.

(In the above argument, feel free to substitute IBM Mainframe, WebSphere, AquaLogic, ESB, or whatever your capital fallacy du jour may be.)

Of course, if databases didn't cost so much, nobody would care how many of them there are. Which is why MySQL, Postgres, SQLite, and the others are really so useful. It's not an issue to create twenty or thirty instances of a free database. There's no need to collect them up into a grand "enterprise data architecture". In fact, exactly the opposite is true. You can finally let independent business units evolve independently. Independent services can own their own data stores, and never let other applications stick their fingers into its guts.

 

So there you have it, a small sample of the rogue's gallery. These bad relations don't get much photo op time with the CEO, but if you look, you'll find them lurking in some cubicle just around the corner.

 

Releasing a free SingleLineFormatter

A number of readers have asked me for reference implementations of the stability and capacity patterns.

I've begun to create some free implementations to go along with Release It. As of today, it just includes a drop-in formatter that you can use in place of the java.util.logging default (which is horrible).

This formatter keeps all the fields lined up in columns, including truncating the logger name and method name if necessary. A columnar format is much easier for the human eye to scan. We all have great pattern-matching machinery in our heads. I can't for the life of me understand why so many vendors work so hard to defeat it. The one thing that doesn't get stuffed into a column is a stack trace. It's good for a stack trace to interrupt the flow of the log file... that's something that you really want to pop out when scanning the file.

It only takes a minute to plug in the SingleLineFormatter. Your admins will thank you for it.

Read about the library.

Download it as .zip or .tgz.