Wide Awake Developers

Two Sites, One Antipattern

| Comments

This week, I had Groundhog Day in December.  I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They’re a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy.  It’s good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer’s cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product.  (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match.  The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog.  It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second.  With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn’t, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn’t overspent for the last two years to compensate. 

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad.  In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks.  All the other app servers would block on those row locks.  The team applied a band-aid to the query logic, and those crashes stopped.

What’s the common factor here? It’s what I call an "Unbounded Result Set".  Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process.  In essence, both of these applications trusted their databases.  The apps weren’t prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data.   If your app processes one record at a time, then looping through an entire result set might be OK—as long as you’re not making a user wait while you do.  But if your app that turns rows into objects, then it had better be very selective about its SELECTs.  The relationships might not be what you expect.  The data producer might have changed in a surprising way, particularly if it’s not under your control.  Purging routines might not be in place, or might have gotten broken.  Definitely don’t trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Read-write Splitting With Oracle

| Comments

Speaking of databases and read/write splitting, Oracle had a session at OpenWorld about it.

Building a read pool of database replicas isn’t something I usually think of doing with Oracle, mainly due to their non-zero license fees.  It changes the scaling equation.

Still, if you are on Oracle and the fees work for you, consider Active Data Guard.   Some key facts from the slides:

  • Average latency for replication was 1 second
  • The maximum latency spike they observed was 10 seconds.
  • A node can take itself offline if it detects excessive latency.
  • You can use DBLinks to allow applications to think they’re writing to a read node.  The node will transparently pass the writes through to the master.
  • This can be done without any tricky JDBC proxies or load-balancing drivers, just the normal Oracle JDBC driver with the bugs we all know and love.
  • Active Data Guard requires Oracle 11g.

Budgetecture and It’s Ugly Cousins

| Comments

It’s the time of year for family gatherings, so here’s a repulsive group portrait of some nearly universal pathologies. Try not to read this while you’re eating.

Budgetecture

We’ve all been hit with budgetecture.  That’s when sound technology choices go out the window in favor of cost-cutting. The conversation goes something like this.

"Do we really need X?" asks the project sponsor. (A.k.a. the gold owner.)

For "X", you can substitute nearly anything that’s vitally necessary to make the system run: software licenses, redundant servers, offsite backups, or power supplies.  It’s always asked with a sort of paternalistic tone, as though the grown-up has caught us blowing all our pocket money on comic books and bubble gum, whilst the serious adults are trying to get on with buying more buckets to carry their profits around in.

The correct way to answer this is "Yes.  We do."  That’s almost never the response.

After all, we’re trained as engineers, and engineering is all about making trade-offs. We know good and well that you don’t really need extravagances like power supplies, so long as there’s a sufficient supply of hamster wheels and cheap interns in the data center.  So instead of simply saying, "Yes. We do," we go on with something like, "Well, you could do without a second server, provided you’re willing to accept downtime for routine maintenance and whenever a RAM chip gets hit by a cosmic ray and flips a bit, causing a crash, but if we get error-checking parity memory then we get around that, so we just have to worry about the operating system crashing, which it does about every three-point-nine days, so we’ll have to institute a regime of nightly restarts that the interns can do whenever they’re taking a break from the power-generating hamster wheels."

All of which might be completely true, but is utterly the wrong thing to say. The sponsor has surely stopped listening after the word, "Well…"

The problem is that you see your part as an engineering role, while your sponsor clearly understands he’s engaged in a negotiation. And in a negotiation, the last thing you want to do is make concessions on the first demand. In fact, the right response to the "do we really need" question is something like this:

"Without a second server, the whole system will come crashing down at least three times daily, particularly when it’s under heaviest load or when you are doing a demo for the Board of Directors. In fact, we really need four servers so we can take an HA pair down independently at any time while still maintaining 100% of our capacity, even in case one of the remaining pair crashes unexpectedly."

Of course, we both know you don’t really need the third and fourth servers. This is just a gambit to get the sponsor to change the subject to something else. You’re upping the ante and showing that you’re already running at the bare, dangerous, nearly-irresponsible minimum tolerable configuration. And besides, if you do actually get the extra servers, you can certainly use one to make your QA environment match production, and the other will make a great build box.

Schedule Quid Pro Quo

Another situation in which we harm ourselves by bringing engineering trade-offs to a negotiation comes when the schedule slips. Statistically speaking, we’re more likely to pick up the bass line from "La Bamba" from a pair of counter-rotating neutron stars than we are to complete a project on time. Sooner or later, you’ll realize that the only way to deliver your project on time and under budget is to reduce it to roughly the scope of "Hello, world!"

When that happens, being a responsible developer, you’ll tell your sponsor that the schedule needs to slip. You may not realize it, but by uttering those words, you’ve given the international sign of negotiating weakness.

Your sponsor, who has his or her own reputation—not to mention budget—tied to the delivery of this project, will reflexively respond with, "We can move the date, but if I give you that, then you have to give me these extra features."

The project is already going to be late. Adding features will surely make it more late, particularly since you’ve already established that the team isn’t moving as fast as expected. So why would someone invested in the success of the project want to further damage it by increasing the scope? It’s about as productive as soaking a grocery store bag (the paper kind) in water, then dropping a coconut into it.

I suspect that it’s sort of like dragging a piece of yarn in front of a kitten. It can’t help but pounce on it. It’s just what kittens do.

 My only advice in this situation is to counter with data. Produce the burndown chart showing when you will actually be ready to release with the current scope. Then show how the fractally iterative cycle of slippage followed by scope creep produces a delivery date that will be moot, as the sun will have exploded before you reach beta.

The Fallacy of Capital

When something costs a lot, we want to use it all the time, regardless of how well suited it is or is not.

This is sort of the inverse of budgetecture.  For example, relational databases used to cost roughly the same as a battleship. So, managers got it in their heads that everything needed to be in the relational database.  Singular. As in, one.

Well, if one database server is the source of all truth, you’d better be pretty careful with it. And the best way to be careful with it is to make sure that nobody, but nobody, ever touches it. Then you collect a group of people with malleable young minds and a bent toward obsessive-compulsive abbreviation forming, and you make them the Curators of Truth.

But, because the damn thing cost so much, you need to get your money’s worth out of it. So, you mandate that every application must store its data in The Database, despite the fact that nobody knows where it is, what it looks like, or even if it really exists.  Like Schrodinger’s cat, it might already be gone, it’s just that nobody has observed it yet. Still, even that genetic algorithm with simulated annealing, running ten million Monte Carlo fitness tests is required to keep its data in The Database.

(In the above argument, feel free to substitute IBM Mainframe, WebSphere, AquaLogic, ESB, or whatever your capital fallacy du jour may be.)

Of course, if databases didn’t cost so much, nobody would care how many of them there are. Which is why MySQL, Postgres, SQLite, and the others are really so useful. It’s not an issue to create twenty or thirty instances of a free database. There’s no need to collect them up into a grand "enterprise data architecture". In fact, exactly the opposite is true. You can finally let independent business units evolve independently. Independent services can own their own data stores, and never let other applications stick their fingers into its guts.

 

So there you have it, a small sample of the rogue’s gallery. These bad relations don’t get much photo op time with the CEO, but if you look, you’ll find them lurking in some cubicle just around the corner.

 

Releasing a Free SingleLineFormatter

| Comments

A number of readers have asked me for reference implementations of the stability and capacity patterns.

I’ve begun to create some free implementations to go along with Release It. As of today, it just includes a drop-in formatter that you can use in place of the java.util.logging default (which is horrible).

This formatter keeps all the fields lined up in columns, including truncating the logger name and method name if necessary. A columnar format is much easier for the human eye to scan. We all have great pattern-matching machinery in our heads. I can’t for the life of me understand why so many vendors work so hard to defeat it. The one thing that doesn’t get stuffed into a column is a stack trace. It’s good for a stack trace to interrupt the flow of the log file… that’s something that you really want to pop out when scanning the file.

It only takes a minute to plug in the SingleLineFormatter. Your admins will thank you for it.

Read about the library.

Download it as .zip or .tgz.

A Dozen Levels of Done

| Comments

What does "done" mean to you?  I find that my definition of "done" continues to expand. When I was still pretty green, I would say "It’s done" when I had finished coding.  (Later, a wiser and more cynical colleague taught me that "done" meant that you had not only finished the work, but made sure to tell your manager you had finished the work.)

The next meaning of "done" that I learned had to do with version control. It’s not done until it’s checked in.

Several years ago, I got test infected and my definition of "done" expanded to include unit testing.

Now that I’ve lived in operations for a few years and gotten to know and love Lean Software Development, I have a new definition of "done".

Here goes:

A feature is not "done" until all of the following can be said about it:

  1. All unit tests are green.
  2. The code is as simple as it can be.
  3. It communicates clearly.
  4. It compiles in the automated build from a clean checkout.
  5. It has passed unit, functional, integration, stress, longevity, load, and resilience testing.
  6. The customer has accepted the feature.
  7. It is included in a release that has been branched in version control.
  8. The feature’s impact on capacity is well-understood.
  9. Deployment instructions for the release are defined and do not include a "point of no return".
  10. Rollback instructions for the release are defined and tested.
  11. It has been deployed and verified.
  12. It is generating revenue.

Until all of these are true, the feature is just unfinished inventory.

Postmodern Programming

| Comments

It’s taken me a while to get to this talk. Not because it was uninteresting, just because it sent my mind in so many directions that I needed time to collect my scattered thoughts.

Objects and Lego Blocks 

On Thursday, James Noble delivered a Keynote about "The Lego Hypothesis". As you might guess, he was talking about the dream of building software as easily as a child assembles a house from Lego bricks. He described it as an old dream, using quotes from the very first conference on Software Engineering… the one where they utterly invented the term "Software Engineering" itself.  In 1968.

The Lego Hypothesis goes something like this: "In the future, software engineering will be set free from the mundane necessity of programming." To realize this dream, we should look at the characteristics of Lego bricks and see if software at all mirrors those characteristics.

Noble ascribed the following characteristics to components:

  • Small
  • Indivisible
  • Substitutable
  • More similar than different
  • Abstract encapsulations
  • Coupled to a few, close neighbors
  • No action at a distance

(These actually predate that 1968 software engineering conference by quite a bit. They were first described by the Greek philosopher Democritus in his theory of atomos.)

The first several characteristics sound a lot like the way we understand objects. The last two are problematic, though.

Examining many different programs and languages, Noble’s research group has found that objects are typically not connected to just a few nearby objects. The majority of objects are coupled to just one or two others. But the extremal cases are very, very extreme. In a Self program, one object had over 10,000,000 inbound references. That is, it was coupled to more than 10,000,000 other objects in the system. (It’s probably ‘nil’, ‘true’, ‘false’, or perhaps the integer object ‘zero’.)

In fact, object graphs tend to form scale-free networks that can be described by power laws.

Lots of other systems in our world form scale-free networks with power law distributions:

  • City sizes
  • Earthquake magnitudes
  • Branches in a roadway network
  • The Internet
  • Blood vessels
  • Galaxy sizes
  • Impact crater diameters
  • Income distributions
  • Books sales

One of the first things to note about power law distributions is that they are not normal. That is, words like "average" and "median" are very misleading. If the average inbound coupling is 1.2, but the maximum is 10,000,000, how much does the average tell you about the large scale behavior of the system?

(An aside: this is the fundamental problem that makes random events so problematic in Nassim Taleb’s book The Black Swan. Benoit Mandelbrot also considers this in The (Mis)Behavior of Markets. Yes, that Mandelbrot.)

Noble made a pretty good case that the Lego Hypothesis is dead as disco. Then came a leap of logic that I must have missed.

Postmodernism

"The ultimate goal of computer science is the program."

You are assigned to write a program to calculate the first 100 prime numbers. If you are a student, you have to write this as if it exists in a vacuum. That is, you code as if this is the first program in the universe. It isn’t. Once you leave the unique environs of school, you’re not likely to sit down with pad of lined paper and a mechanical pencil to derive your own prime-number-finding algorithm. Instead, your first stop is probably Google.

Searching for "prime number sieve" currently gives me about 644,000 results in three-tenths of a second. The results include implementations in JavaScript, Java, C, C++, FORTRAN, PHP, and many others. In fact, if I really need prime numbers rather than a program to find numbers, I can just parasitize somebody else’s computing power with online prime number generators.

Noble quotes Steven Conner from the Cambridge Companion to Postmodernism:

"…that condition in which, for the first time, and as a result of technologies which allow the large-scale storage, access, and re-production of records of the past, the past appears to be included in the present."

In art and literature, postmodernism incorporates elements of past works, directly and by reference. In programming, it means that every program ever written is still alive. They are "alive" in the sense that even dead hardware can be emulated. Papers from the dawn of computing are available online. There are execution environments for COBOL that run in Java Virtual Machines, possibly on virtual operating systems. Today’s systems can completely contain every previous language, program, and execution environment.

I’m now writing well beyond my actual understanding of postmodern critical theory and trying to report what Noble was talking about in his keynote.

The same technological changes that caused the rise of postmodernism in art, film, and literature are now in full force in programming. In a very real sense, we did it to ourselves! We technologists and programmers created the technology—globe-spanning networks, high compression codecs, indexing and retrieval, collaborative filtering, virtualization, emulation—that are now reshaping our profession.

In the age of postmodern programming, there are no longer "correct algorithms". Instead, there are contextual decisions, negotiations, and contingencies. Instead of The Solution, we have individual solutions that solve problems in a context. This should sound familiar to anyone in the patterns movement.

Indeed, he directly references patterns and eXtreme Programming as postmodern programming phenomena, along with "scrap-heap" programming, mashups, glue programming, and scripting languages.

I searched for a great way to wrap this piece up, but ultimately it seemed more appropriate to talk about the contextual impact it had on me. I’ve never been fond of postmodernism; it always seemed simultaneously precious and pretentious. Now, I’ll be giving that movement more attention. Second, I’ve always thought of mashups as sort of tawdry and sordid—not real programming, you know? I’ll be reconsidering that position as well. 

Conference: “Velocity”

| Comments

O’Reilly has announced an upcoming conference called Velocity.

From the announcement:

Web companies, big and small, face many of the same challenges: sites must be faster, infrastructure needs to scale, and everything must be available to customers at all times, no matter what. Velocity is the place to obtain the crucial skills and knowledge to build successful web sites that are fast, scalable, resilient, and highly available.

Unfortunately, there are few opportunities to learn from peers, exchange ideas with experts, and share best practices and lessons learned.

Velocity is changing that by providing the best information on building and operating web sites that are fast, reliable, and always up. We’re bringing together people from around the world who are doing the best performance work, to improve the experience of web users worldwide. Pages will be faster. Sites will have higher up-time. Companies will achieve more with less. The next cool startup will be able to more quickly scale to serve a larger audience, globally. Velocity is the key for crossing over from cool Web 2.0 features to sustainable web sites.

That statement could have been the preface to my book, so I’ll be submitting several proposals for talks.

Putting My Mind Online

| Comments

Along with the longer analysis pieces, I’ve decided to post the entirety of my notes from QCon San Francisco. A few of my friends and colleagues are fellow mind-mappers, so this is for them.

Nygard’s Mind Map from QCon

This file works with FreeMind, an fast, fluid, and free mind mapping tool. 

Two Ways to Boost Your Flagging Web Site

| Comments

Being fast doesn’t make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

13 Threads Needed When Requests Take 700ms

You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

3 Threads Needed When Requests Take 200ms

Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

Suppose you’re site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you’re probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you’re paying for big boxes that are guaranteed to be 50% idle.

Classic Six Pack

Let’s look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

Cache Farms

Cache farming doesn’t mean armies of Chinese gamers stomping rats and making vests. It doesn’t involve registering a ton of domain names, either.

Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you’re running multiple instances on the same hosts, you could be caching the same object once per instance.

Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

So what do we have? If there are four app instances per host, then a frequently accessed object—like a product featured on the home page—will be duplicated eight times. Can we do better? Well, since I’m writing this article, you might suspect the answer is "yes". You’d be right.

The caches I’ve described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you’ll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server’s internal cache, they’re probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it’s already been re-added to the cache.

External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

Six Pack With Cache Farm

If the external cache doesn’t have the item, the app server hits the database as usual. So I’ll turn my attention to the database tier.

Read Pools

The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I’m going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

When your database starts to get that middle-age paunch and it just isn’t as zippy as it used to be, think about offloading those reads. At a minimum, you’ll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you’ll probably get more performance out of your writes once all that pesky read I/O is off the write master.

How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

Six Pack With Cache Farm and Read Pool

By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That’s the Foghorn Leghorn version of the objection. I’ll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

Does that matter?

To a certain extent, I can’t answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I’ll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What’s the total latency before that pricing error will be corrected? Let’s look at the end-to-end process.

  1. A human detects the pricing error.
  2. The observer notifies the responsible merchant.
  3. The merchant verifies that the price is in error and determines the correct price.
  4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
  5. The merchant locates the item and enters the correct price
  6. She hits the "publish" button.
  7. The fast path system connects to the write master in production and updates the price.
  8. The read pool receives the logs with the update and applies them.
  9. The read pool process sends a message to invalidate the item in the app servers’ caches.
  10. The next time users request that product detail page, they see the correct price.

That’s the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

Summing It Up

The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.

References

For more on read pooling, see Cal Henderson’s excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

The most popular open-source external caching framework I’ve seen is memcached. It’s a flexible, multi-lingual caching daemon.

On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

Two Quick Observations

| Comments

Several of the speakers here have echoed two themes about databases.

1. MySQL is in production in a lot of places. I think the high cost of commercial databases (read: Oracle) leads to a kind of budgetechture that concentrates all data in a single massive database. If you remove that cost from the equation, the idea of either functionally partitioning your data stores or creating multiple shards becomes much more palatable.

2. By far the most common database cluster structure has one write master with many read masters. Ian Flint spoke to us about the architectures behind Yahoo Groups and Yahoo Bix. Bix has 30 MySQL read servers and just one write master. Dan Pritchett from eBay had a similar ratio. (His might have been 10:1 rather than 30:1.) In a commerce site, where 98% of the traffic is browsing and only 2% is buying, a read-pooled cluster makes a lot of sense.