Wide Awake Developers

A Dozen Levels of Done

| Comments

What does "done" mean to you?  I find that my definition of "done" continues to expand. When I was still pretty green, I would say "It’s done" when I had finished coding.  (Later, a wiser and more cynical colleague taught me that "done" meant that you had not only finished the work, but made sure to tell your manager you had finished the work.)

The next meaning of "done" that I learned had to do with version control. It’s not done until it’s checked in.

Several years ago, I got test infected and my definition of "done" expanded to include unit testing.

Now that I’ve lived in operations for a few years and gotten to know and love Lean Software Development, I have a new definition of "done".

Here goes:

A feature is not "done" until all of the following can be said about it:

  1. All unit tests are green.
  2. The code is as simple as it can be.
  3. It communicates clearly.
  4. It compiles in the automated build from a clean checkout.
  5. It has passed unit, functional, integration, stress, longevity, load, and resilience testing.
  6. The customer has accepted the feature.
  7. It is included in a release that has been branched in version control.
  8. The feature’s impact on capacity is well-understood.
  9. Deployment instructions for the release are defined and do not include a "point of no return".
  10. Rollback instructions for the release are defined and tested.
  11. It has been deployed and verified.
  12. It is generating revenue.

Until all of these are true, the feature is just unfinished inventory.

Postmodern Programming

| Comments

It’s taken me a while to get to this talk. Not because it was uninteresting, just because it sent my mind in so many directions that I needed time to collect my scattered thoughts.

Objects and Lego Blocks 

On Thursday, James Noble delivered a Keynote about "The Lego Hypothesis". As you might guess, he was talking about the dream of building software as easily as a child assembles a house from Lego bricks. He described it as an old dream, using quotes from the very first conference on Software Engineering… the one where they utterly invented the term "Software Engineering" itself.  In 1968.

The Lego Hypothesis goes something like this: "In the future, software engineering will be set free from the mundane necessity of programming." To realize this dream, we should look at the characteristics of Lego bricks and see if software at all mirrors those characteristics.

Noble ascribed the following characteristics to components:

  • Small
  • Indivisible
  • Substitutable
  • More similar than different
  • Abstract encapsulations
  • Coupled to a few, close neighbors
  • No action at a distance

(These actually predate that 1968 software engineering conference by quite a bit. They were first described by the Greek philosopher Democritus in his theory of atomos.)

The first several characteristics sound a lot like the way we understand objects. The last two are problematic, though.

Examining many different programs and languages, Noble’s research group has found that objects are typically not connected to just a few nearby objects. The majority of objects are coupled to just one or two others. But the extremal cases are very, very extreme. In a Self program, one object had over 10,000,000 inbound references. That is, it was coupled to more than 10,000,000 other objects in the system. (It’s probably ‘nil’, ‘true’, ‘false’, or perhaps the integer object ‘zero’.)

In fact, object graphs tend to form scale-free networks that can be described by power laws.

Lots of other systems in our world form scale-free networks with power law distributions:

  • City sizes
  • Earthquake magnitudes
  • Branches in a roadway network
  • The Internet
  • Blood vessels
  • Galaxy sizes
  • Impact crater diameters
  • Income distributions
  • Books sales

One of the first things to note about power law distributions is that they are not normal. That is, words like "average" and "median" are very misleading. If the average inbound coupling is 1.2, but the maximum is 10,000,000, how much does the average tell you about the large scale behavior of the system?

(An aside: this is the fundamental problem that makes random events so problematic in Nassim Taleb’s book The Black Swan. Benoit Mandelbrot also considers this in The (Mis)Behavior of Markets. Yes, that Mandelbrot.)

Noble made a pretty good case that the Lego Hypothesis is dead as disco. Then came a leap of logic that I must have missed.


"The ultimate goal of computer science is the program."

You are assigned to write a program to calculate the first 100 prime numbers. If you are a student, you have to write this as if it exists in a vacuum. That is, you code as if this is the first program in the universe. It isn’t. Once you leave the unique environs of school, you’re not likely to sit down with pad of lined paper and a mechanical pencil to derive your own prime-number-finding algorithm. Instead, your first stop is probably Google.

Searching for "prime number sieve" currently gives me about 644,000 results in three-tenths of a second. The results include implementations in JavaScript, Java, C, C++, FORTRAN, PHP, and many others. In fact, if I really need prime numbers rather than a program to find numbers, I can just parasitize somebody else’s computing power with online prime number generators.

Noble quotes Steven Conner from the Cambridge Companion to Postmodernism:

"…that condition in which, for the first time, and as a result of technologies which allow the large-scale storage, access, and re-production of records of the past, the past appears to be included in the present."

In art and literature, postmodernism incorporates elements of past works, directly and by reference. In programming, it means that every program ever written is still alive. They are "alive" in the sense that even dead hardware can be emulated. Papers from the dawn of computing are available online. There are execution environments for COBOL that run in Java Virtual Machines, possibly on virtual operating systems. Today’s systems can completely contain every previous language, program, and execution environment.

I’m now writing well beyond my actual understanding of postmodern critical theory and trying to report what Noble was talking about in his keynote.

The same technological changes that caused the rise of postmodernism in art, film, and literature are now in full force in programming. In a very real sense, we did it to ourselves! We technologists and programmers created the technology—globe-spanning networks, high compression codecs, indexing and retrieval, collaborative filtering, virtualization, emulation—that are now reshaping our profession.

In the age of postmodern programming, there are no longer "correct algorithms". Instead, there are contextual decisions, negotiations, and contingencies. Instead of The Solution, we have individual solutions that solve problems in a context. This should sound familiar to anyone in the patterns movement.

Indeed, he directly references patterns and eXtreme Programming as postmodern programming phenomena, along with "scrap-heap" programming, mashups, glue programming, and scripting languages.

I searched for a great way to wrap this piece up, but ultimately it seemed more appropriate to talk about the contextual impact it had on me. I’ve never been fond of postmodernism; it always seemed simultaneously precious and pretentious. Now, I’ll be giving that movement more attention. Second, I’ve always thought of mashups as sort of tawdry and sordid—not real programming, you know? I’ll be reconsidering that position as well. 

Conference: “Velocity”

| Comments

O’Reilly has announced an upcoming conference called Velocity.

From the announcement:

Web companies, big and small, face many of the same challenges: sites must be faster, infrastructure needs to scale, and everything must be available to customers at all times, no matter what. Velocity is the place to obtain the crucial skills and knowledge to build successful web sites that are fast, scalable, resilient, and highly available.

Unfortunately, there are few opportunities to learn from peers, exchange ideas with experts, and share best practices and lessons learned.

Velocity is changing that by providing the best information on building and operating web sites that are fast, reliable, and always up. We’re bringing together people from around the world who are doing the best performance work, to improve the experience of web users worldwide. Pages will be faster. Sites will have higher up-time. Companies will achieve more with less. The next cool startup will be able to more quickly scale to serve a larger audience, globally. Velocity is the key for crossing over from cool Web 2.0 features to sustainable web sites.

That statement could have been the preface to my book, so I’ll be submitting several proposals for talks.

Putting My Mind Online

| Comments

Along with the longer analysis pieces, I’ve decided to post the entirety of my notes from QCon San Francisco. A few of my friends and colleagues are fellow mind-mappers, so this is for them.

Nygard’s Mind Map from QCon

This file works with FreeMind, an fast, fluid, and free mind mapping tool. 

Two Ways to Boost Your Flagging Web Site

| Comments

Being fast doesn’t make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

13 Threads Needed When Requests Take 700ms

You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

3 Threads Needed When Requests Take 200ms

Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

Suppose you’re site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you’re probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you’re paying for big boxes that are guaranteed to be 50% idle.

Classic Six Pack

Let’s look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

Cache Farms

Cache farming doesn’t mean armies of Chinese gamers stomping rats and making vests. It doesn’t involve registering a ton of domain names, either.

Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you’re running multiple instances on the same hosts, you could be caching the same object once per instance.

Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

So what do we have? If there are four app instances per host, then a frequently accessed object—like a product featured on the home page—will be duplicated eight times. Can we do better? Well, since I’m writing this article, you might suspect the answer is "yes". You’d be right.

The caches I’ve described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you’ll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server’s internal cache, they’re probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it’s already been re-added to the cache.

External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

Six Pack With Cache Farm

If the external cache doesn’t have the item, the app server hits the database as usual. So I’ll turn my attention to the database tier.

Read Pools

The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I’m going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

When your database starts to get that middle-age paunch and it just isn’t as zippy as it used to be, think about offloading those reads. At a minimum, you’ll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you’ll probably get more performance out of your writes once all that pesky read I/O is off the write master.

How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

Six Pack With Cache Farm and Read Pool

By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That’s the Foghorn Leghorn version of the objection. I’ll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

Does that matter?

To a certain extent, I can’t answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I’ll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What’s the total latency before that pricing error will be corrected? Let’s look at the end-to-end process.

  1. A human detects the pricing error.
  2. The observer notifies the responsible merchant.
  3. The merchant verifies that the price is in error and determines the correct price.
  4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
  5. The merchant locates the item and enters the correct price
  6. She hits the "publish" button.
  7. The fast path system connects to the write master in production and updates the price.
  8. The read pool receives the logs with the update and applies them.
  9. The read pool process sends a message to invalidate the item in the app servers’ caches.
  10. The next time users request that product detail page, they see the correct price.

That’s the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

Summing It Up

The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.


For more on read pooling, see Cal Henderson’s excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

The most popular open-source external caching framework I’ve seen is memcached. It’s a flexible, multi-lingual caching daemon.

On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

Two Quick Observations

| Comments

Several of the speakers here have echoed two themes about databases.

1. MySQL is in production in a lot of places. I think the high cost of commercial databases (read: Oracle) leads to a kind of budgetechture that concentrates all data in a single massive database. If you remove that cost from the equation, the idea of either functionally partitioning your data stores or creating multiple shards becomes much more palatable.

2. By far the most common database cluster structure has one write master with many read masters. Ian Flint spoke to us about the architectures behind Yahoo Groups and Yahoo Bix. Bix has 30 MySQL read servers and just one write master. Dan Pritchett from eBay had a similar ratio. (His might have been 10:1 rather than 30:1.) In a commerce site, where 98% of the traffic is browsing and only 2% is buying, a read-pooled cluster makes a lot of sense.


Three Vendors Worth Evaluating

| Comments

Several vendors are sponsoring QCon. (One can only wonder what the registration fees would be if they didn’t.) Of these, I think three have products worth immediate evaluation.


In the category of "really cool, but would I pay for it?" is Semmle. Their flagship product, SemmleCode, lets you treat your codebase as a database against which you can run queries. SemmleCode groks the static structure of your code, including relationships and dependencies. Along the way, it calculates pretty much every OO metric yet invented. It also looks at the source repository.

What can you do with it? Well, you can create a query that shows you all the cyclic dependencies in your code. The results can be rendered as a tree with explanations, a graph, or a chart. Or, you can chart your distribution of cyclomatic complexity scores over time. You can look for the classes or packages most likely to create a ripple effect.

Semmle ships with a sample project: the open-source drawing framework JHotDraw. In a stunning coincidence, I’m a contributor to JHotDraw. I wrote the glue code that uses Batik to export a drawing as SVG. So I can say with confidence, that when Semmle showed all kinds of cyclic dependencies in the exporters, it’s absolutely correct. Every one of the queries I saw run against JHotDraw confirmed my own experience with that codebase. Where Semmle indicated difficulty, I had difficulty. Where Semmle showed JHotDraw had good structure, it was easy to modify and extend.

There are an enormous number of things you could do with this, but one thing they currently lack is build-time automation. Semmle integrates with Eclipse, but not ANT or Maven. I’m told that’s coming in a future release.


Virtualization is a hot topic. VMWare has the market lead in this space, but I’m very impressed with 3Tera’s AppLogic.

AppLogic takes virtualization up a level.  It lets you visually construct an entire infrastructure, from load balancers to databases, app servers, proxies, mail exchangers, and everything. These are components they keep in a library, just like transistors and chips in a circuit design program.

Once you’ve defined your infrastructure, a single button click will deploy the whole thing into the grid OS. And there’s the rub. AppLogic doesn’t work with just any old software and it won’t work on top of an existing "traditional" infrastructure.

As a comparison, HP’s SmartFrog just runs an agent on a bunch of Windows, Linux, or HP-UX servers. A management server sends instructions to the agents about how to deploy and configure the necessary software. So SmartFrog could be layered on top of an existing traditional infrastructure.

Not so with AppLogic. You build a grid specifically to support this deployment style. That makes it possible to completely virtualize load balancers and firewalls along with servers. Of course, it also means complete and total lock-in to 3tera.

Still, for someone like a managed hosting provider, 3tera offers the fastest, most complete definition and provisioning system I’ve seen.


What can I say about GigaSpaces? Anyone who’s heard me speak knows that I adore tuple-spaces. GigaSpaces is a tuple-space in the same way that Tibco is a pub-sub messaging system. That is to say, the foundation is a tuple-space, but they’ve added high-level capabilities based on their core transport mechanism.

So, they now have a distributed caching system.  (They call it an "in-memory data grid". Um, OK.) There’s a database gateway, so your front end can put a tuple into memory (fast) while a back-end process takes the tuple and writes it into the database.

Just this week, they announced that their entire stack is free for startups. (Interesting twist: most companies offer the free stuff to open-source projects.) They’ll only start charging you money when you get over $5M in revenue. 

I love the technology. I love the architecture.

Catching Up Through the Day

| Comments

One of the great things about virtual infrastructure is that you can treat it as a service. I use Yahoo’s shared hosting service for this blog. That gives me benefits: low cost and very quick setup. On the down side, I can’t log in as root. So when Yahoo has a problem, I have a problem.

Yesterday, there was something wrong with Yahoo’s install of Movable Type. As a result, I couldn’t post my "five things". I’ll be catching up today, as time permits.

My butt is planted in one track all day today, "Architectures You’ve Always Wondered About." We’ll be hearing about the architecture that runs Second Life, Yahoo, eBay, LinkedIn, and Orbitz. I may need a catheter and an IV.

Architecting for Latency

| Comments

Dan Pritchett, Technical Fellow at eBay, spoke about "Architecting for Latency". His aim was not to talk about minimizing latency, as you might expect, but rather to architect as though you believe latency is unavoidable and real.

We all know the effect latency can have on performance. That’s the first-level effect. If you consider synchronous systems—such as SOAs or replicated DR systems—then latency has a dramatic effect on scalability as well. Whenever a synchronous call reaches across a long wire, the latency gets added directly to the processing time.

For example, if client A calls service B, then A’s processing time will be at least the sum of B’s processing time, plus the latency between A and B. (Yes, it seems obvious when you state it like that, but many architectures still act as though latency is zero.)

Furthermore, latency over IP networks is fundamentally variable. That means A’s performance is unpredictable, and can never be made predictable.

Latency also introduces semantic problems. A replicated database will always have some discrepancy with the master database. A functionally or horizontally partitioned system will either allow discrepancies or must serialize traffic and give up scaling. You can imagine that eBay is much more interested in scaling than serializing traffic.

For example, when a new item is posted to eBay, it does not immediately show up in the search results. The ItemNode service posts a message that eventually causes the item to show up in search results. Admittedly, this is kept to a very short period of time, but still, the item will reach different data centers at different times. So, the search service inside the nearest data center will get the item before the search service inside the farthest. I suspect many eBay users would be shocked, and probably outraged, to hear that shoppers see different search results depending on where they are.

Now, the search service is designed to get consistent within a limited amount of time—for that item. With a constant flow of items, being posted from all over the country, you can imagine that there is a continuous variance among the search services. Like the quantum foam, however, this is near-impossible to observe. One user cannot see it, because a user gets pinned to a single data center. It would take multiple users, searching in the right category, with synchronized clocks, taking snapshots of the results to even observe that the discrepancies happen. And even then, they would only have a chance of seeing it, not a certainty. 

Another example. Dan talked about payment transfer from one user to another. In the traditional model, that would look something like this.

Synchronous write to both databases

You can think of the two databases as being either shards that contain different users or tables that record different parts of the transaction.

This is a design that pretends latency doesn’t exist. In other words, it subscribes to Fallacy #2 of the Fallacies of Distributed Computing. Performance and scalability will suffer here.

(Frankly, it has an availability problem, too, because the availability of the payment service Pr(Payment) will now be Pr(Database 1) * Pr(Network 1) * Pr(Database 2) * Pr(Network 2). In other words, the availability of the payment service is coupled to the availability of Database 1, Database 2, and the two networks connecting Payment to the two databases.)

Instead, Dan recommends a design more like this:

Payment service with back end reconciliation

In this case, the payment service can set an expectation with the user that the money will be credited within some number of minutes. The availability and performance of the payment service is now independent from that of Database 2 and the reconciliation process. Reconciliation happens in the back end. It can be message-driven, batch-driven, or whatever. The main point is that it is decoupled in time and space from the payment service. Now the databases can exist in the same data center or on separate sides of the globe. Either way, the performance, availability, and scalability characteristics of the payment service don’t change. That’s architecting for latency.

Instead of ACID semantics, we think about BASE semantics. (A cute retronym for Basically Available Soft-state Eventually-consistent.) 

Now, many analysts, developers, and business users will object to the loss of global consistency. We heard a spirited debate last night between Dan and Nati Shalom, founder and CTO of GigaSpaces about that very subject.

I have two arguments of my own to support architecting for latency.

First, any page you show to a user represents a point-in-time snapshot. It can always be inaccurate even by the time you finish generating the page. Think about a commerce site saying "Ships in 2 - 3 days". That’s true at the instant when the data is fetched. Ten milliseconds later, it might not be true. By the time you finish generating the page and the user’s browser finishes rendering it (and fetching the 29 JavaScript files needed for the whizzy AJAX interface) the data is already a few seconds old. So global consistency is kind of useless in that case, isn’t it? Besides, I can guarantee there’s already a large amount of latency in the data path from inventory tracking to the availability field in the commerce database anyway.

Second, the cost of global consistency is global serialization. If you assume a certain volume of traffic you must support, the cost of a globally consistent solution will be a multiple of the cost of a latency-tolerant solution. That’s because global consistency can only be achieved by making a single master system of record. When you try to reach large scales, that master system of record is going to be hideously expensive.

Latency is simply an immutable fact of the universe. If we architect for it, we can use it to our advantage. If we ignore it, we will instead be its victims.

For more of Dan’s thinking about latency, see his article on InfoQ.com

SOA Without the Edifice

| Comments

Sometimes the best interactions at a conference aren’t the talks, they’re shouting. An crowded bar with an over-amped DJ may seem like an unlikely place for a discussion on SOA. Even so, when it’s Jim Webber, ThoughtWorks’ SOA practice lead doing the shouting, it works. Given that Jim’s topic is "Guerilla SOA", shouting is probably more appropriate than the hushed and reverential cathedral whispers that usually attend SOA discussions.

Jim’s attitude is that SOA projects tend to attract two things: Taj Mahal architects and parasitic vendors. (My words, not Jim’s.) The combined efforts of these two groups results in monumentally expensive edifices that don’t deliver value. Worse still, these efforts consume work and attention that could go to building services aligned with the real business processes, not some idealized vision of what the processes ought to be.

Jim says that services should be aligned with business processes. When the business process changes, change the service. (To me, this automatically implies that the service cannot be owned by some enterprise governance council.) When you retire the business process, simply retire the service.

These sound like such common sense, that it’s hard to imagine they could be controversial.

I’ll be in the front row for Jim’s talk later today.