Wide Awake Developers

Main

Units of Measure in Scala

Failure to understand or represent units has caused several major disasters, including the costly Ariane 5 disaster in 1996. This is one of those things that DSLs often get right, but mainstream programming languages just ignore. Or, worse, they implement a clunky unit of measure library that ensures you can never again write a sensible arithmetic expression.

While I was at JAOO Australia this week, Amanda Laucher showed some F# code for a recipe that caught my attention. It used numeric literals with that directly attached units to quantities. What's more, it was intelligent about combining units.

I went looking for something similar in Scala. I googled my fingertips off, but without much luck, until Miles Sabin pointed out that there's already a compiler plugin sitting right next to the core Scala code itself.

Installing Units

Scala has it's own package manager, called sbaz. It can directly install the units extension:

sbaz install units

This will install it under your default managed installation. If you haven't done anything else, that will be your Scala install directory. If you have done something else, you probably already know what you're doing, so I won't try to give you instructions.

Using Units

To use units, you first have to import the library's "Preamble". It's also helpful to go ahead and import the "StandardUnits" object. That brings in a whole set of useful SI units.

I'm going to do all this from the Scala interactive interpreter.

scala> import units.Preamble._
import units.Preamble._

scala> import units.StandardUnits._
import units.StandardUnits._

After that, you can multiply any number by a unit to create a dimensional quantity:

scala> 20*m
res0: units.Measure = 20.0*m

scala> res0*res0
res1: units.Measure = 400.0*m*m

scala> Math.Pi*res0*res0
res2: units.Measure = 1256.6370614359173*m*m

Notice that when I multiplied a length (in meters) times itself, I got an area (square meters). To me, this is a really exciting thing about the units library. It can combine dimensions sensibly when you do math on them. In fact, it can help prevent you from incorrectly combining units.

scala> val length = 5*mm
length: units.Measure = 5.0*mm

scala> val weight = 12*g
weight: units.Measure = 12.0*g

scala> length + weight
units.IncompatibleUnits: Incompatible units: g and mm

I can't add grams and millimeters, but I can multiply them.

Creating Units

The StandardUnits package includes a lot of common units relating to basic physics. It doesn't have any relating to system capacity metrics, so I'd like to create some units for that.

scala> import units._
import units._

scala> val requests = SimpleDimension("requests")
requests: units.SimpleDimension = requests

scala> val req = SimpleUnit("req", requests, 1.0)
req: units.SimpleUnit = req

scala> val Kreq = SimpleUnit("Kreq", requests, 1000.0)
Kreq: units.SimpleUnit = Kreq

Now I can combine that simple dimension with others. If I want to express requests per second, I can just write it directly.

scala> 565*req/s
res4: units.Measure = 565.0*req/s

Conclusion

This extension will be the first thing I add to new projects from now on. The convenience of literals, with the extensibility of adding my own dimensions and units means I can easily keep units with all of my numbers.

There's no longer any excuse to neglect your units in a mainstream programming language.

Six Word Methods

In his great collection of essays Why Does Software Cost So Much?, Tom DeMarco makes the interesting point that the software industry had grown from zero to $300 billion dollars (in 1993). This indicates that the market had at least $300B worth of demand for software, even while complaining continuously about the cost and quality of the very same software. It seems to me that the demand for software production, together with the time and cost pressures, has only increased dramatically since then.

(DeMarco enlightens us that the perennial question, "Why does software cost so much?" is not really a question at all, but rather a goad or a negotiation. Also very true.)

Fundamentally, the demand for software production far outstrips our industry's ability to supply it. In fact, I believe that we can classify most software methods and techniques by their relation and response to the problem of surplus demand. Some try to optimize for least-cost production, others for highest quality, still others for shortest cycle time.

In the spirit of six-word memoirs, here are the sometimes dubious responses that various technology and development methods' offer to the overwhelming demand for software production. 

Waterfall: Nevermind backlog, requirements were signed off.

RAD: Build prototypes faster than discarding them.

Offshore outsourcing: Army of cheap developers producing junk.

Onshore outsourcing: Same junk, but with expensive developers.

Agile: Avoid featuritis; outrun pesky business users.

Domain-specific languages: Compress every problem into one-liners.

CMMi: Enough Process means nothing's ever wasted.

Relational Databases: Code? Who cares? Data lives forever.

Model-driven architecture: Jackson Pollack's models into inscrutable code.

Web Services: Terrorize XML until maximum reuse achieved.

FORTH: backward writing IF punctuation time SAVE.

SOA: Iron-fisted governance ensures total calcification.

Intentional programming: Parallelize programming... make programmers of everyone.

Google as IDE: It's been done, probably in Befunge.

Open-source: Bury the world in abandoned code.

Mashups: Parasitize others' apps, then APIs change.

LISP: With enough macros, one uberprogrammer sufficies.

perl: Too busy coding to maintain anyway.

Ruby: Meta-programming: same problems, mysterious solutions.

Ocaml: No, try meta-meta-meta-programming.

Groovy: Faster Java coding, runs like C-64.

Software-as-a-Service: Don't write your own, rent ours.

Cloud Computing: Programmers would go faster without administrators.

Reality

Two Ways To Boost Your Flagging Web Site

Being fast doesn't make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

13 Threads Needed When Requests Take 700ms

You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

3 Threads Needed When Requests Take 200ms

Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

Suppose you're site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you're probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you're paying for big boxes that are guaranteed to be 50% idle.

Classic Six Pack

Let's look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

Cache Farms

Cache farming doesn't mean armies of Chinese gamers stomping rats and making vests. It doesn't involve registering a ton of domain names, either.

Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you're running multiple instances on the same hosts, you could be caching the same object once per instance.

Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

So what do we have? If there are four app instances per host, then a frequently accessed object---like a product featured on the home page---will be duplicated eight times. Can we do better? Well, since I'm writing this article, you might suspect the answer is "yes". You'd be right.

The caches I've described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you'll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server's internal cache, they're probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it's already been re-added to the cache.

External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

Six Pack With Cache Farm

If the external cache doesn't have the item, the app server hits the database as usual. So I'll turn my attention to the database tier.

Read Pools

The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I'm going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

When your database starts to get that middle-age paunch and it just isn't as zippy as it used to be, think about offloading those reads. At a minimum, you'll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you'll probably get more performance out of your writes once all that pesky read I/O is off the write master.

How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

Six Pack With Cache Farm and Read Pool

By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That's the Foghorn Leghorn version of the objection. I'll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

Does that matter?

To a certain extent, I can't answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I'll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What's the total latency before that pricing error will be corrected? Let's look at the end-to-end process.

  1. A human detects the pricing error.
  2. The observer notifies the responsible merchant.
  3. The merchant verifies that the price is in error and determines the correct price.
  4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
  5. The merchant locates the item and enters the correct price
  6. She hits the "publish" button.
  7. The fast path system connects to the write master in production and updates the price.
  8. The read pool receives the logs with the update and applies them.
  9. The read pool process sends a message to invalidate the item in the app servers' caches.
  10. The next time users request that product detail page, they see the correct price.

That's the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

Summing It Up

The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.

References

For more on read pooling, see Cal Henderson's excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

The most popular open-source external caching framework I've seen is memcached. It's a flexible, multi-lingual caching daemon.

On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

Three Vendors Worth Evaluating

Several vendors are sponsoring QCon. (One can only wonder what the registration fees would be if they didn't.) Of these, I think three have products worth immediate evaluation.

Semmle

In the category of "really cool, but would I pay for it?" is Semmle. Their flagship product, SemmleCode, lets you treat your codebase as a database against which you can run queries. SemmleCode groks the static structure of your code, including relationships and dependencies. Along the way, it calculates pretty much every OO metric yet invented. It also looks at the source repository.

What can you do with it? Well, you can create a query that shows you all the cyclic dependencies in your code. The results can be rendered as a tree with explanations, a graph, or a chart. Or, you can chart your distribution of cyclomatic complexity scores over time. You can look for the classes or packages most likely to create a ripple effect.

Semmle ships with a sample project: the open-source drawing framework JHotDraw. In a stunning coincidence, I'm a contributor to JHotDraw. I wrote the glue code that uses Batik to export a drawing as SVG. So I can say with confidence, that when Semmle showed all kinds of cyclic dependencies in the exporters, it's absolutely correct. Every one of the queries I saw run against JHotDraw confirmed my own experience with that codebase. Where Semmle indicated difficulty, I had difficulty. Where Semmle showed JHotDraw had good structure, it was easy to modify and extend.

There are an enormous number of things you could do with this, but one thing they currently lack is build-time automation. Semmle integrates with Eclipse, but not ANT or Maven. I'm told that's coming in a future release.

3Tera

Virtualization is a hot topic. VMWare has the market lead in this space, but I'm very impressed with 3Tera's AppLogic.

AppLogic takes virtualization up a level.  It lets you visually construct an entire infrastructure, from load balancers to databases, app servers, proxies, mail exchangers, and everything. These are components they keep in a library, just like transistors and chips in a circuit design program.

Once you've defined your infrastructure, a single button click will deploy the whole thing into the grid OS. And there's the rub. AppLogic doesn't work with just any old software and it won't work on top of an existing "traditional" infrastructure.

As a comparison, HP's SmartFrog just runs an agent on a bunch of Windows, Linux, or HP-UX servers. A management server sends instructions to the agents about how to deploy and configure the necessary software. So SmartFrog could be layered on top of an existing traditional infrastructure.

Not so with AppLogic. You build a grid specifically to support this deployment style. That makes it possible to completely virtualize load balancers and firewalls along with servers. Of course, it also means complete and total lock-in to 3tera.

Still, for someone like a managed hosting provider, 3tera offers the fastest, most complete definition and provisioning system I've seen.

GigaSpaces

What can I say about GigaSpaces? Anyone who's heard me speak knows that I adore tuple-spaces. GigaSpaces is a tuple-space in the same way that Tibco is a pub-sub messaging system. That is to say, the foundation is a tuple-space, but they've added high-level capabilities based on their core transport mechanism.

So, they now have a distributed caching system.  (They call it an "in-memory data grid". Um, OK.) There's a database gateway, so your front end can put a tuple into memory (fast) while a back-end process takes the tuple and writes it into the database.

Just this week, they announced that their entire stack is free for startups. (Interesting twist: most companies offer the free stuff to open-source projects.) They'll only start charging you money when you get over $5M in revenue. 

I love the technology. I love the architecture.

Three Programming Language Problems Solved Forever

It's often been the case that a difficult problem can be made easier by transforming it into a different representation.  Nowhere is that more true than in mathematics and the pseudo-mathematical realm of programming languages.

For example, LISP, Python, and Ruby all offer beautiful and concise constructs for operating on lists of things.  In each of them, you can make a function which iterates across a list, performing some operation on each element, and returning the resulting list.  C, C++, and Java do not offer any similar construct.  In each of these languages, iterating a list is a control-flow structure that requires multiple lines to express.  More significantly, the function expression of list comprehension can be composed. That is, you can embed a list comprehension structure inside of another function call or list operation.  In reading Programming Collective Intelligence, which uses Python as its implementation language, I've been amazed at how eloquent complex operations can be, especially when I mentally transliterate the same code into Java.

In the evening keynote at QCon, Richard Gabriel covered 50 language topics, with a 50 word statement about each---along with a blend of music, art, and poetry. (If you've never seen Richard perform at a conference, it's quite an experience.)  His presentation "50 in 50" also covered 50 years of programming and introduced languages as diverse as COBOL, SNOBOL, Piet, LISP, Perl, C, Algol, APL, IPL, Befunge, and HQ9+.

HQ9+ particularly caught my attention.  It takes the question of "simplifying the representation of problems" to the utmost extreme.

HQ9+ has a simple grammar.  There are 4 operations, each represented by a single character.

'+' increments the register.

'H' prints every languages natal example, "Hello, world!" 

'Q' makes every program into a quine.  It causes the interpreter to print the program text.  Quines are notoriously difficult assignments for second-year CS students.

'9' causes the interpreter to print the lyrics to the song "99 Bottles of Beer on the Wall."  This qualifies HQ9+ as a  real programming language, suitable for inclusion in the ultimate list of languages.

These three operators solve for some very commonly expressed problems.  In a certain sense, they are the ultimate solution to those problem.  They cannot be reduced any further... you can't get shorter than one character.

Of course, in an audience of programmers, HQ9+ always gets a laugh.  In fact, it was created specifically to make programmers laugh.  And, in fact, it's a kind of meta-level humor. It's not the programs that are funny, but the design of the language itself... an inside joke from one programmer to the rest of us.

Eric Evans: Strategic Design

Eric Evans, author of Domain-Driven Design and founder of Domain Language, embodies the philosophical side of programming.

He gave a wonderful talk on "Strategic Design".  During this talk, he stated a number of maxims that are worth pondering.

"Not all of a large system will be well designed."

"There are always multiple models."

"The diagram is not the model, but it is an expression of part of the model."

These are not principles to be followed, Evans says. Rather, these are fundamental laws of the universe. We must accept them and act accordingly, because disregarding them ends in tears.

Much of this material comes from Part 4 of Domain-Driven Design.  Evans laconically labeled this, "The part no one ever gets to."  Guilty.  But when I get back home to my library, I will make another go of it.

Evans also discusses the relative size of code, amount of time spent, and value of the three fundamental portions of a system: the core domain, supporting subdomains, and generic subdomains.

Generic subdomains are horizontal. You might find these in any system in any company in the world.

Supporting subdomains are business-specific, but not of value to this particular system. That is, they are necessary cost, but do not provide value.

The core domain is the reason for the system. It is the business-specific functionality that makes this system worth building.

Now, in a typical development process (and especially a rewrite project), where does the team's time go? Most of it will go to the largest bulk: the generic subdomains. This is the stuff that has to exist, but it adds no value and is not specific to the company's business. The next largest fraction goes to the supporting subdomains. Finally, the smallest portion of time---and usually the last portion of time---goes to the core domain.

That means the very last thing delivered is the reason for the system's existance in the first place.  Ouch. 

What makes a POJO so great, anyway?

My friend David Hussman once said to me, "The next person that says the word 'POJO' to me is going to get stabbed in the eye with a pen."  At the time, I just commiserated about people who follow crowds rather than making their own decisions.

David's not a violent person.  He's not prone to fits of violence or even hyperbole.  What made this otherwise level-headed coach and guru resort to non-approved uses of a Bic?

This weekend in No Fluff, Just Stuff, I had occasion to contemplate POJOs again.  There were many presentations about "me too" web frameworks.  These are the latest crop of Java web frameworks that are furiously copying Ruby on Rails features as fast as they can.  These invariably make a big deal out of using POJOs for data-mapped entities or for the beans accessed by whatever flavor of page template they use. (See JSF, Seam, WebFlow, Grails, and Tapestry 5 for examples.)

Mainly, I think the infuriating bit is the use of the word "POJO" as if it's a synonym for "good".  There's nothing inherently virtuous about plain old Java objects.  It's a retronym; a name made up for an old thing to distinguish it from the inferior new replacement.

People only care about POJOs because EJB2 was so unbelievably bad.

Nobody gives a crap about "POROs" (Plain old Ruby objects) because ActiveRecord doesn't suck.

Self-Inflicted Wounds

My friend and colleague Paul Lord said, "Good marketing can kill you at any time."

He was describing a failure mode that I discuss in Release It!: Design and Deploy Production-Ready Software as "Attacks of Self-Denial".  These have all the characteristics of a distributed denial-of-service attack (DDoS), except that a company asks for it.  No, I'm not blaming the victim for electronic vandalism... I mean, they actually ask for the attack.

The anti-pattern goes something like this: marketing conceives of a brilliant promotion, which they send to 10,000 customers.  Some of those 10,000 pass the offer along to their friends.  Some of them post it to sites like FatWallet or TechBargains.  On the appointed day, hour, and minute, the site has a date with destiny as a million or more potential customers hit the deep link that marketing sent around in the email.  You know, the one that bypasses the content distribution network, embeds a session ID in the URL, and uses SSL?

Nearly every retailer I know has done this to themselves at one point.  Two holidays ago, one of my clients did it to themselves, when they announced that XBox 360 preorders would begin at a certain day and time.  Between actual customers and the amateur shop-bots that the tech-savvy segment cobbled together, the site got crushed.  (Yes, this was one where marketing sent the deep link that bypassed all the caching and bot-traps.)

Last holiday, Amazon did it to themselves when they discounted the XBox 360 by $300.  (What is it about the XBox 360?)  They offered a thousand units at the discounted price and got ten million shoppers.  All of Amazon was inaccessible for at least 20 minutes.  (It may not sound like much, but some estimates say Amazon generates $1,000,000 per hour during the holiday season, so that 20 minute outage probably cost them around $200,000!)

In Release It!, I discuss some non-technical ways to mitigate this behavior, as well as some design and architecture patterns you can apply to minimize damage when one of these Attacks of Self-Denial occur.

Design Patterns in Real Life

I've seen walking cliches before.  There was this one time in the Skyway that I actually saw a guy with a white cane being led by a woman with huge dark sunglasses and a guide dog.  Today, though, I realized I was watching a design pattern played out with people instead of objects.

I've used the Reactor pattern in my software before.  It's particularly helpful when you combine it with non-blocking multiplexed I/O, such as Java's NIO package.

Consider a server application such as a web server or mail transfer agent.  A client connects to a socket on the server to send a request. The server and client talk back and forth a little bit, then the server either processes or denies the client's request.

If the server just used one thread, then it could only handle a single client at a time.  That's not likely to make a winning product. Instead, the server uses multiple threads to handle many client connections.

The obvious approach is to have one thread handle each connection.  In other words, the server keeps a pool of threads that are ready and waiting for a request.  Each time through its main loop, the server gets a thread from the pool and, on that thread, calls the socket "accept" method.  If there's already a client connection request waiting, then "accept" returns right away.  If not, the thread blocks until a client connects.  Either way, once "accept" returns, the server's thread has an open connection to a client.

At that point, the thread goes on to read from the socket (which blocks again) and, depending on the protocol, may write a response or exchange more protocol handshaking.  Eventually, the demands of protocol satisfied, the client and server say goodbye and each end closes the socket.  The worker thread pulls a Gordon Freeman and disappears into the pool until it gets called up for duty again.

It's a simple, obvious model.  It's also really inefficient.  Any given thread spends most of its life doing nothing.  It's either blocked in the pool, waiting for work, or it's blocked on a socket "accept", "read", or "write" call.

If you think about it, you'll also see that the naive server can handle only as many connections as it has threads.  To handle more connections, it must fork more threads.  Forking threads is expensive in two ways.  First, starting the thread itself is slow.  Second, each thread requires a certain amount of scheduling overhead.  Modern JVMs scale well to large numbers of threads, but sooner or later, you'll still hit the ceiling.

I won't go into all the details of non-blocking I/O here.  (I can point you to a decent article on the subject, though.)  Its greatest benefit is you do not need to dedicate a thread to each connection.  Instead, a much smaller pool of threads can be allocated, as needed, to handle individual steps of the protocol.  In other words, thread 13 doesn't necessarily handle the whole conversation. Instead, thread 4 might accept the connection, thread 29 reads the initial request, thread 17 starts writing the response and thread 99 finishes sending the response.

This model employs threads much more efficiently.  It also scales to many more concurrent requests.  Bookkeeping becomes a hassle, though. Keeping track of the state of the protocol when each thread only does a little bit with the conversation becomes a challenge.  Finally, the (hideously broken) multithreading restrictions in Java's "selector" API make fully multiplexed threads impossible.

The Reactor pattern predates Java's NIO, but works very well here.  It uses a single thread, called the Acceptor, to await incoming "events". This one thread sleeps until any of the connections needs service: either due to an incoming connection request, a socket ready to read, or a socket ready for write.  As soon as one of these events occurs, the Acceptor hands the event off to a dispatcher (worker) thread that then processes the event.

You can visualize this by sitting in a TGI Friday's or Chili's restaurant.  (I'm fond of the crowded little ones inside airports. You know, the ones with a third of the regular menu and a line stretching out the door.  Like a home away from home for me lately.) The "greeter" accepts incoming connections (people) and hands them off to a "worker" (server).  The greeter is then ready for the next incoming request.  (The line out the door is the listen queue, in case you're keeping score.)  When the kitchen delivers the food, it doesn't wait for the original worker thread.  Instead, a different worker thread (a runner) brings the food out to the table.

I'll keep my eyes open for other examples of object-oriented design patterns in real life--though I don't expect to see many based on polymorphism.

JAI 1.1.3 in beta

I've been using JAI 1.1.2 for the past year. It's an incredibly powerful tool, though I will confess that the API is more than a bit quirky.

Early this year, Sun made JAI an open-source project available at java.net. That project has been working on the 1.1.3 release for most of the year. It's now in beta, with a few enhancements and a lot of bug fixes.

The most significant enhancement is that JAI can now be used with Java WebStart. Previously it had to be installed as a JRE extension.

Also, one of the big bugs is fixed. Issue #13 is fixed in the beta. It could cause the JPEG codec to use excessive amounts of memory when decoding large untiled images. (Which we do in our app a lot!)

Technorati Tags: java

Too Much Abstraction

The more I deal with infrastructure architecture, the more I think that somewhere along the way, we have overspecialized. There are too many architects that have never lived with a system in production, or spent time on an operations team. Likewise, there are a lot of operations people that insulate themselves from the specification and development of systems for which they will ultimately take responsibility.

The net result is suboptimization in the hardware/software fit. As a result, overall availability of the application suffers.

Here's a recent example.

First, we're trying to address the general issue of flowing data from production back into pre-production systems -- QA, production support, development, staging. The first attempt took 6 days to complete. Since the requirements of the QA environment stipulate that the data should be no more than one week out of date relative to production, that's a big problem. On further investigation, it appears that the DBA who was executing this process spent most of the time doing scps from one host to another. It's a lot of data, so in one respect 10 hour copies are reasonable.

But the DBA had never been told about the storage architecture. That's the domain of a separate "enterprise service" group. They are fairly protective of their domain and do not often allow their architecture documents to be distributed. They want to reserve the right to change them at will. Now, they will be quite helpful if you approach them with a storage problem, but the trick is knowing when you have a storage problem on your hands.

You see, all of the servers that the DBA was copying files from and to are all on the same SAN. An scp from one host on the SAN to another host on the SAN is pretty redundant.

There's an alternative solution that involves a few simple steps: Take a database snapshot onto a set of disks with mirrors, split the mirrors, and join them onto another set of mirrors, then do an RMAN "recovery" from that snapshot into the target database. Total execution time is about 4 hours.

From six days to four hours, just by restating the problem to the right people.

This is not intended to criticize any of the individuals involved. Far from it, they are all top-notch professionals. But the solution required merging the domains of knowledge from these two groups -- and the organizational structure explicitly discouraged that merging.

Another recent example.

One of my favorite conferences is the Colorado Software Summit. It's a very small, intensely technical crowd. I sometimes think half the participants are also speakers. There's a year-round mailing list for people who are interested in, or have been to, the Summit. These are very skilled and talented people. This is easily the top 1% of the software development field.

Even there, I occasionally see questions about how to handle things like transparent database connection failover. I'll admit that's not exactly a journeyman topic. Bring it up at a party and you'll have plenty of open space to move around in. What surprised me is that there are some fairly standard infrastructure patterns for enabling database connection failover that weren't known to people with decades of experience in the field. (E.g., cluster software reassigns ownership of a virtual IP address to one node or the other, with all applications using the virtual IP address for connections).

This tells me that we've overspecialized, or at least, that the groups are not talking nearly enough. I don't think it's possible to be an expert in high availability, infrastructure architecture, enterprise data management, storage solutions, OOA/D, web design, and network architecture. Somehow, we need to find an effective way to create joint solutions, so we don't have software being developed that's completely ignorant of its deployment architecture, nor should we have infrastructure investments that are not capable of being used by the software. We need closer ties between operations, architecture, and development.

The Lights Are On, Is Anybody Home?

We pay a lot of attention to stakeholders when we create systems. The end users get a say, as do the Gold Owners. Analysts put their imprimatur on the requirements. In better cases, operations and administration adds their own spin. It seems like the only group that doesn't have any input during requirements gathering is the development team itself. That is truly unfortunate.

Not even the users will have to live with the system more than the developers will. Developers literally inhabit the system for most of their waking hours, just as much (or maybe more) than they inhabit their cubes or offices. When the code is messy, nobody suffers more than the developers. When living in the system becomes unpleasant, morale will suffer. Any time you hear a developer ask for a few weeks of "cleanup" after a release, what they are really saying is, "This room is a terrible mess. We need to remodel."

A code review is just like an episode of "Trading Spaces". Developers get to trade problems for a while, to see if somebody else can see possibilities in their dwelling. Rip out that clunky old design that doesn't work any more! Hang some fabric on the walls and change the lighting.

Whether your virtual working environment becomes a cozy place, a model of efficiency, or a cold, drab prison, you create your own living space. It is worth taking some care to create a place you enjoy inhabiting. You will spend a lot of time there before the job is done.

Don't Build Systems That Boink

Note: This piece originally appeared in the "Marbles Monthly" newsletter in April 2003

I caught an incredibly entertaining special on The Learning Channel last week. A bunch of academics decided that they were going to build an authentic Roman-style catapult, based on some ancient descriptions. They had great plans, engineering expertise, and some really dedicated and creative builders. The plan was to hurl a 57 pound stone 400 yards, with a machine that weighed 30 tons. It was amazing to see the builders faces swing between hope and fear. The excitement mingled with apprehension.

At one point, the head carpenter said that it would be wonderful to see it work, but "I'm fairly certain it's going to boink." I immediately knew what he meant. "Boink" sums up all the myriad ways this massive device could go horribly wrong and wreak havoc upon them all. It could fall over on somebody. It could break, releasing all that kinetic energy in the wrong direction, or in every direction. The ball could fly off backwards. The rope might relax so much that it just did nothing. One of the throwing arms could break. They could both break. In other words, it could do anything other than what it was intended to do.

That sounds pretty familiar. I see the same expressions on my teammates' faces every day. This enormous project we're slaving on could fall over and crush us all into jelly. It could consume our hours, our minds, and our every waking hour. Worst case, it might cost us our families, our health, our passion. It could embarrass the company, or cost it tons of money. In fact, just about the most benign thing it could do is nothing.

So how do you make a system that don't boink? It is hard enough just making the system do what it is supposed to. The good news is that some simple "do's and don'ts" will take us a long way toward non-boinkage.

Automation is Your Friend #1: Runs lots of tests -- and run them all the time

Automated unit tests and automated functional tests will guarantee that you don't backslide. They provide concrete evidence of your functionality, and they force you to keep your code integrated.

Automation is Your Friend #2: Be fanatic about build and deployment processes

A reliable, fully automated build process will prevent headaches and heartbreaks. A bad process--or a manual process--will introduce errors and make it harder to deliver on an iterative cycle.

Start with a fully automated build script on day one. Start planning your first production-class deployment right away, and execute a deployment within the first three weeks. A build machine (it can be a workstation) should create a complete, installable numbered package. That same package should be delivered into each environment. That way, you can be absolutely certain that QA gets exactly the same build that went into integration testing.

Avoid the temptation to check out the source code to each environment. An unbelievable amount of downtime can be traced to a version label being changed between when the QA build and the production build got done.

Everything In Its Place

Keep things separated that either change at different speeds. Log files change very fast, so isolate them. Data changes a little less quickly but is still dynamic. "Content" changes slower yet, but is still faster than code. Configuration settings usually come somewhere between code and content. Each of these things should go in their own location, isolated and protected from each other.

Be transparent

Log everything interesting that happens. Log every exception or warning. Log the start and end of long-running tasks. Always make sure your logs include a timestamp!

Be sure to make the location of your log files configurable. It's not usually a good idea to keep log files in the same filesystem as your code or data. Filling up a filesystem with logs should not bring your system down.

Keep your configuration out of your code

It is always a good idea to separate metadata from code. This includes settings like host names, port numbers, database URLs and passwords, and external integrations.

A good configuration plan will allow your system to exist in different environments -- QA versus production, for example. It should also allow for clustered or replicated installations.

Keep your code and your data separated

The object-oriented approach is a good wasy to build software, but it's a lousy way to deploy systems. Code changes at a different frequency than data. Keep them separated. For example, in a web system, it should be easy to deploy a new code drop without disrupting the content of the site. Likewise, new content should not affect the code.

I think I'd like to

I think I'd like to do some Smalltalk (or Squeak) development sometime. Just for myself. It would be good for me -- like an artist going to a retreat and setting aside all notions of practicality. I know I'll never work in Squeak professionally. That's why it would be like saying to yourself, "In this now, purity of expression is all that matters. Tomorrow, I will worry about making something I can sell. Tomorrow I will design so the mediocre masses that follow me cannot corrupt it. Today, I will work for the joy I find in the work."

The burdens of responsibility leave no room for such indulgence. So I turn back to Java and C#. I'll write another Address class and deal with another session manager, and more cookies. Always with the cookies.


Needles, Haystacks

So, this may seem a little off-topic, but it comes round in the end. Really, it does.

I've been aggravated with the way members of the fourth estate have been treating the supposed "information" that various TLAs had before the September 11 attacks. (That used to be my birthday, by the way. I've since decided to change it.) We hear that four of five good bits of information scattered across the hundreds of FBI, CIA, NSA, NRO, IRS, DEA, INS, or IMF offices "clearly indicate" that terrorists were planning to fly planes into buildings. Maybe so. Still, it doesn't take a doctorate in complexity theory to figure out that you could probably find just as much data to support any conclusion you want. I'm willing to bet that if the same amount of collective effort were invested, we could prove that the U. S. Government has evidence that Saddam Hussein and aliens from Saturn are going to land in Red Square to re-establish the Soviet Union and launch missiles at Guam.

You see, if you already have the conclusion in hand, you can sift through mountain ranges of data to find those bits that best support your conclusion. That's just hindsight. It's only good for gossipy hens clucking over the backyard fence, network news anchors, and not-so-subtle innuendos by Congresscritters.

The trouble is, it doesn't work in reverse. How many documents does just the FBI produce every day? 10,000? 50,000? How would anyone find exactly those five or six documents that really matter and ignore all of the chaff? That's the job of analysis, and it's damn hard. A priori, you could only put these documents together and form a conclusion through sheer dumb luck. No matter how many analysts the agencies hire, they will always be crushed by the tsunami of data.

Now, I'm not trying to make excuses for the alphabet soup gang. I think they need to reconsider some of their basic operations. I'll leave questions about separating counter-intelligence from law enforcement to others. I want to think about harnessing randomness. You see, government agencies are, by their very nature, bureaucratic entities. Bureaucracies thrive on command-and-control structures. I think it comes from protecting their budgets. Orders flow down the hierarchy, information flows up. Somewhere, at the top, an omniscient being directs the whole shebang. A command-and-control structure hates nothing more than randomness. Randomness is noise in the system, evidence of an inadequate procedures. A properly structured bureaucracy has a big, fat binder that defines who talks to whom, and when, and under what circumstances.

Such a structure is perfectly optimized to ignore things. Why? Because each level in the chain of command has to summarize, categorize, and condense information for its immediate superior. Information is lost at every exchange. Worse yet, the chance for somebody to see a pattern is minimized. The problem is this whole idea that information flows toward a converging point. Whether that point is the head of the agency, the POTUS, or an army of analysts in Foggy Bottom, they cannot assimilate everything. There isn't even any way to build information systems to support the mass of data produced every day, let alone correlating reports over time.

So, how do Dan Rather and his cohorts find these things and put them together? Decentralization. There are hordes of pit-bull journalists just waiting for the scandal that will catapult them onto CNN. ("Eat your heart out Wolf, I found the smoking gun first!")

Just imagine if every document produced by the Minneapolis field office of the FBI were sent to every other FBI agent and office in the country. A vast torrent of data flowing constantly around the nation. Suppose that an agent filing a report about suspicious flight school activity could correlate that with other reports about students at other flight schools. He might dig a little deeper and find some additional reports about increased training activity, or a cluster of expired visas that overlap with the students in the schools. In short, it would be a lot easier to correlate those random bits of data to make the connections. Humans are amazing at detecting patterns, but they have to see the data first!

This is what we should focus on. Not on rebuilding the $6 Billion Bureaucracy, but on finding ways to make available all of the data collected today. (Notice that I haven't said anything that requires weakening our 4th or 5th Amendment rights. This can all be done under laws that existed before 9/11.) Well, we certainly have a model for a global, decentrallized document repository that will let you search, index, and correlate all of its contents. We even have technologies that can induce membership in a set. I'd love to see what Google Sets would do with the 19 hijackers names, after you have it index the entire contents of the FBI, CIA, and INS databases. Who would it nominate for membership in that set?

Basically, the recipe is this: move away from ill-conceived ideas about creating a "global clearinghouse" for intelligence reports. Decentralize it. Follow the model of the Internet, Gnutella, and Google. Maximize the chances for field agents and analysts to be exposed to that last, vital bit of data that makes a pattern come clear. Then, when an agent perceives a pattern, make damn sure the command-and-control structure is ready to respond.

Multiplier Effects

Here's another way to think about the ethics of software, in terms of multipliers. Think back to the last major virus scare, or when Star Wars Episode II was released. Some "analyst"--who probably found his certificate in a box of Cracker Jack--publishing some ridiculous estimate of damages.

BTW, I have to take a minute to disassemble this kind of analysis. Stick with me, it won't take long.

If you take 1.5 seconds to delete the virus, it costs nothing. It's an absolutely immeasurable impact to your day. It won't even affect your productivity. You will probably spend more time than that discussing sports scores, going to the bathroom, chatting with a client, or any of the hundreds of other things human beings do during a day. It's literally lost in the noise. Nevertheless, some peabrain analyst who likes big numbers will take that 1.5 seconds and multiply it by the millions of other users and their 1.5 seconds, then multiply that by the "national average salary" or some such number.

So, even though it takes you longer to blow your nose than to delete the virus email, somehow it still ends up "costing the economy" 5x10^6 USD in "lost productivity". The underlying assumptions here are so thoroughly rotten that the result cannot be anything but a joke. Sure as hell though, you'll see this analysis dragged out every time there's a news story--or better yet, a trial--about an email worm.

The real moral of this story isn't about innumeracy in the press, or spotlight seekers exploiting innumeracy. It's about multipliers.

Suppose you have a decision to make about a particular feature. You can do it the easy way in about a week, or the hard way in about a month. (Hypothetical.) Which way should you do it? Suppose that the easy way makes the user click an extra button, whereas doing it the hard way makes the program a bit smarter and saves the user one click. Just one click. Which way should you do it?

Let's consider an analogy. Suppose I'm putting a sign up on my building. Is it OK to mount the sign six feet up on the wall, so that pedestrians have to duck or go around it? It's much easier for me to hang the sign if I don't have to set up a ladder and scaffold. It's only a minor annoyance to the pedestrians. It's not like it would block the sidewalk or anything. All they have to do is duck. (We'll just ignore the fact that pissing off all your potential customers is not a good business strategy.)

It's not ethical to worsen the lives of others, even a small bit, just to make things easy for yourself. These days, successful software is measured in millions of users, of people. Always be mindful of the impact your decisions--even small ones--have on those people. Accept large burdens to ease the burden on those people, even if your impact on any given individual is miniscule. The cumulative good you do that way will always overwhelm the individual costs you pay.

Ethical decisions in software development

Ethical decisions in software development do not only arise when we are talking about malware or copyright infringement.

If my programs are successful, then they impact the lives of thousands or millions of people. That impact can be positive or negative. The program can make their lives better or worse--even if just in minute proportions.

Every time I make a decision about how a program behaves, I am really deciding what my users can and cannot do. If I make an input required, I am forcing them to abide by my rules. (Hopefully, it is a rule they expressed first, at least.) Conversely, if I allow partial entry, then I am allowing some licentiousness. They can get away with less rigorous work.

That makes every programming decision an ethical decision.


Designing for Emergent Behavior

Lately, I've been grooving on emergent behavior. This fuzzy term comes from the equally fuzzy field of complexity studies. Mix complex rules together with non-linear effects (like humans) and you are likely to observe emergent behavior.

Recent example: web browser security holes. Any program inherently constitutes a complex system. Add in some dynamic reprogramming, downloadable code, system-level scripting, and millions upon millions of users and you've got a perfect petri dish. Sit back and watch the show. Unpredictable behavior will surely result.

In fact, "emergent" sometimes gets used as a synonym for "unpredictable". By and large, I believe that's true. In traditional systems design, "unpredictable" definitely equals "sloppy". Command-and-control, baby. Emergent behavior is what happens when your program goes off the rails.

The thing is, emergent behavior is where all the really interesting things happen. Predictable programs are boring. Big batch runs are predictable.

But, you have to consider the complete system. In a big batch run, the system is linear: inputs, transformation, outputs. No feedback. No humans. When you include humans in your view of the system, all these messy feedback loops start to appear. It gets even worse when you have multiple humans connected via the programs. Feedback loops that stretch from one person, through at least two programs, out to another person and back.

Any system that involves humans will exhibit emergent behaviors -- and this is a very good thing.

Are "designed" behavior and "emergent" behavior inherently incompatible? I don't think so. I think it may be possible to design for emergent behavior. I mean that certain designs will encourage some kinds of emergent behavior, whereas other designs encourage other kinds of emergent behavior. We can study the behaviors produced by various systems and designs to build a compendium of factors that are likely to facilitate one class of behavior or another.

For example: In every corporation, I see large volumes of data stored and shared in two different formats. The nature of the two systems encourages very different behaviors.

First we have relational databases. These tend to be large, expensive systems. As a result, they are centralized to one degree or another. The nature of relational algebra is that of a static schema. Therefore, changes are rigidly controlled. Centralized, rigidly controlled assets require guardians (DBAs) and gatekeepers (data modelers). Because the schema is well-defined and changes slowly, the database gains a degree of transparency. Applications are integrated through their databases. Generic tools for backup, reporting, extraction, and modeling become possible. The data can be accessed from a variety of applications in a relatively generic fashion.

The other data storage tool I see used widely is the spreadsheet. I almost never see a spreadsheet used to calculate numbers. Instead, most are used as a schema-less data storage tool. Often created directly by the business analysts, these spreadsheets are very conducive to change. Sharing is as simple as sending the file through email. Of course, this leads to version conflicts and concurrent update issues that have to be settled by hand (usually by printing a timestamp on the hardcopies!) There is not a central definition of the data structure. Indeed, neither the data nor the structures from spreadsheets can be reused. A spreadsheet makes the 2-dimensional structure of a table obvious, but it makes relationships difficult, if not impossible, to represent. Ergo, spreadsheet users don't do relationships. Access to the spreadsheets is always mediated by a single application.

So, two different systems. Both store structured (or at least semi-structured) data. The nature of each produces very different emergent behaviors. In one case, we find the evolution of acolytes of the RDBMS. In the other case, we find that a numeric analysis tool is being used for widespread data storage and sharing.

Given enough examples, enough time, and enough study, can we not learn to extrapolate from the essential nature of our designs to the most probable emergent behaviors? Even perhaps, to select the emergent behaviors that we desire first, and, starting from those, decide what essential nature our designs must embody to most likely to encourage those behaviors?