Wide Awake Developers

Cameron Purdy: 10 Ways to Botch Enterprise Java Scalability and Reliability

| Comments

Here at QCon, well-known Java developer Cameron Purdy gave a fun talk called "10 Ways to Botch Enterprise Java Scalability and Reliability".  (He also gave this talk at JavaOne.)

While I could quibble with Cameron’s counting—there were actually more like 16 points thanks to some numerical overloading—I liked his content.  He echoes many of the antipatterns from Release It.   In particular, he talks about the problem I call "Unbounded Result Sets".  That is, whether using an ORM tool or straight database queries, you can always get back more than you expect. 

Sometimes, you get back way, way more than you expect. I once saw a small messaging table, that normally held ten or twenty rows, grow to over ten million rows.  The application servers never contemplated there could be so many messages.  Each one would attempt to fetch the entire contents of the table and turn them into objects.  So, each app server would run out of memory and crash.  That rolled back the transaction, allowing the next app server to impale itself on the same table.

Unbounded Result Sets don’t just happen from "SELECT * FROM FOO;", though.  Think about an ORM handling the parent-child relationship for you.  Simply calling something like customer.getOrders() will return every order for that customer.  By writing that call, you implicitly assume that the set of orders for a customer will always be small.  Maybe.  Maybe not.  How about blogUser.getPosts()?  Or tickerSymbol.getTrades()?

Unbounded Result Sets also happen with web services and SOAs.  A seemingly innocuous request for information could create an overwhelming deluge—an avalanche of XML that will bury your system.  At the least, reading the results can take a long time.  In the worst case, you will run out of memory and crash.

The fundamental flaw with an Unbounded Result Set is that you are trusting someone else not to harm you, either a data producer or a remote web service. 

Take charge of your own safety! 

Be defensive!

Don’t get hurt again in another dysfunctional relationship!

Three Programming Language Problems Solved Forever

| Comments

It’s often been the case that a difficult problem can be made easier by transforming it into a different representation.  Nowhere is that more true than in mathematics and the pseudo-mathematical realm of programming languages.

For example, LISP, Python, and Ruby all offer beautiful and concise constructs for operating on lists of things.  In each of them, you can make a function which iterates across a list, performing some operation on each element, and returning the resulting list.  C, C++, and Java do not offer any similar construct.  In each of these languages, iterating a list is a control-flow structure that requires multiple lines to express.  More significantly, the function expression of list comprehension can be composed. That is, you can embed a list comprehension structure inside of another function call or list operation.  In reading Programming Collective Intelligence, which uses Python as its implementation language, I’ve been amazed at how eloquent complex operations can be, especially when I mentally transliterate the same code into Java.

In the evening keynote at QCon, Richard Gabriel covered 50 language topics, with a 50 word statement about each—along with a blend of music, art, and poetry. (If you’ve never seen Richard perform at a conference, it’s quite an experience.)  His presentation "50 in 50" also covered 50 years of programming and introduced languages as diverse as COBOL, SNOBOL, Piet, LISP, Perl, C, Algol, APL, IPL, Befunge, and HQ9+.

HQ9+ particularly caught my attention.  It takes the question of "simplifying the representation of problems" to the utmost extreme.

HQ9+ has a simple grammar.  There are 4 operations, each represented by a single character.

’+’ increments the register.

‘H’ prints every languages natal example, "Hello, world!" 

‘Q’ makes every program into a quine.  It causes the interpreter to print the program text.  Quines are notoriously difficult assignments for second-year CS students.

‘9’ causes the interpreter to print the lyrics to the song "99 Bottles of Beer on the Wall."  This qualifies HQ9+ as a  real programming language, suitable for inclusion in the ultimate list of languages.

These three operators solve for some very commonly expressed problems.  In a certain sense, they are the ultimate solution to those problem.  They cannot be reduced any further… you can’t get shorter than one character.

Of course, in an audience of programmers, HQ9+ always gets a laugh.  In fact, it was created specifically to make programmers laugh.  And, in fact, it’s a kind of meta-level humor. It’s not the programs that are funny, but the design of the language itself… an inside joke from one programmer to the rest of us.

Eric Evans: Strategic Design

| Comments

Eric Evans, author of Domain-Driven Design and founder of Domain Language, embodies the philosophical side of programming.

He gave a wonderful talk on "Strategic Design".  During this talk, he stated a number of maxims that are worth pondering.

"Not all of a large system will be well designed."

"There are always multiple models."

"The diagram is not the model, but it is an expression of part of the model."

These are not principles to be followed, Evans says. Rather, these are fundamental laws of the universe. We must accept them and act accordingly, because disregarding them ends in tears.

Much of this material comes from Part 4 of Domain-Driven Design.  Evans laconically labeled this, "The part no one ever gets to."  Guilty.  But when I get back home to my library, I will make another go of it.

Evans also discusses the relative size of code, amount of time spent, and value of the three fundamental portions of a system: the core domain, supporting subdomains, and generic subdomains.

Generic subdomains are horizontal. You might find these in any system in any company in the world.

Supporting subdomains are business-specific, but not of value to this particular system. That is, they are necessary cost, but do not provide value.

The core domain is the reason for the system. It is the business-specific functionality that makes this system worth building.

Now, in a typical development process (and especially a rewrite project), where does the team’s time go? Most of it will go to the largest bulk: the generic subdomains. This is the stuff that has to exist, but it adds no value and is not specific to the company’s business. The next largest fraction goes to the supporting subdomains. Finally, the smallest portion of time—and usually the last portion of time—goes to the core domain.

That means the very last thing delivered is the reason for the system’s existance in the first place.  Ouch. 

Kent Beck’s Keynote: “Trends in Agile Development”

| Comments

Kent Beck spoke with his characteristic mix of humor, intelligence, and empathy.  Throughout his career, Kent has brought a consistently humanistic view of development.  That is, software is written by humans–emotional, fallible, creative, and messy–for other humans.  Any attempt to treat development as robotic will end in tears.

During his keynote, Kent talked about engaging people through appreciative inquiry.  This is a learnable technique, based in human psychology, that helps focus on positive attributes.  It counters the negaitivity that so many developers and engineers are prone to.  (My take: we spend a lot of time, necessarily, focusing on how things can go wrong.  Whether by nature or by experience, that leads us to a pessimistic view of the world.)

Appreciative inquiry begins by asking, "What do we do well?"  Even if all you can say is that the garbage cans get emptied every night, that’s at least something that works well.  Build from there.

He specifically recommended The Thin Book of Appreciative Inquiry, which I’ve already ordered.

I should also note that Kent has a new book out, called Implementation Patterns, which he described as being about, "Communicating with other people, through code."

From QCon San Francisco

| Comments

I’m at QCon San Francisco this week.  (An aside: after being a speaker at No Fluff, Just Stuff, it’s interesting to be the audience again.  As usual, on returning from travels in a different domain, one has a new perspective on familiar scenes.) This conference targets senior developers, architects, and project managers.  One of the very appealing things is the track on "Architectures you’ve always wondered about".  This coveres high-volume architectures for sites such as LinkedIn and eBay as well as other networked applications like Second Life.  These applications live and work in thin air, where traffic levels far outstrip most sites in the world.  Performance and scalability are two of my personal themes, so I’m very interested in learning from these pioneers about what happens when you’ve blown past the limits of traditional 3-tier, app-server centered architecture.

Through the remainder of the week, I’ll be blogging five ideas, insights, or experiences from each day of the conference.

Normal Accidents

| Comments

While I was writing Release It!, I was influenced by James R. Chile’s book Inviting Disaster. One of Chile’s sources is Normal Accidents, by Charles Perrow. I’ve just started reading, and even the first two pages offer great insight.

Normal Accidents describes systems that are inherently unstable, to the point that system failures are inevitable and should be expected.  These "normal" accidents result from systems that exhibit the characteristics of high "interactive complexity" and "tight coupling".

Interactive complexity refers to internal linkages, hidden from the view of operators. These invisible relations between components or subsystems produce multiple effects from a single cause.  They can also produce outcomes that do not seem to relate to their inputs.

In software systems, interactive complexity is endemic. Any time two programs share a server or database, they are linked. Any time a system contains a feedback loop, it inherently has higher interactive complexity. Feedback loops aren’t always obvious.  For example, suppose a new software release consumes a fraction more CPU per transaction than before. That small increment might puch the server from a non-contending regime and a contending one.  Once in contention, the added CPU usage creates more latency. That latency, and the increase in task-switching overhead, produces more latency. Positive feedback.

High interactive complexity leads operators to misunderstand the system and its warning signs. Thus misinformed, they act in ways that do not avert the crisis and may actually precipitate it. 

When processes happen very fast, and there is no way to isolate one part of the system from another, the system is tightly coupled.  Tight coupling allows small incidents to spread into large-scale failures.

Classic "web architecture" exhibits both high interactive complexity and tight coupling. Hence, we should expect "normal" accidents.  Uptime will be dominated by the occurence of these accidents, rather than the individual probability of failure in each component.

The first section of Release It! deals exclusively with system stability.  It shows how to reduce coupling and diminish interactive complexity. 

You Keep Using That Word. I Do Not Think It Means What You Think It Means.

| Comments

"Scalable" is a tricky word. We use it like there’s one single definition. We speak as if it’s binary: this architecture is scalable, that one isn’t.

The first really tough thing about scalability is finding a useful definition. Here’s the one I use:

Marginal revenue / transaction > Marginal cost / transaction

The cost per transaction has to account for all cost factors: bandwidth, server capacity, physical infrastructure, administration, operations, backups, and the cost of capital.

(And, by the way, it’s even better when the ratio of revenue to cost per transaction grows as the volume increases.)

The second really tough thing about scalability and architecture is that there isn’t one that’s right.  An architecture may work perfectly well for a range of transaction volumes, but fail badly as one variable gets large.

Don’t treat "scalability" as either a binary issue or a moral failing. Ask instead, "how far will this architecture scale before the marginal cost deteriorates relative to the marginal revenue?" Then, follow that up with, "What part of the architecture will hit a scaling limit, and what can I incrementally replace to remove that limit?"

Engineering in the White Space

| Comments

"Is software Engineering, or is it Art?"

Debate between the Artisans and the Engineers has simmered, and occasionally boiled, since the very introduction of the phrase "Software Engineering".  I won’t restate all the points on both sides here, since I would surely forget someone’s pet argument, and also because I see no need to be redundant.

Deep in my heart, I believe that building programs is art and architecture, but not engineering.

But, what if you’re not just building programs?

Programs and Systems

A "program" has a few characteristics that I’ll assign here:

  1. It accepts input.
  2. It produces output.
  3. It runs a sequence of instructions.
  4. Statically, it exhibits cohesion in its executable form. [*]
  5. Dynamically, it exhibits cohesion in its address space. [**]

* That is, the transitive closure of all code to be executed is finite, although it may not all be known in advance of execution.  This allows dynamic extension via plugins, but not, for example, dynamic execution of any scripts or code found on the Web.  So, a web browser is a program, but Javascript executed on some page is an independent program, not part of the browser itself.

** For "address space", feel free to substitute "object space", "process space", or "virtual memory". Cohesion requires that all the code that can access the address space should be regarded as a single program.  (IPC through shared memory is a special case of an output, and should be considered more akin to a database or memory-mapped file than to part of the program’s own address space.)

Suppose you have two separate scripts that each manipulate the same database.  I would regard those as two separate—though not independent—programs.  A single instance of Tomcat may contain several independent programs, but all the servlets in one EAR file are part of one program.

For the moment, I will not consider trivial objections, such as two distinct sets of functionality that happen to be packaged and delivered in a single EAR file.  It’s less interesting to me whether code does access the entire address space then whether it could.  A library checkout program that includes functions for both librarians and patrons may not use common code for card number lookup, but it could.  (And, arguably, it should.)  That makes it one program, in my eyes.

A "System", on the other hand, consists of interdependent programs that have commonalities in their inputs and outputs.  They could be arranges in a chain, a web, or a loop.  No matter, if one program’s input depends on another program’s output, then they are part of a system.

Systems can be composed, whereas programs cannot.  

Tricky White Space

Some programs run all the time, responding to intermittent inputs, these we call "servers".  It is very common to see servers represented as a deceptively simple little rectangle on a diagram.  Between servers, we draw little arrows to indicate communication, of some sort.

One little arrow might mean, "Synchronous request/reply using SOAP-XML over HTTP." That’s quite a lot of information for one little glyph to carry.  There’s not usually enough room to write all that, so we label the unfortunate arrow with either "XML over HTTP"—if viewing it from an internal perspective—or "SKU Lookup"—if we have an external perspective.

That little arrow, bravely bridging the white space between programs, looks like a direct contact.  It is Voyager, carrying its recorded message to parts unknown.  It is Aricebo, blasting a hopeful greeting into the endless dark.

Well, not really…

These days, the white space isn’t as empty as it once was.  A kind of lumeniferous ether fills the void between servers on the diagram.

The Substrate

There is many a slip ‘twixt cup and lip.  In between points A and B on our diagram, there exist some or all of the following:

  • Network interface cards
  • Network switches
  • Layer 2 - 3 firewalls
  • Layer 7 (application) firewalls
  • Intrusion Detection and Prevention Systems
  • Message queues
  • Message brokers
  • XML transformation engines
  • Flat file translations
  • FTP servers
  • Polling jobs
  • Database "landing zone" tables
  • ETL scripts
  • Metro-area SoNET rings
  • MPLS gateways
  • Trunk lines
  • Oceans
  • Ocean liners
  • Phillipine fishing trawlers (see, "Underwater Cable Break")

Even in the simple cases, there will be four or five computers between program A and B, each running their own programs to handle things like packet switching, traffic analysis, routing, threat analysis, and so on.

I’ve seen a single arrow, running from one server to another, labelled "Fulfillment".  It so happened that one server was inside my client’s company while the other server was in a fulfillment house’s company.  That little arrow, so critical to customer satisfaction, really represented a Byzantine chain of events that resembled a game of "Mousetrap" more than a single interface.  It had messages going to message brokers that appended lines to files, which were later picked up by an hourly job that would FTP the files to the "gateway" server (still inside my client’s company.)  The gateway server read each line from the file and constructed and XML message, which it then sent via HTTP to the fulfillment house.

It Stays Up

We analogize bridge-building as the epitome of engineering. (Side note: I live in the Twin Cities area, so we’re a little leery of bridge engineering right now.  Might better find another analogy, OK?)  Engineering a bridge starts by examining the static and dynamic load factors that the bridge must support: traffic density, weight, wind and water forces, ice, snow, and so on.

Bridging between two programs should consider static and dynamic loads, too.  Instead of just "SOAP-XML over HTTP", that one little arrow should also say, "Expect one query per HTTP request and send back one response per HTTP reply.  Expect up to 100 requests per second, and deliver responses in less than 250 milliseconds 99.999% of the time."

It Falls Down

Building the right failure modes is vital. The last job of any structure is to fall down well. The same is true for programs, and for our hardy little arrow.

The interface needs to define what happens on each end when things come unglued. What if the caller sends more than 100 requests per second? Is it OK to refuse them? Should the receiver drop requests on the floor, refuse politely, or make the best effort possible?

What should the caller do when replies take more than 250 milliseconds? Should it retry the call? Should it wait until later, or assume the receiver has failed and move on without that function?

What happens when the caller sends a request with version 1.0 of the protocol and gets back a reply in version 1.1? What if it gets back some HTML instead of XML?  Or an MP3 file instead of XML?

When a bridge falls down, it is shocking, horrifying, and often fatal. Computers and networks, on the other hand, fall down all the time.  They always will.  Therefore, it’s incumbent on us to ensure that individual computers and networks fail in predictable ways. We need to know what happens to that arrow when one end disappears for a while.

In the White Space

This, then, is the essence of engineering in the white space. Decide what kind of load that arrow must support.  Figure out what to do when the demand is more than it can bear.  Decide what happens when the substrate beneath it falls apart, or when the duplicitous rectangle on the other end goes bonkers.

Inside the boxes, we find art.

The arrows demand engineering.