Let’s engage in a thought experiment for a moment. Suppose that software was trivial to create and only ever needed to be used once. Completely disposable. So, somebody comes to you and says, “I have a problem and I need you to solve it. I need a tool that will do blah-de-blah for a little while.” You could think of the software the way that a carpenter thinks of a jig for cutting a piece of wood on a table saw, or a metalworker thinks of creating a jig to drill a hole at the right angle and depth.
If software were like this, you would never care about its architecture. You would spend a few minutes to create the thing that was needed, it would be used for the job at hand, and then it would be thrown away. It really wouldn’t matter how good the software was on the inside–how easy it was to change–because you’d never change it! It wouldn’t matter how it adapted to changing business requirements, because you’d just create a new one when the new requirement came up. In this thought experiment we wouldn’t worry about architecture.
The key difference between this thought experiment and actual software? Of course, actual software is not disposable. It has a lifespan over some amount of time. Really, it’s the time dimension that makes architecture important.
Over time, we need for many different people to work effectively in the software. Over time, we need the throughput of features to stay constant, or hopefully not decrease too much. Maybe it even increases in particularly nice cases. Over time, the business needs change so we need to adapt the software.
It’s really time that makes us care about architecture.
Isn’t it interesting then, that we never include time as a dimension in our architecture descriptions?
FaKod (I think that translates as “The Fatalistic Coder”?) has written a nice Scala implementation of the Circuit Breaker pattern, and even better, has made it available on GitHub.
Check out http://github.com/FaKod/Circuit-Breaker-for-Scala for the code.
The Circuit Breaker can be mixed in to any type. See http://wiki.github.com/FaKod/Circuit-Breaker-for-Scala/ for an example of usage.
I’ve been asked to sit on a panel regarding the future of software development. This is always risky and makes me nervous, for two reasons. First, prediction is a notoriously low success-rate activity. Second, the people you always see making predictions like this are usually well past their “use by” date. Nevertheless, here are a collection of barely-related thoughts I have on that subject.
Two obvious trends are cloud computing and mobile access. They are complementary. As the number of people and devices on the net increases, our ability to shape traffic on the demand side gets worse. Spikes in demand will happen faster and reach higher levels over time. Mobile devices exacerbate the demand side problems by greatly increasing both the number of people on the net and the fraction of their time they are able to access it.
Large traffic volumes both create and demand large data. Our tools for processing tera- and petabyte datasets will improve dramatically. Map/Reduce computing (a la Hadoop) has created attention and excitement in this space, but it is ultimately just one tool among many. We need better languages to help us think and express large data problems. In particular, we need a language that makes big data processing accessible to people with little background in statistics or algorithms.
Speaking of languages, many of the problems we face today cannot be solved inside a single language or application. The behavior of a web site today cannot be adequately explained or reasoned about just by examining the application code. Instead, a site picks up attributes of behavior from a multitude of sources: application code, web server configuration, edge caching servers, data grid servers, offline or asynchronous processing, machine learning elements, active network devices (such as application firewalls), and data stores. “Programming” as we would describe it today–coding application behavior in a request handler–defines a diminishing portion of the behavior. We lack tools or languages to express and reason about these distributed, extended, fragmented systems. Consequently, it is difficult to predict the functionality, performance, capacity, scalability, and availability of these systems.
Some of this will be mitigated naturally as application-specific functions disappear into tools and frameworks. Companies innovating at the leading edge of scalability today are doing things in application-specific behavior to compensate for deficiencies in tools and platforms. For example, caching servers could arguably disappear into storage engines and no-one would complain. In other words, don’t count the database vendors out yet. You’ll see key-value stores and in-memory data grid features popping up in relational databases any day now.
In general, it appears that Objects will diminish as a programming paradigm. Object-oriented programming will still exist… I’m not claiming “the death of objects” or something silly like that. However, OO will become just one more paradigm among several, rather than the dominant paradigm it has been for the last 15 years. “Object oriented” will no longer be synonymous with “good”.
Regarding Java. I fear that Java will have to be abandoned to the “Enterprise Development” world. It will be relegated to the hands of cut-rate business coders bashing out their gray business applications for $30 / hour. We’ve passed the tipping point on this one. We used to joke that Java would be the next COBOL, but that doesn’t seem as funny now that it’s true. Java will continue to exist. Millions of lines of it will be written each year. It won’t be the driver of innovation, though. As individual programmers, I’d recommend that you learn another language immediately and differentiate yourself from the hordes of low-skill, low-rent outsource coders that will service the mainstream Java consumer.
Where will innovation come from? Although some of the blush seems to be coming off Ruby, the reduction in hype has mainly allowed Ruby and Ruby on Rails developers to knuckle down and produce. That community continues to drive tremendous innovation. Many of the interesting developments here relate to process. Ruby developers have given us fantastic tools like Gems and Capistrano, that let small teams outperform and outproduce groups four times their size.
To my great surprise, data storage has become a hotbed of innovation in the last few years. Some of this is driven by the high-scalability fetishists, which is probably the wrong reason for 98% of companies and teams. However, innovations around column stores, graph databases, and key-value stores offer developers new tools to reduce the impedance mismatch between their data storage and their programming language. We spent twenty years trying to squeeze objects into relational databases. Aside from the object databases, which were an early casualty of Oracle’s ascension, we mostly focused on changing the application code through framework after framework and ORM after ORM. It’s refreshing to see storage models that are easier to use and easier to modify.
This will also cause another flurry of “reactive innovation” from the database vendors, just as we saw with “Universal Databases” in the mid-90s. The big players here–Microsoft and Oracle–won’t let some schemaless little upstarts erode their market share. More significantly, they aren’t about to let their flagship products–and the ones which give them beachheads inside every major corporation–get intermediated by some open-source frameworks banged up by the social network giants. Look for big moves by these vendors into high scalability, agile storage, and eventual consistency storage.
People who don’t live in operations can carry some funny misconceptions in their heads. Some of my personal faves:
- Just add some servers!
- I want a report of every configuration setting that’s different between production and QA!
- We’re going to make sure this (outage) never happens again!
I’ve recently been reminded of this during some discussions about disaster recovery. This topic seems to breed misconceptions. Somewhere, I think most people carry around a mental model of failover that looks like this:
That is, failover is essentially automatic and magical.
Sadly, there are many intermediate states that aren’t found in this mental model. For example, there can be quite some time between failure and it’s detection. Depending on the detection and notification, there can be quite a delay before failover is initiated at all. (I once spoke with a retailer whose primary notification mechanism seemed to be the Marketing VP’s wife.)
Once you account for delays, you also have to account for faulty mechanisms. Failover itself often fails, usually due to configuration drift. Regular drills and failover exercises are the only way to ensure that failover works when you need it. When the failover mechanisms themselves fail, your system gets thrown into one of these terminal states that require manual recovery.
Just off the cuff, I think the full model looks a lot more like this:
It’s worth considering each of these states and asking yourself the following questions:
- Is the state transition triggered automatically or manually?
- Is the transition step executed by hand or through automation?
- How long will the state transition take?
- How can I tell whether it worked or not?
- How can I recover if it didn’t work?
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. -Leslie Lamport
On my way to QCon Tokyo and QCon China, I had some time to kill so I headed over to Delta’s Skyclub lounge. I’ve been a member for a few years now. And why not? I mean, who could pass up tepid coffee, stale party snacks, and a TV permanently locked to CNN? Wait… that actually doesn’t sound like such a hot deal.
Oh! I remember, it’s for the wifi access. (Well, that plus reliably clean bathrooms, but we need not discuss that.) Being able to count on wifi access without paying for yet another data plan has been pretty helpful for me. (As an aside, I might change my tune once I try a mifi box. Carrying my own hotspot sounds even better.)
Like most wifi providers, the Skyclub has a captive portal. Before you can get a TCP/IP connection to anything, you have to submit a form with a checkbox to agree to 89 pages of terms and conditions. I’m well aware that Delta’s lawyers are trying to make sure the company isn’t liable if I go downloading bootlegs of every Ally McBeal episode. But I really don’t know if these agreements are enforceable. For all I know, page 83 has me agreeing to 7 years indentured servitude cleaning Delta’s toilets.
Anyway, Delta has outsourced operations of their wifi network to Concourse Communications. And apparently, they’ve had an outage all morning that has blocked anyone from using wifi in the Minneapolis Skyclubs. When I submit the form with the checkbox, I get the following error page:
Including this bit of stacktrace:
There’s a lot to dislike here.
- Why is this yelling at me, the user? To anyone who isn’t a web site developer, this makes it sound like the user did something wrong. There’s a ton of scary language here: "instance-specific error", "allow remote connections", "Named Pipes Provider"… heck, this sounds like it’s accusing the user of hacking servers. "Stack trace" sure sounds like the Feds are hot on somebody’s trail, doesn’t it?
- Isn’t it fabulous to know that Ken keeps his projects on his D: drive? If I had to lay bets, I’d say that Ken screwed up his configuration string. In fact, the whole problem smells like a failed deployment or poorly executed change. Ken probably pushed some code out late on a Friday afternoon, then boogied out of town. My prediction (totally unverifiable, of course) is that this problem will take less than 5 minutes to resolve, once Ken gets his ass back from the beach.
- We mere users get to see quite a bit of internal information here. Nothing really damaging, unless of course Wilson ORMapper has some security defects or something like that.
- Stepping back from this specific error message, we have the larger question: is it sensible to couple availability of the network to the availability of this check-the-box application? Accessing the network is the primary purpose of this whole system. It is the most critical feature. Is collecting a compulsory boolean "true" from every user really as important as the reason the whole damn thing was built in the first place? Of course not! (As an aside, this is an example of Le Chatelier’s Principle: "Complex systems tend to oppose their own proper function.")
We see this kind of operational coupling all the time. Non-critical features are allowed to damage or destroy critical features. Maybe there’s a single thread pool that services all kinds of requests, rather than reserving a separate pool for the important things. Maybe a process is overly linearized and doesn’t allow for secondary, after-the-fact processing. Or, maybe a critical and a non-critical system both share an enterprise service—producing a common-mode dependency.
Whatever the proximate cause, the underlying problem is lack of diligence in operational decoupling.
I’m working on a syllabus for an extensive course on web architecture. This will be for experienced programmers looking to become architects.
Like all of my work about architecture, this covers technology, business, and strategic aspects, so there’s an emphasis on creating high-velocity, competitive organizations.
In general, I’m aiming for a mark that’s just behind the bleeding edge. So, I’m including several of the NoSQL persistence technologies, for example, but not including Erjang because it’s too early. (Or is that “erl-y”? )
(What I’d really love to do is make a screencast series out of all of these. I’m daunted, though. There’s a lot of ground to cover here!)
I’m interested in hearing your feedback. What would you add? Remove?
Methods and Processes
- Systems Thinking/Learning Organization
- High Velocity Organizations
- Safety Culture
- Error-Inducing Systems (“Normal Accidents”)
- Points of Leverage
- Fundamental Dynamics: Iteration, Variation, Selection, Feedback, Constraint
- 5D architecture
- Failures of Intuition
- Critical Chain
- Lean Software Development
- Real Options
- Strategic Navigation
- Tempo, Adaptation
- REST / ROA
- Pipes & Filters
- App-server centric
- Event-Driven Architecture
- The “architecture” of the web
- HTTP 1.0 & 1.1
- Browser fetch behaviors
- HTTP Intermediaries
The Nature of the Web
- Mashups/APIs/Linked Open Data
- Unit testing
- BDD/Spec testing
- “Web-shaped” persistence
- 8 Fallacies of Distributed Computing
- CAP Theorem
Languages and Frameworks
- Code Smells
- Object Thinking
- Object Design
- Functional Thinking
- API Design
- Design for Operations
- Information Hiding
- Recognizing Coupling
- Cloud (AWS)
Build and Version Control
- Private repos
- Collaboration across projects
There’s an old joke about a couple of folks on a plane who hear the captain successively announce that they’ve lost one, two, then three engines. Each time, he reassures the passengers that they’re OK, but will be progressively later to land. After the losing the third engine, one passenger tells the other, “If the last one goes, we’ll be up here all night!”
It’s a remarkable aircraft that can fly on just one out of four engines. Most four engine jets need at least two to cruise. (I’ve been told that they can make a controlled descent on one engine, but can’t maintain altitude.)
Likewise, your web app probably needs more than just one functioning server to handle demand. The usual approach to computing availability is to compute the odds that at least one server survives:
If all the servers are identical, meaning that we expect them to have the same failure rate, then this reduces to the more familiar form:
The mighty Mississippi River starts in Minnesota, at Lake Itasca. Every kid in Minnesota has to make the ritual pilgrimage to Itasca State Park at some point, where wading across North America’s longest river is a rite of passage.
One of the very interesting things in Itasca State Park is a section of forest that is fenced off so that deer cannot enter it. It’s part of a decades-long experiment to see how forests are affected by browsing herbivores. What’s really interesting is that not only are the quantity of plants different inside the protected area, but the types of plants and trees are different, too. Because deer prefer to nibble on younger trees, fewer saplings survive in the main body of the forest than in the fenced-off portion. Outside the fence, the distribution of tree size and age is biased toward older trees. The population of trees is weighted more toward resinous species like pines, which deer prefer not to eat. Inside the fence, more saplings survive into young maturity, so you see a more even distribution of tree ages and a wider diversity of species represented in the mature trees. The changes in the canopy affect the ground cover which, in turn, change how deer could (if allowed) reach the trees and browse them.
So, here’s a feedback loop that involves deer, trees, leaves and brush. The net result is a different ecosystem (albeit a slightly artificial one.)
Most physical and biological systems are like this in several ways, particularly relating to feedback. In our artificial systems (electrical, mechanical, symbolic, or semantic) we build in feedback mechanisms as a deliberate control. These are often one dimensional, proportional, and negative.
In natural systems, feedback arises everywhere. Sometimes, it proves to be helpful for the long-term stability of the system. In which case, the feedback itself gets reinforced by the existence and perpetuation of the system it exists within. In a sense, the system adapts to reinforce beneficial feedback. Conversely, feedback webs that cause too much instability will, like an overly aggressive virus, lead to destruction of their host system and disappear. So, we can see the constituents of a system co-evolving with each other and the system itself.
The old “microphone-amplifier-speaker-squealing” example of feedback really fails here. We lack both language and metaphor to really grasp this kind of interaction over time. In part, I think that’s because we like to separate the world into isolated components and only talk about components at a single level of abstraction. The trouble is that abstractions like “level of abstraction” only exist in our minds.
Here’s another example of coevolution, courtesy of Jared Diamond in “Guns, Germs, and Steel”. I’ll apologize in advance for oversimplifying; I’m devoting a paragraph to an argument he develops across entire chapters.
At some point, a group of nomads decided that the seeds of these particular grasses were tasty. In collecting the grasses, they spread it around. Some kinds of seeds survived the winter better and responded well to being sown by humans. Now, nobody sat down and systematically picked out which seeds grew better or worse. They didn’t have to, because the seeds that grew better produced more seeds for the next generation. Over time, a tiny difference (fractions of a percent) in productivity would lead some strains to supplant the others. Meanwhile, inextricably linked, some humans figured out how to plants, harvest, and eat these early grains. These humans had an advantage over their neighbors, so they were able to feed more babies. That turns out to be a benefit, because farming is hard work and requires more offspring to help produce food. (Another feedback loop.) Oh, and this kind of labor makes it advantageous to keep livestock, too. Over time, these farmers would breed and feed more children than the nomads, so farmers would come to be a larger and larger percentage of the population. Just as an added wrinkle, keeping livestock and fertilizing fields both lead to diseases that simultaneously harm the individuals and occasionally decimate the population, but also provide some long-term benefits such as better disease resistance and inadvertent biological warfare when encountering other civilizations.
Try to diagram the feedback loops here: nomads, farmers, livestock, grains, birthrates, and so on. Everything is connected to everything else. It’s really hard to avoid slipping into teleological language here. We’ve got feedback and feedforward at several different levels and timescales here, from the scale of microbes to livestock to civilizations, and across centuries. This dynamic altered the course of many species evolution: cattle, wheat, maize, and yes, good old H. Sapiens.
The human intellectual penchant for decomposition, isolation, and leveled abstraction is purely an artifact of the size of our bodies and the duration of our lives.
Google has published an explanation of the widespread GMail outage from September 1st. In this explanation, they trace the root cause to a layer of “request routers”:
…a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.
This perfectly describes the “Chain Reaction” stability antipattern from Release It!