A short while back, I did a brief series on the value of "dirty data"—copious amounts of unstructured, non-relational data created by the many interactions user have with your site and each other.
O’Reilly is creating a new line of "community-authored" books. One of them is called "97 Thing Every Software Architect Should Know".
All of the "97 Things" books will be created by wiki, with the best entries being selected from all the wiki contributions.
I’ve contributed several axioms that have been selected for the book:
- Talk about the arch, but see the scaffolding beneath it
- You’re negotiating more often than you think
- Software architecture has ethical consequences
- Everything will ultimately fail
- Engineer in the white spaces
Long-time readers of this blog may recognize some of these themes.
You can see the whole wiki here.
Stewart Brand’s famous book How Buildings Learn has been on my reading queue for a while, possibly a few years. Now that I’ve begun reading it, I wish I had gotten it sooner. Listen to this:
The finished-looking model and visually obsessive renderings dominate the let’s-do-it meeting, so that shallow guesses are frozen as deep decisions. All the design intelligence gets forced to the earliest part of the building process, when everyone knows the least about what is really needed.
Wow. It’s hard to tell what industry he’s talking about there. It could easily apply to software development. No wonder Brand is so well-regarded in the Agile community!
Another wonderful parallel is between what Brand calls "Low Road" and "High Road" buildings. A Low Road building is one that is flexible, cheap, and easy to modify. It’s hackable. Lofts, garages, old factory floors, warehouses, and so on. Each new owner can gut and modify it without qualms. A building where you can drill holes through the walls, run your own cabling, and rip out every interior wall is a Low Road building.
High Road buildings evolve gradually over time, through persistent care and love. There doesn’t necessarily have to be a consistent–or even coherent–vision, but each own does need to feel a strong sense of preservation. High Road buildings become monuments, but they aren’t made that way. They just evolve in that direction as each generation adds their own character.
Then there are the buildings that aren’t High or Low Road. Too static to be Low Road, but not valued enough to be High Road. Resistant to change, bureaucratic in management. Diffuse responsibility produces static (i.e., dead) buildings. Deliberately setting out to design a work of art, paradoxically, prevents you from creating a living, livable building.
Again, I see some clear parallels to software architecture here. On the one hand, we’ve got Low Road architecture. Easy to glue together, easy to rip apart. Nobody gets bent out of shape if you blow up a hodge-podge of shoestring batch jobs and quick-and-dirty web apps. CGI scripts written in perl are classic Low Road architecture. It doesn’t mean they’re bad, but they’re probably not going to go a long time without being changed in some massive ways.
High Road architecture would express a conservativism that we don’t often see. High Road is not "big" architecture. Rather, High Road means cohesive systems lovingly tended. Emacs strikes me as a good example of High Road architecture. Yes, it’s accumulated a lot of bits and oddments over the years, but it’s quite conservative in its architecture.
Enterprise SOA projects, to me, seem like dead buildings. They’re overspecified and too focused on the moment of rollout. They’re the grand facades with leaky roofs. They’re the corporate office buildings that get gerrymandered into paralysis. They preach change, but produce stasis.
Dan Pritchett is a man after my own heart. His latest post talks about the path to availability enlightenment. The obvious path–reliable components and vendor-supported commercial software–leads only to tears.
You can begin on the path to enlightenment when you set aside dreams of perfect software running on perfect hardware, talking over perfect networks. Instead, embrace the reality of fallible components. Don’t design around them, design for them.
How do you design for failure-prone components? That’s what most of Release It! is all about.
There seems to be something inherently contradictory about "Enterprise" agile tool vendors. There’s never been a tool invented that’s as flexible in use or process as the 3x5 card. No matter what, any tool must embed some notion of a process, or at least a meta-process.
I’ve looked at several of the "agile lifecycle management" and "agile project management" tools this week. To me, they all look exactly like regular project management tools. They just have some different terminology and ajax-y web interfaces.
Vendors listen: just because you’ve got a drag-and-drop rectangle on a web page doesn’t make it agile!
The point of agile tools isn’t to move cards around the board in ever-cooler ways. It isn’t to automatically generate burndown graphs and publish them for management.
The point of agile tools is this: at any time, the team can choose to rip up the pavement and do it differently next iteration.
What happens once you’ve paid a bunch of money for some enterprise lifecycle management tool from one of these outfits? (Name them and they appear; so I won’t.) Investment requires use. Once you’ve paid for something—or once your boss has paid for it—you’ll be stuck using it.
Now look, I’m not against tools. I use them as force multipliers all the time. I just don’t want to get stuck with some albatross of a PLM, ALM, LFCM, or LEM, just because we paid a gob of money for it.
The only agile tools I want are those I can throw away without qualm when the team decides it doesn’t fit any more. If the team cannot change its own processes and tools, then it cannot adapt to the things it learns. If it cannot adapt, it isn’t agile. Period.
As an organization scales up, it must navigate several transitions. If it fails to make these transitions well, it will stall out or disappear.
One of them happens when the company grows larger than "village-sized". In a village of about 150 people or less, it’s possible for you to know everyone else. Larger than that, and you need some kind of secondary structures, because personal relationships don’t reach from every person to every other person. Not coincidentally, this is also the size where you see startups introducing mid-level management.
There are other factors that can bring this on sooner. If the company is split into several locations, people at one location will lose track of those in other locations. Likewise, if the company is split into different practice areas or functional groups, those groups will tend to become separate villages on their own. In either case, the village transition will happen sooner than 150.
It’s a tough transition, because it takes the company from a flat, familial structure to a hierarchical one. That implicitly moves the axis of status from pure merit to positional. Low-numbered employees may find themselves suddenly reporting to a newcomer with no historical context. It shouldn’t come as a surprise when long-time employees start leaving, but somehow the founders never expect it.
This is also when the founders start to lose touch with day-to-day execution. They need to recognize that they will never again know every employee by name, family, skills, and goals. Beyond village size, the founders have to be professional managers. Of course, this may also be when the board (if there is one) brings in some professional managers. It shouldn’t come as a surprise when founders start getting replaced, but somehow they never expect it.
Amazon has issued a more detailed statement explaining the S3 outage from June 20, 2008. In my company, we’d call this a "Post Incident Report" or PIR. It has all the necessary sections:
- Observed behavior
- Root cause analysis
- Followup actions: corrective and operational
This is exactly what I’d expect from any mature service provider.
There are a few interesting bits from the report. First, the condition seems to have arisen from an unexpected failure mode in the platform’s self-management protocol. This shouldn’t surprise anyone. It’s a new way of doing business, and some of the most innovative software development, applied at the largest scales. Bugs will creep in.
In fact, I’d expect to find more cases of odd emergent behavior at large scale.
Second, the back of my envelope still shows S3 at 99.94% availability for the year. That’s better than most data center providers. It’s certainly better than most corporate IT departments do.
Third, Amazon rightly recognizes that transparency is a necessary condition for trust. Many service providers would fall into the "bunker mentality" of the embattled organization. That’s a deteriorating spiral of distrust, coverups, and spin control. Transparency is most vital after an incident. If you cannot maintain transparency then, it won’t matter at any other time.
I’ve talked before about adopting a failure-oriented mindset. That means you should expect every component of your system or application to someday fail. In fact, they’ll usually fail at the worst possible times.
When a component does fail, whatever unit of work it’s processing at the time will most likely be lost. If that unit of work is backed up by a transactional database, well, you’re in luck. The database will do it’s Omega-13 bit on the transaction and it’ll be like nothing ever happened.
Of course, if you’ve got more than one database, then you either need two-phase commit or pixie dust. (OK, compensating transactions can help, too, but only if the thing that failed isn’t the thing that would do the compensating transaction.)
I don’t favor distributed transactions, for a lot of reasons. They’re not scalable, and I find that the risk of deadlock goes way up when you’ve got multiple systems accessing multiple databases. Yes, uniform lock ordering will prevent that, but I never want to trust my own application’s stability to good coding practices in other people’s apps.
Besides, enterprise integration through database access is just… icky.
Messaging is the way to go. Messaging offers superior scalability, better response time to users, and better resilience against partial system failures. It also provides enough spatial, temporal, and logical decoupling between systems that you can evolve the endpoints independently.
Udi Dahan has published an excellent article with several patterns for robust messaging. It’s worth reading, and studying. He addresses the real-world issues you’ll encounter when building messaging apps, such as giant messages clogging up your queue, or "poison" messages that sit in the front of the queue causing errors and subsequent rollbacks.
If large amounts of dirty data are actually valuable, how do you go about collecting it? Who’s in the best position to amass huge piles?
One strategy is to scavenge publicly visible data. Go screen-scrape whatever you can from web sites. That’s Google’s approach, along with one camp of the Semantic Web tribe.
Another approach is to give something away in exchange for that data. Position yourself as a connector or hub. Brokers always have great visibility. The IM servers, the Twitter crowd, and the social networks in general sit in the middle of great networks of people. LinkedIn is pursuing this approach, as are Twitter+Summize, and BlogLines. Facebook has already made multiple, highly creepy, attempts to capitalize on their "man-in-the-middle" status. Meebo is in a good spot, and trying to leverage it further. Metcalfe’s Law will make it hard to break into this space, but once you do, your visibility is a great natural advantage.
Aggregators get to see what people are interested in. FriendFeed is sitting on a torrential flow of dirty data. ("Sewage", perhaps?) FeedBurner sees the value in their dirty data.
Anyone at the endpoint of traffic should be able to get good insight into their own world. While the aggregators and hubs get global visibility, the endpoints are naturally more limited. Still, that shouldn’t stop them from making the most of the dirt flowing their way. Amazon has done well here.
Sun is making a run at this kind of visibility with Project Hydrazine, but I’m skeptical. They aren’t naturally in a position to collect it, and off-to-the-side instrumentation is never as powerful. Although, companies like Omniture have made a market out of off-to-the-side instrumentation, so there’s a possibility there.
Carriers like Verizon, Qwest, and AT&T are in a natural position to take advantage of the traffic crossing their networks, but as parties in a regulated industry, they are mostly prohibited from looking at the traffic crossing their networks.
So, if you’re a carrier or a transport network, you’re well positioned to amass tons of dirty data. If you are a hub or broker, then you’ve already got it. Otherwise, consider giving away a service to bring people in. Instead of supporting it with ad revenue, support it by gleaning valuable insight.
Just remember that a little bit of dirty data is a pain in the ass, but mountains of it are pure gold.
Continuing my theme of filthy data.
A few years ago, there was a lot of excitement around clickstream analysis. This was the idea that, by watching a user’s clicks around a website, you could predict things about that user.
What a backwards idea.
For any given user, you can imagine an huge number of plausible explanations for any given browsing session. You’ll never enumerate all the use cases that motivate someone to spend ten minutes on seven pages of your web site.
No, the user doesn’t tell us much about himself by his pattern of clicks.
But the aggregate of all the users’ clicks… that tells us a lot! Not about the users, but about how the users perceive our site. It tells us about ourselves!
A commerce company may consider two products to be related for any number of reasons. Deliberate cross-selling, functional alignment, interchangability, whatever. Any such relationships we create between products in the catalog only reflect how we view our own catalog. Flip that around, though, and look at products that the users view as related. Every day, in every session, users are telling us that products have some relationship to each other.
Hmm. But, then, what about those times when I buy something for myself and something for my kids during the same session? Or when I got that prank gift for my brother?
Once you aggregate all that dirty data, weak connections like the prank gift will just be part of the background noise. The connections that stand out from the noise are the real ones, the only ones that ultimately matter.
This is an inversion of the clickstream. It tells us nearly nothing about the clicker. Instead, it illuminates the clickee.