Mounds of Filthy Data - michaelnygard.com

Data is the future.

The barriers to entering online business are pretty low, these days. You can do it with zero infrastructure, which means no capital spent on depreciating assets like servers and switches. Open source operating systems, databases, servers, middleware, libraries, and development tools mean that you don't spend money on software licenses or maintenance contracts. All you need is an idea, followed by a SMOP.

With both the cost side trending toward zero, how can there be any barrier to entry?

The "classic" answer is the network effect, also known as Metcalfe's Law. (The word "classic" in web business models means anything more than two years old, of course.) The first Twitter user didn't get a whole lot out of it. The ten-million-and-first gets a lot more benefit. That makes it tough for a newcomer like Plurk to get an edge.

I see a new model emerging, though. Metcalfe's Law is part of it, keeping people engaged. The best thing about having users, though, is that they do things. Every action by every user tells you something, if you can keep track of it all.

Twitter gets a lot of its value from the people connected at the endpoints. But, they also get enormous power from being the hub in the middle of it. Imagine what you can do when you see the content of every message passing through a system that large. A few things come to mind right away. You could extract all the links that people are posting to see what's hot today. (Zeitgeist.) You could use semantic analysis to tell how people feel about current topics, like Presidential candidates in the U.S. You could track product names and mentions to see which products delight people and which cause frustration. You could publish a slang dictionary that actually keeps up! The possibilities are enormous.

Ah, I can already sense an objection forming. How the heck is anyone supposed to figure out all that stuff from noisy, messy textual human communication? We're cryptic, ironic, and oblique. We sometimes mean the exact opposite of what we say. Any machine intelligence that tries to grok all of Twitter will surely self-destruct, right? That supposed "data" is just a big steaming pile of human contradictions!

In my view, though, it's the dirtiness of the data that makes it beautiful. Yes, there will be contradictions. There will be ironic asides. But, those will come out in the wash. They'll be balanced out by the sincere, meaningful, or obvious. Not every message will be semantically clear or consistent, but given enough messy data, clear patterns will still emerge.

There's the key: enough data to see patterns. Large amounts. Huge amounts. Vast piles of filthy data.

Over the next couple of days, I'll post a series of entries exploring how to amass dirty data, who's got a natural advantage, and programming models that work with it.