If large amounts of dirty data are actually valuable, how do you go about collecting it? Who’s in the best position to amass huge piles?
One strategy is to scavenge publicly visible data. Go screen-scrape whatever you can from web sites. That’s Google’s approach, along with one camp of the Semantic Web tribe.
Another approach is to give something away in exchange for that data. Position yourself as a connector or hub. Brokers always have great visibility. The IM servers, the Twitter crowd, and the social networks in general sit in the middle of great networks of people. LinkedIn is pursuing this approach, as are Twitter+Summize, and BlogLines. Facebook has already made multiple, highly creepy, attempts to capitalize on their "man-in-the-middle" status. Meebo is in a good spot, and trying to leverage it further. Metcalfe’s Law will make it hard to break into this space, but once you do, your visibility is a great natural advantage.
Aggregators get to see what people are interested in. FriendFeed is sitting on a torrential flow of dirty data. ("Sewage", perhaps?) FeedBurner sees the value in their dirty data.
Anyone at the endpoint of traffic should be able to get good insight into their own world. While the aggregators and hubs get global visibility, the endpoints are naturally more limited. Still, that shouldn’t stop them from making the most of the dirt flowing their way. Amazon has done well here.
Sun is making a run at this kind of visibility with Project Hydrazine, but I’m skeptical. They aren’t naturally in a position to collect it, and off-to-the-side instrumentation is never as powerful. Although, companies like Omniture have made a market out of off-to-the-side instrumentation, so there’s a possibility there.
Carriers like Verizon, Qwest, and AT&T are in a natural position to take advantage of the traffic crossing their networks, but as parties in a regulated industry, they are mostly prohibited from looking at the traffic crossing their networks.
So, if you’re a carrier or a transport network, you’re well positioned to amass tons of dirty data. If you are a hub or broker, then you’ve already got it. Otherwise, consider giving away a service to bring people in. Instead of supporting it with ad revenue, support it by gleaning valuable insight.
Just remember that a little bit of dirty data is a pain in the ass, but mountains of it are pure gold.