Perfection is Not Always Required

In my series on dirty data, I made the argument that sometimes incomplete, inaccurate, or inconsistent data was OK. In fact, not only is it OK, but it can be an advantage.

There's a really slick Ruby library called WhatLanguage that illustrates this beautifully. The author also wrote a nice article introducing the library. WhatLanguage automatically determines the language that a piece of text is written in.

For example (from the article)

require 'whatlanguage'

"Je suis un homme".language      # => :french

Very nice.

WhatLanguage works by comparing words in the input text to a data structure that can tell you whether a word exists in the corpus. There's the catch, though. It can return a false positive! That would mean you get an incorrect "yes" sometimes for words that aren't in the language in question. On the other hand, it's guaranteed against false negatives.

You might imagine that there are pretty limited circumstances when you'd use a data structure that sometimes returns incorrect answers. (There is a calculable probability of a false positive. It never reaches zero.) It works for WhatLanguage, though.

You see, each word contributes to a histogram binned by possible language. Ultimately, one language "wins", based on whichever has the most entries in the histogram. False positives may contribute an extra point to incorrect languages, but the correct language will pretty much always emerge from the noise, provided there's enough source text to work from.

So, there's another example of information emerging from noisy inputs, just as long as there's enough of it.