The other day I tweeted that "Data is the New Oil." A lot of people retweeted, but a quite a few asked what I meant by that. I'll amplify a bit to explain the analogy.

This ended up being a lot to unpack from a quick tweet! For quite a few years now, I've used Twitter as a way to scratch the itch of personal expression. A quick sound bite there, highly compressed and idiosyncratic was just enough to relieve the mental pressure. As a consequence, I stopped blogging nearly as much. Lately, though, I feel the need for nuance and explanation, so I hope to do more in this space.

First, oil was the key resource that drove the industrial revolution in the 20th century. That was the age of oil and steel, according to economist and historian Carlota Perez. In Technological Revolutions and Financial Capital, Prof. Perez shows that every technology revolution goes through predictable phases, from irruption to exhaustion. Economics in the 20th C were totally defined by access to and movement of oil. Those who had it either had leverage or became victims, depending on their ability to create military and economic alliances. Oil reserves could put a nation on the world stage. A nation that bargained well with its oil would have power far beyond what its population size or technological ability would usually merit.

In fact, a large part of the U.S. economic dominance in the latter portion of the 20th C can be explained by the petrodollar. Since the Bretton Woods conference after WWII, oil transactions around the world were denominated in USD. If Saudi Arabia sold oil to China, then China had to pay SA in dollars. That meant China needed plenty of USD currency reserves and SA needed the US to hold riyal. (The biggest economic story in the world right now is not the DJIA hitting 26,000 or falling by 0.5% in a day… it's that China, Russia, Saudi Arabia, and Iran are now trading oil denominated in rubles, yuan, and SDRs.)

But before the internal combustion engine, oil wasn't a resource it was a nuisance. The oil-rich land in Oklahoma is where the U.S. Government settled people it wanted to get out of the way. Oil gets in the way of farming. It was development of the new technology that turned oil from a hassle into a resource.

Once oil became a resource, a feedback loop got underway. More demand for oil led to more extraction, which caused industries to find new uses for the stuff. Plastics, fertilizers, etc. Increased demand drove increased supply and more efficient extraction, which in turn led to more demand.

Prof. Perez already identified the next technological revolution as information technology. However, I think her book got the timing wrong. It was published in 2002 and dated the start of the revolution to the advent of the personal computer in 1970. With the advantage of 16 years of additional observation, I think that there were two missing pieces: networking and machine learning. The real irruption of information technology started over the last decade. And as with the previous revolution, this one creates a need for a new resource: data.

Before this, data was a nuisance. It filled up disks and needed to be purged. It was often dirty (meaning not fully correct or conforming to syntactic rules) and incomplete. But toward the end of the 00's, some people started to see it as a resource. You might spend a lot of time cleansing and canonicalizing small data sets. But with a lot of data, it's impossible. At the same time though, you don't need to clean the data to glean information. Some kinds of errors average out and interesting signals emerge.

(If only I had come up with the name "Big Data" instead of "Dirty Data!")

Of course, we're well beyond mere Big Data now. With every eye turning toward machine learning, we've got a new challenge for our data. That's training.

A machine learning model is only as good as the training data. The training data itself needs to be classified. In other words, to train a machine to detect cars, you need a lot of photos where some are tagged "this has a car" and others don't have that tag. Yes, some CAPTCHAs just might be using you to train a machine, instead of proving you aren't one.

(Aside: we're going to see a lot of conflict about biases in ML models. We will expect the machines to be free of human cognitive and social biases, but we're training them with data created and classified by humans! We will actually be asking the machines to make errors in a systematic way to offset humans' systematic errors in the training inputs. It's not hard to see why HAL 9000 went spare.)

Data is digital, but it's not easy to move around in these quantities. We're not talking about a station wagon full of tapes barreling down the highway… we're talking about a convoy of 18-wheelers loaded with racks full of disks.

Companies that have tagged or classified data sets are the new oil producing and exporting countries. If you have large quantities of classified photos, video, voice, text, etc. you are well-positioned to train ML models. If you don't have such a dataset, then you need to create a consumer-oriented startup to get humans to do the initial classification for you or you need to license access to data from one of the big players. (There are some open-access datasets that hobbyists can use, but those will never be as large or as current as the proprietary data sets.) Alternatively, focus on providing the engineering support and tooling for the technostates that have the data, the same way that Norway provides engineering to Saudi Arabia.

Just as oil production led to new uses of oil that reshaped everything from consumer products to food production to hygiene, I fully expect data-fueled ML models to reshape this century. Moreover, we will see demand for ever-greater data production from our homes, workplaces, and devices. This will cause tension and conflict about data use just as happened with land-use, water-use, and mineral rights. That will lead to new legal regimes and doctrines. In extreme cases, it may lead to revolutions similar to the Revolutions of 1848 in Europe.