Checking in on whether Data is Oil

Jun 04, 2025

Clive Humby’s 2006 quip that “data is the new oil” landed just as the Web 2.0 platforms were discovering that logs and clickstreams could be refined into ad-targeting gold. When The Economist put the line on its 2017 cover, the analogy hardened into doctrine: whoever owned the biggest well of user data would inherit Rockefeller-scale power.

In the late-2010s deep-learning boom that sentiment made sense. ImageNet was private until 2009; Google’s Search and YouTube logs were (and remain) proprietary; Facebook’s social graph could not be duplicated. Accumulating exclusive, labeled corpora looked like the only durable moat.

Benedict Evans summarized the thesis neatly: the edge lies not in having data but in owning the network effects behind it—because nobody can switch social graphs the way you can switch cloud providers.

Then the ground shifted. Open corpora such as Common Crawl exploded—over 250 billion pages, refreshed monthly—and became a first-stop ingredient for nearly every large model. Once a dataset is free on S3, it is by definition non-exclusive: scrape it once, and every lab has the same tokens.

The result is a curious leveling effect. A newly released public corpus raises all boats: GPT, LLaMA, Claude, Gemini. But precisely because everyone can drink from the same well, no one gains lasting advantage from it. Unlike oil, which depletes, the flow of text, images, and telemetry grows super-exponentially. The competitive edge migrates elsewhere—toward compute, architectures, or exotic private data.

Paradoxically, plenty does not guarantee sufficiency. A 2022 study estimated that by the late 2020s frontier labs could exhaust all high-quality human text available for training. Quantity is infinite; useful novelty is scarce.

Modern LLMs are “jagged” rather than smooth: dazzling at Olympiad math, wobbly at kindergarten logic. Scholarly work in Nature finds that scaling often trades raw accuracy for reliability—bigger models mis-generalize in surprising corners. Domain evaluations show discipline-specific gaps as wide as the chasm between literature and oncology.

The common thread is data quality: models ace what they have many, diverse, clean exemplars of, and stumble elsewhere. That pushes the frontier from “more tokens” to “better, purpose-built tokens.”

So we see:

Scale AI pivoted from bounding boxes to frontier knowledge datasets—curated, expert-labeled corpora—and expects to double revenue to $2 billion in 2025.
Anthropic notes in the bioweapons-risk section of its latest model card that the company benchmarked model answers against novices, intermediates and “experts (domain-specific PhDs)” and used those PhDs to grade the responses.
OpenAI’s GPT-4o system card lists paid partnerships (e.g., pay-walled archives) as a key differentiator.

These are less like oil barons with valuable plots of land and more like bespoke refinery operators: they do not own the web, but they own the cleaning, labeling, and alignment loops that turn noisy text into something a model can genuinely learn from.

Restated, frontier labs compete on process knowledge—automated deduplication, evaluator-in-the-loop reinforcement, safety filters—not on the underlying bytes.

Some Implications

Open data helps humanity, not moats. Each public release narrows the gap between incumbents and upstarts.
Moats shift to refinement and rights. Expect a premium on legally clean, high-precision datasets and the tooling to manufacture them.
Synthetic pipelines are the next arms race. Whoever masters fidelity-controlled generation can sidestep human-data ceilings.
Regulation will follow the refiners. Copyright battles have already reached Common Crawl. Future fights may pivot to data provenance audits and synthetic watermark standards.

Oil propelled the 20th century because it was scarce, consumable, and locked to specific geographies. Data in 2025 fails each of those tests: it gushes from every phone and server, it is copied rather than burned, and—thanks to the internet—it is almost perfectly mobile. What is scarce is the craftsmanship that turns undifferentiated bits into something a reasoning system can trust.

Google’s Veo 3 serves as the most salient exception to this line of thinking, built on a private trove so large and so well-aligned that no public crawl can reproduce it. Billions of captioned YouTube hours give DeepMind a head start in video-audio-text alignment the way Standard Oil once enjoyed prime acreage in Pennsylvania.

But beyond this counter-example, the decisive resource is no longer crude data but the refinery itself—compute, automated cleaners, expert graders, alignment loops, and safety valves. “Data is the new oil” captured a moment when possession of the well mattered most; today, advantage flows to whoever can turn overflowing streams into purpose-built fuel. In that light, the better metaphor may be silicon: sand is everywhere, but only a handful of fabs can sculpt it into chips. Likewise, data is abundant, yet only a few workshops can forge it into knowledge.

Collisions

Discussion about this post