TL;DR

Thorsten Meyer AI’s latest Control Series article argues that training data has become the AI industry’s next chokepoint as public web text nears practical limits. The report points to Epoch AI projections, Anthropic’s $1.5 billion authors settlement and rising demand for expert and proprietary corpora as evidence that data is moving from free input to priced asset.

Thorsten Meyer AI published Part 3 of its Control Series, identifying training data as the AI industry’s next control point as public web text nears practical limits and valuable corpora move behind licensing deals, lawsuits and state control.

The report’s confirmed basis is a cluster of legal, market and research developments. Epoch AI researchers have estimated that AI developers could train models on datasets roughly equal to the available stock of public human text between 2026 and 2032, with pressure arriving sooner if labs overtrain models for efficiency. The Control Series cites about 300 trillion high-quality public text tokens as the relevant supply estimate; that figure is a projection, not a measured endpoint.

The legal shift is already visible. The Associated Press reported that a San Francisco federal judge gave preliminary approval in September 2025 to Anthropic’s $1.5 billion settlement with authors and publishers over allegations that nearly 465,000 books had been pirated for AI training. The settlement was reported at about $3,000 per covered book and does not cover future works. Judge William Alsup’s earlier ruling drew a line between lawfully acquired books used for training and pirated copies used to build a library.

The report says the remaining valuable data is now concentrated in harder places: paywalled archives, enterprise systems, expert reviews, autonomous-vehicle fleets, intelligence and battlefield records. It also cites Nvidia’s reported $320 million purchase of synthetic-data firm Gretel and Meta’s $14.3 billion Scale AI stake as signs that companies are paying for data pipelines, not just chips.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Proprietary Corpora Become Moats

The finding matters because compute can be rented and models can be copied or matched over time, while proprietary data is tied to whoever collected, verified or owns it. If public text is no longer enough, model performance may depend more on exclusive corpora, expert feedback and real-world signals that rivals cannot simply buy at market rates.

For publishers and creators, the shift strengthens the case for licensing and enforcement. For enterprises, it raises a data-governance risk: internal documents, customer records, workflows and labels may become training assets for a vendor that later competes with them. Reported settlement and transaction figures are historical; they are not forecasts or financial advice.

Synthetic Data Generation: A Beginner’s Guide

Synthetic Data Generation: A Beginner’s Guide

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

From Scraping To Licensing

Large AI systems were built during a period when much of the open web could be scraped cheaply and the legal boundaries were unsettled. That period is narrowing as publishers, authors and platforms challenge scraping practices or sign paid licensing deals. The New York Times’ case against OpenAI remains active, while some publishers have chosen licensing over litigation.

Synthetic data is one response to scarcity, but the report treats it as incomplete. Research on model collapse has warned that models trained heavily on machine-generated material can lose diversity or amplify errors unless fresh, verified human data remains in the mix. That makes verified data more valuable, not less.

“Data was supposed to be the abundant input. It’s the scarce one.”

— Thorsten Meyer AI, The Control Series Part 3

Understanding Open Source and Free Software Licensing

Understanding Open Source and Free Software Licensing

Used Book in Good Condition

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Open Questions On Ownership

It is not clear how fast public-data scarcity will bind frontier labs, because data efficiency, multimodal training and synthetic data could change the timeline. Epoch AI’s date range is a forecast, and the report’s median around 2028 depends on model-training trends continuing.

Legal limits are also unsettled. The Anthropic settlement addressed past piracy claims, not future training rules or model outputs. Courts have not yet produced a single stable rule for how copyright, licensing and fair use will apply across the AI industry.

Amazon

high-quality proprietary datasets

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Courts And Contracts Decide

The next tests will come in courtrooms, licensing negotiations and procurement contracts. Watch the New York Times case against OpenAI, further publisher deals, enterprise AI terms that restrict training use, and government rules for military or sovereign datasets. The winners will likely be the labs, companies and states that can prove where their data came from and keep control over how it is used.

Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual development?

Thorsten Meyer AI has published a new Control Series analysis arguing that data, not compute, is becoming the AI industry’s hardest bottleneck.

Is public training data already gone?

No. The claim is that high-quality public human text may be approaching practical limits for frontier model training. The 2026-2032 range is a projection, not a confirmed depletion date.

Did the Anthropic settlement ban AI training on books?

No. The settlement concerned past claims over pirated copies. The court treated lawful acquisition and piracy differently, and the settlement does not set a complete rule for future training.

Why is proprietary data different from rented compute?

Compute capacity can be leased from cloud providers. Proprietary data often exists only inside one company, platform, lab or government, so rivals cannot rent an identical copy.

What should companies watch now?

Companies should review AI vendor terms, training-use permissions, data-retention rules and whether their operational data could help a supplier build competing products.

Source: Thorsten Meyer AI

This content is for general information only and is not financial, tax or legal advice. Consult a qualified professional for decisions about your money.
You May Also Like

Vitalik Buterin Responds to Allegations of Inner Circle Bias

I explore how Vitalik Buterin addresses inner circle bias concerns, revealing the true extent of Ethereum’s openness and ongoing efforts for decentralization.

SEC Documents Reveal BNY Mellon’s Major Bitcoin ETF Holdings

Incredible revelations from SEC documents show BNY Mellon’s significant Bitcoin ETF investments, hinting at a transformative shift in cryptocurrency’s landscape. What does this mean for investors?

Live updates: G7 leaders voice ‘support’ for US-Iran agreement as draft text obtained by CNN

G7 leaders express support for the US-Iran deal as a draft text circulates, signaling potential progress in negotiations to restore Iran-US relations.

What Centralized Exchange Growth Means for Market Confidence

A growing centralized exchange boosts market confidence by enhancing trust and stability—discover how this impacts your investment opportunities.