TL;DR

Thorsten Meyer AI’s third Control Series installment says AI competition is shifting from rentable compute to scarce data. The report cites public-text limits, copyright settlements, expert datasets and sovereign data controls as signs that proprietary data is becoming a core advantage.

Thorsten Meyer AI published the third installment of its six-part Control Series, naming data as the AI industry’s emerging chokepoint and warning that the most valuable training material is moving behind licenses, enterprise walls, expert workflows and national controls.

The report cites Epoch AI’s estimate that the public internet contains roughly 300 trillion tokens of high-quality text, with frontier training datasets already approaching that supply. Epoch AI projects that available public human text could be fully used between 2026 and 2032, with a median estimate around 2028; the report treats that range as a projection, not a settled endpoint.

The analysis says the earlier era of scraping large amounts of public web material at low cost is giving way to a market for licensed and proprietary data. It points to Anthropic’s reported $1.5 billion settlement with authors over pirated books, while stressing that the settlement covered past piracy claims, not future training practices or model outputs. The New York Times’ case against OpenAI remains in discovery, and publishers including News Corp have pursued licensing deals, according to the source material.

The report also highlights synthetic data as a partial response to scarcity, citing Nvidia’s $320 million purchase of Gretel and Microsoft’s use of hundreds of billions of synthetic tokens. But it warns, based on cited research, that machine-generated training material can compound errors in areas where answers are hard to verify. That risk increases the value of fresh, verified human data from experts, enterprises and real-world operations.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Data Becomes The AI Moat

The report’s central claim is that compute can be rented, but unique data cannot. If model architectures and chips become easier to buy or imitate, the data used to train and refine systems may become a stronger source of advantage than raw hardware access.

For companies, that makes internal data a strategic asset rather than a byproduct. Customer records, workflow histories, research notes, legal analysis, clinical judgments and operational logs can improve AI systems if used responsibly, but the same data can also strengthen a vendor or rival if handed over without clear rights and safeguards.

For startups, the shift may raise costs. A licensing market can pay creators and reduce legal risk, but large settlements and premium data deals favor companies with enough capital to buy access. The report warns that fencing data may protect rights holders while also concentrating AI power among better-funded firms.

Synthetic Data Generation: A Beginner’s Guide

Synthetic Data Generation: A Beginner’s Guide

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

From Web Scraping To Licenses

Early AI scaling depended heavily on large public datasets collected from the open web. The Control Series frames that phase as the free part of AI training: broad, cheap and difficult for any single firm to fully control.

The report says the next phase is different. High-value material increasingly sits behind paywalls, inside companies, with credentialed experts or in sensitive state-controlled settings. It cites Meta’s reported $14.3 billion investment for a 49% stake in Scale AI as an example of how data pipelines and labeling relationships can carry strategic value, while also raising customer concerns about who may benefit from shared data.

The report also points to Ukraine’s handling of battlefield-related AI data as a sign that some governments may treat operational data as a sovereign asset. In that framing, licensing access while keeping control of the model or underlying corpus preserves leverage.

“Data was supposed to be the abundant input. It’s the scarce one.”

— Thorsten Meyer AI, The Control Series

Amazon

licensed training data datasets

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Limits Of The Data Ceiling

It is not yet clear exactly when high-quality public text will stop being useful for frontier training at scale. Epoch AI’s 2026-to-2032 range is a forecast, and the timing could shift if algorithms become more efficient, new public sources emerge or model developers change training methods.

Legal rules also remain unsettled. The Anthropic settlement addressed specific piracy claims, while other copyright cases and licensing negotiations are still moving. Courts, lawmakers and private contracts will shape what data can be used, at what price and under which restrictions.

The business impact is also developing. The report says customers moved away from Scale after Meta’s investment, but the longer-term effect on data vendors, AI labs and enterprise buyers remains uncertain.

Amazon

expert annotated datasets for AI

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Licensing Deals Set The Pace

The next phase will likely be measured through court rulings, publisher licensing agreements, enterprise AI contracts and national data policies. AI labs will keep using synthetic data and efficiency gains, but the report suggests the strongest systems will depend more on verified human and proprietary datasets.

For businesses and governments, the near-term decision is contractual: who can use their data, for what purpose, whether it can train future models and what rights remain with the original owner.

Applications of Synthetic High Dimensional Data

Applications of Synthetic High Dimensional Data

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main development in this report?

The report says AI’s competitive pressure is shifting from compute to data, because public web text is nearing practical limits while valuable private datasets are being licensed, restricted or guarded.

Is public training data already exhausted?

No. The report cites Epoch AI’s projection that high-quality public human text could be fully used between 2026 and 2032, with a median around 2028. That is a forecast, not a confirmed endpoint.

Why does the Anthropic settlement matter?

The reported $1.5 billion settlement signals that AI training data carries legal and financial risk when sourced from pirated material. The report says it also helps push the market toward paid licensing.

Can synthetic data solve the shortage?

Synthetic data can help, and major AI firms are already using it. The report says it is not a full substitute because repeated training on machine-generated material can spread errors when outputs are hard to check.

What should companies watch now?

Companies should review AI vendor contracts, training-data rights and internal data controls. The report’s warning is that proprietary data can become a competitive asset if protected, or a lost advantage if shared without limits.

Source: Thorsten Meyer AI

You May Also Like

AI in Cyber Defense: Machine Learning to Predict and Prevent Attacks

Theodore’s guide explores how AI and machine learning can predict and prevent cyber attacks, transforming security—discover how your defenses can stay ahead.

Build vs Buy a Prebuilt AI Workstation

Struggling to decide whether to build or buy your AI workstation? Discover the real costs, time, and control factors that matter in 2026. Read now!

Open Code Review – An AI-powered code review CLI tool

Alibaba’s open source AI code review CLI tool, Open Code Review, now available for community use, offers precise, scalable code analysis using LLMs.

I turned a $80 RK3562 Android tablet into a Debian Linux workstation

A user successfully installed Debian 12 on an $80 RK3562-based Android tablet, turning it into a Linux workstation without unlocking the bootloader.