AI Training Data Is Eating Itself?

AI Training Data Is Eating ItselfAI systems train on their own degraded outputs. This creates a feedback loop that corrupts the internet itself. By April 2025, 74% of new webpages were AI-generated. Bots produce 51% of web traffic.

Video – AI Crisis Internet Collapse

The result is model collapse. Errors amplify. Biases compound. Reliable information becomes scarce. Without intervention, this infrastructural decay will reshape how elections work, how trust forms, and who controls narrative deployment at scale.

Core Answer:

  • Model collapse occurs when AI trains on AI-generated content, degrading quality with each iteration
  • Human-generated data will be exhausted within this decade, around 300 trillion tokens total
  • Bots generate 51% of web traffic and 21-29% of content on platforms like Twitter
  • AI swarms enable low-cost narrative control, with 45 accounts generating 4 billion impressions
  • Solutions include charging for accounts, banning human-impersonating bots, and independent watchdog networks

What Is Model Collapse?

By April 2025, 74 percent of new webpages contained AI-generated text.

The internet is not expanding. It is consuming itself.

Gary Marcus calls this model collapse. The technical term does not capture the scale. When AI systems train on their own output, they do not improve. They degrade. Errors amplify. Biases compound. The semantic range narrows with each iteration.

This is not a future risk. It is infrastructure reality.

What this means: The fuel source for AI advancement contaminates itself faster than companies anticipated.

Why AI Companies Are Feeding Models Contaminated Data

LLMs are desperate for data. Companies prioritize quantity over quality because scaling demands it. They scrape Reddit. They ingest propaganda sites. They pull from sources they know are compromised.

The models do not distinguish good information from bad. They consume everything.

Research shows that even 1 in 1000 synthetic data points triggers collapse. Larger training sets do not fix this. More compute does not solve it. The fuel itself contaminates the system.

By August 2025, 10.4% of sources cited in Google AI Overviews were themselves AI-generated. The recursion is already embedded in the infrastructure.

Human-generated public text data will not sustain scaling beyond this decade. The total effective stock sits around 300 trillion tokens. We are not hitting hardware limits. We are hitting feedstock exhaustion.

Bottom line: The industry built scaling roadmaps assuming infinite clean data. That assumption collapsed in 2025.

Video

How Bots Took Over the Internet

Bots now generate 51% of web traffic. Three years ago, it was 42%.

The Dead Internet Theory stopped being theory when we prioritized convenience over quality.

AI swarms are the next infrastructure layer. These are not individual bots. They are coordinated networks of thousands of agents maintaining consistent personas, posting 24 hours a day, restating talking points in varied language to evade detection.

A peer-reviewed study in Science Journal warned that these swarms spread misinformation and harass real users, playing a new role in information warfare. The lead author noted that the more sophisticated these bots are, the fewer you will need.

In 2024, Global Witness identified 45 accounts that generated more than 4 billion impressions around polarizing content. That is 45 accounts. The leverage ratio is what matters. When individual operators generate billions of impressions, the cost of manufacturing consensus collapses to near zero.

The pattern: Output velocity and algorithmic amplification create infrastructure advantage, not raw bot numbers.

Why Authoritarians Have the Infrastructure Advantage

China, Iran, Russia, Turkey, and North Korea are using bot networks to amplify narratives worldwide. The problem is not the bots themselves. It is the use of bot farms to trick social media algorithms into making people believe lies are true.

A lie repeated often enough becomes truth. This is not rhetoric. This is infrastructure advantage.

Fewer than 5% of Twitter accounts belong to bots, but they generate 21-29% of content. The infrastructure advantage is not in numbers. It is in output velocity and algorithmic amplification.

In a 2019 paper, Chinese military political warfare unit member Li Bicheng described an AI system that would create not posts, but personas.

Accounts generated by such a system might spend most of the time posting about fake jobs, hobbies, or families, but every once in a while, they slip in a reference to Taiwan or to the social wrongs of the United States.

This is not disinformation. It is infrastructure for narrative deployment.

Key insight: Narrative control at scale is now low-cost infrastructure, not expensive propaganda operations.

What This Means for Elections

Marcus argues that upcoming elections will face more pervasive last-minute AI-driven propaganda. Voters struggle to access reliable facts needed for democratic decision-making when authentic human content becomes a smaller share of what they see.

The internet is not a neutral information layer. It is a system with a degenerative condition. As bots and AI-generated content proliferate, trust in information erodes. The feedback loop accelerates.

Researchers set up Capture the Narrative, the first social media wargame for students to build AI bots to influence a fictional election. The experiment proved what was theoretical: low-cost, high-impact narrative control is now accessible infrastructure.

Around half of the content you see online is now made and spread by AI. The question is not whether this affects elections. The question is which elections fall first.

Reality check: Democratic decision-making requires reliable information. That substrate is degrading faster than institutions are adapting.

AI Training Data

What We Could Do About It

Marcus stresses that the situation is not terminal but requires urgent treatment. He proposes several measures.

Make it more expensive to run fake accounts. Small posting fees change the economics of bot deployment.

Outlaw chatbots that impersonate humans. Meaningful penalties for large-scale violations. The current regulatory environment treats this as a minor infraction.

Give researchers transparent access to platform data. Right now, platforms control the data and the narrative about the data.

Establish an independent watchdog network. Non-governmental monitoring of AI-driven information operations that pushes platforms and policymakers to act.

Without broad political agreement that this is a problem, every election is vulnerable. The infrastructure advantage belongs to whoever is willing to deploy it first.

Strategic reality: These are not technical solutions. They are political decisions about who controls information infrastructure.

The Shift You Need to Reprice Now

This is not about better content moderation. It is not about fact-checking at scale. Those are tactical responses to a structural problem.

The internet was built on the assumption that human-generated content would remain the dominant signal. That assumption no longer holds.

When 30 to 40% of the active web is synthetic, and models train on that synthetic data, the degradation is not linear. It is exponential. Researchers call this model collapse. The rest of us will call it the information crisis that rewrote competitive dynamics before anyone repriced it.

You are not six months late on this shift. You are watching it happen now. The question is whether you are building for a world where human data is the new rare earth mineral, or whether you are still optimizing for a paradigm that already ended.

AI Training Data Is Eating

Frequently Asked Questions

What is model collapse?
Model collapse happens when AI systems train on their own AI-generated outputs. Each iteration amplifies errors and narrows the range of responses. Even 1 in 1000 synthetic data points triggers this degradation. More compute or larger datasets do not fix it because the training data itself contaminates the system.

How much of the internet is AI-generated?
By April 2025, 74% of new webpages contained AI-generated text. Bots generate 51% of web traffic. By August 2025, 10.4% of sources cited in Google AI Overviews were themselves AI-generated. The feedback loop is already active in production systems.

What are AI swarms?
AI swarms are coordinated networks of thousands of bot accounts that maintain consistent personas while varying language to evade detection. They operate 24/7, posting polarizing content. In 2024, 45 accounts generated over 4 billion impressions. The leverage ratio makes manufacturing consensus near zero cost.

Why does this matter for elections?
When half the content voters see online is AI-generated, access to reliable facts becomes harder. AI-driven propaganda at scale is now low-cost infrastructure. Researchers proved this in Capture the Narrative, a social media wargame where students built bots to influence a fictional election. The question is not whether elections are affected, but which ones fall first.

Will we run out of training data?
Yes. Human-generated public text data totals around 300 trillion tokens. At current scaling rates, we exhaust this supply within this decade. We are not hitting hardware limits. We are hitting feedstock exhaustion. This is why companies scrape everything, including low-quality and compromised sources.

How do authoritarians use this infrastructure?
China, Iran, Russia, Turkey, and North Korea deploy bot networks to amplify narratives globally. Fewer than 5% of Twitter accounts are bots, but they generate 21-29% of content. Chinese military strategist Li Bicheng described AI systems that create personas posting about jobs and hobbies, then slip in geopolitical messaging. This is narrative deployment infrastructure, not traditional propaganda.

What solutions exist?
Marcus proposes four measures: charge fees for posting to make bot deployment expensive, outlaw bots that impersonate humans with meaningful penalties, give researchers transparent platform access, and establish independent watchdog networks. These require political will, not technical fixes. Without broad agreement this is a problem, the infrastructure advantage goes to whoever deploys first.

Is the Dead Internet Theory real?
Yes. Bot traffic rose from 42% in 2022 to 51% in 2025. The theory stopped being theory when infrastructure prioritized convenience over quality. The internet is not expanding with human content. It is cannibalizing itself with synthetic outputs that degrade information reliability.

Key Takeaways

  • Model collapse is not a future risk. It is infrastructure reality. AI systems training on their own outputs degrade exponentially, and 74% of new webpages already contain AI-generated text.
  • Human-generated training data will be exhausted within this decade. The total stock is around 300 trillion tokens. We are hitting feedstock limits, not hardware limits.
  • Bots generate 51% of web traffic and 21-29% of platform content. AI swarms with 45 accounts produced 4 billion impressions, collapsing the cost of manufacturing consensus to near-zero.
  • Authoritarians have an infrastructure advantage. Countries like China, Iran, and Russia use bot networks for narrative deployment at scale. Output velocity and algorithmic amplification matter more than raw bot numbers.
  • Elections are vulnerable now. Half the content voters see is AI-generated. Access to reliable facts is degrading faster than democratic institutions are adapting. The question is which elections fall first.
  • Solutions require political will, not just technical fixes. Charging fees for accounts, banning human-impersonating bots, granting researchers platform access, and creating independent watchdogs are all possible, but need broad agreement.
  • You are watching a structural shift in real time. The assumption that human content dominates the internet no longer holds. You are building for a world where human data is the rare resource, or you are optimizing for a paradigm that already ended.

 

Index