Unlocking the Hidden Danger: How AI-Generated Data Poses a Poisonous Pitfall for Future AI Models
As the Internet becomes saturated with AI-generated content, it’s causing contamination of the training data for future models.
So what happens when AI eats itself?
The rise of generative artificial intelligence (AI) has led to the widespread use of programs that can create text, images, code, and music. However, as AI-generated content proliferates on the Internet, it poses a potential threat to the quality of training data used for developing new AI models. The process of AI developers scraping the Internet for content might inadvertently introduce errors into datasets, leading to a cascade of issues with each subsequent generation of models.
Model Autophagy Disorder
Model Autophagy
Leveraging synthetic data for AI model training is often a cost-effective and convenient alternative, particularly when large volumes of data are required or the protection of sensitive information is crucial. While synthetic data can outperform real-world data in certain scenarios, it is frequently employed due to the dwindling availability of easily accessible, high-quality real-world datasets for training large language models. There is growing concern that the demand for machine learning datasets could deplete all high-quality language data by 2026, prompting increased reliance on AI-generated outputs and synthetic data.
A potential challenge arising from this trend is the emergence of Model Autophagy Disorder (MAD), a phenomenon characterized by models essentially “eating themselves” through repeated training on AI-generated data.
Without sampling bias, synthetic data modes drift from real modes and merge.
With sampling bias, synthetic data modes drift and collapse around individual (high
quality) images instead of merging.
Coined by researchers at Stanford University and Rice University, MAD manifests when self-consuming loops lack a sufficient influx of fresh data, resulting in a decline in the quality and diversity of subsequent iterations. This disorder is observable across various applications, ranging from text-based chatbots like ChatGPT to image-based generative models like Midjourney. The repetitive training of successive model generations exclusively on AI outputs leads to a degenerative process, causing models to progressively forget the true data distribution over time.
From Nuclear Fallout to AI Model Collapse
Model Collapse
This potential issue of slow poisoning has drawn parallels with a historical dilemma from the 20th century. In the aftermath of atomic bomb detonations at the close of World War II, the Earth’s atmosphere bore witness to decades of nuclear testing fallout, imbuing it with a trace of radioactive contamination.Similarly, the constant training of AI models using AI-generated content is likened to the slow poisoning of these models over time, a phenomenon termed “model collapse.” With each training cycle, errors accumulate, reaching a point where the resulting AI model loses its coherence and practical meaning.
A training diet of AI-generated text, even in small quantities, eventually becomes “poisonous” to the model being trained. Currently there are few obvious antidotes. “While it may not be an issue right now or in, let’s say, a few months, I believe it will become a consideration in a few years,” says Rik Sarkar, a computer scientist at the School of Informatics at the University of Edinburgh in Scotland.
Researchers have conducted experiments to witness this phenomenon in action. Starting with a language model trained on human-generated data, they observed a troubling progression. Using the model to generate AI output, subsequent iterations of training resulted in a compounding of errors. For instance, a model originally trained to discuss historical English architecture might inexplicably deviate into nonsensical musings about jackrabbits in its tenth iteration.
AI Ouroboro?
AI Ouroboro
This problem, termed “model collapse,” is not confined to language models alone. It has been observed in various AI models, including a diffusion model used for generating images. After experimenting with relatively modest models — programs that are smaller and use fewer training data than the likes of the language model GPT-4 or the image generator Stable Diffusion. It’s possible that larger models will prove more resistant to model collapse, but researchers say there is little reason to believe so, highlighting the vulnerability of models, especially at the data “tails” — the data elements that are less frequently represented in a model’s training set.
In July 2023, just eight months after ChatGPT’s launch, a recent study revealed a noticeable decline in the model’s ability to write code and perform tasks compared to its initial release. Researchers from Stanford University and the University of California Berkeley conducted tests on generated code from GPT 3.5 and GPT 4, comparing versions from March and June 2023. The findings showed a significant drop in executable responses, with only 10% of GPT 4’s outputs being executable, down from the 53% seen in March. GPT 3.5 also exhibited a decline from 22% correctness in March to just 2% in June.
Despite the closed nature of OpenAI’s system, it is evident that certain capabilities of the model have diminished over time. This degradation serves as a clear illustration of the consequences when a model lacks access to fresh training data. While language models can learn from generated data, preserving the original data is crucial for sustaining and improving overall model performance.
AI-generated content is already infiltrating areas crucial for training data, such as language models used by mainstream news outlets and even Wikipedia. The concern is that existing tools for training AI models are becoming saturated with synthetic text, raising questions about the quality and reliability of future models.
How can we address this issue?
To address the threat of model collapse, some suggest using urated datasets that are known to be free from generative AI influence. For instance, standardized image datasets curated by humans could provide a source of data untainted by AI-generated content. However, the challenge lies in distinguishing human-generated data from synthetic content and filtering out the latter, which is a complex task.