Model Collapse: The Self-destructive Cycle of GenAI
On how AI-generated data leads to declining model performance.
Tools like ChatGPT and Midjourney have become daily companions for many and the introduction of GenAI into the enterprise world is being talked about everywhere. If we trust projections by some, progress will be essentially infinite and everything comes down to scale. Scale of data centres, scale of compute and scale of data.
Contrary to these predictions recent research has identified a structural phenomenon now coined ‘model collapse’, which describes a gradual decline that affects AI models, in which the data they generate end up polluting the training set of the next generation and thereby leading them, over time, to misperceive reality. Because of this, I – and others before me – have underlined the necessity to segregate ‘clean’, human-only generated data.
Formalised insights (here and here) into model collapse have recently been published so I want to examine what model collapse is, why it happens, review suggested mitigation strategies are applied and consider the implications it might have for the future of GenAI.
I. What is Model Collapse?
Model collapse is a degenerative process where AI models progressively degrade when trained on data generated by other AI models, often referred to as ‘synthetic data’. Over time, models lose their ability to accurately represent the original data distribution, leading to loss of valuable semantic information and ultimately poor performance.
Nota bene: What is ‘synthetic data’? Synthetic data refers to artificially generated information created by AI models rather than collected from real-world events. In the context of AI models and their training, synthetic data is mostly used to supplement real world data, to increase the diversity and extent of the datasets on which models can be trained effectively. The concept has been used in wide range of use cases. For example, in the context of generating complex datasets to train algorithms interpreting surveillance footage for military or police operations that understand subtle details in captured scenes or in creating language models as science tutors.
In other words, the initially promising idea of using GenAI to generate an ever larger training corpus, didn’t account for the possibility that data generated in this way might lack certain subtle semantic nuance, which allowed the success of the neural network approach to be begin with – and which many continue to wonder why in fact it would work at all!
II. The Mechanism of Model Collapse
A simple (somewhat oversimplifying) analogy is that of a photocopy of a photocopy of a photocopy. Each new copy loses a bit of the original’s detail, leading, over time, to a blurry and inaccurate representation of the original. But let’s take a closer look at key mechanisms contributing to model collapse: tail cutting and tail narrowing.
Tail cutting and narrowing
Statistics and data distribution is one of the impactful reasons for the imperceptibly different data GenAI creates. Let’s look at this intuitively and then in more detail: If you recall – and, to keep things tidy, let’s just stick with LLMs here for the moment, but the issue applies similarly to other models – the models predict the most likely next token, taking into account the context window.
Nota bene: this is not entirely true, as going with the highest probability token doesn’t actually lead to the best results – as rated by humans. You can read about this in Stephen Wolframs very accessible book ‘What is ChatGPT doing and why does it work?’.
With this in mind, it should be intuitive that the distribution of words shifts in datasets generated by an LLM towards the ‘more likely words’, and away from the distribution that would result in texts written by humans who would not have such programmed tendency. This image illustrates this very well:
Taken from: Figure 2, Tell of Tails, arXiv:2402.07043v2
Already at a temperature 1 (yellow) the perplexity shifts from the tail on the right all the way to the left as the temperature is decreased, leading to a much narrower distribution of data, which means a stronger preference to select the higher probability word predictions.
Nota bene: Temperature is a method used to control the randomness of predictions in language models. A temperature of 1 represents a neutral setting. Lowering the temperature makes the model more confident by sharpening the probability distribution, meaning it will prefer high-probability words more strongly. Conversely, increasing the temperature flattens the distribution, introducing more randomness and diversity into the predictions.
Perplexity in LLMs measures a model's ability to predict the next word in a sequence, with lower perplexity indicating higher prediction accuracy.
A more detailed explanation of why GenAI data suffers from tail cutting and tail narrowing is as follows:
a) Tail cutting
Tail cutting refers to the deliberate or accidental truncation of the data distribution’s tail, which means ignoring the less frequent, but potentially significant, data points. When AI models generate new data, they often rely on sampling methods, which prioritize more likely data points and discard less likely ones. Even the fact that training data is by definition finite, leads to a de-facto tail cutting via a sampling bias.
Nota bene: This is in my opinion one of the strongest criticisms about the current approach to AI: Where evolution has created a constant feedback loop of the environment to any biological intelligence via sensory feedback that is processed by the brain, AI is not coupled to the world, but presented with a predetermined dataset (or subset of reality) that is meant to serve as all the ‘education’ the model receives.
b) Tail narrowing
Tail narrowing involves the compression of the data distribution’s tail (= representing lower probability tokens), where the diversity of the data points in the tail is reduced, making the distribution narrower and more peaked and thereby placing more emphasis on higher probability predictions.
This results in a truncated and less diverse data distribution and the model loses its ability to handle rare, but important events or details, which are crucial especially for generalizations. Over generations, this lack of diversity accumulates, causing significant degradation in model performance.
Functional expressivity error
A more general problem rests in AI models’ insufficient capacity to capture complex patterns in the data. While they can approximate many functions well, they would need to grow infinitely large, to do so with perfection. Because of this size-limitation, the model might mistakenly assign probability to outcomes that should be impossible or none to outcomes that should be possible. For example, a model might be designed to use a single bell curve to represent a data set that in reality is best represented by two (partiually) overlapping bell curves. Even with all the data in the world, the model still won’t be able to perfectly represent the true distribution, so that model errors will be inevitable, leading to new data, that does not represent the world accurately in certain detail.
Consequential loss of scaling laws
Scaling laws have driven much of the recent success in AI, where a neural net’s performance increases with the amount of training data, model size and compute applied. All else being equal a model trained on 10x the data, 10x in size, running on a 10x chip cluster won’t be 10x better than the original, but noticeably better. In this way scaling laws are intimately related to the emergence of abilities in larger models, that are not present in smaller ones. This bolsters the now common narrative that in AI scaling is like to a law of physics and ‘all you need’.
Reality tells a different story: when synthetic data is used, these scaling laws break down, over time leading to model collapse.
III. Mitigation Strategies
But its not all doom and gloom either. Just as the phenomenon of model collapse is analysed in more detail, research is pointing to mitigating strategies. When you go through the list – and I am sure other techniques exist and will evolve – notice, however, the emerging theme: human oversight.
Mixing AI-generated data with real data
To help maintain diversity and richness of the training data, findings suggest that mixing AI-generated data with original, human-generated data could help to moderate model collapse. Already a small proportion of original data can help sustain model performance and prevent model degradation from solely synthetic data. Here, robustness of real data is utilized to balance the potential inaccuracies introduced by AI-generated content.
Fine-tuning and RLHF
Fine-tuning with expert human verifiers and Reinforcement Learning with Human Feedback (RLHF), during which human trainers assess and correct the model outputs, and use this feedback to further or re-adjust model weights, can enhance the AI-generated data quality significantly and reduces the likelihood of errors propagating through successive generations of model training. There are attempts to rely on LLM-to-LLM methods.
Data Selection and Pruning
Data selection and pruning mechanisms can help filter out low-quality AI-generated data before it is used for training.
Nota bene: Pruning in machine learning involves selectively removing unnecessary or less important components, such as neurons in a neural network, branches in a decision tree or redundant data points in a dataset. This process helps streamline models and datasets by reducing complexity, minimising overfitting and improving efficiency, leading to enhanced model performance.
People like the trainers in RLHF assess the quality of AI-generated data and discard samples that do not meet specific standards. By ensuring that only high-quality synthesized data is used for training, the overall performance of the models can be maintained, to reduce the risk of model collapse. The catch here is of course: This is a version of the chicken-and-egg-problem, because machine learning and statistics are as useful as they are, precisely because they can draw attention to tendencies and patterns, which were imperceptible to humans who have studies the data in detail.
IV. Implications for AI Progress and Development
As AI-generated content becomes more prevalent, maintaining access to original, human-generated data seems crucial. Whether the issue is not to be underestimated or overstated, as so often, depends: On the one hand current generations of models were trained on predominantly human-generated data on the scale of the entire internet. Potentially, they have thereby already exhausted the available ‘clean’ training data, so that there is nowhere to go for additional scaling with more ‘clean’ training data.
On the other hand, the operative word here is probably ‘public data’. Transgressions by companies accessing paywalled and thereby semi-private data aside, enterprises around the world do sit on vastly more data. Estimates are difficult, but on the assumption that these private data repositories are domain specific, there is probably some specific use cases that can be squeezed out of this private data, provided of course that enterprises keep their data ‘clean’, which I suspect very few do.
A take-away at this point, nonetheless, is that one of the key assumptions to infinite scaling of the current technology doesn’t hold true, at least not in the way of a straight-forward, ‘we can simply extrapolate from the straight line into the future’-approach, that some, like Leopold Aschenbrenner, assert in his book Situational Awareness where he predicts that
“AGI by 2027 is strikingly plausible.”
We will, for the time being, continue to require human involvement in the process of dataset curation specifically, but also AI development more generally.
V. Conclusion
Model collapse poses a challenge to current architectures of our most advanced GenAI models. Without exploring alternatives this we might be witnessing a slowing development of AI. Understanding this phenomenon and implementing strategies to mitigate its effects will be crucial in advancing current models. Considering that mitigation seems mostly relying on human intervention the future of models at infinite scale based on synthetic data is yet to arrive.