📓 Cabinet of Ideas

Ai Has Poisoned Its Own Well – Tracy Durnell

AI has poisoned its own well – Tracy Durnell #

Excerpt #

What will happen to GPT- once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.


Replied to The Curse of Recursion: Training on Generated Data Makes Models Forget (arXiv.org)

What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, they’ve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse. I don’t think the quality that generative AI will be able to reach on a poisoned data supply will be good enough to get rid of all us plebs.

They need an astronomical amount of training data to make any model better than what already exists. By releasing their models for public use now, when they’re not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. Stack Overflow has thrown up their hands and said they can’t moderate generative AI content, meaning the site can no longer be a training source for coding material. Publishers of formerly reputable sites are laying off their staff and experimenting with AI-generated articles. There is no consistent system for marking up generated content online that will allow companies to trust material of unknown origin as training data. Because of this approach, 2022 and 2023 will be essentially “lost years” of internet-sourced content, even if they can establish a tagging system going forward — and get people hostile or ambivalent to them to use it.

In their haste to propagandize the benefits of generative AI and encourage adoption so widespread it can’t be stopped, they’re already encouraging people to lean on LLMs to write for them. Microsoft is plugging AI tools into their flagship Office Suite software. Writing well is a skill that few possess today, and these companies are creating an environment where even fewer people will bother to learn and practice professional writing. As time goes by, there will be less human-created material (especially of quality and complexity) available as new training data.

Obtaining quality training data is going to be very expensive in five years if AI doesn’t win all its lawsuits over training data being fair use. By allowing us a glimpse into their vision and process, they’ve turned nearly every professional artist and writer against them as an existential threat. Even if they do win their fair use lawsuits, it may be a challenge to access the data; every creative person who relies on their work for pay will do everything they can to prevent their creations from becoming future training data.

Even worse, these companies’ misuse of the internet commons — of humanity’s collective creativity — as fuel for their own profit could lead to fragmentation and closing off of online information to prevent its theft. Bloggers don’t want their words stolen, and social media companies are getting wise to the value of “their” data and beginning to charge for API access. The difficulty and cost of gathering sufficient high quality training data for future models will incentivize continued use of whatever is easiest to grab, only hastening model collapse and increasing the likelihood of malicious actors perpetrating poisoning attacks.