Gen AI to Flood the Zone with Spam?
There are risks to LLMs being trained on increasingly corrupt data generated by other gen AI systems with hallucinatory tendencies
"Snakes Eating Tails" by FCCA/NCSU is licensed under CC BY-NC 2.0.
What are the risks to artificial intelligence (AI)? After generative AI (gen AI) exploded into the news with the launch of ChatGPT in November 2022, I ran a couple of columns on Sharpen Your Axe exploring the idea that the tech could end up destroying humanity. I now think this risk is probably a little overblown.
We can all see that gen AI is very good at spotting which words look convincing next to other words, which is a little different from genuine intelligence; and might or might not tell us something about the world outside our heads. As a result, chatbot tech seems unlikely to push us to our doom, at least any time soon.
Even so, the risk of AI destroying humanity clearly isn’t zero. I still believe we should be very hesitant about letting AIs control the nuclear codes, for example. This is a particularly important point as the data centres needed to supersize the tech will increasingly draw on nuclear power. We should also apply plenty of regulatory oversight to the entrepreneurs who are developing killer drones for Western armed forces, as I discussed in the column on the new right of Silicon Valley.
This week’s column will look at a much more mundane risk that wasn’t particularly clear when we were all beginners with ChatGPT. The speed of gen AI is its most amazing feature. Type in a prompt and you get lots of content almost immediately. It often looks better than it really is - the tech can produce reams and reams of content, not all of it particularly accurate. It can also hallucinate. It is all presented with an air of great confidence. If you want a six-word “too long; didn’t read” (TL;DR) version, here it is: “spam, spam, spam, egg and spam.”
We already live in a world where the amount of data is increasing exponentially. If you add up all the world’s non-digital data in US academic research libraries, it would come to 2 petabytes. Google was already adding 20 petabytes of data a day way back in 2008. This has increased at an incredible rate since then. As a result of the growth, 90% of the world’s data was created in the last two years, according to statistics from August 2023.
The amount of data in the world doubles every two years: and gen AI is already accelerating the trend. It is free (or very cheap) and also extremely fast when compared to human-generated content. The percentage of the world’s data that is created by gen AI will increase at a staggering rate.
Given the tech’s propensity to hallucinate, this is problematic. Future large language models (LLMs) will be trained on increasingly corrupt data. The errors could slowly build up. LLMs trained on data from other LLMs can quickly spew nonsense.
The central problem is that gen AI is much better at generating clickbait material of questionable quality than it is at creating premium content. If this hypothesis is right, there is a real risk hat Gen AI will “flood the zone with shit,” in the infamous words of far-right agitator Steve Bannon in 2018 (he was talking about the need for misinformation, disinformation and conspiracy theories to overwhelm fact-checkers).
We can already see some early cases in the wild. For example, wordfreq - an open-source platform measuring word use on the internet - shut down in June. Part of the reason was the way that LLMs generate vast amounts of spam that masquerades as human-generated content. Another example comes from the good folks at WikiProject AI Cleanup, who are trying to tidy up AI-generated spam on Wikipedia.
Spam is annoying, but there are also very real risks too. Experts expect an explosion of cyber-crime as hackers adopt AI tools that allow them to scale up their activities. Be careful out there!
What do the risks of gen AI mean for you, dear reader? First of all, if you work in content, you should be thinking about premium content that is hard for gen AI to replicate (a point made well by film director Guillermo del Toro recently). It is also worth keeping an eye on the multiple lawsuits by content creators claiming that AI is ripping off their material.
Secondly, libraries, old books and analogue material are likely to gain in value as the amount of AI-generated clickbait grows exponentially, just as vinyl records became fashionable in the streaming age. On a related note, if you want an idea for a project, please consider ideas outside the digital world. Experiences are likely to become increasingly important for many. What about live music? Or festivals? Or adventure holidays? Or fine dining? Or slow-cooked food?
Finally, if you think I am being too pessimistic, please check out this essay from the tech-optimist camp. Also, it is worth flagging a recent Nobel Prize to Demis Hassabis, and John Jumper, who developed AlphaFold, an AI system that predicts the 3D structure of proteins from their amino acids. Their paper from 2021 quickly became one of the most-cited publications of all time.
The difference between hedgehogs (one model of reality) and foxes (multiple overlapping models) is a recurrent theme in this blog. As foxes, we should be unsurprised that some aspects of AI will make it easier for scientists to solve knotty problems, while other aspects are likely to flood the zone with spam and rip off genuinely creative people. The comments are open. See you next week!
Previously on Sharpen Your Axe
Will AI destroy the world? (Part one and part two)
Killer drones and the Silicon Valley right
Exponential growth (part one and part two)
Further Reading
Do Androids Dream of Electric Sheep? by Philip K. Dick
This essay is released with a CC BY-NY-ND license. Please link to sharpenyouraxe.substack.com if you re-use this material.
Sharpen Your Axe is a project to develop a community who want to think critically about the media, conspiracy theories and current affairs without getting conned by gurus selling fringe views. Please subscribe to get this content in your inbox every week. Shares on social media are appreciated!
If this is the first post you have seen, I recommend starting with the third anniversary post. You can also find an ultra-cheap Kindle book here. If you want to read the book on your phone, tablet or computer, you can download the Kindle software for Android, Apple or Windows for free.
Opinions expressed on Substack and Substack Notes, as well as on Bluesky, Mastodon and X (formerly Twitter) are those of Rupert Cocke as an individual and do not reflect the opinions or views of the organization where he works or its subsidiaries.