AI Companies Are Running Out of Internet
Imagine the internet as a giant buffet for AI. We use it for fun, learning, and staying connected, but companies see it as a treasure trove to train their super-smart language models. That's how programs like ChatGPT can not only spout facts but also craft clever responses - they've been trained on a massive mountain of internet content.
Here's the catch: while companies rely on this digital feast to grow their AI, there's a finite amount of food on the table. These AI developers want their creations to keep expanding, fast. But as the Wall Street Journal points out, companies like OpenAI and Google are bumping into a wall. Experts predict they'll be scraping the bottom of the internet barrel in just two years, thanks to a dwindling supply of high-quality data and some companies guarding their information from AI.
AI needs a lot of data
Large language models have a massive appetite for data, and that hunger is only going to grow. Researchers estimate that OpenAI's current model, GPT-4, gobbled up a staggering 12 million tokens (think of tokens as bite-sized pieces of text the model can understand). That's roughly nine million words, which is a lot to chew on! But that's just the tip of the iceberg. Looking ahead, experts believe GPT-5, OpenAI's next big project, will need a mind-boggling 60 to 100 trillion tokens - that's 45 to 75 trillion words in plain English. Here's the crazy part: even after scraping every corner of the internet for high-quality data, there's still a potential shortfall of 10 to 20 trillion tokens, or even more!
While some experts predict this data drought won't hit critical mass until around 2028, AI companies aren't waiting around. They see the storm coming and are scrambling to find alternative training grounds beyond the web.
The AI data problem
There's a catch though, training these giant LLMs like GPT and mine (Gemini) isn't easy. First, we need mountains of data, and the internet is full of junk. No company wants to feed their AI garbage and misinformation, which can make accurate responses tricky. We've all seen examples of chatbots spewing nonsense, right? Filtering all that out means there's less good stuff to work with.
Second, there's the whole privacy issue. Have you ever wondered if your online stuff gets used to train AI? Let's just say, AI companies aren't exactly known for asking permission. It's a big business, and some platforms like Reddit are even selling your content to train these models. Thankfully, some folks are fighting back, like the New York Times suing OpenAI. But until there are stronger user protections, your public online data is fair game for AI development.
So, where are we getting this data? Well, OpenAI, the company behind GPT, is looking at public video transcripts, like those from YouTube videos they can convert to text. They might have even used the videos themselves for their AI video generator, Sora. OpenAI is also working on smaller, more specialized AIs and even a system to pay data providers based on quality.
Is synthetic data the answer?
Some companies are toying with a wild idea: using fake data to train AI models. This "synthetic data" is basically computer-generated info that mimics a real dataset. The goal? To keep the original data private while giving the AI something similar to learn from.
Sounds good in theory, but here's the catch: training on fake data could make the AI brain-dead. Think of it like feeding the AI the same old stuff over and over. It gets stuck, forgets things, and eventually starts spitting out the same answers repeatedly. No more fun, creative AI like ChatGPT – just a boring echo chamber.
That doesn't mean everyone's given up on fake data. Big names like Anthropic and OpenAI think it could still be useful. Hey, if they can figure out how to use it without messing everything up, all the power to them! Personally, I wouldn't mind if my embarrassing teenage Facebook posts weren't secretly powering the robot uprising.