AI Companies Face Data Shortage After Using Up All Available Internet Resources for Model Training

By Thea Felicity

Published: Apr 03 2024, 13:30 PM EDT

This means that AI companies are striving to improve their large language models (LLMs) by using vast amounts of data from the internet for training. However, this intense competition to create better models has almost used up all available internet data, creating a shortage that is important for further model development, according to WSJ.

What AI Companies Can Do To Train Models

To solve this, First Post reported that AI companies are seeking other data options like video transcripts and AI-generated "synthetic data." Yet, using AI-generated data comes with its own issues, as it may result in AI models producing inaccurate results, sparking concerns about their reliability and effectiveness.

Other concerns regarding synthetic data have pressed about digital "inbreeding," where excessive dependence on artificially generated data might cause AI models to fail.

Now, despite these obstacles, companies like Dataology are innovating methods to train expansive models with fewer resources, while industry giants like OpenAI are contemplating unique strategies, like utilizing transcriptions from publicly available YouTube videos, to train their upcoming models.

However, such methods are not without controversy. OpenAI encounters backlash for employing such videos in model training, risking legal disputes from video creators.

While concerns about the depletion of usable training data persist, some experts remain optimistic, believing that significant breakthroughs could address these concerns in the future.

However, experts also believe that an alternative solution to this dilemma exists. AI companies could opt to scale back their pursuit of larger and more advanced models, considering the environmental impact associated with their development, including significant energy consumption and reliance on rare-earth minerals for computing chips.

Sectors