AI Companies Face Data Shortage After Using Up All Available Internet Resources for Model Training

By Thea Felicity

Apr 03, 2024 01:30 PM EDT

A photo taken on November 23, 2023 shows the logo of the ChatGPT application developed by US artificial intelligence research organization OpenAI on a smartphone screen (L) and the letters AI on a laptop screen in Frankfurt am Main, western Germany. Sam Altman's shock return as chief executive of OpenAI late on November 22 -- days after being sacked -- caps a chaotic period that highlighted deep tensions at the heart of the Artificial Intelligence community. The board that fired Altman from his role as CEO of the ChatGPT creator has been almost entirely replaced following a rebellion by employees, cementing his position at the helm of the firm.
(Photo : Photo by KIRILL KUDRYAVTSEV/AFP via Getty Images)

This means that AI companies are striving to improve their large language models (LLMs) by using vast amounts of data from the internet for training. However, this intense competition to create better models has almost used up all available internet data, creating a shortage that is important for further model development, according to WSJ.

READ NEXT: Microsoft, OpenAI to Build $100 Billion New Data Center for 'Stargate' AI Supercomputer: Report

What AI Companies Can Do To Train Models

To solve this, First Post reported that AI companies are seeking other data options like video transcripts and AI-generated "synthetic data." Yet, using AI-generated data comes with its own issues, as it may result in AI models producing inaccurate results, sparking concerns about their reliability and effectiveness.

Other concerns regarding synthetic data have pressed about digital "inbreeding," where excessive dependence on artificially generated data might cause AI models to fail. 

Now, despite these obstacles, companies like Dataology are innovating methods to train expansive models with fewer resources, while industry giants like OpenAI are contemplating unique strategies, like utilizing transcriptions from publicly available YouTube videos, to train their upcoming models.

However, such methods are not without controversy. OpenAI encounters backlash for employing such videos in model training, risking legal disputes from video creators. 

While concerns about the depletion of usable training data persist, some experts remain optimistic, believing that significant breakthroughs could address these concerns in the future. 

However, experts also believe that an alternative solution to this dilemma exists. AI companies could opt to scale back their pursuit of larger and more advanced models, considering the environmental impact associated with their development, including significant energy consumption and reliance on rare-earth minerals for computing chips.

READ MORE: OpenAI Reveals Dangerous AI Voice-Cloning Tech But Won't Be Released Yet; Here's Why!

© 2024 VCPOST, All rights reserved. Do not reproduce without permission.

Join the Conversation

Real Time Analytics