OpenAI Reportedly Transcribed YouTube Videos for AI Training

US tech giants OpenAI Inc., Alphabet Inc.’s Google LLC, and Meta Platforms Inc. have reportedly opted for artificial intelligence (AI) learning methods that fall into the gray area of copyright law covering the technology, as companies find it challenging to obtain quality training data.

A report from a US newspaper alleged that OpenAI has used its speech recognition tool, Whisper, to transcribe audio from more than a million hours of videos on Google’s YouTube platform.

The firm, with the help of its President Greg Brockman, purportedly utilized the transcripts to create conversational texts to train its latest large language model (LLM) GPT-4.

The news came after the ChatGPT developer reportedly removed data from YouTube videos and podcasts to train two of its AIs. Google has also made a similar move for AI learning, according to the report.

San Francisco-based OpenAI was reportedly aware of the potential uncertainty over the legality of its action but saw it to be fair use. Additionally, the company has had talks with the OpenAI team about whether transcribing YouTube videos may violate the video-sharing platform’s policies.

On Meta’s side, the report said the Facebook parent had explored the option of acquiring US publishing group Simon & Schuster LLC to access long-form content that it can feed into its AI model.

Meta was also reported to have considered procuring copyrighted works on the Internet as licensing discussions with publishers, artists, and media require more time.

YouTube Chief Executive Neal Mohan acknowledged last week OpenAI’s possible use of YouTube to train its video-generating model Sora.

Google spokesperson Matt Bryan said they address such unauthorized use with ‘technical and legal measures’ provided they have a legitimate or technical reason.

AI Companies Struggle to Find Broader Data Access

The report came as AI companies attempting to build more powerful AI models may have been struggling to acquire vast amounts of information that they could use to help their systems learn.

OpenAI, Google, and the wider AI training space are facing rapidly diminishing training data for their AI systems, which become more advanced as more data are fed into them. A week earlier, it was reported that firms may beat new content by 2028.

Strong demand is exhausting the available sources of excellent text data online, and certain data owners are prohibiting AI companies from accessing their information.

Some executives and researchers now expect to see firms slowing down on AI development as the demand for quality data potentially exceeds supply in two years.

Such a concern can be addressed by allowing AI models to learn from ‘synthetic’ data, which is information they generated themselves.

Another option is ‘curriculum learning,’ a tactic that can train AI systems by providing them with quality data in a controlled manner to possibly enable a more intelligent link between concepts with significantly fewer data.

Still, the two methods have yet to be tested.

User Review
0 (0 votes)


Leave a Reply