OpenAI & Google Accused Of Feeding AI With Millions Of YouTube Videos

By Mikelle Leow, 09 Apr 2024

Illustrations 117499893 © Darakchi and 189206689 © Prabath Gunasekara | Dreamstime.com, background generated on AI

Get ready with tech giants as they get a little too creative to give their artificial intelligence models a glow-up.

The New York Times has alleged that OpenAI used its Whisper speech recognition tool to transcribe millions of YouTube videos, amounting to over a million hours, of content. These transcripts were then supposedly fed into the ever-powerful GPT-4 model, known for its impressive text-generating skills.

This practice, seen as a potential violation of creators’ copyrights, appears to be in conflict with YouTube’s policies, which prohibit unauthorized scraping or downloading of content.

The Times, in a separate, high-profile legal dispute with OpenAI, has already accused the latter of copyright infringement for using millions of its articles to train large language models without permission. This new report on YouTube videos adds another layer to the ongoing battle.

Google was purportedly aware of OpenAI’s actions, but it did not intervene because it was using YouTube videos to train its own AI models, per the report. The datasets are said to not only contain YouTube content but also files from Google Docs and Google Sheets.

Google, for its part, made revisions to its privacy policy in June 2023 to broaden the scope of its use of publicly available content, including the training of its Bard language model. These changes, which went into effect in July, were described by the tech giant as an effort to clarify its practices, asserting that any use of data for AI training is conducted with the explicit consent of users participating in experimental features.

Despite these assurances, the timing of its privacy policy update has raised eyebrows, coming as it does amid allegations of widespread use of YouTube content to train AI models.

The supposed revelation brings to mind a viral moment from last month when OpenAI CTO Mira Murati was quizzed by Wall Street Journal tech journalist Joanna Stern about how the company trained Sora, OpenAI’s impressive text-to-video AI model. Murati shared that the AI firm used “publicly available data and licensed data,” which prompted Stern to ask if she meant YouTube videos.

Murati, clearly uncomfortable, hesitated with a mouth twist before responding, “I’m actually not sure about that.”

If you're a tech executive the minimum you need is a good poker face when you're asked a question like "Is your model trained on YouTube data?" https://t.co/UQvS5VXuRQ pic.twitter.com/l6USDILWN8
— Chris Stokel-Walker (@stokel) March 14, 2024

OpenAI and Google haven’t commented on the allegations from the exposé yet.

[via Inc., Futurism, 9to5Google, New York Times, images via various sources]

To discuss this topic, please click here.

Advertise here

Also check out these recent news

Humor

OpenAI & Google Accused Of Feeding AI With Millions Of YouTube Videos

Loewe Indulges Recent Tomato Meme By Reimagining It As A Real-Life Purse

DuckDuckGo Now Lets You Stay Anonymous While Talking To AI Chatbots

A LEGO Lamborghini Countach Steers Right Into Your Home

Hublot x Daniel Arsham Join Hands To Craft Rare ‘Droplet’ Pocket Watch

Leica Launches ‘LUX’ Mobile App That Puts An Old-School Camera Into Your Pocket