OpenAI & Google Accused Of Feeding AI With Millions Of YouTube Videos
By Mikelle Leow, 09 Apr 2024
Illustrations 117499893 © Darakchi and 189206689 © Prabath Gunasekara | Dreamstime.com, background generated on AI
Get ready with tech giants as they get a little too creative to give their artificial intelligence models a glow-up.
The New York Times has alleged that OpenAI used its Whisper speech recognition tool to transcribe millions of YouTube videos, amounting to over a million hours, of content. These transcripts were then supposedly fed into the ever-powerful GPT-4 model, known for its impressive text-generating skills.
This practice, seen as a potential violation of creators’ copyrights, appears to be in conflict with YouTube’s policies, which prohibit unauthorized scraping or downloading of content.
The Times, in a separate, high-profile legal dispute with OpenAI, has already accused the latter of copyright infringement for using millions of its articles to train large language models without permission. This new report on YouTube videos adds another layer to the ongoing battle.
Google was purportedly aware of OpenAI’s actions, but it did not intervene because it was using YouTube videos to train its own AI models, per the report. The datasets are said to not only contain YouTube content but also files from Google Docs and Google Sheets.
Google, for its part, made revisions to its privacy policy in June 2023 to broaden the scope of its use of publicly available content, including the training of its Bard language model. These changes, which went into effect in July, were described by the tech giant as an effort to clarify its practices, asserting that any use of data for AI training is conducted with the explicit consent of users participating in experimental features.
Despite these assurances, the timing of its privacy policy update has raised eyebrows, coming as it does amid allegations of widespread use of YouTube content to train AI models.
The supposed revelation brings to mind a viral moment from last month when OpenAI CTO Mira Murati was quizzed by Wall Street Journal tech journalist Joanna Stern about how the company trained Sora, OpenAI’s impressive text-to-video AI model. Murati shared that the AI firm used “publicly available data and licensed data,” which prompted Stern to ask if she meant YouTube videos.
Murati, clearly uncomfortable, hesitated with a mouth twist before responding, “I’m actually not sure about that.”
If you're a tech executive the minimum you need is a good poker face when you're asked a question like "Is your model trained on YouTube data?" https://t.co/UQvS5VXuRQ pic.twitter.com/l6USDILWN8
— Chris Stokel-Walker (@stokel) March 14, 2024
OpenAI and Google haven’t commented on the allegations from the exposé yet.
[via Inc., Futurism, 9to5Google, New York Times, images via various sources]