Get 40% Off
🚨 Volatile Markets? Find Hidden Gems for Serious Outperformance
Find Stocks Now

OpenAI Reportedly Used More Than a Million Hours of YouTube Videos to Train Its Latest AI Model

Published 09/04/2024, 18:43
Updated 09/04/2024, 20:10
© Reuters.  OpenAI Reportedly Used More Than a Million Hours of YouTube Videos to Train Its Latest AI Model

Benzinga - by MarketBeat, Benzinga Contributor.

Where does AI training data come from?

A report from The New York Times revealed on Friday that OpenAI may have trained AI models on YouTube video transcriptions and Google may have been doing the same thing.

The report found that in the hunt for fresh digital data to train its newer, smarter AI system, OpenAI researchers created a workaround called Whisper, which could take YouTube videos and transcribe them into text that could then be fed as new AI training data — for a more conversational, next-generation AI.

The process of developing GPT-4, the powerful AI model behind OpenAI's latest ChatGPT chatbot, took over a million hours of YouTube videos transcribed by Whisper, according to the NYTimes' sources.

The Times reports that OpenAI employees had conversations about how YouTube transcription training data could potentially violate YouTube's rules, but OpenAI decided to move forward anyway with the belief that training AI with the videos was fair use.

Knowledge of where the training data was coming from extended up to senior leadership, according to The Times, with OpenAI's president Greg Brockman even allegedly helping collect videos.

The Wall Street Journal's Joanna Stern interviewed OpenAI's CTO Mira Murati last month and asked her what data was used to train one of OpenAI's most recent products: a tool called Sora that generates videos based on text prompts.

"We used publicly available data and licensed data" Murati said. When Stern asked "So, videos on YouTube?" Murati replied, "I'm actually not sure about that."

3rd party Ad. Not an offer or recommendation by Investing.com. See disclosure here or remove ads .

When Stern further asked "Videos from Facebook, Instagram?" Murati stated, "You know, if they were publicly available, publicly available to use, there might be the data, but I'm not sure. I'm not confident about it."

YouTube CEO Neal Mohan said last week that if OpenAI used YouTube videos to train Sora, that would be a "clear violation" of YouTube's terms of use.

The terms of service "does not allow for things like transcripts or video bits to be downloaded" Mohan told Emily Chang, host of Bloomberg Originals.

Yet five sources told The Times that Google did the same thing as OpenAI, allegedly transcribing YouTube videos to generate new training text for its AI models in a potential violation of copyright law.

Google owns YouTube and told The Times that its AI is "trained on some YouTube content" that its agreements with creators allow.

Lawsuits over training AI with copyrighted material have become widespread in recent years, with authors like Paul Tremblay and Sarah Silverman alleging that their books were part of datasets used to train AI — without their consent.

The lawyers for these lawsuits, Joseph Saveri and Matthew Butterick, state on their website that generative AI is just "human intelligence, repackaged and divorced from its creators."

More than 15,000 authors signed a letter last year asking big tech CEOs, including ones at OpenAI, Google, Microsoft (NASDAQ: MSFT), Meta (NASDAQ: META), and IBM (NYSE: IBM), to obtain the consent of writers before training AI with their work and credit and compensate them.

3rd party Ad. Not an offer or recommendation by Investing.com. See disclosure here or remove ads .

It's not just authors: musicians too are feeling the impact of AI. Artists like Billie Eilish and Jon Bon Jovi signed an open letter last week accusing big tech companies of using their work to train models without permission or compensation.

"These efforts are direly aimed at replacing the work of human artists with massive quantities of AI-created "sounds" and "images" that substantially dilute the royalty pools that are paid out to artists" the letter stated.

Tennessee became the first state to pass legislation protecting artists from deepfakes, or cloned and manipulated versions of their voices, last month.

The article "OpenAI Reportedly Used More Than a Million Hours of YouTube Videos to Train Its Latest AI Model" first appeared on MarketBeat.

Read the original article on Benzinga

Latest comments

Risk Disclosure: Trading in financial instruments and/or cryptocurrencies involves high risks including the risk of losing some, or all, of your investment amount, and may not be suitable for all investors. Prices of cryptocurrencies are extremely volatile and may be affected by external factors such as financial, regulatory or political events. Trading on margin increases the financial risks.
Before deciding to trade in financial instrument or cryptocurrencies you should be fully informed of the risks and costs associated with trading the financial markets, carefully consider your investment objectives, level of experience, and risk appetite, and seek professional advice where needed.
Fusion Media would like to remind you that the data contained in this website is not necessarily real-time nor accurate. The data and prices on the website are not necessarily provided by any market or exchange, but may be provided by market makers, and so prices may not be accurate and may differ from the actual price at any given market, meaning prices are indicative and not appropriate for trading purposes. Fusion Media and any provider of the data contained in this website will not accept liability for any loss or damage as a result of your trading, or your reliance on the information contained within this website.
It is prohibited to use, store, reproduce, display, modify, transmit or distribute the data contained in this website without the explicit prior written permission of Fusion Media and/or the data provider. All intellectual property rights are reserved by the providers and/or the exchange providing the data contained in this website.
Fusion Media may be compensated by the advertisers that appear on the website, based on your interaction with the advertisements or advertisers.
© 2007-2024 - Fusion Media Limited. All Rights Reserved.