Platform

YouTube Creators Blindsided: Billions Of Views Fueling AI Without Permission

Published on July 18, 2024

A new investigation reveals that major tech companies have used content from thousands of YouTube videos to train artificial intelligence models without creators’ knowledge or consent, WIRED reports, citing research by Proof News that sheds light on the widespread practice of harvesting online content for AI development.

According to the report, subtitles from 173,536 YouTube videos, sourced from over 48,000 channels, were utilized by prominent tech firms, including Anthropic, Nvidia, Apple, and Salesforce. This dataset, known as YouTube Subtitles, contains transcripts from various content creators, from educational channels like Khan Academy and MIT to popular entertainment shows such as “The Late Show With Stephen Colbert” and “Jimmy Kimmel Live.”

The investigation found that content from YouTube’s biggest stars was also included in the dataset. Videos from channels like MrBeast (289 million subscribers), Marques Brownlee (19 million subscribers), Jacksepticeye (nearly 31 million subscribers), and PewDiePie (111 million subscribers) were among those used for AI training.

Dave Wiskus, CEO of Nebula, a streaming service partially owned by creators, expressed concern about the practice. “It’s theft,” he stated, adding that it’s “disrespectful” to use creators’ work without consent, especially given the potential for generative AI to replace artists.

According to WIRED, the dataset in question is part of a more extensive compilation called the Pile, released by the nonprofit EleutherAI. The Pile includes material from various sources, such as the European Parliament, English Wikipedia, and even Enron Corporation employees’ emails. While the dataset was intended to lower barriers to AI development, it has been utilized by academic researchers and major tech companies.

Apple, Nvidia, and Salesforce have reportedly used the Pile to train their AI models. Apple’s recent OpenELM model, released weeks before the company announced new AI capabilities for iPhones and MacBooks, was trained using this dataset. WIRED reports that Anthropic recently received a $4 billion investment from Amazon and confirmed using the Pile to develop its AI assistant, Claude.

Many content creators were unaware that their work had been used for AI training. David Pakman, host of “The David Pakman Show,” a politics channel with over 2 million subscribers, had nearly 160 videos in the dataset without his knowledge. Pakman argues that creators should be compensated if AI companies profit from using their content.

The YouTube Subtitles dataset also contains subtitles from more than 12,000 videos that have since been deleted from YouTube, raising questions about data retention and usage rights.