Apple and NVIDIA busted swiping YouTube videos to train AI models

In early April, YouTube sent a clear message to AI model developers that downloading data from the platform and using it to train AI models is a clear violation of YouTube’s terms of service.

This sentiment was reinforced in the same week as YouTube’s public comment about its content being used to train AI model, but it came from a Google spokesperson who told the New York Times any, “unauthorized scraping or downloading of YouTube content” is prohibited. However, a new report from Proof News has found YouTube has been scraped for its data, and some of the biggest tech companies advancing AI have used it to train models.

According to a Proof News investigation, subtitles from 172,535 YouTube videos were siphoned from more than 48,000 channels, and some of these channels included prominent creators on the platform such as MKBHD (19 million subscribers), MrBeast (289 million), Jacksepticeye (31 million), PewDiePie (111 million), Stephen Colbert, John Oliver, Jimmy Kimmel, and more. Notably, the video transcriptions are subtitles files.

The report found that Apple, NVIDIA, Salesforce, Anthropic, and others used a dataset called Pile, which is accessible and open to anyone with internet access. Moreover, the report states Apple, NVIDIA, and Salesforce have stated in their respective research papers that Pile was used to train their AI models. In Apple’s case, the Pile dataset was used to train OpenELM, a new AI model that was released in April, only weeks before the Cupertino company unveiled Apple Intelligence.

It should be noted that all of the big tech companies listed above didn’t download the YouTube video transcriptions, as that was EleutherAI, which created the dataset for educational and academic purposes. However, it appears big tech discovered the dataset and decided to use it to train their models, which raises the question of what happens when a company uses a dataset from a third party to train an AI model, but that dataset contains data that users didn’t consent to be used for training purposes.

AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.

Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, NVIDIA, Apple, and Salesforce. The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did “The Late Show With Stephen Colbert,” “Last Week Tonight With John Oliver,” and “Jimmy Kimmel Live,”” reads Proof News’ YouTube description