Apple and NVIDIA busted swiping YouTube videos to train AI models

In early April, YouTube sent a clear message to AI model developers that downloading data from the platform and using it to train AI models is a clear violation of YouTube’s terms of service.

This sentiment was reinforced in the same week as YouTube’s public comment about its content being used to train AI model, but it came from a Google spokesperson who told the New York Times any, “unauthorized scraping or downloading of YouTube content” is prohibited. However, a new report from Proof News has found YouTube has been scraped for its data, and some of the biggest tech companies advancing AI have used it to train models.

According to a Proof News investigation, subtitles from 172,535 YouTube videos were siphoned from more than 48,000 channels, and some of these channels included prominent creators on the platform such as MKBHD (19 million subscribers), MrBeast (289 million), Jacksepticeye (31 million), PewDiePie (111 million), Stephen Colbert, John Oliver, Jimmy Kimmel, and more. Notably, the video transcriptions are subtitles files.

The report found that Apple, NVIDIA, Salesforce, Anthropic, and others used a dataset called Pile, which is accessible and open to anyone with internet access. Moreover, the report states Apple, NVIDIA, and Salesforce have stated in their respective research papers that Pile was used to train their AI models. In Apple’s case, the Pile dataset was used to train OpenELM, a new AI model that was released in April, only weeks before the Cupertino company unveiled Apple Intelligence.

It should be noted that all of the big tech companies listed above didn’t download the YouTube video transcriptions, as that was EleutherAI, which created the dataset for educational and academic purposes. However, it appears big tech discovered the dataset and decided to use it to train their models, which raises the question of what happens when a company uses a dataset from a third party to train an AI model, but that dataset contains data that users didn’t consent to be used for training purposes.

“AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.

Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, NVIDIA, Apple, and Salesforce. The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did “The Late Show With Stephen Colbert,” “Last Week Tonight With John Oliver,” and “Jimmy Kimmel Live,”” reads Proof News’ YouTube description

University spin-out Afynia secures $5M seed to commercialize its microRNA panel test for endometriosis | TechCrunch

Canadian biotech startup Afynia Laboratories, a spin-out from McMaster University in Ontario, has picked up $5 million in seed funding to commercialize a blood test

February 25, 2025

A Beginner’s Guide to Homology | HackerNoon

Author: (1) David Staines. Table of Links Abstract 1 Introduction 2 Mathematical Arguments 3 Outline and Preview 4 Calvo Framework and 4.1 Household’s Problem 4.2

December 11, 2024

Helldivers II is the best-selling game of the year in U.S. across consoles and PC

Helldivers II is estimated to be the best-selling release of 2024 so far. 2 VIEW GALLERY – 2 IMAGES The Verhoeven-inspired squad shooter Helldivers II

May 3, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.