YouTube channels from major news publishers and creators were in video data sets used by Microsoft, Meta, Snap, Runway, and Bytedance.
This article was originally published by Nieman Lab on 30/10/2025 and is hereby reproduced by iMEdD with permission. Any reprint permissions are subject to the original publisher. Read the original article here.
Main image: Shutterstock
Editorial cartooning: From pen to AI (and back again)

Can a cartoon be “born” from an algorithm? If so, what does this mean for the future of satire and commentary? An AI researcher and four cartoonists speak to iMEdD. The latter explain why, after experimenting, they chose to leave it out of the picture —for now.
Last month, The Atlantic dropped the latest investigation in its ongoing series on generative AI training data sets. Staff writer Alex Reisner found that at least 15 million YouTube videos had been used for training data by major technology companies, either for research or, in some cases, to build AI video products.
The Atlantic’s reporting focused over a dozen prominent training data sets that were either compiled or used by companies including Microsoft, Meta, Snap, Tencent, Runway, and ByteDance. The investigation shows how the unauthorized use of YouTube videos has been an essential contributor to the AI industry’s recent leap forward in AI video generation quality.
“Much as ChatGPT couldn’t write like Shakespeare without first ‘reading’ Shakespeare, a video generator couldn’t construct a fake newscast without ‘watching’ tons of recorded broadcasts,” writes Reisner.
The Atlantic’s story briefly mentions that more than 30,000 videos from the BBC were among the training data, alongside other YouTube channels focused on news. Using a searchable database published by The Atlantic, I wanted to better understand the scale at which news channels had been targeted. In the same data sets, I found hundreds of thousands of videos that were taken from some of the most popular news publishers and news creators on YouTube, including The New York Times, The Washington Post, The Guardian, Al Jazeera, and The Wall Street Journal.
For example, more than 88,000 videos were included from Fox News’ YouTube channels, including its flagship account and Fox Business. Another roughly 70,000 videos were taken from the channels of ABC News and its morning show, Good Morning America. I also found more than 55,000 videos from Bloomberg’s YouTube channels, including Bloomberg Originals, Bloomberg Television, and Bloomberg Technology.
Searching through Vox Media-owned YouTube channels in the database, I found more than 30,000 videos, including explainers from Vox, travel docs from Eater, and animal tearjerkers from The Dodo. Roughly 13,900 of those videos were from The Verge’s official YouTube channel, including iOS gadget guides, episodes of its flagship podcast The Vergecast, and interviews with Silicon Valley CEOs like Mark Zuckerberg.
YouTube CEO Neal Mohan has previously said that it’s against the platform’s terms of service for other companies to download videos and use them for training data.
“In order to survive, AI platforms know they need (and their consumers want) quality, credible content like ours that give their products relevance and purpose,” said Lauren Starke, a spokesperson for Vox Media. “They’re spending at unprecedented levels on AI infrastructure: chips, servers, and data centers that power their models. Yet when it comes to the content that makes those models useful — journalism, creative work — they’ve comparatively spent next to nothing.”
In May 2024, Vox Media signed a partnership with OpenAI for an undisclosed sum allowing the company to use its content for products like ChatGPT. Starke said Vox Media will continue to explore partnerships with AI companies that respect their work, but “pursue legal remedies to protect our intellectual property, when necessary.”
“Without our quality content, the reality for these platforms will be: garbage in, garbage out,” she said.
The Atlantic’s database includes over a dozen distinct video training data sets, all of which have been used prominently in generative AI research and development. Some of those data sets have explicit links to commercial video generation models on the market.
For example, I found 11,604 videos from the official YouTube channel of The New York Times across 11 different data sets in the database. Over 8,000 of these videos, though, came from a single training data source — Runway Gen-3. Compiled by Runway, a company that has received backing from Salesforce, Google, and Nvidia, this data set was made to train its flagship video generation model. When Gen-3 was released back in June 2024, it received rave reviews and was compared to earlier iterations of OpenAI’s Sora and Google’s Veo models. Earlier this year, Runway was valued at $3 billion.
Among the thousands of videos from The New York Times in Runway Gen-3, there is a documentary on JFK’s assassination, a visual investigation into the Hong Kong pro-democracy protests, a sit-down interview with Barack Obama, and an opinion column about Russian influence operations. An additional 382 videos are taken from the NYT Cooking YouTube channel, including viral recipes, how-to baking guides, and short-form street food docs. (One caveat is that Runway may have omitted certain videos when it ultimately trained Gen-3.)
An internal Runway spreadsheet published by 404 Media last year gives some insight into why YouTube videos from news publishers were targeted. The spreadsheet, called “Video sourcing – Jupiter” (referencing the codename for Gen-3), lists thousands of channels that were marked by the company as high quality.
In the document, 27,000 videos from The Wall Street Journal’s YouTube channel were tagged with: “lot of graphics, walkthroughs, ‘show and tell.’” From CNET, 22,000 videos were described as “tech reviews” and tagged with the keyword phrase “using a laptop.” From the Washington Post, 21,000 videos were labeled as “lots of newscaseter[sic] but plenty of b-roll.” Another 35,000 videos from Good Morning America were tagged “gargling,” AI jargon for when a model superficially mimics something from its training without deeper “understanding.”
From The New York Times’ official YouTube channel, videos were listed with the description “nyt video, op docs, b roll, talking, human subjects.” Hundreds of NYT Cooking videos were tagged with the keyword “scrambling eggs.” This language gives some indication of the visual vernaculars — or even specific actions — Gen-3 was being trained to mimic.
Since the model’s release, major Hollywood studios have started folding Runway’s products into their film, TV, and marketing production pipelines. According to a report from Bloomberg this summer, Netflix is already using Runway tools in its “content production,” and Walt Disney Co. has similarly been testing out its software.
Meanwhile, there have been no reported licensing deals between Runway and the many news publishers whose work was included in the data set, including The Washington Post, Vox Media, the BBC, and The New York Times. Runway did not respond to a request for comment.
“The Times has not authorized the use of videos that it publishes on YouTube for AI training purposes by any third party. As the owner of these works, the Times has the exclusive legal right to decide how and where our content is used — and are monitoring this closely,” said a spokesperson for the Times, which is currently suing OpenAI and Microsoft for allegedly using its text articles to train ChatGPT without permission. “We will continue to actively investigate infringement of our valuable intellectual property, and will enforce our rights as appropriate.”
Not all the training data sets in The Atlantic database have such clear ties to commercial AI video products. Some were used by the research arms of major AI companies, including Meta, Snap, Tencent, and Bytedance. This usage is public because employees disclosed it themselves in research articles.
For example, a training data set called HD-VILA-100M was first collected by Microsoft Research Asia, the company’s research lab headquartered in Beijing, China. The Atlantic reported that HD-VILA-100M was made available for download by Microsoft researchers and then used by a host of major AI companies in their own research and development.
Meta used the data set to develop its text-to-video system “Make-A-Video,” which was released back in 2022. A research lab at Tencent, the Chinese tech giant, used HD-VILA-100M to create a publicly available dataset that could rival the training data used by OpenAI for its Sora video generation model. Bytedance, the owner of TikTok, used the model to train its experimental text-to-video model MagicVideo. Snap, the owner of Snapchat, used the model for research into improving AI video captioning, video search tools, and text-to-video generation.
Within HD-VILA-100M, as it was passed across the AI industry over several years, were thousands of YouTube videos owned by news publishers. That includes more than 13,000 videos downloaded from Fox News YouTube channels, roughly 6,300 from various DW channels, and another 5,520 from the Al Jazeera English channel, among others.
While research using HD-VILA-100M has advanced video generation technology at each respective company, it’s harder to draw straight lines between its usage and any one proprietary model or feature.
Similarly, YT-Temporal-180M is a dataset compiled by researchers at the University of Washington and the Allen Institute for AI, a nonprofit research organization. The Atlantic reported that the data set is hosted on Google Cloud servers and available for download through Hugging Face, a platform for sharing data sets and machine learning models. YT-Temporal-180M includes about 36,000 videos from Fox News, about 34,000 videos from Bloomberg, and roughly 31,000 videos from ABC News, among others.
Since it first became available in 2021, YT-Temporal-180M has been downloaded from Hugging Face more than 1,450 times. Many of the data sets identified and audited by The Atlantic remain available for download on Hugging Face to use freely for model training.
Major publishers were not the only news-focused YouTube channels I found. Videos from news creators — independent channels that host news aggregation, talk shows, interviews, and political punditry — were scattered throughout the training data sets and sometimes rivaled the numbers from traditional news media.
I found several of the most popular progressive news channels on YouTube in the training data sets, including over 15,000 videos from The David Pakman Show, a talk show that has more than three million YouTube subscribers. His videos were included in both HD-VILA-100M and YT-Temporal-180M, among others. Pakman, the founder and host of the program, confirmed he had not received any requests to use these videos for AI training.
“I understand that AI training often involves scraping large amounts of publicly available data, and that’s part of how these systems improve,” Pakman told me. “When the use is this concentrated — a.k.a tens of thousands of videos from one creator — it feels less like incidental inclusion and more like large-scale extraction of intellectual property without consent.”
Wired previously reported on how subtitles from Pakman’s videos were used to train language models.
Over 11,000 videos from The Majority Report with Sam Seder, which has nearly 2 million subscribers on YouTube, were also in the data sets. When I spoke to Seder, he speculated that his channel offers AI companies a “visual and linguistic vernacular” that’s fundamentally different from mainstream news publishers. Those thousands of videos from The Majority Report include recorded livestreams, listener call-in shows, and reaction videos, all of which round up to a radio-jockey-style brand of political commentary.
Notably, very few of the most prominent U.S. conservative political commentators on YouTube were in the data sets. For example, there were no videos from Steven Crowder or The Rubin Report. There were 460 videos from Ben Shapiro’s YouTube channel, which has over 7 million subscribers.
Under YouTube’s rules, when a creator uploads an original video, they automatically retain copyright. That said, YouTube does have a carve out to use the content for its own AI training purposes. Earlier this year, CNBC reported that YouTube had used a subset of videos on its platform to train Google’s Gemini and Veo 3 models. This allowance does not extend to third parties.
News publications and news creators don’t need to register their YouTube videos with the U.S. Copyright Office (USCO) to have a valid copyright claim. That said, registering videos by submitting an application and paying a filing fee does come with legal benefits, like the ability to sue for copyright infringement.
The New York Times told me that it “registers its print edition and website on an ongoing basis with the US Copyright Office, including all underlying content.” In many cases, YouTube videos from the Times that are based on print or web articles that have already been registered with USCO could be considered “derivative works” and covered by the same filings.
“Taking content from creators like the Times without permission violates the law and will severely harm the market for original, independent reporting, which will diminish the ability of people to tell important stories, leaving the public less informed,” a spokesperson for the Times told me. “The Times believes that the future success of this technology should not come at the expense of journalistic institutions.”
Seder, meanwhile, said none of the videos on The Majority Report channel — often five uploads per day — are registered with the USCO. As he puts it, he simply doesn’t “have the pockets” to cover filing fees and retain legal counsel, especially when up against some of the largest companies in the world.
He is comfortable with other creators pulling clips from his videos without permission, to a degree. After all, reaction videos are fuel for news creators across YouTube.
“People are using my content all the time, but they’re adding commentary to it, and it is part of a conversation, and it is transparent — that’s part of the ecosystem,” said Seder. He sees the mass downloading of his channel for AI training in another light. “What these [AI companies] are doing is fundamentally different. There’s no reciprocity; it’s only exploitative.”