The internet is seen by some as a vast repository of information readily available for training open and closed AI systems. However, this “data commons” raises significant ethical and legal concerns regarding data consent, attribution, and copyright, particularly for media companies. These concerns are growing due to the fear that AI systems may use the media’s content for training without consent, exacerbating conflicts over intellectual property rights.
A new study, Consent in Crisis: The Rapid Decline of the AI Data Commons, investigates these issues by examining how AI developers use web data and how data access and usage protocols shift over time. This research involves a comprehensive audit of web sources used in major AI training datasets, including C4, RefinedWeb, and Dolma.
The research also evaluates the practices of AI developers, such as Google, OpenAI, Anthropic, Cohere, and Meta, as well as non-profit archival organizations such as Common Crawl and the Internet Archive. By focusing on dynamic web domains and tracking changes over time, this study assesses the evolving landscape of data usage and its implications for media companies.
The changing landscape of consent for AI training
The research observations provide strong empirical evidence for the misalignment between AI uses and web-derived training data. This analysis helps track major shifts in signaling consent preferences and reveals current tools’ limitations.
Increased restrictions on AI data
- From April 2023 to April 2024, a growing number of websites started blocking AI bots from collecting their data. Websites accomplish this by including specific instructions in files called robots.txt and their terms of service.
- Impact: About 25% of the most critical data sources and 5% of all data used in some major AI datasets (C4, RefinedWeb, and Dolma) are now off-limits to AI.
Consent asymmetries and inconsistencies
- OpenAI’s bots, which collect data for AI training, are blocked more often than other companies’ bots. The rules about what these bots can and cannot do usually need to be clarified or more consistent.
- Impact: This inconsistency makes adhering to data usage preferences difficult and indicates ineffective management tools.
Divergence in the web data quality
- The most popular web domains for AI training are news, forums, encyclopedias and includes academic and government content. These domains contain diverse content, such as images, videos, and audio. Many of these sites montize via ads and paywalls. They also frequently have restrictions for how their content can be used in their terms of service. In contrast, other web domains consist of personal/organizational websites, blogs, and e-commerce sites with less monetization and fewer restrictions.
- Impact: The increasing restrictions on popular, content-rich websites mean that AI models must increasingly rely on open or user generated content. Thus, they miss out on the highest-quality and most up-to-date information, potentially affecting their performance and accuracy.
Mismatch between web data and AI usage
- There needs to be a closer connection between the web data collected for training AI and the actual tasks AI systems perform in the real world.
- Impact: This misalignment could lead to problems with AI systems’ performance and data collection. It may also lead to legal issues related to copyright.
AI economic fears may reshape internet data
- The use of internet content for AI training, which was not its original intent, shifts incentives for content creation. With the increasing use of paywalls and ads, small-scale content providers might opt out or move to walled platforms to protect their data. Without better control mechanisms for website owners, the open web is likely to shrink further, with more content locked behind paywalls or logins to prevent unauthorized use
- Impact: This trend could significantly reduce access to high quality information availability for AI training.
The media’s choice to opt out of AI training
While the Internet has served as a critical resource for AI development, the use of the content created by others, including the media, (often at great expense) without consent presents significant ethical and legal challenges. As more media companies choose to exclude their content from AI training, the datasets become less representative and outdated. The decline in data quality reduces the relevance and accuracy of the resulting AI models. Therefore, improved data governance and transparency are essential to allow for open access of content online. It also provides a framework for ethical use of web content for AI training, which in turn should improve the quality of training data.