As large language models (LLMs) evolve from experimental tools to valuable assets, transparency in their data sourcing is rapidly declining. Initially, datasets were openly shared, allowing the public to examine the content used for training. However, LLM companies tightly guard their data sources today, leading to new intellectual property (IP) conflicts. Many media companies are pursuing litigation to protect their content from unauthorized use in AI training. At the same time, courts, regulators, and policymakers are engaged in debates over content ownership and the responsibilities of large language model (LLM) developers.
A new report by George Wukoson, Ziff Davis’ lead AI attorney, and Joey Fortuna, the company’s chief technology officer, sheds light on the nature of data sources used by major LLMs. Their research reveals that AI developers often favor high-quality content when selecting training data, especially content owned by premium media companies. Their findings support discussions around publishers’ IP rights, content licensing, and the ethical dimensions of AI development.
Dataset analysis and key findings
Wukoson and Fortuna’s research uses Domain Authority (DA), a metric developed by Moz for search engine optimization, to measure the prominence of domains in several LLM training datasets. They examine Common Crawl, C4, OpenWebText, and OpenWebText2, and analyze how curation levels affect the inclusion of content from high-DA, premium sources.
Their findings showed that as datasets become more curated, the share of content from high-quality publishers rises significantly. Key findings from the research:
1. Increasing inclusion of premium content
In less curated datasets like Common Crawl, content from major media companies only makes up about 0.44%. However, in OpenWebText2, a highly curated dataset, content from these companies jumps to 12.04%. This shift indicates that LLM developers selectively incorporate reputable sources to improve the quality and accuracy of the model’s output.
2. Higher DA correlates with higher curation levels
Common Crawl, an uncurated dataset, has over 50% of domains with DA scores below 10, indicating that it includes a significant amount of low-authority content. In contrast, OpenWebText2 comprises 39.4% of domains with DA scores between 90 and 100, reflecting a preference for high-authority, reliable sources as datasets undergo curation.
3. Prominence of premium content
Leading publishers like The New York Times and News Corp consistently appear in the top DA range (90–100), reflecting their high authority in the dataset rankings. Their dominance in these datasets suggests that LLMs are more frequently training on established news and media sources. This gives these outlets a stronger influence in shaping model behavior and responses.
These trends show a pattern where the curation of training datasets systematically filters out lower-quality sources, favoring reputable, high-DA domains. As a result, LLMs benefit from exposure to high-quality, well-sourced content, which may enhance their performance but raise concerns about IP use and representation.
Ethical and legal implications
Prioritizing high-quality, high-DA content in LLM datasets escalates legal disputes between media companies and AI firms. For example, the New York Times has filed a copyright infringement suit against major AI developers. They argue that these AI companies profit from high-quality content without appropriate compensation to the original publishers.
As LLMs continue transforming industries, the value of high-quality, curated content becomes increasingly apparent. The authors’ analysis shows that curation prioritizes content from high-DA, reputable media content companies, amplifying their role in shaping model outcomes. This trend will likely intensify as LLM companies refine their training methodologies, sparking further debate over intellectual property and AI firms’ financial obligations to content creators. The findings here call for a broader dialogue around data licensing and compensation frameworks that reflect the mutual value between content creators and AI innovators.