It’s easy to feel like publishers have little recourse to the fast-track of AI adoption. Big Tech keeps updating its offerings, cribbing more content, clobbering website traffic, and battling for legal traction. This is a wake-up call.
On the other side, consumers keep changing their consumption habits. They increasingly turn to AI for search. They accept the answers provided by AI Overviews (even if they know about AI hallucinations), and reduce their click-throughs to publishers.
Suffice to say, we’re in the messy middle. Media companies and independent publishers have been left to fend for themselves.
- Some have been turning to the courts, fighting against AI companies big and small, managing individual battle after battle in the hopes of pushing back against the latest attack on our business model.
- Some have been inking deals with those very same AI companies — The New York Times and Amazon recently closed a content licensing agreement; Open AI has partnerships with the Washington Post. Hearst and News Corp, among others.
But it doesn’t have to be this way. Publishers are better off working together to set standards for what they want when it comes to AI, from negotiations to government legislation, rather than piecemeal settlements and implementation that might not always go our way. One way we can proactively do just that: Consider setting an industry standard for do-not-scrape policies.
This isn’t the time to sit back while AI companies help themselves to the internet’s creative work. At Raptive, we work with publishers to find ways to help them better protect their content and prepare for what’s next.
Do not scrape
Blocking AI crawlers from sites isn’t new. Publishers have been doing so for years. As Axios reported in 2023, per Originality.AI data, some 20% of the top 1000 websites have been blocking crawlers from their sites. Back in August of that year, after OpenAI announced its GPTBot crawler, major news publishers like The New York Times, Reuters and CNN, among others, made public that they would block the crawler from their sites.
The publishers aren’t alone in pushing back against the scraping. Last week, Cloudflare rolled out a new a permission-based system for AI crawlers, becoming the first internet infrastructure provider to block AI crawlers by default.
Reddit, meanwhile, has been waging its own battle against content scraping. Last June, for example, the company updated its robot exclusion protocol to have firmer restrictions against the third-party companies that crawl the company’s content; this followed the shift to charge for API access, enabling the company to monetize its vast user-generated content and control how AI companies use its data for model training. Earlier this month, Reddit filed suit against AI company Anthropic with claims that the company had “obtained access or tried to obtain access to Reddit data more than 100,000 times,” reported The New York Times.
Until recently, crawlers’ main function seemed to benefit publishers and platforms alike: traffic. Sites allowed their content to be crawled, and in doing so, they would get the monetizable eyeballs they relied upon. Robot.txt code allowed publishers and platforms to make clear who could scrape their content and what they could do with it.
That gets a lot more complicated when AI companies are the ones doing the scraping. They take content without permission and use it to benefit their companies without compensating the publishers, platforms, or creators. Even with do-not-scrape policies in place, AI companies have been reportedly trying to get around those policies to continue scraping the content and using it for their benefit.
And while Google does allow publishers to opt out of AI training and Gemini app inclusion, avoiding AI Overviews and the newer AI Mode requires opting out of all Google Search indexing—a rigid all-or-nothing approach, despite Google considering but ultimately rejecting more flexible, a la carte controls that we believe publishers deserve.
If AI companies are crawling content, serving AI Overviews and link-less search replies that cut traffic to websites without crediting or compensating the originators of that content for said content, it’s only a matter of time until the very companies AI is scraping no longer exist. Figuring out a model for compensation for publishers, platforms and creators isn’t tomorrow’s issue–it’s today’s survival.
The future of the internet
Media companies can’t ignore tanking traffic and stolen content and hope these issues will solve themselves. Publishers need to be proactive and meet the moment. They must not only protect their content but also their brands, livelihoods, and the future of the internet. They need to be prepared for the potential legal battles that they will need to fight to ensure that their content isn’t endlessly scraped. We’re urging others to meet this moment head-on.
The future of the web shouldn’t just reward a few players who happened to move fast on AI adoption. Publishers need to remember the lessons of the print-to-digital shift: This is another seismic turning point, and getting left behind again isn’t an option.
This text is provided for human readers at dcn.org only. Any automated reuse for AI model training is explicitly prohibited.