While in some ways the web has evolved organically, it also functions within accepted structures and guidelines that have allowed websites to operate smoothly and to enable discovery online. One such protocol is robots.txt, which emerged in the mid 1990s to give webmasters some control over which web spiders could visit their sites. A robots.txt file is a plain text document that is placed in the root directory of a website. It contains instructions for search engine bots on which pages to crawl and which to ignore. Significantly, compliance with its directives is voluntary. Google, has long followed and endorsed this voluntary approach. And no publisher has dared to exclude Google considering its 90%+ share of the search market.
Today, a variety of companies use bots to crawl and scrape content from websites. Historically, content has been scraped for relatively benign purposes such as non-commercial research and search indexing, which promises the benefit of driving audiences to a site. In recent years, however, previously benign and new crawlers have begun scraping content for commercial purposes such as training Large Language Models (LLMs), use in Generative Artificial Intelligence (GAI) tools, and inclusion in retrieval augmented generation outputs (aka “grounding”).
Under current internet standards such as the robots.txt protocol, publishers can only block or allow crawlers by domain. Publishers are not able to communicate case-by-case (company, bot and purpose) exceptions in accordance with their terms of use in a machine-readable format. And again: compliance with the protocol is entirely voluntary. The Internet Architecture Board (IAB) held a workshop in September on whether and how to update the robots.txt protocol and it appears the Internet Engineering Task Force (IETF), which is responsible for the protocol, plans to convene more discussions on how best to move forward.
A significant problem is that scraping happens without notification to or consent from the content owners. It often violates the website’s terms of use in blatant violation of applicable laws. OpenAI and Google recognized this imbalance when they each developed differing controls (utilizing the robots.txt framework) for publishers to opt out of having their content used for certain purposes.
Predictably, however, these controls don’t fully empower publishers. For example, Google will allow a publisher to opt out of training for their AI services. However, if a publisher wants to prevent their work from being used in Generative AI Search—which allows Google to redeploy and monetize the content—they have to opt out of search entirely. It would be immensely useful to have an updated robots.txt protocol to provide more granular controls for publishers in light of the massive scraping operations of AI companies.
The legal framework that protects copyrighted works
While big tech companies tout the benefits of AI, much of the content crawled and scraped by bots is protected under copyright law, or other laws which are intended to enable publishers and other businesses to protect their investments against misappropriation and theft.
Copyright holders have the exclusive right to reproduce, distribute and monetize their copyrighted works as they see fit for a defined period. These protections incentivize the creative industries by allowing them to reap the fruits of their labors and enable them to reinvest into new content creation. The benefits to our society are nearly impossible to quantify as the varied kinds of copyrighted material enrich our lives daily: music, literature, film and television, visual art, journalism, and other original works provide inspiration, education, and personal and societal transformation. The Founding Fathers included copyright in the Constitution (Article I, section 8, clause 8) because they recognized the value of incentivizing the creation of original works.
In addition to copyright, publishers also rely on contractual protections contained in their terms of service which govern how the content on their websites may be accessed and exploited. Additionally, regulation against unlawful competition is designed to protect against the misappropriation of content for purposes of creating competing products and services. This is to deter free riding and prevent dilution of incentives to invest in new content. The proper application of and respect for these laws is part of the basic framework underlying the thriving internet economy.
The value of copyrighted works must be protected
The primary revenues for publishers are advertising, licensing, and, increasingly, subscriptions. Publishers make their copyrighted content available to consumers through a wide range of means, including on websites and apps that are supported by various methods for monetization such as metered paywalls. It is important to note that even if content is available online and not behind a subscription wall, that does not extinguish its copyrighted status. In other words: It is not free for the taking.
That said, there are many cases where a copyright holder may choose to allow the use of their original work for commercial or non-commercial purposes. In these cases, potential licensees contact the copyright holder to seek a negotiated agreement, which may define the extent to which the content may be used and any protections for the copyright holder’s brand.
Unfortunately, AI developers, in large part, do not respect the framework of laws and rules described above. They seek to challenge and reshape these laws in a manner that would be exceptionally harmful for digital publishers, by bolstering their position that content made publicly available should be free for the taking – in this case, to build and operationalize AI models, tools and services.
Publishers are embracing the benefits of AI innovation. They are partnering with developers and third parties, for both commercial and non-commercial purposes, to provide access and licenses for the use of their content in a manner that is mutually beneficial. However, incentives are lacking to encourage AI developers to seek permission and access/licensing solutions. Publishers need a practical tool to signal to bots at scale whether they wish to permit crawling and scraping for the purposes of AI exploitation.
Next steps and the future of robots.txt
The IETF should update the robots.txt protocol to create more specific technical measures that will help publishers convey the purposes for which their content may or may not be used, including by expressing limitations on the scraping and use of their content for GAI purposes. While this should not be viewed as in any way reducing the existing legal obligations of third parties to seek permission directly from copyright holders, it could be useful for publishers to be able to signal publicly and through a machine-readable format what uses are permitted, e.g. scraping for search purposes is permitted, whereas scraping to train LLMs or other commercial GAI purposes is not.
Of course, a publisher’s terms of use should always remain legally binding and trump any machine-readable signals. Furthermore, these measures should not be treated as creating an “opt out” system for scraping. A publisher’s decision not to employ these signals is not permission (either explicit or implicit) to scrape websites or use content in violation of the terms of use or applicable laws. And any ambiguity must be construed in favor of the rights holders.
In order to achieve a solution in a timely and efficient manner, the focus should be on a means to clearly and specifically signal permission or prohibitions against crawling and scraping for the purposes of AI exploitation. Others may seek to layer certain licensing solutions on top of this, which should be left to the market. In addition, it should be ensured that there is transparency for bots which crawl and scrape for purposes of AI exploitation. Any solution should not be predicated on the whims of AI developers to announce the identities of their bots or operate in any manner that obscures their identity and purposes of their activity.
And, critically, search and AI scraping must not be comingled. The protocol should not be allowed to be used in a manner that requires publishers to accept crawling and scraping for AI exploitation as a condition for being indexed for search.
Let’s not repeat the mistakes of the past by allowing big tech companies to leverage their dominance in one market to dominate an emerging market like AI. Original content is important to our future and we should build out web standards that carry forward our longstanding respect for copyright in the AI age.