The many profound possibilities for creators leveraging artificial intelligence have been quietly heating up over the last decade. So, what really changed in the past few months? Certainly, the clever roll-out of ChatGPT only months after DALL-E 2 has thrown gas on the burner. Right now, press coverage and industry chatter are mostly focused on uncovering real-world applications offered by generative AI, which are exciting to be sure. However, wise publishers will heed lessons learned from past experiences in which their data were scraped and take an active role in preserving their business value.
“Scraping the web” is a term as old as the web itself. It’s no secret that Google’s core business was built by crawling companies’ servers to train its search engine to find and deliver satisfying results across the vast world wide web with speed and precision. The result: Google now plays the role of gatekeeper for much of the web.
The reality is that any publisher who made the radical decision not to allow Google’s crawler to do its thing would quickly become almost impossible to find online. Giving Google the free right to crawl your site is the cost of business to exist on the open web today.
Just last week Google began testing blocking news properties in Canada. As Campbell Clark wrote for the Globe and Mail earlier this week, “It’s not a threat, of course. Sometimes people just wake up with a horse’s head in their bed.”
This is yet another stark example of Google’s power. No individual publisher has the bargaining power to opt out and impact Google’s business. So no one does.
It’s not actually that different from the way in which Facebook built its social media platform into a digital advertising juggernaut. After many years scraping and collecting user data across the web and its partners – from far beyond the walls of its garden – Facebook began using AI to leverage its massive data warehouse appropriately named “The Hive” in a manner that is irreversible. Publishers are only now learning how all of this data was used, and abused, through lawsuit discovery and leaked documents.
Importantly, as was uncovered through one of these leaked documents, Facebook engineers used the metaphor of ink spilled into a lake to describe what happens to a person’s data once it makes its way into the Facebook’s data warehouse:
“We’ve built systems with open borders. The result of these open systems and open culture is well described with an analogy:
Imagine you hold a bottle of ink in your hand. This bottle of ink is a mixture of all kinds of user data (3PD, 1PD, SCD, Europe, etc.) You pour that ink into a lake of water (our open data systems; our open culture) … and it flows … everywhere. How do you put that ink back in the bottle? How do you organize it again, such that it only flows to the allowed places in the lake?”
This brings us to generative AI – 2023’s hot topic. It’s unequivocal that the Large Language Models are being trained by ingesting the work of newsrooms across the web. To better understand this fact, consider that Google disclosed that a whopping 15% of Google’s daily searches are completely new searches.
Now think about that for a moment: The most likely trigger for a novel search is an unfolding news event. That means the simple fact that these searches occur was triggered by the work of news media. And, as we know, Google monetizes answers. Particularly when confronted by a new query, Google’s machine learning algorithms lean into authority brands – those deemed likely to provide a trustworthy response. Guess who those are? You can see by looking at the top results presented by a search around breaking news: the results are called “top stories” and you guessed it: they come from media brands.
No one should doubt that a similar level of interest, intelligence, and impact on generative AI will be instigated—and answered—because of the work product of the news media, particularly given the role of Large Language Models in their training and operation.
So, if the output of newsrooms, writers, artists, and media companies were not available to scrape, what would happen to the quality of 2023’s much-hyped generative AI?
Copious effort and investment go into the creation of premium content. The use of premium content to train AI models is akin to fueling your machine with gasoline you “found” in someone else’s garage. It doesn’t inherently mean that the machine is not powerful, or useful. But that fuel was not free to create. If your machine can’t run without it, something is not quite right with the system.
Given Facebook’s metaphor, the publishing world may need to serve notice to OpenAI and its ilk before they can make the claim that the ink flow can no longer be reverse engineered. Let’s hope it is not already too late.
In November, DeviantArt rolled out an HTML tag so that an artist can inform web robots not to crawl their images to train art-generating AI. Media companies need to consider whether the right course of action is to launch a similar tag, which tells bots not to crawl and train their AI with publishers’ work. Of course, like many technical issues on the open web, this would only rein in the good robots and would not solve for bad bots.
But it is important that we make a start, before too much ink has spilled.