As large language models (LLMs) evolve from experimental tools to valuable assets, transparency in their data sourcing is rapidly declining. Initially, datasets were openly shared, allowing the public to examine the content used for training. However, LLM companies tightly guard their data sources today, leading to new intellectual property (IP) conflicts. Many media companies are pursuing litigation to protect their content from unauthorized use in AI training. At the same time, courts, regulators, and policymakers are engaged in debates over content ownership and the responsibilities of large language model (LLM) developers.
A new report by George Wukoson, Ziff Davis’ lead AI attorney, and Joey Fortuna, the company’s chief technology officer, sheds light on the nature of data sources used by major LLMs. Their researchreveals that AI developers often favor high-quality content when selecting training data, especially content owned by premium media companies. Their findings support discussions around publishers’ IP rights, content licensing, and the ethical dimensions of AI development.
Dataset analysis and key findings
Wukoson and Fortuna’s research uses Domain Authority (DA), a metric developed by Moz for search engine optimization, to measure the prominence of domains in several LLM training datasets. They examine Common Crawl, C4, OpenWebText, and OpenWebText2, and analyze how curation levels affect the inclusion of content from high-DA, premium sources.
Their findings showed that as datasets become more curated, the share of content from high-quality publishers rises significantly. Key findings from the research:
1. Increasing inclusion of premium content
In less curated datasets like Common Crawl, content from major media companies only makes up about 0.44%. However, in OpenWebText2, a highly curated dataset, content from these companies jumps to 12.04%. This shift indicates that LLM developers selectively incorporate reputable sources to improve the quality and accuracy of the model’s output.
2. Higher DA correlates with higher curation levels
Common Crawl, an uncurated dataset, has over 50% of domains with DA scores below 10, indicating that it includes a significant amount of low-authority content. In contrast, OpenWebText2 comprises 39.4% of domains with DA scores between 90 and 100, reflecting a preference for high-authority, reliable sources as datasets undergo curation.
3. Prominence of premium content
Leading publishers like The New York Times and News Corp consistently appear in the top DA range (90–100), reflecting their high authority in the dataset rankings. Their dominance in these datasets suggests that LLMs are more frequently training on established news and media sources. This gives these outlets a stronger influence in shaping model behavior and responses.
These trends show a pattern where the curation of training datasets systematically filters out lower-quality sources, favoring reputable, high-DA domains. As a result, LLMs benefit from exposure to high-quality, well-sourced content, which may enhance their performance but raise concerns about IP use and representation.
Ethical and legal implications
Prioritizing high-quality, high-DA content in LLM datasets escalates legal disputes between media companies and AI firms. For example, the New York Times has filed a copyright infringement suit against major AI developers. They argue that these AI companies profit from high-quality content without appropriate compensation to the original publishers.
As LLMs continue transforming industries, the value of high-quality, curated content becomes increasingly apparent. The authors’ analysis shows that curation prioritizes content from high-DA, reputable media content companies, amplifying their role in shaping model outcomes. This trend will likely intensify as LLM companies refine their training methodologies, sparking further debate over intellectual property and AI firms’ financial obligations to content creators. The findings here call for a broader dialogue around data licensing and compensation frameworks that reflect the mutual value between content creators and AI innovators.
In today’s rapidly evolving media landscape, publishers and broadcasters are increasingly looking to AI to reach new audiences and build loyalty. At Arc XP’s recent Connect London event, a panel of industry experts—Lisa Anzinger, Enterprise Lead at Echobox; Aliya Itzkowitz, Strategy Manager at FT Strategies; and Madeleine White, VP of Marketing at Poool—shared their insights on how AI is reshaping engagement strategies for media companies.
How AI can help reach new audiences
One of the biggest challenges for media companies is breaking through the noise to reach new audiences. AI offers a way to do this by understanding and targeting specific audience needs. As Anzinger from Echobox explained, “AI can help you counteract [algorithm changes] and actually still reach those audiences, maybe even reach them through wider channels… whether that’s social media, newsletters, or even others.” AI’s adaptability is crucial in an environment where third-party platforms frequently adjust their algorithms.
Itzkowitz from FT Strategies highlighted the importance of multi-format engagement: “One of the most exciting things for me is there’s so many new ways that we can engage with readers, especially thanks to the generative AI boom.” For example, she shared that the Financial Times (FT) has been experimenting with a feature called “definitions,” which helps younger readers understand financial terms by offering definitions that pop up on hover. This is one way AI makes news content more accessible and appealing to younger audiences, who may find certain jargon intimidating.
Personalization and community building
AI’s ability to personalize experiences was another focus of the discussion. Anzinger explained that Echobox has seen significant engagement improvements through AI-driven personalization in newsletters, saying, “We have the ability to personalize each newsletter to the individual reader… [our client] Group Sud Ouest managed to increase open rates by 53% and click rates by 42%.” This tailored approach ensures that readers receive content that matters to them at just the right time.
Beyond personalization, White from Poool highlighted the growing importance of community building, emphasizing that AI’s real value lies in connecting people. “With AI, content can be created so easily that anyone can create content… community is what’s going to set you apart.” AI can help facilitate these communities, for example, by using AI-driven chatbots to engage users in conversations and provide a space where people with shared interests can connect and interact. White mentioned that some Norwegian publishers have even used AI chatbots that act as virtual community members, creating a more dynamic and interactive environment.
Ethical considerations and transparency in AI
Implementing AI responsibly is a priority, and transparency is essential in building trust. White believes in the value of being open with audiences: “Always be transparent…if a reader believes they can trust you and they know when you are being honest about whether AI’s being used, I think that really helps.” On the other hand, Itzkowitz offered a nuanced perspective, suggesting that over time, labeling every AI-driven feature may become unnecessary as AI becomes more integral to everyday processes.
However, when AI is directly engaging with audiences, clear communication is critical. Itzkowitz shared an example from a project involving synthetic voices: “If you’re using the voice of a journalist or cloning a voice…then you need to disclaim that.” AI’s role in generating content is likely to be acceptable for readers as long as they feel informed about when it’s being used.
Legacy media vs. new brands: Who has the AI advantage?
An interesting debate arose around whether legacy media brands have an advantage in AI-driven engagement due to their extensive archives and resources. While Itzkowitz acknowledged that larger companies have more data and resources, she pointed out that smaller publishers can often innovate faster due to less bureaucratic red tape. “Smaller publishers…may just say, let’s try full automation. Let’s see how it goes,” said Anzinger, highlighting the experimental edge smaller companies can bring to the table.
For smaller teams, adopting hybrid models that combine ready-made AI solutions with limited internal development is a viable approach. As Itzkowitz noted, “AI has somewhat changed this…you no longer need huge development teams to build something.”
Overcoming fear of failure with AI
The panelists acknowledged that trying new technologies like AI can be intimidating, especially with the risk of projects not meeting expectations. However, Anzinger advised companies to take a “trial and error” approach, iterating based on results. “Just keep trying… you can’t be afraid of something going wrong, just need to keep trying,” she said.
White underscored the importance of resilience and learning from failure, recalling the New York Times’ bold decision to implement a paywall in 2011 despite criticism. “Trying and failing means you are potentially going to succeed and be ahead of anyone else,” she said, stressing that bold moves can often be the most rewarding in the long term.
What’s next for AI in audience engagement?
Looking to the future, the panelists shared their excitement about the new ways AI can enhance engagement. Anzinger is optimistic about using AI to produce dynamic content, such as automated videos from articles, which will help media companies engage younger audiences on platforms like TikTok and Instagram. Meanwhile, Itzkowitz is eager to see AI-driven creativity blossom: “What’s that killer app for generative AI, or how are we really going to change the news experience so that young people and audiences in general really want to engage with news again?”
White emphasized the potential to create individualized experiences: “Instead of creating this one-size-fits-all model… [AI means] giving them a unique experience that the reader next to them isn’t going to get.” This hyper-personalized approach has the potential to make each reader feel valued and understood, a key to building loyalty.
Finally,the panelists concurred that AI is not here to replace journalists but to augment their capacity to connect with readers in increasingly meaningful ways. As AI tools evolve, media companies that embrace experimentation, prioritize transparency, and stay committed to creating genuine connections will be best positioned to thrive.
While in some ways the web has evolved organically, it also functions within accepted structures and guidelines that have allowed websites to operate smoothly and to enable discovery online. One such protocol is robots.txt, which emerged in the mid 1990s to give webmasters some control over which web spiders could visit their sites. A robots.txt file is a plain text document that is placed in the root directory of a website. It contains instructions for search engine bots on which pages to crawl and which to ignore. Significantly, compliance with its directives is voluntary. Google, has long followed and endorsed this voluntary approach. And no publisher has dared to exclude Google considering its 90%+ share of the search market.
Today, a variety of companies use bots to crawl and scrape content from websites. Historically, content has been scraped for relatively benign purposes such as non-commercial research and search indexing, which promises the benefit of driving audiences to a site. In recent years, however, previously benign and new crawlers have begun scraping content for commercial purposes such as training Large Language Models (LLMs), use in Generative Artificial Intelligence (GAI) tools, and inclusion in retrieval augmented generation outputs (aka “grounding”).
Under current internet standards such as the robots.txt protocol, publishers can only block or allow crawlers by domain. Publishers are not able to communicate case-by-case (company, bot and purpose) exceptions in accordance with their terms of use in a machine-readable format. And again: compliance with the protocol is entirely voluntary. The Internet Architecture Board (IAB) held a workshop in September on whether and how to update the robots.txt protocol and it appears the Internet Engineering Task Force (IETF), which is responsible for the protocol, plans to convene more discussions on how best to move forward.
A significant problem is that scraping happens without notification to or consent from the content owners. It often violates the website’s terms of use in blatant violation of applicable laws. OpenAI and Google recognized this imbalance when they each developed differing controls (utilizing the robots.txt framework) for publishers to opt out of having their content used for certain purposes.
Predictably, however, these controls don’t fully empower publishers. For example, Google will allow a publisher to opt out of training for their AI services. However, if a publisher wants to prevent their work from being used in Generative AI Search—which allows Google to redeploy and monetize the content—they have to opt out of search entirely. It would be immensely useful to have an updated robots.txt protocol to provide more granular controls for publishers in light of the massive scraping operations of AI companies.
The legal framework that protects copyrighted works
While big tech companies tout the benefits of AI, much of the content crawled and scraped by bots is protected under copyright law, or other laws which are intended to enable publishers and other businesses to protect their investments against misappropriation and theft.
Copyright holders have the exclusive right to reproduce, distribute and monetize their copyrighted works as they see fit for a defined period. These protections incentivize the creative industries by allowing them to reap the fruits of their labors and enable them to reinvest into new content creation. The benefits to our society are nearly impossible to quantify as the varied kinds of copyrighted material enrich our lives daily: music, literature, film and television, visual art, journalism, and other original works provide inspiration, education, and personal and societal transformation. The Founding Fathers included copyright in the Constitution (Article I, section 8, clause 8) because they recognized the value of incentivizing the creation of original works.
In addition to copyright, publishers also rely on contractual protections contained in their terms of service which govern how the content on their websites may be accessed and exploited. Additionally, regulation against unlawful competition is designed to protect against the misappropriation of content for purposes of creating competing products and services. This is to deter free riding and prevent dilution of incentives to invest in new content. The proper application of and respect for these laws is part of the basic framework underlying the thriving internet economy.
The value of copyrighted works must be protected
The primary revenues for publishers are advertising, licensing, and, increasingly, subscriptions. Publishers make their copyrighted content available to consumers through a wide range of means, including on websites and apps that are supported by various methods for monetization such as metered paywalls. It is important to note that even if content is available online and not behind a subscription wall, that does not extinguish its copyrighted status. In other words: It is not free for the taking.
That said, there are many cases where a copyright holder may choose to allow the use of their original work for commercial or non-commercial purposes. In these cases, potential licensees contact the copyright holder to seek a negotiated agreement, which may define the extent to which the content may be used and any protections for the copyright holder’s brand.
Unfortunately, AI developers, in large part, do not respect the framework of laws and rules described above. They seek to challenge and reshape these laws in a manner that would be exceptionally harmful for digital publishers, by bolstering their position that content made publicly available should be free for the taking – in this case, to build and operationalize AI models, tools and services.
Publishers are embracing the benefits of AI innovation. They are partnering with developers and third parties, for both commercial and non-commercial purposes, to provide access and licenses for the use of their content in a manner that is mutually beneficial. However, incentives are lacking to encourage AI developers to seek permission and access/licensing solutions. Publishers need a practical tool to signal to bots at scale whether they wish to permit crawling and scraping for the purposes of AI exploitation.
Next steps and the future of robots.txt
The IETF should update the robots.txt protocol to create more specific technical measures that will help publishers convey the purposes for which their content may or may not be used, including by expressing limitations on the scraping and use of their content for GAI purposes. While this should not be viewed as in any way reducing the existing legal obligations of third parties to seek permission directly from copyright holders, it could be useful for publishers to be able to signal publicly and through a machine-readable format what uses are permitted, e.g. scraping for search purposes is permitted, whereas scraping to train LLMs or other commercial GAI purposes is not.
Of course, a publisher’s terms of use should always remain legally binding and trump any machine-readable signals. Furthermore, these measures should not be treated as creating an “opt out” system for scraping. A publisher’s decision not to employ these signals is not permission (either explicit or implicit) to scrape websites or use content in violation of the terms of use or applicable laws. And any ambiguity must be construed in favor of the rights holders.
In order to achieve a solution in a timely and efficient manner, the focus should be on a means to clearly and specifically signal permission or prohibitions against crawling and scraping for the purposes of AI exploitation. Others may seek to layer certain licensing solutions on top of this, which should be left to the market. In addition, it should be ensured that there is transparency for bots which crawl and scrape for purposes of AI exploitation. Any solution should not be predicated on the whims of AI developers to announce the identities of their bots or operate in any manner that obscures their identity and purposes of their activity.
And, critically, search and AI scraping must not be comingled. The protocol should not be allowed to be used in a manner that requires publishers to accept crawling and scraping for AI exploitation as a condition for being indexed for search.
Let’s not repeat the mistakes of the past by allowing big tech companies to leverage their dominance in one market to dominate an emerging market like AI. Original content is important to our future and we should build out web standards that carry forward our longstanding respect for copyright in the AI age.
Two small letters – AI – are proving major disruptors for the digital news industry. Already, 70% of journalists and newsroom leaders use generative AI in some capacity. While we hear much excitement over its potential, opinion remains divided. Senior editorial representatives at titles such as UK newspaper The Sun have expressed concern that AI in journalism will deliver a “tsunami of beige” content. And readers say they prefer AI-free news.
As such, there is a balancing act to be performed for AI and human input in newsrooms, pitting its potential to revolutionize the industry against the risks of over-reliance and ethical concerns to find a middle ground that benefits all. Against a backdrop of budget and team cuts, however, it’s crucial to realize that AI can enhance workflow efficiency, allow journalists to do more with less. That, in turn, can allow teams to focus on what matters the most: creating human-centered, authentic content that engages the reader.
Using AI to bridge the gap with users
Primary motivations among journalists for using AI technologies includeautomating mundane tasks. But as Ranty Islam, Professor of Digital Journalism, pointed out at the 2024 DW Global Media Forum, this isn’t the be-all and end-all of AI. The key lies in integrating AI into a holistic strategy that brings journalism closer to readers. Using AI to perform necessary but time-intensive tertiary tasks means that journalists—notably those in local newsrooms with smaller budgets and teams—can get more actual journalism done. They can get out there to connect with real people and stories.
Moreover, as audience needs change, AI can help newsrooms track and enhance the stories and formats that perform for them in an audience-first strategy. Using AI alongside tracking means newsrooms can harness suggestions about when to write stories and the kinds of topics and formats that audiences want to see. Content suggestions can be made based on AI tagging systems,while algorithms that suggest stories based on user behavior or interests can help tailor content for different audiences. This enhances the reader experience for greater engagement and retention while also helping boost subscription offerings through data-driven personalization.
Doing more with less
There is a wealth of opportunities for newsrooms to use AI to help with everyday tasks, and the benefits for understaffed newsrooms are clear. The local news sector has particularly felt the impact, losing nearly two-thirds of its journalists since 2005. AI can serve as a helping hand for tasks that would otherwise require multiple staff members. We have integrated AI into our liveblogging software, for example, to enable users to generate liveblog content summaries in seconds, assimilate live sports results, adjust tone and language, create social media posts, and generate image captions for enhanced SEO.
AI’s potential to localize content and engage new audiences is widely recognized.FT Strategies has highlighted AI translation as “truly disruptive in the context of news”, particularly for multilingual communities or multinational publishers seeking to replicate regional content across multiple sites.
Indeed, AI excels at summarizing and extracting information, making it extremely useful for summary generation. While most reporters aren’t keen on full AI copy generation, enabling teams to recycle their content quickly and easily or suggest headlines based on keywords can be a huge help. Moreover, since the training data is their own, the summaries reflect the author’s original, authentic style.
This summarization can be carried through to data analysis to create charts and infographics. AI can even create text descriptions for supporting media to help with search engine optimization or social media posts and matching hashtags to promote stories. Journalists aren’t typically SEO or social media managers, but small teams sometimes need to wear multiple hats. Using AI as a virtual assistant allows reporters to focus more energy on their reporting.
AI can also be harnessed to support a journalist’s development or to augment collaboration or brainstorming that might once have been done in a newsroom among a large team. AI can be used to create identify story gaps or flaws, a tool to suggest improvements, or to proofread, make composure suggestions, and adapt tonality to the situation at hand. This is particularly useful when wanting to address various user needs or even different age groups.
Liveblogs offer a prime example of how AI can be harnessed to enhance reporting, helping manage and update live content in real-time, automatically pulling in relevant information, images, and social media posts. This allows journalists to focus on providing valuable insights and context, delivering a dynamic and engaging experience for readers.
Trust and transparency while using AI in journalism
Using AI behind the scenes in this multitude of ways chimes with reader comfort levels. The Reuters Institute for the Study of Journalism revealed that while readers are skeptical of AI-generated content, they are generally happier with AI handling behind-the-scenes tasks under human supervision—as long as newsrooms are transparent about AI usage.
While AI can help to detect disinformation, fact-check, or moderate online comments, adding to the integrity of journalistic content, its tendency to hallucinate and invent also means that human oversight is vital. We must train journalists to work alongside AI, using it to enhance, not supersede, their skills, striking a balance between AI utility and the preservation of the human element in news reporting.
AI is a tool, not a substitute. It can automate mundane tasks, save time and assist research and brainstorming processes, but its power lies in complementing human effort, not replacing or overshadowing it.
The future of journalism lies in a hybrid approach in which AI supports, not replaces, the essential human touch that defines quality journalism. Whatever the medium—print, online, liveblog—by fostering collaboration between technology and editorial expertise, newsrooms can navigate this evolving landscape, ensuring that AI enhances, rather than diminishes, the integrity and creativity of journalistic work.
For media companies investigating how to incorporate AI into their operations, open chatbots like ChatGPT, Claude, or Google Bard are often a natural starting point. Thanks to OpenAI’s game-changing launch of ChatGPT for public use in 2022, AI has essentially become synonymous with chatbots in the public mind, so it’s unsurprising that’s where many media companies turn first when experimenting with AI.
Open chatbots certainly have useful applications, but they also have some serious limitations. The user experience for these chatbots varies widely. They place a burden on the user to know how to engineer prompts that will generate good results, which can lead to significant user fatigue. Media companies will get more value from AI by going beyond basic chatbots to build capabilities that will deliver better results and a better experience for users.
Creating a better chatbot
The success of any AI-powered chatbot comes down to what’s underneath. The Washington Post has launched an excellent climate chatbot, which works well because they invested in building the underlying functionality from the ground up. They also emphasized providing trustworthy responses because the underlying large-language model synthesizes information from Washington Post articles published since 2016 in their Climate & Environment and Weather sections. The chatbot is also highly controlled and tightly framed to focus on climate coverage, which delivers a better experience than open chatbots.
Building a good chatbot requires many different technologies. First, a chatbot needs to have a solid system prompt behind it that defines what it is and is not supposed to do. Second, the chatbot needs to be based on a fast, performant model, which could include OpenAI’s GPT-4, Google’s BERT, or other large-language models. The platform needs to be quick in order to deliver answers in real time. And finally, the chatbot model needs to be trained to ensure it does what you want it to do.
It’s also important to remember that AI doesn’t have to be synonymous with a chat- or prompt-based interface. Based on the use case, a button, an automated feature, or a call to another application might be more appropriate.
Leveling up with fine tuning and vectors
To truly deliver meaningful applications of AI with significant business value, media organizations should look to fine tuning and vector databases. These advanced capabilities allow chatbots and other AI applications to be customized to meet an organization’s specific needs and operate with a deep understanding of its content base.
Fine tuning trains an AI model to deliver results that are tailored to your requirements. It essentially tells the model to read all of your content and learn exactly what you want it to do and what the results should be. For example, with fine tuning, an AI model can learn to generate headlines in a specific length or style. This can be done even down to the author level, with the model learning how to detect who the content creator is and return a headline or summary in their tone of voice.
Vector databases go a step further by building a knowledge map of all of your content— you could even think of a vector database as a miniature brain that serves as the “memory” for your AI applications. At a basic level, a vector database stores data or content in various formats (a single word, a story, a photo, a video, etc.) and creates a numerical representation of that content. The numbers assigned to each piece of content are used to calculate its distance from other content in terms of relevance. Mapping content relationships in this way enables powerful search and recommendation applications.
To understand fine tuning and vector databases in practical terms, we can look at the example of using AI for content tagging. A general-purpose AI model like GPT can look at a story and identify keywords or topics that could be used for tagging, but it doesn’t understand your specific tagging requirements.
Fine tuning the model will incorporate your tagging requirements. For example, if you have a specific set of approved tags, it will return a result that’s tailored to those needs. A vector database will not only know your whole tag library but will also understand the relationships between tags and identify overlaps that will help with powering search and content recommendations.
It’s an exciting time for AI in the media industry, with new developments emerging every month. Building your own AI capabilities can be daunting, and software vendor offerings for AI vary widely. If you spend some time learning about the possibilities for AI, including chatbots and beyond, you’ll be well positioned to create your AI strategy and identify the technologies and vendors that can help you achieve your goals.
The internet is seen by some as a vast repository of information readily available for training open and closed AI systems. However, this “data commons” raises significant ethical and legal concerns regarding data consent, attribution, and copyright, particularly for media companies. These concerns are growing due to the fear that AI systems may use the media’s content for training without consent, exacerbating conflicts over intellectual property rights.
A new study, Consent in Crisis: The Rapid Decline of the AI Data Commons, investigates these issues by examining how AI developers use web data and how data access and usage protocols shift over time. This research involves a comprehensive audit of web sources used in major AI training datasets, including C4, RefinedWeb, and Dolma.
The research also evaluates the practices of AI developers, such as Google, OpenAI, Anthropic, Cohere, and Meta, as well as non-profit archival organizations such as Common Crawl and the Internet Archive. By focusing on dynamic web domains and tracking changes over time, this study assesses the evolving landscape of data usage and its implications for media companies.
The changing landscape of consent for AI training
The research observations provide strong empirical evidence for the misalignment between AI uses and web-derived training data. This analysis helps track major shifts in signaling consent preferences and reveals current tools’ limitations.
Increased restrictions on AI data
From April 2023 to April 2024, a growing number of websites started blocking AI bots from collecting their data. Websites accomplish this by including specific instructions in files called robots.txt and their terms of service.
Impact: About 25% of the most critical data sources and 5% of all data used in some major AI datasets (C4, RefinedWeb, and Dolma) are now off-limits to AI.
Consent asymmetries and inconsistencies
OpenAI’s bots, which collect data for AI training, are blocked more often than other companies’ bots. The rules about what these bots can and cannot do usually need to be clarified or more consistent.
Impact: This inconsistency makes adhering to data usage preferences difficult and indicates ineffective management tools.
Divergence in the web data quality
The most popular web domains for AI training are news, forums, encyclopedias and includes academic and government content. These domains contain diverse content, such as images, videos, and audio. Many of these sites montize via ads and paywalls. They also frequently have restrictions for how their content can be used in their terms of service. In contrast, other web domains consist of personal/organizational websites, blogs, and e-commerce sites with less monetization and fewer restrictions.
Impact: The increasing restrictions on popular, content-rich websites mean that AI models must increasingly rely on open or user generated content. Thus, they miss out on the highest-quality and most up-to-date information, potentially affecting their performance and accuracy.
Mismatch between web data and AI usage
There needs to be a closer connection between the web data collected for training AI and the actual tasks AI systems perform in the real world.
Impact: This misalignment could lead to problems with AI systems’ performance and data collection. It may also lead to legal issues related to copyright.
AI economic fears may reshape internet data
The use of internet content for AI training, which was not its original intent, shifts incentives for content creation. With the increasing use of paywalls and ads, small-scale content providers might opt out or move to walled platforms to protect their data. Without better control mechanisms for website owners, the open web is likely to shrink further, with more content locked behind paywalls or logins to prevent unauthorized use
Impact: This trend could significantly reduce access to high quality information availability for AI training.
The media’s choice to opt out of AI training
While the Internet has served as a critical resource for AI development, the use of the content created by others, including the media, (often at great expense) without consent presents significant ethical and legal challenges. As more media companies choose to exclude their content from AI training, the datasets become less representative and outdated. The decline in data quality reduces the relevance and accuracy of the resulting AI models. Therefore, improved data governance and transparency are essential to allow for open access of content online. It also provides a framework for ethical use of web content for AI training, which in turn should improve the quality of training data.
Creativity fuels innovation and expression across various media disciplines. With the advent of generative artificial intelligence (AI) and large language models (LLMs), many question how these technologies will influence human creativity. While generative AI may effectively enhance productivity across sectors, its impact on creative processes remains questioned. New research, Generative AI enhances individual creativity but reduces the collective diversity of novel content, investigates AI’s effect on the creative output of short (or micro) fiction. While the research focuses on short stories, the study examines how generative AI influences the production of creative expression, which has larger implications.
Creative assessment
The study evaluates creativity based on two main dimensions: novelty and usefulness. Novelty refers to the story’s originality and uniqueness. Usefulness relates to its potential to develop into a publishable piece. The study randomly assigns participants to one of three experimental conditions for writing a short story: Human-only, Human with one Generative AI idea, and Human with five Generative AI ideas.
The AI-assisted conditions include three-sentence story ideas to inspire creative narratives. This design allows the researchers to assess how AI-generated prompts affect the creativity, quality, and enjoyment of the stories produced. Both writers and 600 independent evaluators assessed these dimensions, providing a comprehensive view of the stories’ creative merit across different conditions.
Usage and creativity
In the two generative AI conditions, 88.4% of participants used AI to generate an initial story idea. In the “Human with one Gen AI idea” group, 82 out of 100 writers did this, while in the “Human with five Gen AI ideas” group, 93 out of 98 writers did the same. When given the option to use AI multiple times in the “Human with five GenAI ideas” group, participants averaged 2.55 requests, with 24.5% asking for a maximum of five ideas.
The findings from the independent evaluators show that access to generative AI significantly enhances the creativity and quality of short stories. Writers who use AI-generated ideas to produce stories consistently rate higher in creativity and writing quality than those written without AI assistance. This effect was particularly noticeable in the Human with five GenAI ideas condition, suggesting that increasing exposure to AI-generated prompts leads to greater creative output.
However, the study also uncovers notable similarities among stories generated with AI assistance. This suggests that while AI can enhance individual creativity, it also homogenizes creative outputs, diminishing the diversity of innovative elements and perspectives. The benefits of generative AI for individual writers may come at the cost of reduced collective novelty in creative outputs.
Implications for stakeholders
Despite several limitations, the research highlights the complex interplay between AI and creativity across different artistic domains. These limitations restrict the creative task by length (eight sentences), medium (writing), and type of output (short story). Additionally, there is no interaction with the LLM or variation in prompts. Future studies should explore longer and more varied creative tasks, different AI models, and the ethical considerations surrounding AI’s role in creative text and video production. Examining the cultural and economic implications in creative media sectors and balancing innovation with preserving artistic diversity is essential.
Generative AI can enhance human creativity by providing novel ideas and streamlining the creative processes. However, its integration into creative media processes must be thoughtful to safeguard the richness and diversity of artistic expression. This study sets the stage for further exploration into generative AI’s evolving capabilities and ethical implications in fostering creativity across diverse artistic domains. As AI technologies evolve, understanding their impact on human creativity is crucial for harnessing their full potential while preserving the essence of human innovation and expression.
In the fast-paced world of digital publishing, the latest wave of developments in Artificial Intelligence (AI) has emerged as a welcome solution to the organized chaos of ad operations. Yet, despite its transformative potential, many media companies struggle with the adoption of AI technology. Costly implementation, complex integrations, and a shortage of AI-savvy professionals are hurdles slowing adoption to a snail’s pace.
For media executives looking to move faster, the answer is simple: purpose-built AI solutions. Forget everything you know about generic AI technology. The real magic happens with AI solutions specifically built – for a very specific purpose. By embracing a tailored approach, media executives can accelerate AI adoption with purpose-built solutions that deliver immediate value and growth.
But to harness AI’s power, understanding the strategic advantage of purpose-built AI solutions is crucial. These specialized tools can help media companies reduce implementation issues and offer tangible benefits. Let’s explore the common challenges in media operations that custom AI tools can address.
Unpacking the challenges in digital media operations
Operationally, digital publishers have their work cut out for them in today’s digital media ecosystem. Fragmented data, inefficient, manual workflows, and complexity management in ad operations create significant challenges. These issues slow down processes and hinder the ability to quickly adapt to market changes.
This is where a purpose-built AI solution can deliver a strategic advantage over a generic AI tool. Think of a generic AI tool as a Swiss Army knife – versatile but not specialized for any one task. In contrast, a purpose-built AI solution is like a precision scalpel, expertly designed for a specific function, ensuring optimal performance and efficiency in that area. Now, let’s explore how these tailored solutions can specifically address implementation challenges.
Finding and implementing AI solutions involves extensive testing and high costs. However, purpose-built AI sidesteps these challenges with pre-designed functionalities that can be implemented quickly and efficiently. Here’s how these tailored solutions address common implementation hurdles.
Lower implementation costs
The initial investment in AI technology, including hardware, software, and skilled personnel, can be prohibitively expensive. However, purpose-built AI solutions are pre-designed for specific tasks, reducing the need for extensive custom development. This lowers both initial investment and ongoing costs.
Simplified integrations
Integrating AI systems with existing workflows often proves difficult, requiring significant time and resources. But purpose-built solutions are designed to integrate seamlessly with existing workflows and technologies, minimizing the complexity and time required for setup. They offer specific capabilities that streamline the integration process.
Unified data management
Disparate data sources and poor data quality hinder AI performance. According to Theorem’s research, 33% of ad professionals cite a lack of centralized tools as a major pain point. Purpose-built AI solutions consolidate data sources, improving quality and consistency. This unified approach enables more accurate insights, better decision-making, and more effective ad targeting.
User-friendly
There is often a shortage of professionals with the expertise needed to develop, implement, and maintain AI systems. With user-friendly interfaces and automated features, purpose-built AI solutions reduce the dependency on specialized AI talent. This makes it easier for existing staff to utilize and manage the AI system.
Faster deployment
These solutions are designed for specific workflows and processes, which reduces the development cycle, while accelerating deployment, and team training. Organizations can rapidly implement the solution and hit the ground running.
With implementation challenges out of the way, more on the tangible benefits and rapid results purpose-built AI solutions have to offer.
Benefits and strategic growth opportunities
Purpose-built, custom AI solutions offer a number of benefits and opportunities for growth, including:
Immediate value
With an AI solution specifically designed to automate ad operations, implementation and adoption shift from labor-intensive to quick and easy. This allows media companies to quickly realize productivity gains by tapping into ready-to-launch solutions almost instantly.
Scalability
These solutions are built to scale seamlessly with the company’s growth. As your business expands and evolves, purpose-built AI solutions can adapt to new requirements and increased workloads. This flexibility ensures sustained performance and supports long-term success without the need for constant reinvestment in new technologies.
Cost-effectiveness
Purpose-built AI solutions offer significant cost benefits. Processes are streamlined, makegoods and errors decrease. And, as a result, implementation and operational costs are reduced.
New revenue generation
Purpose-built AI solutions can identify new revenue streams and optimize existing ones. For example, an AI solution built specifically to increase engagement through more targeted, personalized advertising can generate more ad revenue. Or consider the impact of a solution designed to predict what type of content will be popular in the future. This solution would allow publishers to focus on creating content that is more likely to attract and retain users, driving more revenue.
Maintaining a competitive edge in digital media’s turbulent ecosystem today requires the ability to act swiftly and strategically. Understanding the benefits is just the beginning, now it’s time to take action.
Practical steps to drive quick adoption of purpose-built AI solutions
Implementing purpose-built AI solutions can be streamlined with the right approach. By following these steps, your organization can swiftly integrate AI technology and start reaping the benefits.
Start by identifying key areas where AI can have the most impact with a thorough assessment of current processes.
Prioritize those that promise the quickest wins and greatest value. Next, research and select AI solutions with capabilities that align with your business goals and workflow challenges.
Measure the potential impact on data, infrastructure and governance to ensure smoother AI adoption.
Identify training needs and assess any ethical considerations.
Carefully evaluate vendors based on functionality, ease of integration, and proven success.
Begin with a pilot implementation, test the solution in a controlled environment, gather feedback, and make necessary adjustments before a full rollout.
Investing in a purpose-built AI solution is a long-term strategy that yields ongoing benefits as the technology evolves. Much like choosing a tailored suit that fits perfectly vs a suit off the rack, it offers the precise fit and functionality needed to drive strategic growth. Those who embrace it now stand to reap immediate productivity gains, scalability, and cost-effectiveness.
The introduction of AI-generated search results is just the next step in a long line of the platforms moving more of the audience interactions behind their walled gardens. This is an accelerating trend that’s not going to reverse. Google began answering common questions itself in 2012, Meta increased its deprioritization of news in 2023, and now some analysts are predicting that AI search will drop traffic to media sites by 40% in the next couple years.
It’s a dire prediction. Panic is understandable. The uncertainty is doubled by the sheer pace of AI developments and the fracturing of the attention economy.
However, it is important to know that this is another situation in which it is critical to focus on the fundamentals. Media companies need to develop direct relationships with audiences, double down on quality content, and use new technology to remove any inefficiencies in their publishing operation. Yes, the industry has already done this for decades. However, there are new approaches in 2024 that can allow publishers to improve experiences to attract direct audiences.
All-in on direct relationships
When there’s breaking news, is the first thought in your audience’s mind opening your app, or typing in your URL? Or are they going to take the first answer they can get – likely from someone else’s channel?
Some media companies view direct relationships as a “nice to have” or as a secondary objective. If that’s the case, it’s time to make them a priority.
Whether direct relationships are already the top priority or not, now’s a good time to take a step back to re-evaluate the website’s content experience and the business model that supports it. Does it emphasize—above all else—providing an audience experience that encourages readers to create a direct relationship with your business?
When the cost to produce content is zero, quality stands out
This brings us to the avenue that drives direct relationships: your website, and your app. Particularly as search declines as a traffic source, these become the primary interaction space with audiences. We’ll follow up next month with frameworks for your product team to use to make your website and apps more engaging to further build your direct audience traffic.
It’s no longer about competing for attention on a third-party platform—for example through a sheer quantity of content about every possible keyword. It’s about making the owned platform compelling. Quality over quantity has never been more important.
Incorporating AI into editorial workflow
As the cost to create content is increasingly commoditized via the large language models (LLMs), the internet will fill up with generic noise—even more so than it already is.
Content that’s actually of genuinely high quality will rise in appreciation, both by readers themselves and the search engines that deliver traffic to them. Google is already punishing low quality content. So are audiences. The teams using LLMs to generate entire articles, whole-cloth, are being downgraded by Google (and this approach is not likely to drive readers to you directly either).
But AI does have its uses. One big challenge in generating quality content is time. Ideally, technology gives time back to journalists. They’ll have extra time to dig into their research. They may gain another hour to interview more sources and find that killer quote. Editors have more time to really make the copy pop. The editorial team has more time for collaborating on the perfect news package. The list goes on.
AI is perfect for automating all the non-critical busywork that consumes so much time: generating titles, tags, excerpts, backlinks, A/B tests, and more. This frees up researchers, writers, and creatives to do the work that audiences value most, and deliver the content that drives readers to return to websites and download apps.
This approach has been emerging for a while now. For example, ChatGPT is great at creating suggestions for titles, excerpts, tags, and so on. However, there’s a new approach that’s really accelerating results: Retrieval Augmented Generation (RAG).
RAG is the difference maker when it comes to quality
The base-model LLMs are trained on the whole internet, rather than specific businesses. RAG brings an organization’s own data to AI generation. Journalists using ChatGPT to get generations will get “ok” results that they then need to spend time fixing. With RAG, they can focus the results to make sure they’re fine-tuned to your particular style. That’s important for branding, and also saves creatives time to use for other things.
The next level not only uses content data, but also performance data to optimize RAG setups. This way, AI is not just generating headline suggestions or excerpts that match a particular voice, it’s also basing them on what has historically generated the most results.
In other words, instead of giving a newsroom ChatGPT subscriptions and saying “have at it,” media companies can use middleware that intelligently prompts LLMs using their own historical content and performance data.
Do this right and journalists, editors, and content optimizers can effortlessly generate suggestions for titles, tags, links, and more. These generations will be rooted in brand and identity, instead of being generic noise. This means the team doesn’t need to spend time doing all that manually, and can focus on content quality.
Using RAG to leverage the back catalog
Media companies have thousands upon thousands of articles published going back years. Some of them are still relevant. But the reality is that leveraging the back catalog effectively has been a difficult undertaking.
Humans can’t possibly remember the entirety of everything an organization has ever published. But machines can.
A machine plugged into the CMS can use Natural Language Processing (NLP) to understand the content currently being worked—what is it about? Then it can check the back catalog for every single other article on the topic. It can also rank each of those historical articles by which generated the most attention and which floundered. Then it can help staff insert the most high-performing links into current pieces.
Similarly, imagine the same process, just in reverse. By automating the updating of historical evergreen content with fresh links, new articles can immediately jump-start with built-in traffic.
Removing silos between creation and analysis
While Google traffic might be declining, it will nonetheless remain important in this new world. And in this period of uncertainty, media organizations need to convert as much as possible of the traffic from this channel while it is still operating.
We call this “Leaving the platforms behind.” Media companies should focus on getting as much of the traffic from search and other channels into first-party data collection funnels as possible. This way, they can build enough moat to continue floating if any or all of these traffic channels completely disappear.
Most teams today have dedicated SEO analysts who are essentially gatekeepers between SEO insights and content production. The SEO analysts aren’t going anywhere any time soon. But the new table stakes are that every journalist needs to be able to self-serve keyword insights.
It is important to use analytics tools that bring search console data directly to the approachable/easy article analytics page that the editorial team already knows how to use. Ideally, analytics tools should connect keywords and other platform traffic to conversions, so everyone on your team can understand their impact on leaving the platforms behind.
Done well, you’ll create a feedback loop that evolves and improves your content in a way that resonates with readers and machines.
Quality is all that matters
This is not the first “all hands on deck,” moment for the media industry. That being said, what we’re seeing is that the barometer of success is a truly aligned strategy and execution that brings product, business development, and editorial teams together to pursue creating first party relationships with audiences. The organizations that have little brand identity, and pursue traffic instead of subscriptions, are suffering—and will likely continue to do so.
Last month, I co-led a week-long journalism program during which we visited 16 newsrooms, media outlets and tech companies in New York. This study tour provided an in-depth snapshot of the biggest issues facing the media today and offered insights into some of the potential solutions publishers are exploring to address them.
We met with everyone from traditional media players – like The New York Times, Associated Press, CBS and Hearst – to digital providers such as Complex Media and ProPublica, as well as conversations with academics and policy experts. Based upon these visits and conversations, here are four key takeaways about the state of media and content publishing today.
1. Hands-on AI experience matters
Not surprisingly, AI dominated many conversations. Although recent research shows the American public is both skeptical and surprisingly unaware of these tools, the emergence of Generative AI – and the discussions around it – are impossible to ignore.
One mantra oft repeated throughout the week was that everyone in the media will need to be conversant with AI. Despite this, research has shown that many newsrooms are hesitant about adopting these technologies. Others, however, are taking a more proactive approach. “I like playing offense, not defense, Aimee Rinehart, Senior Product Manager AI Strategy at the Associated Press, told us. “Figure out how the tools work and your limits.”
With many media companies having to do more with less, AI can help improve workflows, support labor-intensive work like investigative journalism, as well as streamline and diversify content creation and distribution. By harnessing these AI-powered functions, smaller outlets may benefit the most, given the efficiencies these resource-strapped players may be able to unlock.
Reporting on AI is also an emerging journalistic beat. This is an area more newsrooms are likely to invest in, given AI’s potential to radically reshape our lives. As Hilke Schellmann, an Emmy‑award winning investigative reporter and journalism professor at NYU, told us “we used to hold powerful people to account, now we have to add holding AI accountable.”
Echoing Schellmann’s sentiments, “every journalist should be experimenting with AI,” one ProPublica journalist said. “We owe it to our audience to know what this is capable of.”
2. Demonstrating distinctiveness and value is imperative
One fear of an AI-driven world is that traffic to publishers will tank as Generative Search, and tools like ChatGPT, remove the need for users to visit the sites of creators and information providers. In that environment, distinctiveness, trustworthy and fresh content becomes more valuable than ever. “You need to produce journalism that gives people a reason to show up,” says Ryan Knutson, co-host of The Wall Street Journal’s daily news podcast, The Journal.
In response, publishers will need to demonstrate their expertise and unique voice. That means leaning more into service journalism, exclusives, and formats like explainers, analysis, newsletters, and podcasts.
Bloomberg’s John Authers, exemplifies this in his daily Points of Return newsletter. With more than three decades of experience covering markets and investments, he brings a longitudinal and distinctive human perspective to his reporting. Alongside this, scoops still matter, Authers suggests. After all, “journalism is about finding out something other people don’t know,” he says.
Media players also need to make a more effective case as to why original content needs to be supported and paid for. As Gaetane Michelle Lewis, SEO leader at the Associated Press, put it, “part of our job is communicating to the audience what we have and that you need it.”
For a non-profit like ProPublica that means demonstrating impact. They publish three impact reports a year, and their Annual Report highlights how their work has led to change at a time when “many newsrooms can no longer afford to take on this kind of deep-dive reporting.”
“Our North Star is the potential to make a positive change through impact,” Communications Director, Alexis Stephens, said. And she emphasized how “this form of journalism is critical to democracy.”
The New York Times’ business model is very different but its publisher, A.G. Sulzberger, has similarly advocated for the need for independent journalism. As he put it, “a fully informed society not only makes better decisions but operates with more trust, more empathy, and greater care.”
Given the competition from AI, streaming services, and other sources of attention, media outlets will increasingly need to advocate more forcefully for support through subscriptions, donations, sponsorships, and advertising. In doing this, they’ll need to address what sets them apart from the competition, and why this matters on a wider societal level.
“This is a perilous time for the free press,” Sulzberger told The New Yorker last year. “That reality should animate anyone who understands its central importance in a healthy democracy.”
3. Analytics and accessibility go hand in hand
Against this backdrop, finding and retaining audiences is more important than ever. However, keeping their attention is a major challenge. Data from Chartbeat revealed that half the audiences visiting outlets in their network stay on a site for fewer than 15 seconds.
This has multiple implications. From a revenue perspective, this may mean users aren’t on a page long enough for ad impressions to count. It also challenges outlets to look at how content is produced and presented.
In a world where media providers continue to emphasize growing reader revenues, getting audiences to dig deeper and stay for longer, is essential. “The longer someone reads, the more likely they are to return,” explained Chartbeat’s CMO Jill Nicolson.
There isn’t a magic wand to fix this. Tools for publishers to explore include compelling headlines, effective formats, layout, and linking strategies. Sometimes, Nicolson said, even small modifications can make all the difference.
These efforts don’t just apply to your website. They apply to every medium you use. Brendan Dunne of Complex Media referred to the need for “spicy titles” for episodes of their podcasts and YouTube videos. Julia D’Apolito, Associate Social Editor at Hearst Magazines, shared how their approach to content might be reversed. “We’ve been starting to do social-first projects… and then turning them into an article,” she said, rather than the other way round.
Staff at The New York Times also spoke about the potential for counter-programing. One way to combat news fatigue and avoidance is to shine a light on your non-news content. The success of NYT verticals such as Cooking, Wirecutter, and Games shows how diversifying content can create a more compelling and immersive proposition, making audiences return more often.
Lastly, language and tone matters. As one ProPublica journalist put it, “My editor always says pretend like you’re writing for Sesame Steet. Make things accurate, but simple.” Reflecting on their podcasts, Dunne also stresses the need for accessibility. “People want to feel like they’re part of a group chat, not a lecture,” he said.
Fundamentally, this also means being more audience-centric in the way that stories are approached and told. “Is the angle that’s interesting to us as editors the same as our audiences?” Nicolson asked us. Too often, the data would suggest, it is not.
4. Continued concern about the state of local news
Finally, the challenges faced by local news media, particularly newspapers, emerged in several discussions. Steven Waldman, the Founder and CEO of Rebuild Local News, reminded us that advertising revenue at local newspapers had dropped 82% in two decades. The issue is not “that the readers left the papers,” he said, “it’s that the advertisers did.”
For Waldman, the current crisis is an opportunity not just to “revive local news,” but also to “make better local news.” This means creating a more equitable landscape with content serving a wider range of audiences and making newsrooms more diverse. “Local news is a service profession,” he noted. “You’re serving the community, not the newsroom.”
According to new analysis, the number of partisan-funded outlets designed to appear like impartial news sources (so-called “pink slime” sites) now surpasses the number of genuine local daily newspapers in the USA. This significantly impacts the news and information communities receive, shaping their worldviews and decision-making.
Into this mix, AI is also rearing its ugly head. While it can be hugely beneficial for some media companies—“AI is the assistant I prayed for,” saysParis Brown, associate editor of The Baltimore Times. However, it can also be used to fuel misinformation, accelerating pink slime efforts.
“AI is supercharging lies,” one journalist at ProPublica told us, pointing to the emergence of “cheap fakes” alongside “deep fakes,” as content which can confirm existing biases. The absence of boots on the ground makes it harder for these efforts to be countered. Yet, as Hilke Schellmann, reminded us “in a world where we are going to be swimming in generative text, fact-checking is more important [than ever].”
This emerging battleground makes it all the more important for increased funding for local news. Legislative efforts, increased support from philanthropy, and other mechanisms can all play a role in helping grow and diversify this sector. Steven Waldman puts it plainly: “We have to solve the business model and the trust model at the same time,” he said.
All eyes on the future
The future of media is being written today, and our visit to New York provided a detailed insight into the principles and mindsets that will shape these next few chapters.
From the transformative potential of AI, to the urgent need to demonstrate distinctiveness and value, it is clear that sustainability has to be rooted in adaptability and innovation.
Using tools like AI and Analytics to inform decisions, while balancing this with a commitment to quality and community engagement is crucial. Media companies who fail to harness these technologies are likely to get left behind.
In an AI-driven world, more than ever, publishers need to stand out or risk fading away. Original content, unique voices, counter-programming, being “audience first,” and other strategies can all play a role in this. Simultaneously, media players must also actively advocate for why their original content needs to be funded and paid for.
Our week-long journey through the heart of New York’s media landscape challenged the narrative that news media and journalism are dying. It isn’t. It’s just evolving. And fast.
The public has a knowledge gap around generative artificial intelligence (GenAI), especially when it comes to its use in news media, according to a recent study of residents in six countries. Younger people across countries are more likely to have used GenAI tools and be more comfortable and optimistic about the future of GenAI than older people. And a higher level of experience using Gen AI tools appears to correlate with more positive assessment of their utility and reliability.
Over two thousand residents in each of six countries were surveyed for the May 2024 report What Does the Public in Six Countries Think of Generative AI in News? (Reuters Institute for the Study of Journalism). The countries surveyed were Argentina, Denmark, France, Japan, the UK and the USA.
Younger people more optimistic about GenAI
Overall, younger people had higher familiarity and comfort with GenAI tools. They were also more optimistic about future use and more comfortable with the use of GenAI tools in news media and journalism.
People aged 18-24 in all six countries were much more likely to have used GenAI tools such as ChatGPT, and to use them regularly, than older respondents. Averaging across countries, only 16% of respondents 55+ report using ChatGPT at least once, compared to 56% aged 18 to 24.
Respondents 18-24 are much more likely to expect GenAI to have a large impact on ordinary people in the next five years. Sixty percent of people 18-24 expect this, while only 40% of people 55+ do.
In five out of six countries surveyed, people aged 18-34 are more likely to expect GenAI tools to have a positive impact in their own lives and on society. However, Argentia residents aged 45+ broke rank, expressing more optimism about GenAI improving both their own lives and society at large than younger generations.
Many respondents believe GenAI will improve scientific research, healthcare, and transportation. However, they express much more pessimism about its impact on news and journalism and job security.
Younger people, while still skeptical, have more trust in responsible use of GenAI by many sectors. This tendency is especially pronounced in sectors viewed with greatest skepticism by the overall public – such as government, politicians, social media, search engines, and news media.
Across all six countries, people 18-24 are significantly more likely than average to say they are comfortable using news produced entirely or partly by AI.
People don’t regularly use GenAI tools
Even the youngest generation surveyed reports infrequent use of GenAI tools. However, if the correlation between young people using GenAI more and feeling more optimistic and trusting about it holds true on a broader scale, it’s likely that as more people become comfortable using GenAI tools regularly, there will be less trepidation surrounding it.
Between 20-30% of the online public across countries have not heard of any of the most popular AI tools.
While ChatGPT proved by far the most recognized GenAI tool, only 1% of respondents in Japan, 2% in France and the UK, and 7% in the U.S. say they use ChatGPT daily. Eighteen percent of the youngest age group report using ChatGPT weekly, compared to only 3% of those aged 55+.
Only 5% of people surveyed across the six countries report using GenAI to get the latest news.
It’s worth noting that the populations surveyed were in affluent countries with higher-than-average education and internet connectivity levels. Less affluent, free, and connected countries likely have even fewer people experienced with GenAI tools.
The jury is out on public opinion of GenAI in news
A great deal of uncertainty prevails around GenAI use among all people, especially those with lower levels of formal education and less experience using GenAI tools. Across all six countries, over half (53%) of respondents answered “neither” or “don’t know” when asked whether GenAI will make their lives better or worse. Most, however, think it will make news and journalism worse.
When it comes to news, people are more comfortable with GenAI tools being used for backroom work such as editing and translation than they are with its use to create information (writing articles or creating images).
There is skepticism about whether humans are adequately vetting content produced using GenAI. Many believe that news produced using GenAI tools is less valuable.
Users have more comfort around GenAI use to produce news on “soft” topics such as fashion and sports, much less to produce “hard” news such as international affairs and political topics.
Thirty percent of U.S. and Argentina respondents trust news media to use GenAI responsibly. Only 12% in the UK and 18% in France agree. For comparison, over half of respondents in most of the countries trust healthcare professionals to use GenAI responsibility.
Most of the public believes it is very important to have humans “in the loop” overseeing GenAI use in newsrooms. Almost half surveyed do not believe that is happening. Across the six-country average, only a third believe human editors “always” or “often” check GenAI output for accuracy and quality.
A cross-country average of 41% say that news created mostly by AI will be “less worth paying for” and 19% “don’t know. 32% answered “about the same.”
Opportunities to lead
These findings present a rocky road for news leaders to traverse. However, they also offer also an opportunity to fill the knowledge gap with information that is educational and reassuring.
Research indicates that the international public overall values transparency in news media as a general practice, and blames news owners and leadership (rather than individual journalists) when it is lacking. However, some research shows users claim to want transparency around GenAI tools in news, but trust news less once they are made aware of its use.
The fact that the public at large is still wavering presents an opportunity for media leaders to get out in front on this issue. Creating policy and providing transparency around the use of GenAI tools in news and journalism is critical. News leaders especially need to educate the public about their standards for human oversight around content produced using GenAI tools.
These days, digital media companies are all trying to figure out how to best incorporate AI into their products, services and capabilities, via partnerships or by building their own. The goal is to gain a competitive edge as they tailor AI capabilities to their audiences, subscribers and clients’ specific needs.
By leveraging proprietary Large Language Models (LLMs) digital media companies have a new tool in their toolboxes. These offerings offer differentiation and added value, enhanced audience engagement and user experience. These proprietary LLMs also set them apart from companies that are opting for licensing partnerships with other LLMs, which offer more generalized knowledge bases and draw from a wide range of sources in terms of subject matter and quality.
A growing number of digital media companies are rolling out their own LLM-based generative AI features for search and data-based purposes to enhance user experience and create fine-tuned solutions. In addition to looking at several of the offerings media companies are bringing to market, we spoke to Dow Jones, Financial Times and Outside Inc. about the generative AI tools they’ve built and explore the strategies behind them.
Media companies fuel generative AI for better solutions
Digital media companies are harnessing the power of generative AI to unlock the full potential of their own – sometimes vast amounts – of proprietary information. These new products allow them to offer valuable, personalized, and accessible content to their audiences, subscribers, customers and clients.
Take for example, Bloomberg, which released a research paper in March detailing the development of its new large-scale generative AI model called BloombergGPT. The LLM was trained on a wide range of financial data to assist Bloomberg in improving existing financial natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, news classification, and question answering, among others. In addition, the tool will help Bloomberg customers organize the vast quantities of data available on the Bloomberg Terminal in ways that suit their specific needs.
Launched in beta June 4, Fortune partnered with Accenture to create a generative AI product called Fortune Analytics. The tool delivers ChatGPT-style responses based on 20 years of financial data from the Fortune 500 and Global 500 lists, as well as related articles, and helps customers build graphic visualizations.
Generative AI helps customers speed up processes
A deeper discussion of how digital media companies are using AI provides insights to help others understand the potential to leverage the technology for their own needs. Dow Jones, for example uses Generative AI for a platform that helps customers meet compliance requirements.
Dow Jones Risk Compliance is a global provider of risk and compliance solutions across banks and corporations which helps organizations perform checks on their counterparties. They do that from the perspective of complying with anti-money laundering regulation, anti-corruption regulation, looking to also mitigate supply chain risk and reputational issues. Dow Jones Risk Compliance provides tools that allow customers to search data sets and help manage regulatory and reputational risk.
In April, Dow Jones Risk & Compliance launched an AI-powered research platform for clients that enables organizations to build an investigative due diligence report covering multiple sources in as little as five minutes. Called Dow Jones Integrity Check, the research platform is a fully automated solution that goes beyond screening to identify risks and red flags from thousands of data sources.
The planning for Dow Jones Integrity Check goes back a few years, as the company sought to provide its customers with a quicker way to do due diligence on their counterparties, Joel Lange, executive Vice President and General Manager, Risk and Research at Dow Jones explained.
Lange said that Dow Jones effectively built a platform which automatically creates a report for customers on a person or company, using technology from AI firm Xapien. It brings together Dow Jones’ data that is plugged into other data sets, corporate registrar information, and wider web content. It then leverages the platform’s Generative AI capability to produce a piece of analysis or a report.
Dow Jones Risk & Compliance customers use their technology to make critical, often complex, business decisions. Often the data collection process can be incredibly time consuming, taking days if not weeks.
The new tool “provides investigations, teams, banks and corporations with initial due diligence. Essentially it’s a starting point for them to conduct their due diligence, effectively automating a lot of that data collection process,” according to Lange.
Lange points out that the compliance field is always in need of increased efficiency. However, it carries with it great risk to reputation. Dow Jones Integrity Check was designed to reshape compliance workflows, creating an additional layer of investigation that can be deployed at scale. “What we’re doing here is enabling them to more rapidly and efficiently aggregate, consolidate, and bring information to the fore, which they can then analyze and then take that investigation further to finalize an outcome,” Lange said.
Regardless of the quality of the generated results, most experts believe that it is important to have a human in the loop in order to maintain content accuracy, mitigate bias, and enhance the credibility of the content. Lange also said that it’s critical to have “that human in the loop to evaluate the information and then to make a decision in relation to the action that the customer wants to take.”
In recent months, digital media companies have been launching their own generative AI tools that allow users to ask questions in natural language and receive accurate and relevant results.
The Associated Press created Merlin, an AI-generated search tool that makes searching the AP archive more accurate. “Merlin pinpoints key moments in our videos to exact second and can be used for older archive material that lacks modern keywords or metadata,” explained AP Editor in Chief Julie Pace at The International Journalism Festival in Perugia in April.
Outside’s Scout: AI search with useful results
Chatbots have become a popular form of search. Originally pre-programmed and only able to answer select questions included in their programming, chatbots have evolved and increased engagement by providing a conversational interface. Used for everything from organizing schedules and news updates to customer service inquiries, Generative AI-based chatbots assist users in finding information more efficiently across a wide range of industries.
Much like The Guardian, The Washington Post, The New York Times and other digital media organizations that blocked OpenAI from using their content to power artificial intelligence, Outside CEO Robin Thurston explained that Outside Inc. wasn’t going to let third parties scrape their platforms to train LLM models.
Instead, they looked at leveraging their own content and data. “We had a lot of proprietary content that we felt was not easily accessible. It’s almost what I’d call the front page problem, which is you put something on the front page and then it kind of disappears into the ether,” Thurston said.
“We asked ourselves: How do we create something leveraging all this proprietary data? How do we leverage that in a way that really brings value to our user?” Thurston said. The answer was Scout, Outside Inc.’s AI search assistant. Scout is a custom-developed chatbot.
The company could see that generative AI offered a way to make that content accessible and even more useful to its readers. Outside had a lot of evergreen content that wasn’t adding value once it left the front page. Their brands inspire and inform audiences about outdoor adventures, new destinations and gear – a lot of which is evergreen and proprietary content that still had value if it could easily be surfaced by its audience. The chat interface allows their content to continue to be accessible to readers after it is no longer front and center on the website.
Scout gives users a summary answer to their question, leveraging Outside Inc’s proprietary data, and surfaces articles that it references. “It’s just a much more advanced search mechanism than our old tool was. Not only does it summarize, but it then returns the things that are most relevant,” he explained.
Additionally, Outside Inc’s old search function worked by each individual brand. Scout searches across the 20+ properties owned by the parent company which include Backpacker, Climbing, SKI Magazine, and Yoga Journal, among others. Scout brings all of the results together, from the 20+ different Outside brands, from the best camping destinations, to the best trails, outdoor activities for the family, gear, equipment and food all in one result.
One aspect that sets Outside Inc.’s model apart is their customer base, which differs from general news media customers. Outside’s customers engage in a different type of interaction, not just a quick transactional skim of a news story. “We have a bit of a different relationship in that they’re not only getting inspiration from us, which trip should I take? What gear should I buy? But then because of our portfolio, they’re kind of looking at what’s next,” Thurston said.
It was important to Thurston to use the LLM in a number of different ways, so Outside Inc launched a local newsletter initiative with the help of AI. “On Monday mornings we do a local running, cycling and outdoor newsletter that goes to people that sign up for it, and it uses that same LLM to pick what types of routes and content for that local newsletter that we’re now delivering in 64,000 ZIP codes in the U.S.”
Thurston said they had a team working on Scout and it took about six months. “Luckily, we had already built a lot of infrastructure in preparation for this in terms of how we were going to leverage our data. Even for something like traditional search, we were building a backend so that we could do that across the board. But this is obviously a much more complicated model that allows us to do it in a completely new way,” he said.
Connecting AI search to a real subscriber need
In late March, The Financial Times released its first generative AI feature for subscribers called Ask FT. Like Scout, the chat-based search tool allows users to ask any question and receive a response using FT content published over the last two decades. The feature is currently available to approximately 500 FT Professional subscribers. It is powered by the FT’s own internal search capabilities, combined with a third-party LLM.
The tool is designed to help users understand complicated issues or topics, like Ireland’s offshore energy policy, rather than just searching for specific information. Ask FT searches through Financial Times (FT) content, generates a summary and cites the sources.
“It works particularly well for people who are trying to understand quite complex issues that might have been going on over time or have lots of different elements,” explained Lindsey Jayne, the chief product officer of the Financial Times.
Jayne explained that they spend a lot of time understanding why people choose the FT and how they use it. People read the FT to understand the world around them, to have a deep background knowledge of emerging events and affairs. “With any kind of technology, it’s always important to look at how technology is evolving to see what it can do. But I think it’s really important to connect that back to a real need that your customers have, something they’re trying to get done. Otherwise it’s just tech for the sake of tech and people might play with it, but not stick with it,” she said.
Trusted sources and GenAI attribution
Solutions like those from Dow Jones, FT and Outside Inc. highlight the power of a brand with a trusted audience relationship to create deep, authentic relationships built on reliability and credibility. Trusted media brands are considered authoritative because their content is based on credible sources and facts, which ensures accuracy.
Currently, generative AI has demonstrated low accuracy and poses challenges to sourcing and attribution. Attribution is a central feature for digital media companies who roll out their own generative AI solutions. For Dow Jones compliance customers, attribution is critical to customers, to know if they’re going to make a decision based on information that is available in the media, according to Lange.
“They need to have that attributed to within the solution so that if it’s flowing into their audit trails or they have to present that in a court of law, or if they would need to present it to our internal audit, the attribution is really key. (Attribution) is going to be critical for a lot of the solutions that will come to market,” he said. “The attribution has to be there in order to rely on it for a compliance use case or really any other use case. You really need to know where that fact or that piece of information or data actually came from and be able to source it back to the underlying article.”
The Financial Times’ generative AI tool also offers attribution to FT articles in all of its answers. Ask FT pulls together lots of different source material, generates an answer, and attributes it to various FT articles. “What we ask the large language model to do is to read those segments of the articles and to turn them into a summary that explains the things you need to know and then to also cite them so that you have the opportunity to check it,” Jayne said.
They also ask the FT model to infer from people’s questions when it should be searching from. “Maybe you’re really interested in what’s happened in the last year or so, and we also get the model to reread the answer, reread all of the segments and check that, as kind of a guard against hallucination. You can never get rid of hallucination totally, but you can do lots to mitigate it.”
The Financial Times is also asking for feedback from the subscribers using the tool. “We’re literally reading all of the feedback to help understand what kinds of questions work, where it falls down, where it doesn’t, and who’s using it, why and when.”
Leaning into media strengths and adding a superpower
Generative AI seems to have created unlimited opportunities and also considerable challenges, questions and concerns. However it is clear that an asset many media companies possess is a deep reservoir of quality content and it is good for business to extract the most value from the investment in its creation. Leveraging their own content to train and program generative AI tools that serve readers seems like a very promising application.
In fact, generative AI can give trustworthy sources a bit of a super power. Jayne from the FT offered the example of scientists using the technology to read through hundreds of thousands of research papers and find patterns in a process that would otherwise take years to read in an effort to make important connections.
While scraped-content LLMs pose risks to authenticity, accuracy and attribution, proprietary learning models offer a promising alternative.
As Jayne put it, “The media has “an opportunity to harness what AI could mean for the user experience, what it could mean for journalism, in a way that’s very thoughtful, very clear and in line with our values and principles.” At the same time, she cautions that we shouldn’t be “getting overly excited because it’s not the answer to everything – even though we can’t escape the buzz at the moment.”
We are seeing many efforts bump up against the limits of what generative AI is able to do right now. However, media companies can avoid some of generative AI’s current pitfalls by employing the technology’s powerful language prediction, data processing and summarization capabilities while leaning into their own strengths of authenticity and accuracy.