Policy / DCN perspectives on policy, law, and legislative news surrounding digital content

Let’s not allow AI to be another data disaster

April 11, 2024 | By Chris Pedigo, SVP Government Affairs – DCN @Pedigo_Chris

The environment for collecting and using data on the web has often been compared to the wild west – a place with no rules and where only the strong (and often morally-questionable) survive. Unfortunately, generative AI technology is developing in a similar vacuum of governance and ethical leadership.

Since the early days of the Internet, there were hundreds if not thousands of venture-backed companies competing to scoop up as much data as possible about consumers. They would then try to spin those datasets into a compelling product or service usually involving a model that included data-driven advertising. Nowadays, Meta and Google are the most often cited aggressive data collectors. Though arguably that’s because they killed off the competition and strong-armed their way into a dominant market position.

Google’s parent company, Alphabet collects massive amounts of data from Android devices, Google services, and its apps (Search, Maps, Gmail, etc.) and Chrome. It has even delayed killing off third party cookies in Chrome (the last major browser to do so) because it hasn’t developed a good way to maintain its dominant position as collector of consumer data.

Data vacuum meets governance vacuum

Meta set about to hoover up so much consumer data directly or indirectly that it failed to have controls in place around who could collect it or the purposes for which it could be used (see Cambridge Analytica). Another cringey example recently came to light when court documents were unsealed. Lest we think this behavior a thing of the past, Meta was reportedly using Onavo (a VPN it purchased in 2013) as a trojan horse to gather valuable analytics data on Snapchat, Amazon and YouTube. Meta is now being sued for violating wiretapping laws.

While regulators and legislative bodies are working to clean up the debris left in the aftermath of the wild west data industry, the race to compete in the Generative AI market might take data collection to a whole new level, likely with unforeseen and potentially catastrophic results.

Large Language Models (LLMs) need data to get better – lots of it. The hockey stick progress we’ve seen in the last 18 months among generative AI systems is almost completely attributable to the massive increase in datasets upon which the LLMs are trained. The New York Times recently reported on the red hot competition among AI companies to find new data for training with companies scraping any and all content they can get their hands on. And this is taking place with no regard for copyright law, terms of use or consumer privacy laws (and without any respect for consumers’ reasonable expectations of privacy).

That said, as The New York Times’ article also notes, AI systems may exhaust all available data for training by 2026. In the absence of high-quality original data, they might even turn to synthetic data – data that was created by AI systems – for training. Who knows what kind of consequences that could render?

Legal safeguards needed for generative AI

Sure, there are some existing safeguards that could be helpful in setting a more responsible course forward. AI companies have been confronted with numerous legal challenges to their unfettered data collection. These companies face a number of lawsuits around copyright infringement as well. However, these suits could take years to fully play out given the AI companies are well-funded and would likely appeal any setbacks in court.

There are privacy laws on the books that likely impact data collection by AI companies. But those laws exist only in a handful of states and it’s not clear exactly how the law applies since AI companies won’t disclose what and whose data they use for training.

Against this bleak backdrop, there have been some promising recent developments around generative AI governance in Congress. This week, a new bipartisan consumer privacy bill was unveiled. While there are some serious concerns and questions to address in that bill, at least the issue is front and center. At the same time, Members of Congress from both parties appear to be actively and constructively wrestling with how best to regulate the emerging AI industry. In fact, nearly every AI bill that has been introduced is bipartisan in nature.

As the wild west of data collection gets even wilder, it’s clear we need basic rules for AI systems and stronger protections for consumers. Without this, we are likely doomed to repeat the mistakes of the previous data collection bonanza – possibly with far more severe consequences.

AI, data collection, privacy