Fairly AI Submission to the United Nations' Call for Papers: Data Pollution in the Age of Generative AI

September 30, 2023

Ahead of the United Nations multistakeholder advisory body on AI’s first meeting, Fairly AI responded to the call for papers. We chose to write about key Issues on global AI governance, specifically, data pollution and its short to medium-term effects.

We live in an era where AI systems can now generate content ranging from written text, to speech, and even video. To many people, that possibility brings forth many avenues for artistic expression. However, as with many digital tools, the possibility for abuse looms at large. The key question we ask in this piece is what would happen if irrelevant, trite, or even malicious AI-generated content, which we term ‘data pollution’, floods the internet. What would be the effect on consumers, companies, and the infrastructure that runs the internet like search engines? The aim of this piece is to analyze the immediate effects of mass-produced AI-generated content filling the internet. We then examine some of the challenges associated with identifying this content. Finally, we look at some of the problems search engines will face when encountering this content.

What is Data Pollution?

To reiterate: by data pollution, we mean the proliferation of low-quality or malicious AI-generated content across the internet at scale. Data pollution may take the form of fake news, deep fakes, mediocre “art”, bot-driven forum posts, fake reviews, and the like. One may ask: what really separates data pollution from existing forms of low-quality or malicious content like trite memes or fake Amazon reviews? In a word: scale.

Large language models are now able to generate multimedia content at scale while remaining very personalizable. This combination of scale, automation, and content customization may lead to a situation where AI-generated content overshadows genuine human efforts to contribute meaningful content to the internet. Since one piece of AI-generated content is just different enough from other similar pieces of content, detecting AI-generated content poses a significant challenge in counteracting data pollution. The key point here is not the current efforts at watermarking or detecting AI-generated content, but rather the amount of content that AI systems will generate in the interim. Even with the release of a robust AI content detector, there will likely be an ensuing cat-and-mouse game of AI content detection and detector evasion.

Data Pollution is the Cause of “Platform Decay”

As we see more and more low-quality or even malicious AI-generated content, the question remains how do we conceptualize this issue in the medium to long term? The Electronic Frontier Foundation (EFF) refers to this phenomenon as “platform decay” where “platforms degrade what made users choose the platform in the first place, making the deal worse for them in order to attract business customers. So instead of showing you the things you asked for, your time and attention are sold to businesses by platforms.” In an illustrative example, the EFF notes:

Platforms follow a predictable lifecycle: first, they offer their end-users a good deal. Early Facebook users got a feed consisting solely of updates from the people they cared about, and promises of privacy. Early Google searchers got result screens filled with Google’s best guess at what they were searching for, not ads. Amazon once made it easy to find the product you were looking for, without making you wade through five screens’ worth of “sponsored” results.

Data pollution, therefore, is the result of platform decay. As generative AI systems produce more content, whether or not the platforms hosting such content approve of it, much of the content generated by AI will steer internet platforms towards platform decay. There is, however, one caveat: when the EFF described platform decay they described the platforms as agents of their own decay. In this case, platforms (at least initially) exercise little control over the process of decay due primarily to the fact that detecting AI-generated content is a challenge and it can now happen at a massive scale. As a result, platforms are objects of an externally imposed decay rather than architects of it.

How will Data Pollution Affect Consumers

This issue comes to the fore when examining how consumers will use the internet in the years to come. CEO and Vice Chairman of Christian Dior SE commenting on fake news and transparency noted:

More than 75% of their [that is to say luxury clients’] purchases are made following research on the internet. Here, our clients find extremely diverse types of information, comments from other clients, press articles, blogs by fashion influencers, and so on and so on. The more digitally connected our clients, the more perfect our products must be, because the slightest defect is immediately publicized, and the more comprehensive must be the information about them. For the price they pay, our clients want to be sure they are acquiring a product, a garment, that will live up to their high standards. Not only because it's beautiful and well made, but also because it's produced in optimum conditions of social and environmental responsibility, and marketed in optimum conditions of transparency. As far as possible, it's our duty to respond. [emphasis added]

From this quote, we can see that the vast majority of consumers for major luxury brands rely on the internet to judge products before purchase, similar to how others rely on Amazon reviews for less expensive products. In essence, customers of varying degrees rely on the internet to make purchase decisions, the question then is how AI-generated content will affect consumers.

One concept to analyze the impact of generative AI on areas of the internet such as comments, blogs, and increasingly audio and video as well as the concept of a ‘nudge’. Nobel Laureate and University of Chicago economist Richard H. Thaler and Harvard Law Professor Cass R. Sunstein define a ‘nudge’ in their book “Nudge: Improving Decisions about Health, Wealth, and Happiness” as:

… any aspect of the choice architecture that alters people's behavior in a predictable way without forbidding any options or significantly changing their economic incentives. To count as a mere nudge, the intervention must be easy and cheap to avoid. Nudges are not mandates. Putting fruit at eye level counts as a nudge. Banning junk food does not.

Case Study: Products and Product Reviews

The rise of generative AI in content development brings the potential for misinformation in high-risk domains. Recently, the New York Mycological Society warned in a tweet that: “@Amazon and other retail outlets have been inundated with AI foraging and identification books. Please only buy books of known authors and foragers, it can literally mean life or death.” When it comes to highly specialized domains that require subject matter expertise, AI-generated content poses a significant misinformation risk to consumers. Extending this further, as authors incorporate AI-generated content into their own works, there may be potential for publisher liability for conveying misinformation in a high-risk domain.

From another angle, AI-generated content extends from the products themselves to the the reviews given by users. Nuanced AI-generated reviews of Amazon products, done at scale could consist of a mass-nudging of consumer behaviour. Therefore, if a bad actor wishes to manipulate consumer behavior, they might use a collection of subtle ‘nudges’ to slowly shift consumer opinion on a given product. The issue, again, is in identifying the cause of a nudge and how well AI-generated content detection algorithms work. A single AI-generated review is easy and cheap to avoid, but that fact alone does not alleviate the effect it can have on consumer minds. When written convincingly, perhaps even with AI-generated images of (fake) “product failure”, synthetic reviews could pose major issues for companies that try to elicit honest feedback. This is yet another case of the increasing signal-to-noise ratio when trying to navigate an internet rife with harmful or misleading AI-generated content.

Requiring all reviewers in particular to verify their accounts with government-issued credentials might seem like a solution but this approach would simply mean that a human rather than a bot would be used to convey the AI-generated content. As AI and human-generated outputs become more similar, it may become more difficult to detect AI-generated content particularly that content produced by models that do not watermark outputs. As a result, platforms may inadvertently flag genuine users who produce original content. To extend the prior point further, the trend of ‘official’ and ‘unofficial’ models is something that may continue into the future. What we mean by this is that there are AI models, open or closed source, that aim to adhere to some kind of content limitation and safety testing framework such as those built by OpenAI, Meta, or Cohere. And then there are models which aim for fully uncensored outputs such as Unstable Diffusion. It is these models that may skirt around the boundaries adhered to by mainstream model developers when it comes to limiting the harms of generated content. As a result, there is likely to be a segment of AI development that happens ‘underground’ and actively produces models that can contribute to data pollution.

Furthermore, having platforms institute a three-strikes policy for reporting unauthorized AI-generated content similar to how platforms like YouTube use copyright strikes may bring an additional issue into the fray: malicious use of a reporting feature to stifle user-generated content. For instance, some companies might misuse a ‘report as AI-generated’ feature to censor critical comments about their products. This concept is known as a “liar’s dividend” which researcher Kaylyn Jackson Schiff and her team noted “works through two theoretical channels: by invoking informational uncertainty or by encouraging oppositional rallying of core supporters.” The AI liar’s dividend has manifested in the political arena in India when a politician claimed that controversial audio clips ascribed to him were deepfakes. The misuse of reporting features again contributes to the theme of decreased signal-to-noise ratios, in this case, malicious reports form the ‘noise’ which prevents genuine feedback (‘the signal’) from being heard.

Medium-Term Effects of Data Pollution

At this point, we have described the immediate effects of data pollution caused by AI content-driven platform decay. The next area of discussion is what data pollution will do to existing platforms in the medium term. One technology that comes to mind is a search engine. At a fundamental level, search engines rely on indices built off the pages they process. Google, for instance, groups their search engine operation into three parts: crawling, indexing, and serving search results, and explains their process briefly:

When a user enters a query, our machines search the index for matching pages and return the results we believe are the highest quality and most relevant to the user's query. Relevancy is determined by hundreds of factors, which could include information such as the user's location, language, and device (desktop or phone). For example, searching for "bicycle repair shops" would show different results to a user in Paris than it would to a user in Hong Kong.

As AI-generated content proliferates across the internet, the increase of volume in web pages, comments, articles, images, and video would mean the need for more crawling, indexing, and retrieval infrastructure on the part of search engine operators. In the interim, the increased load due to AI-generated content may put a strain on search engine systems.

From a usability standpoint, the current phenomenon of search engine poisoning–where sponsored search results that contain malware masquerade as popular open-source software–indicates that search engines have normalized hosting irrelevant and even harmful content on their platforms. If large volumes of AI-generated content are added to that mix, it's possible that users will grow annoyed with the lack of relevant results and reduce search engine use altogether. Conversely, as automated AI systems that generate content increasingly ‘plug into’ the internet themselves to find content, they may fall prey to irrelevant or even harmful content captured by search engines.

In summary, generative AI is able to synthesize content at an industrial scale that can be nuanced and convincing. As a result, the sheer volume of content coupled with the challenges associated with detecting AI-generated content means that the latter may eclipse genuine user-generated content on the internet, thus ‘polluting’ the internet with a deluge of trite, irrelevant, or even harmful AI-generated content. Furthermore, the infrastructure that many rely on such as search engines or social media sites may not be able to amply cope with the new flood of AI-generated content, and sifting it apart from human-generated content.

Looking ahead, a key part of initiating a solution against data pollution is being able to accurately identify it, but admittedly, there is still a significant technological challenge ahead. In the interim, the skepticism that AI-generated content breeds may lead to a shift in internet behavior where users seek out content from actual individuals with credibility, expertise, and talent rather than pseudonymous accounts. Facilitating the ability for users to find these ‘verified’ content producers may be a potential solution that is practical to implement.

‍

Fairly AI Submission to the United Nations' Call for Papers: Data Pollution in the Age of Generative AI

You may be interested in

AI Safety+Cybersecurity R&D Tracker

AI Framework Tracker

Global AI Regulation Tracker

Want to get started with safe & compliant AI adoption?