The LLM Police: Between Firewalls and Policies for Generative AI

July 10, 2023
The LLM Police: Between Firewalls and Policies for Generative AI

Generative AI startups seem to be springing up every week with over $1.6B venture capital dollars invested in Q1 2023 alone. And still, we find some organizations promising AI governance without policy. On one hand, some startups paint a picture of a world where sliders and checkboxes are all you need to assuage your fears of an AI going off the rails. On the other hand, some startups claim that you need technology to fight technology because “policies don’t work.” The story is far more nuanced than that. 

Policy is more than a set of rules and expectations drafted by management. Policy forms a vision for how an organization manages the flow of its inputs and outputs in order to achieve its mission. Policy doesn’t have to be a checklist. It can be a dynamic system that uses quantitative metrics and qualitative analyses. One example of this in a cybersecurity context is an internal controls questionnaire (qualitative) to ensure secure operations as well as running adversarial testing on an organization’s model (quantitative).  

For a practical example, consider the case of a data leak. An employee for an organization might use a popular language model to summarize company meeting minutes. Here, confidential data stored in an organization flows out in an unauthorized manner, which may then lead it to being exposed to the world if that model is trained on the leaked information. This happened with Samsung

While a firewall might stop some sensitive data from being leaked, policy acts as a blueprint for how employees work with AI. That might mean leadership expresses its preference for an open source language model fine-tuned for internal use or a privacy layer on top of existing commercial language models. Firewalls work fine as safety nets, but policies drive the organization forward as playbooks. Picking a firewall, choosing what to detect  and managing it throughout an operational lifecycle all come within the ambit of policy.

Human-in-the-Loop Implies Policy-in-the-Loop

Just what is policy? One definition provided by the Canadian government describes it as “a set of statements or principles, values and intent that outlines expectations and provides a basis for consistent decision-making and resource allocation in respect to a specific issue.” When AI regulations such as the EU AI Act are still on the horizon, forward-thinking organizations have already developed policies to ensure their safe and secure adoption of AI. 

When organizations build, deploy, and use AI systems, people start to ask questions regarding the operational, business, and ethical risks of AI. The sorts of questions people ask include:

  • Should I do this?”
  • How should I do this?”,
  • Who is responsible for this?”, and
  • What does this mean?”

People ask these questions when there is no policy to follow or existing policy is unclear. Once an organization formalizes answers to questions such as these, they have built an internal  policy.

One might wonder: how does this connect to having a human in the loop?  Simply put, since a human-in-the-loop is a decision-maker, there would naturally be decision-making criteria. An organization would want decisions that are made to be clear, conspicuous, and consistent. If a decision is available to all (conspicuous) but full of jargon and legalese (unclear) it makes it difficult to follow. If a policy is clear but inconsistently applied, it will simply be disregarded.  A policy therefore acts as a clear, conspicuous, and consistent set of criteria for decision makers to do their job and is thus part of having a human in the loop.

In a recent Achieving Safe & Compliant AI panel discussion, Dr. Jon Hill , Professor of Model Risk at NYU, highlighted the importance of why models cannot be blindly trusted. He recalled a story from October of 1960 where a U.S. early warning system based out of Greenland “indicat[ed] that a massive Soviet nuclear attack on the U.S. was underway—with a certainty of 99.9 percent.” The ‘humans-in-the-loop’, however, didn’t trust the system without verification. Dr. Hill noted that the then Soviet Premier Nikita Khrushchev was in New York slamming his shoe in the U.N assembly and the detected size of the attack far exceeded Soviet capabilities. By understanding the surrounding context, the systems operators knew to disregard the ‘certainty’ percentages in the system they had because they had a broader understanding of the surrounding context. To parallel this example, today we find that generative AI models such as ChatGPT, despite their internal ‘high confidence’ in an output, may still hallucinate because they lack a contextual awareness of their surrounding circumstances.

Therefore, if we ignore policy, we essentially ignore the people in an organization who need to work with AI systems. By using AI technology without policy, we ignore the context in which technology sits. Advocating for human-in-the-loop processes that are built atop clear policies does not mean we oppose automating parts of AI policy. Rather, we propose a complementary approach where quantitative and qualitative metrics work hand in hand to identify and mitigate AI risk.

Using Technology to Monitor Technology Implies Alignment

When looking at a complementary approach that examines humans in an organization as well as the technology they use, it is important to highlight how technology can help govern other technology. Technology has a role to play in building guardrails when using AI. It makes sense that it would; AI enables diverse capabilities such as content generation and decision making at scale, necessitating the need for systems that can monitor AI to also operate at scale. When using AI systems we often hear about alignment, and to quote the Fairly Large Language Model Policy eBook:

“A number of definitions for AI alignment exist[,] but they broadly center on directing AI operation in accordance with human values. An issue that arises when a chosen definition for alignment is too vague to be implemented. As a result, organizations may consider developing a policy for how its stated values would translate into alignment goals and in turn how those goals can be achieved at an operational level.”

If we have to consider alignment when building and using AI systems, it stands to reason that we would have to consider alignment when monitoring technology as well. 

Alignment Implies Policy

People using and building AI have different goals. Organizations have different goals. Even nation states have different goals. This means that when building systems to govern AI, goals and therefore alignment criteria will differ as well. And in order to specify those criteria, you need people who can spot the ambiguities that some of those goals contain. Once we’ve clarified what’s ambiguous, we formalize the answers to our initial set of four questions to build policy. Thus, alignment implies and informs policy.

It might be argued that alignment can be achieved without policy, like by using reinforcement learning with human feedback (RLHF) or AI feedback (constitutional AI). In our LLM policy, we noted that:

“Reinforcement learning with human feedback (RLHF) is a popular choice for fine-tuning models to produce favourable outputs. However, RLHF carries with it risks as well. The preferences exhibited for certain outputs over others can represent the biases carried by the human fine-tuners. One way to mitigate this is by having a diverse set of fine-tuners.

One challenge that arises in the context of fine-tuning models is divergent behaviors such as ‘situationally-aware reward hacking’ where:

  1. Model goals are broad in scope,
  2. Models draw spurious correlations between reward signals and the cause of those signals,
  3. Consistent reward misspecifications lead to positive feedback loops for seeking to achieve those goals, and
  4. This leads to strange and repeated model behaviour, or
  5. Models seek power and pursue outcomes like avoiding shutdown, convincing others to serve its own goals, or attempting to gain resources or influence.

Another approach altogether is to use constitutional AI where “oversight is provided through a list of rules or principles” to inform a model that then engages with harmful outputs itself; “reinforcement learning with AI feedback” in essence. One issue with this approach is that human feedback brings an experience of the world outside of textual data coupled with the emotions that such experiences elicit. Neither of these can be replicated with an AI fine-tuner even if it included multimodality.”

And so we arrive back at the same issue we found in the early nuclear strike detection system example: models operate in a context that they may not be aware of. As a result, specifying criteria for alignment in using an automated system implies a need for policy just as having a human-in-the-loop does for decision-making.

Concluding Remarks

When considering the role of policies in an organization, the key takeaways from this piece are:

  • Since humans in the loop are decision makers who use decision-making criteria, policy forms a core part of having a human in the loop.
  • Using technology implies having a goal in mind for its use (alignment).
  • Having a vision of aligned technology requires clarifying what an organization’s goals are and therefore requires policy.

Building policy is not a one-person job. It means having conversations with stakeholders inside and outside the organization while engaging with a multidisciplinary team to identify gaps and clearly state expectations. As an AI-driven organization scales, its policies will become more complex. As a result, solutions that can automate AI governance will act as a pressure-release valve to free teams from heaps of compliance paperwork and enable them to operationalize AI safely and securely within their organization.

You may be interested in

Want to get started with safe & compliant AI adoption?

Schedule a call with one of our experts to see how Fairly can help