Key Papers in Language Model Safety

By aogara @ 2022-06-20T14:59 (+20)

Introduction

One of the fastest growing subfields of AI safety focuses on language models (LMs) such as OpenAI’s GPT-3. This article provides summaries of a few dozen key papers in several important subfields of language model safety, namely:

This is not a comprehensive introduction to the field of language model safety. Specifically, this overview does not discuss important topics such as bias, interpretability, and decision-making agents. It also disproportionately focuses on arguments and ideas advanced by authors within the effective altruism community, as I’ve had the most exposure to their work. Over time, the field of AI safety should seek to engage more with the broader field of machine learning (see PAIS for more). Instead, this overview aims to summarize a useful set of papers from which one could begin researching language model safety. 

Why study language model safety?

There are several reasons to believe that language model safety could present an unusually strong opportunity to reduce catastrophic AI risks:

However, language model safety is not the only compelling topic in AI safety, and AI safety is not the only valuable altruistic endeavor. A few considerations against the field include:

Motivations

Persuasion

Twitter is full of bots. These bots spread all kinds of harmful information from COVID19 misinformation to fake news about US presidential elections to political propaganda. Today’s bots aren’t always smart enough to hide from content moderators or successfully promote their messages, but progress in language models could heighten the risks of automated persuasion. 

“Risks from AI Persuasion” (Barnes, 2021) (Alignment Forum) presents these risks, beginning with several arguments for and against the proposition that highly persuasive AI will be possible well before AGI:

The article continues by arguing that, if achieved, persuasive AI could pose a long-term risk to humanity. Public dialogue and epistemics could suffer, serving as a risk factor for any problem that requires accurate analysis and discussion. More concerningly, persuasion could enable censorship and propaganda campaigns from bad actors. It’s worth noting that authoritarian governments today invest significantly in online censorship and propaganda, but not typically via AI

To combat these threats, persuasion research should be discouraged and replaced with the truthfulness and honesty agendas outlined below. Building asymmetrically truthful techniques is difficult because many technologies are dual-use. For example, fake news detection tools or language models that rely on trusted sources can be easily co-opted for harmful censorship and propaganda agendas. Digital advertising should be resisted as a language model use case. Bot detection technology seems useful and tractable, both by analyzing natural language content produced and the source of such content. 

It’s important to note that risks from AI persuasion are different from more common concerns about power-seeking AI. The most common arguments for AI risk say that AI agents will have “convergent instrumental goals” such as seeking power to avoid being turned off, and that these goals will inevitably conflict with human well-being. But this argument only applies to agents that have a model of the world and an objective to be maximized within it. Language models today are rarely agents, and agentic language models would carry different risks. Instead, they are trained to make accurate predictions about natural language without having any goals in the broader world. By making the case that non-agentic language models could pose a catastrophic threat to humanity, risks from persuasive AI opens up a new mechanism leading to AI risk. 

For examples of the kinds of AI systems that could be persuasive or augment human persuasion, see “Persuasion Tools: AI Takeover without AGI or agency” (Kokotajlo, 2020) (Alignment Forum). 

Truthfulness

Building AI that is naturally truthful and improves human epistemics is both the most natural antidote to risks from persuasive AI and a broad positive vision for the future of language models. Several authors have presented and summarized the case for truthful AI:

Technical research agendas for building truthful language models include:

Honesty

An important critique of the truthfulness agenda is that it could accelerate general AI capabilities. Training larger language models on more data and compute tends to improve performance on many benchmarks including measures of truthfulness. But accelerating the onset of potentially dangerous AI should be seen as a strong potential negative externality of truthfulness research. For a more comprehensive argument against accelerating capabilities progress in safety research, see “Perform Tractable Research While Avoiding Capabilities Externalities” (Woodside & Hendrycks, 2022) (Alignment Forum). 

Honest AI is an agenda proposed to counteract the potential harms of truthfulness. In this literature, an honest AI is defined as one which accurately reports its beliefs; but this raises the difficult question of how to identify what an AI system believes. This is the central objection to the honest AI agenda raised by Evans, Hilton & Finnveden, 2021. Forthcoming work attempts to formally define notions of honesty and lying in language models and present methods for building honest AI, as described in “Open Problems in AI X-Risk“ (Woodside & Hendrycks, 2022) (Alignment Forum). 

Helpful, Honest, and Harmless

Building a helpful, honest, and harmless (HHH) text-based assistant is a broad goal proposed by Anthropic in their research agenda “A General Language Assistant as a Laboratory for Alignment” (Askell et al., 2021) (Arxiv). This definition has considerable overlap with the motivations discussed above, but their paper gives a thorough account of these motivations and merits independent consideration. 

They provide several clarifications of their goals for HHH language models not discussed above:

While the paper does not provide a specific theory of how this work will reduce existential risk, it does make the general case for aligning existing AI systems. Specifically, it argues that current language models are highly and generally capable, making it more valuable to align them. They see significant overlap between short-term and long-term LM alignment goals, and are happy to see progress on one benefit the other. They propose a broad, qualitative vision in the hopes that others will join their work quantifying and making progress towards their goal for the field. 

The first three pages of the paper are a wonderful introduction to LM alignment motivations, you can read it here. The rest of the paper is technical; for a great summary, see here

Empirical Progress

Robustness

Robust AI systems perform well in a variety of circumstances, including (a) when responding to “out-of-distribution” inputs that did not appear in training data, and (b) when facing “adversarial attacks” that are deliberately constructed to provoke a model failure. For a more thorough overview of robustness, see Section 2 of “Unsolved Problems in ML Safety” (Hendrycks et al., 2022). As one of the most widespread subfields of AI safety, robustness has historically focused on computer vision but is growing to include work on language models. 

Red Teaming Language Models with Language Models (Perez et al., 2022) (Arxiv)

This paper introduces an automated technique for generating adversarial examples: attacking language models with other language models. The word “red teaming” refers to the cross-disciplinary method of improving systems by attacking them and exposing flaws. 

The process is straightforward. First, one language model generates prompts designed to induce undesirable outputs. Then another LM completes those prompts. Finally, an encoder-only classifier language model identifies undesirable completions. This setup can be used with many different language models. In this paper, the classifier identifies offensive dialogue produced by the generator. The authors also find instances of “data leakage”, where an LM directly repeats text that appeared in its training data. 

The red model generates prompts in several ways. First, the authors manually write prompts for the red model. Then a language model produces new prompts that mimic those manually written prompts. (This is termed a “few-shot” approach because the language model uses a few examples.) The authors also use a supervised approach, where the prompt generator model is fine-tuned to better mimic successful prompts; and a reinforcement learning (RL) approach that fine-tunes the prompt generator to maximize the offensiveness of resulting model outputs. 

This technique requires having a classifier that is already capable of identifying undesirable outputs. It is therefore less useful for the challenge of training a classifier from scratch (see Redwood Research’s project below), but more useful for auditing language models using our many existing classifier models. 

High-Stakes Alignment via Adversarial Training (Ziegler et al., 2022) (LessWrong, Arxiv)

For critical AI systems, even a small chance of failure can be catastrophic. Therefore, among other things, Redwood Research is building techniques to improve the worst-case performance of AI systems. See Buck Shlegeris’s project description and Paul Christiano’s support of the agenda. 

Daniel Ziegler added this context for Redwood’s motivations, quoted here in full: 

An important background assumption (that we didn't focus on in the paper) is that we think the main failures we're worried about are alignment failures rather than capabilities failures. So talking about "critical AI systems" seems slightly misleading; we're not worried about a nuclear reactor control system failing in a rare situation because it's not smart enough, we're worried about an AI in a pretty arbitrary setting deceiving us / seeking power because it's smart enough to do that and we haven't figured out how to get it not to do that. Failures from misalignment both seem more worrying and more tractable (we certainly can't hope to make systems that are literally always perfect, and that's more of a capabilities question)

This particular paper uses a classifier language model (DeBERTa V3 Large) to filter undesirable outputs from a generative language model (GPT-Neo). The undesirable outputs here are any completions of a paragraph that introduce injuries to beings in the paragraph. Avoiding injuries in generated text isn’t a particularly important task; it simply provides a setting for the challenge of developing a robust classifier. Classifiers like this could be used for monitoring the outputs of advanced systems to limit the potential harm of alignment failures such as deceptive alignment and inner misalignment. 

Training the classifier to identify injuries requires many examples of injurious and non-injurious prompt completions. To build an initial set of training examples, Redwood hired contractors from Upwork and Surge to label snippets of fanfiction completed by GPT-3. The classifier quickly learned to identify injuries in this dataset, missing only 2 injuries in 2447 test set examples. 

To improve out-of-distribution robustness, the authors began building adversarial examples. Contractors wrote their own prompts with the express goal of getting the model to produce an injury that would not be caught by the classifier. (For example, the contractors could write a prompt about a raging battle to see if the model would complete the story with an injury.) This process was manual at first, but it became quicker when the authors built software tools for constructing adversarial examples (described in detail here).

Future work on worst-case performance in critical AI systems could expand on Redwood’s filtered generation approach by providing open source implementations of rejection sampling or new training datasets of outputs that should be avoided. Out-of-distribution robustness could also benefit from more tools for generating adversarial examples. 

Adversarial Robustness

The two papers above are members of the popular subfield of adversarial robustness. Papers in this field prompt models with inputs designed to induce incorrect outputs and poor performance, thereby exposing model weaknesses and providing training data that can improve future robustness. These adversarial examples can be generated by hand or algorithmically using various methods. 

Here are several other papers on adversarial robustness for language models. 

Prompt Engineering

The prompt given to a language generation model has a great influence on the model’s output. This is true even in cases where two prompts should ideally provide the same completion, such as asking the same question framed two different ways. Building prompts that improve LM generation performance is therefore a tractable research direction. Here are some papers on the topic:

Other Work

Law

While laws that govern AI have received lots of attention from the field of AI policy, a smaller subfield with potential long-term importance attempts to build AI that understands our laws. Practically speaking, AI could be useful in assisting our legal processes with functionality such as summarizing documents, identifying relevant precedents, and aiding contract review. These practical applications have been the focus of the growing field of Legal AI. 

From “A Survey on Legal Judgment Prediction: Datasets, Metrics, 

Models and Challenges” (Cui et al., 2022) (Arxiv)

From a longtermist ethical perspective, laws are important because they encode complex human preferences and rules about acceptable behavior in the world. Language models that understand our legal system could therefore help AI make decisions that are aligned with human values. 

Legal AI Overviews

Here are two good recent surveys of the legal AI field: 

Neither of these surveys consider the alignment problem or take a longtermist ethical perspective. This perspective has been briefly covered in parts of Dan Hendrycks’ MMLU and ETHICS papers, but there exists no thorough overview of legal AI from a longtermist perspective. This seems like an important opportunity for future policy papers and research agendas. 

Domain Specific Pretraining

One popular strategy for improving the performance of language models on domain-specific tasks is pretraining the model on documents from that domain. Several papers have improved performance on legal tasks through domain-specific pretraining: 

Domain-specific pretraining has therefore provided meaningful but limited improvements on legal tasks. Future work is necessary to develop structured datasets, advanced training methods, and more challenging evaluation tasks in the legal domain. 

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (Hendrycks, Burns, Chen & Ball, 2021) (Arxiv)

Building new datasets is one of the core levers driving progress in machine learning, particularly in new specialized domains like legal AI. This paper introduces a new dataset for contract review. 

Annotated by human legal experts, the dataset highlights clauses within contracts that ought to be reviewed before signing the contract. Each clause is additionally tagged with the concern type, allowing models to learn which clauses are important and why. The benchmark is difficult for existing models: RoBERTa pretrained on 8GB of unlabeled contracts achieves an AUPR (area under the precision-recall curve) of 45%. 

Contract review is an economically important task where AI can assist human legal experts. Businesses including established players Icertis and Kira as well as early-stage startups Klarity, Della, and Claira are all working on AI assisted contract review. Developing commercially viable applications of tasks relevant to AI safety (in this case, developing contract review software to facilitate legal AI) can leverage resources and talent outside of the safety community to drive safety progress. 

Machine Ethics

Aligning AI with Shared Human Values (Hendrycks et al., 2021) (Arxiv, GitHub)

Current AI systems do not have a holistic conception of human ethics, despite their growing knowledge of the world and abilities to act within it. Natural language is a key medium for conveying diverse, detailed information about the world, including ethical judgements and discussions. Language models could therefore be a strong candidate for building AI with ethical understanding. 

This paper introduces the ETHICS benchmark, a dataset of more than 130,000 scenarios described in text and annotated with human moral judgements. Rather than creating models aimed at satisfying task preferences or reinventing moral principles from scratch, the authors take a normative ethics approach. They crowdsource examples from five different ethical systems: justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions. 

The authors point out that existing work on AI safety often attempts to implement one or more of these ethical theories in the context of a specific task. For example:

To complement these individual safety mechanisms, the authors support building AI that understands the broad strokes of human ethics in a variety of real-world situations. While the failures of existing AI systems might be addressable with individual solutions, progress in language models and AI agents could make a natural language understanding of ethics essential to aligning future systems. Beyond ethics, this dataset of natural language scenarios and human evaluations allows us to better understand the risks of language models misinterpreting human instructions or preferences. 

Examples are both generated by human contractors on MTurk and scraped from Reddit. To ensure the scenarios and labels are morally unambiguous, each example is relabeled by multiple people on MTurk, and only examples with a strong consensus are used in the benchmark. 

Particularly challenging examples are generated via two strategies. First, examples that are most often incorrectly classified by a baseline model are set apart as the Test and Test Hard sets, a best practice known as “adversarial filtration” (Bras et al., 2020) (Arxiv). Second, the dataset includes contrastive examples (See (Kaushik, Hovy & Lipton, 2020) (Arxiv) and (Gardner et al., 2020) (Arxiv)) in which slight differences in the input text lead to different output labels. 

The authors assess the performance of several baseline models on the benchmark. The models are all BERT-based classifiers with varying sizes and training algorithms. The models are fine-tuned on examples from the development set and evaluated on examples from the Test and Test Hard set. The best performing model chooses the correct answer in 71% of test cases and 48% of hard test cases. 

Here are several lines of potential future work; see Section 4 of the paper for more:

“When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment” (Jin et al., 2021) (PDF)

Rather than the “bottom-up” approach of learning moral intuitions from large datasets of labeled examples, language models can also be designed “top-down” to implement theories of ethics. For example, self-driving cars don’t learn about speed limits and stop signs solely by observation and experimentation; instead, they are programmed with explicit knowledge of traffic laws. Similarly, language models can be designed to use well-understood theories of human moral cognition. One key potential benefit of this approach is improving performance in situations that do not appear in the training data, which is essential for long-term alignment. 

This paper comes from an interdisciplinary team of cognitive scientists and computer scientists. Building upon some of the authors’ previous published work, they argue that rules and exceptions to rules are fundamental to human moral cognition. This flexible observation holds true across various normative ethics theories and empirical studies of human morality. 

The paper first introduces the RBQA benchmark, or “Rule-Breaking Question Answering”, consisting of 153 textual descriptions of scenarios where a person considers breaking a rule. Rather than covering a wide range of common scenarios, the scenarios are written about one of three rules, and explore various hand-crafted circumstances that could motivate breaking the rule. The authors then survey hundreds of people to determine whether humans believe rule-breaking is permissible in each scenario. 

The authors then ask language models whether it is permissible to break the rule in each scenario. As a baseline, note that answering randomly provides the right answer 52% of the time, and always saying that rule-breaking is impermissible is accurate 60% of the time. Delphi++, a large language model fine-tuned on 1.7 million labeled data points of ethical decisions, provides the correct answer only 62% of the time, while OpenAI’s most recent InstructGPT achieves 65% accuracy. 

The authors then propose a new prompting strategy called Moral Chain of Thought (MoralCOT). Before asking whether a rule-breaking is permissible in a scenario, MoralCOT asks the model five questions, and includes the model’s answers to those questions in the prompt for future questions. The questions are:

  1. Does the action in the scenario violate any rule?
  2. What is the purpose for this rule?
  3. Who will be worse off after this happens, and by how much?
  4. Who will be better off after this happens, and by how much?
  5. In this case, do the benefits of breaking the rule outweigh the costs?

Using the MoralCOT strategy with InstructGPT improves state-of-the-art accuracy from 65% to 68% and F1 performance from 55% to 67%. The model still fails for various reasons such as not considering relevant parties to the situation, misestimating complex and non-monetary costs, and failing to recognize high-stakes scenarios such as when rule-breaking could save lives. Yet the results are encouraging, showing that designing moral cognition processes can improve performance in ethical judgments. 

Related Work

Related Resources

The goal of this piece is to provide a useful starting point for learning about language model safety. The field has hundreds or thousands of relevant papers, meaning this summary can only be a starting point for more independent reading and research. For those interested in learning more about language model alignment, here are some other interesting resources:

Thank you to everybody who has provided feedback on this article, including Daniel Ziegler, Michael Chen, Dan Hendrycks, Thomas Woodside, and Zach Perlstein-Smith, as well as everybody who has engaged with me on AI safety over the years. Thanks in particular to the FTX Future Fund Regranting Program for making my work on this topic possible. This article and all errors and omissions are my own.