On excluding dangerous information from training

By ShayBenMoshe @ 2023-11-17T20:09 (+8)

This is a linkpost to https://www.lesswrong.com/posts/jKhJbrNRFA4afYPoP/on-excluding-dangerous-information-from-training

Introduction

In this short post, I would like to argue that it might be a good idea to exclude certain information – such as cybersecurity and biorisk-enabling knowledge – from frontier model training. I argue that this

is feasible, both technically and socially;
reduces significant misalignment and misuse risk drivers from near-to-medium future models;
is a good time to set this norm;
is a good test case for regulation.

After arguing for these points, I conclude with a call to action.

Remarks

To emphasize, I do not argue that this

is directly relevant to the alignment problem;
eliminates all risks from near-to-medium future models;
(significantly) reduces risks from superintelligence.

As I am far more knowledgeable in cybersecurity than, say, biorisks, whenever discussing specifics, I will only give examples from cybersecurity. Nevertheless, I think that the arguments hold as-is for relatively narrow subfields, e.g. what I imagine is mostly relevant for manufacturing lethal pathogens. One may want to exclude other information which might drive risks, such as information on AI safety (broadly defined) or energy production (nuclear energy or solar panels) among others, but this is out of the scope of this post.

I would like to thank Asher Brass, David Manheim, Edo Arad and Itay Knaan-Harpaz for useful comments on a draft of this post. They do not necessarily endorse the views expressed here, and all mistakes are mine.

Feasibility

Technical feasibility

Filtering information from textual datasets seems fairly straightforward. It seems easy to develop a classifier (e.g., fine-tuned from a small language model) detecting offensive cybersecurity-related information.

For example, one would want to exclude examples of specific vulnerabilities and exploits (e.g. all CVEs), information about classes of vulnerabilities (e.g. heap overflows and null dereference, in the context of vulnerabilities), exploitation mitigations (e.g. ASLR, DEP, SafeSEH, stack cookie, CFG, pointer tagging), exploitation techniques (e.g. ROP, NOP slides, heap spraying) and cybersecurity-related tools and toolchains (e.g. shellcodes, IDA, metasploit, antivirus capabilities, fuzzers). Some more debatable information to exclude are the code of particular attack surfaces (e.g. Linux TCP/IP stack) and technical details of real-world cybersecurity incidents. At any rate, all of these seem easy to detect.

Furthermore, as models' sample efficiency is very low at the present, it is likely that a moderately low false-negative level would suffice for significantly decreasing such capabilities.

Social feasibility

Most (legitimate) use cases don't employ such capabilities. Moreover, this kind of information is fairly narrow and self-contained, so excluding it from the dataset will likely not result in a meaningfully less capable model in other respects. Therefore, it seems likely that most actors – including AI labs and the open source community – won't have a strong incentive to include such information.

Moreover, actors might have relatively strong incentives to take such measures, whether because of worry from AI risks, avoidance of being sued in cases of (small case) misuse or accidents, or public reputation considerations.

It is true that some actors (such as pentesters, scientists, militaries, etc.) might be interested in such capabilities – both for legitimate and illegitimate uses. In such cases, they can train narrow models. I believe that this still reduces misuse and misalignment risks as I explain in the next section.

Risk reduction

Misalignment risks

Many misalignment risks are driven by such capabilities (see for example [1][2][3][4][5][6]). Clearly, reducing knowledge of such information thus reduces the likelihood of successful misalignment incidents.

To still employ such capabilities, models will either have to be sufficiently agentic and have strong in-context or online learning capabilities to acquire this information (through the internet for example), or be strong enough to invent them on their own (without even knowing what mitigations were implemented by humans). Both of these seem further in the future than when models would otherwise carry misalignment risks due to other factors. Thus, this could potentially buy significant time for AI safety work (including assisted by powerful, but not extremely powerful, AI models).

As mentioned above, some actors will still be interested in such capabilities. Nevertheless, in those cases they might be content with narrow(er) models, which therefore entail significantly smaller misalignment risks.

Misuse risks

Many misuse risks are driven by the very same capabilities (see for example [3][4][5][6][7]). Surely, these actions won't eliminate such risks, but they would significantly raise the bar for executing them. A malicious actor would have to either train an advanced model on their own, or gain access to such models' weights and further fine-tune them, both of which require significant know-how, money and time.

Setting a norm

With the recent surge in public interest in AI risks, this seems like a very good time for such actions. Given the risks and relative ease of implementation, it seems likely that some safety-minded actors could adopt these measures voluntarily in the near future. As these are simple enough and cost relatively little, even less safety-minded actors might be willing to take them soon after, as it becomes a more widely accepted practice, and as tools and standard methods make it easy to implement.

Regulation test case

The same considerations also seem to make this into a relatively easy target for training data regulation. Thus, this can serve as a test case for AI governance actors, policymakers, etc. to start with, leading to easier future regulation processes.

Call to action

Here are few calls to action:

AI labs can adopt these ideas, and implement them on their future models.
AI safety researchers and engineers can develop a standardized tool for filtering such information, to be adopted by actors training models.
AI governance actors can develop these ideas, and push for their regulation.
Others can give feedback, point out shortcomings, and suggest other improvements.

I am happy to assist with these (especially where my background in cybersecurity can help), and am available at shaybm9@gmail.com.