#197 – On whether Anthropic's AI safety policy is up to the task (Nick Joseph on The 80,000 Hours Podcast)

By 80000_Hours @ 2024-08-22T15:34 (+9)

We just published an interview: Nick Joseph on whether Anthropic's AI safety policy is up to the task. Listen on Spotify, watch on Youtube, or click through for other audio options, the transcript, and related links. Below are the episode summary and some key excerpts.

Episode summary

If you’re considering working on capabilities at some lab, try to understand their theory of change. Ask people there, “How does your work on capabilities lead to a better outcome?” and see if you agree with that. I would talk to their safety team, talk to safety researchers externally, get their take. Do they think that this is a good thing to do?

And then I would also look at their track record and their governance and all the things to answer the question of, do you think they will push on this theory of change? Like over the next five years, are you confident this is what will actually happen?

— Nick Joseph

The three biggest AI companies — Anthropic, OpenAI, and DeepMind — have now all released policies designed to make their AI models less likely to go rogue or cause catastrophic damage as they approach, and eventually exceed, human capabilities. Are they good enough?

That’s what host Rob Wiblin tries to hash out in this interview (recorded May 30) with Nick Joseph — one of the original cofounders of Anthropic, its current head of training, and a big fan of Anthropic’s “responsible scaling policy” (or “RSP”). Anthropic is the most safety focused of the AI companies, known for a culture that treats the risks of its work as deadly serious.

As Nick explains, these scaling policies commit companies to dig into what new dangerous things a model can do — after it’s trained, but before it’s in wide use. The companies then promise to put in place safeguards they think are sufficient to tackle those capabilities before availability is extended further. For instance, if a model could significantly help design a deadly bioweapon, then its weights need to be properly secured so they can’t be stolen by terrorists interested in using it that way.

As capabilities grow further — for example, if testing shows that a model could exfiltrate itself and spread autonomously in the wild — then new measures would need to be put in place to make that impossible, or demonstrate that such a goal can never arise.

Nick points out three big virtues to the RSP approach:

Rob then pushes Nick on some of the best objections to the RSP mechanisms he’s found, including:

Nick explains why he thinks some of these worries are overblown, while others are legitimate but just point to the hard work we all need to put in to get a good outcome.

Nick and Rob also discuss whether it’s essential to eventually hand over operation of responsible scaling policies to external auditors or regulatory bodies, if those policies are going to be able to hold up against the intense commercial pressures that might end up arrayed against them.

In addition to all of that, Nick and Rob talk about:

And as a reminder, if you want to let us know your reaction to this interview, or send any other feedback, our inbox is always open at podcast@80000hours.org.

Producer and editor: Keiran Harris
Audio engineering by Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Video engineering: Simon Monsour
Transcriptions: Katy Moore

Highlights

What Anthropic's responsible scaling policy commits the company to doing

Nick Joseph: In a nutshell, the idea is it’s a policy where you define various safety levels — these sort of different levels of risk that a model might have — and create evaluations, so tests to say, is a model this dangerous? Does it require this level of precautions? And then you need to also define sets of precautions that need to be taken in order to train or deploy models at that particular risk level.

Basically, for every level, we’ll define these red-line capabilities, which are capabilities that we think are dangerous.

I can maybe give some examples here, which is this acronym, CBRN: chemical, biological, radiological, and nuclear threats. And in this area, it might be that a nonexpert can make some weapon that can kill many people as easily as an expert can. So this would increase the pool of people that can do that a lot. On cyberattacks, it might be like, “Can a model help with some really large-scale cyberattack?” And on autonomy, “Can the model perform some tasks that are sort of precursors to autonomy?” is our current one, but that’s a trickier one to figure out.

So we establish these red-line capabilities that we shouldn’t train until we have safety mitigations in place, and then we create evaluations to show that models are far from them or to know if they’re not. These evaluations can’t test for that capability, because you want them to turn up positive before you’ve trained a really dangerous model. But we can kind of think of them as yellow lines: once you get past there, you should reevaluate.

And the last thing is then developing standards to make models safe. We want to have a bunch of safety precautions in place once we train those dangerous models.

That’s the main aspects of it. There’s also sort of a promise to iteratively extend this. Creating the evaluations is really hard. We don’t really know what the evaluation should be for a superintelligent model yet, so we’re starting with the closer risks. And once we hit that next level, defining the one after it.

Why Nick is a big fan of the RSP approach

Nick Joseph: One thing I like is that it separates out whether an AI is capable of being dangerous from what to do about it. I think this is a spot where there are many people who are sceptical that models will ever be capable of this sort of catastrophic danger. Therefore they’re like, “We shouldn’t take precautions, because the models aren’t that smart.” I think this is a nice way to agree where you can. It’s a much easier message to say, “If we have evaluations showing the model can do X, then we should take these precautions.” I think you can build more support for something along those lines, and it targets your precautions at the time when there’s actual danger.

There are a bunch of other things I can talk through. One other thing I really like is that it aligns commercial incentives with safety goals. Once we put this RSP in place, it’s now the case that our safety teams are under the same pressure as our product teams — where if we want to ship a model, and we get to ASL-3, the thing that will block us from being able to get revenue, being able to get users, et cetera, is: Do we have the ability to deploy it safely? It’s a nice outcome-based approach, where it’s not, Did we invest X amount of money in it? It’s not like, Did we try? It’s: Did we succeed? And I think that often really is important for organisations to set this goal of, “You need to succeed at this in order to deploy your products.”

Rob Wiblin: Is it actually the case that it’s had that cultural effect within Anthropic, now that people realise that a failure on the safety side would prevent the release of the model that matters to the future of the company? And so there’s a similar level of pressure on the people doing this testing as there is on the people actually training the model in the first place?

Nick Joseph: Oh yeah, for sure. I mean, you asked me earlier, when are we going to have ASL-3? I think I receive this from someone on one of the safety teams on a weekly basis, because the hard thing for them actually is their deadline isn’t a date; it’s once we have created some capability. And they’re very focused on that.

Rob Wiblin: So their fear, the thing that they worry about at night, is that you might be able to hit ASL-3 next year, and they’re not going to be ready, and that’s going to hold up the entire enterprise?

Nick Joseph: Yeah. I can give some other things, like 8% of Anthropic staff works on security, for instance. There’s a lot you have to plan for it, but there’s a lot of work going into being ready for these next safety levels. We have multiple teams working on alignment, interpretability, creating evaluations. So yeah, there’s a lot of effort that goes into it.

Rob Wiblin: When you say security, do you mean computer security? So preventing the weights from getting stolen? Or a broader class?

Nick Joseph: Both. The weights could get stolen, someone’s computer could get compromised. You could have someone hack into and get all of your IP. There’s a bunch of different dangers on the security front, where the weights are certainly an important one, but they’re definitely not the only one.

Are RSPs still valuable if the people using them aren't bought in?

Nick Joseph: Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.

I think that there are some checks in place here. So having whistleblower-type protections such that people can say if a company is breaking from the RSP or not trying hard enough to elicit capabilities or to interpret it in a good way, and then public discussion can add some pressure. But ultimately, I think you do need regulation to have these very strict requirements.

Over time, I hope we’ll make it more and more concrete. The blocker of course on doing that is that we don’t know for a lot of these things — and being overly concrete, where you specify something very precisely that turns out to be wrong, can be very costly. And if you then have to go and change it, et cetera, it can take away some of the credibility. So sort of aiming for as concrete as we can make it, while balancing that.

Rob Wiblin: The response to this that jumps out to me is just that ultimately it feels like this kind of policy has to be implemented by a group that’s external to the company that’s then affected by the determination. It really reminds me of accounting or auditing for a major company. It’s not sufficient for a major corporation to just have its own accounting standards, and follow that and say, “We’re going to follow our own internal best practices.” You get — and it’s legally required that you get — external auditors in to confirm that there’s no chicanery going on.

And at the point that these models potentially really are risky, or it’s plausible that the results will come back saying that we can’t release this; maybe we even have to delete it off of our servers according to the policy, I would feel more comfortable if I knew that some external group that had different incentives was the one figuring that out. Do you think that ultimately is where things are likely to go in the medium term?

Nick Joseph: I think that’d be great. I would also feel more comfortable if that was the case. I think one of the challenges here is that for auditing, there’s a bunch of external accountants. This is a profession. Many people know what to do. There are very clear rules. For some of the stuff we’re doing, there really aren’t external, established auditors that everyone trusts to come in and say, we took your model and we certified that it can’t autonomously replicate across the internet or cause these things.

So I think that’s currently not practical. I think that would be great to have at some point. One thing that will be important is that that auditor has enough expertise to properly assess the capabilities of the models.

Nick's biggest reservations about the RSP approach

Nick Joseph: I think for Anthropic specifically, it’s definitely around this under-elicitation problem. I think it’s a really fundamentally hard problem to take a model and say that you’ve tried as hard as one could to elicit this particular danger. There’s always something. Maybe there’s a better researcher. There’s a saying: “No negative result is final.” If you fail to do something, someone else might just succeed at it next. So that’s one thing I’m worried about.

Then the other one is just unknown unknowns. We are creating these evaluations for risks that we are worried about and we see coming, but there might be risks that we’ve missed. Things that we didn’t realise would come before — either didn’t realise would happen at all, or thought would happen after, for later levels, but turn out to arise earlier.

Rob Wiblin: What could be done about those things? Would it help to just have more people on the team doing the evals? Or to have more people both within and outside of Anthropic trying to come up with better evaluations and figuring out better red-teaming methods?

Nick Joseph: Yeah, and I think that this is really something that people outside Anthropic can do. The elicitation stuff has to happen internally, and that’s more about putting as much effort as we can into it. But creating evaluations can really happen anywhere. Coming up with new risk categories and threat models is something that anyone can contribute to.

Rob Wiblin: What are the places that are doing the best work on this? Anthropic surely has some people working on this, but I guess I mentioned METR: [Model Evaluation and Threat Research]. They’re a group that helped to develop the idea of RSPs in the first place and develop evals. And I think the AI Safety Institute in the UK is involved in developing these sort of standard safety evals. Is there anywhere else that people should be aware of where this is going on?

Nick Joseph: There’s also the US AI Safety Institute. And I think this is actually something you could probably just do on your own. I think one thing, I don’t know, at least for people early in their career, if you’re trying to get a role doing something, I would recommend just go and do it. I think you probably could just write up a report, post it online, be like, “This is my threat model. These are the things I think are important.” You could implement the evaluations and share them on GitHub. But yeah, there are also organisations you could go to to get mentorship and work with others on it.

Rob Wiblin: I see. So this would look like, I suppose you could try to think up new threat models, think up new things that you need to be looking for, because this might be a dangerous capability and people haven’t yet appreciated how much it matters. But I guess you could spend your time trying to find ways to elicit the ability to autonomously spread and steal model weights and get yourself onto other computers from these models and see if you can find an angle on trying to find warning signs, or signs of these emerging capabilities that other people have missed and then talk about them.

And you can just do that while signed into Claude 3 Opus on your website?

Nick Joseph: I think you’ll have more luck with the elicitation if you actually work at one of the labs, because you’ll have access to training the models as well. But you can do a lot with Claude 3 on the website or via an API — which is a programming term for basically an interface where you can send a request for like, “I want a response back,” and automatically do that in your app. So you can set up a sequence of prompts and test a bunch of things via the APIs for Claude, or any other publicly accessible model.

Should Anthropic's RSP have wider safety buffers?

Nick Joseph: So there are these red-line capabilities: these are the capabilities that are actually the dangerous ones. We don’t want to train a model that has these capabilities until we have the next set of precautions in place. Then there are evaluations we’re creating, and these evaluations are meant to certify that the model is far short of those capabilities. It’s not “Can the model do those capabilities?” — because once we pass them, we then need to put all the safety mitigations in place, et cetera.

And then when we have to run those evaluations, we have some heuristics like when the effective compute goes up by a certain fraction — that is a very cheap thing that we can evaluate on every step of the run — or something along those lines so that we know when to run it.

In terms of how conservative they are, I guess one example I would give is, if you’re thinking about autonomy — where a model could spread to a bunch of other computers and autonomously replicate across the internet — I think our evaluations are pretty conservative on that front. We test if it can replicate to a fully undefended machine, or if it can do some basic fine-tuning of another language model to add a simple backdoor. I think these are pretty simple capabilities, and there’s always a judgement call there. We could set them easier, but then we might trip those and look at the model and be like, “This isn’t really dangerous; it doesn’t warrant the level of precaution that we’re going to give it.”

Rob Wiblin: There was something also about that the RSP says that you’ll be worried if the model can succeed half the time at these various different tasks trying to spread itself to other machines. Why is succeeding half the time the threshold?

Nick Joseph: So there’s a few tasks. I don’t off the top of my head remember the exact thresholds, but basically it’s just a reliability thing. In order for the model to chain all of these capabilities together into some long-running thing, it does need to have a certain success rate. Probably it actually needs a very, very high success rate in order for it to start autonomously replicating despite us trying to stop it, et cetera. So we set a threshold that’s fairly conservative on that front.

Rob Wiblin: Is part of the reason that you’re thinking that if a model can do this worrying thing half the time, then it might not be very much additional training away from being able to do it 99% of the time? That might just require some additional fine-tuning to get there. Then the model might be dangerous if it was leaked, because it would be so close to being able to do this stuff.

Nick Joseph: Yeah, that’s often the case. Although of course we could then elicit it, if we’d set a higher number. Even if we got 10%, maybe that’s enough that we could bootstrap it. Often when you’re training something, if it can be successful, you can reward it for that successful behaviour and then increase the odds of that success. It’s often easier to go from 10% to 70% than it is to go from 0% to 10%.

Rob Wiblin: So if I understand correctly, the RSP proposes to retest models every time you increase the amount of training compute or data by fourfold, is that right? That’s kind of the checkpoint?

Nick Joseph: We’re still thinking about what is the best thing to do there, and that one might change, but we use this notion of effective compute. So really this has to do with when you train a model, it goes down to a certain loss. And we have these nice scaling laws of if you have more compute, you should expect to get to the next loss. You might also have a big algorithmic win where you don’t use any more compute, but you get to a lower loss. And we have coined this term “effective compute.” So that would account for that as well.

These jumps are sort of the jump where we have sort of a visceral sense of how much smarter a model seems when you do that jump, and have set that as our bar for when we have to run all these evaluations — which do require a staff member to go and run them, spend a bunch of time trying to elicit the capabilities, et cetera.

Alternatives to RSPs

Rob Wiblin: What are the alternative approaches for dealing with AI risk that people advocate that you think are weaker in relative terms?

Nick Joseph: I mean, I think the first baseline is nothing. There could just be nothing here. I think the downsides of that is that these models are very powerful. They could at some point in the future be dangerous. And I think that companies creating them have a responsibility to think really carefully about those risks and be thoughtful. It’s a major externality. That’s maybe the easiest baseline of do nothing.

Other things people propose would be a pause, where a bunch of people say that there are all these dangers, why don’t we just not do it? I think that makes sense. If you’re training these models that are really dangerous, it does feel a bit like, why are you doing this if you’re worried about it? But I think there are actually really clear and obvious benefits to AI products right now. And the catastrophic risks, currently, they’re definitely not obvious. I think they’re probably not immediate.

As a result, this isn’t a practical ask. Not everyone is going to pause. So what will happen is only the places that care the most — that are the most worried about this, and the most careful with safety — will pause, and you’ll sort of have this adverse selection effect. I think there eventually might be a time for a pause, but I would want that to be backed up by, “Here are clear evaluations showing the models have these really catastrophically dangerous capabilities. And here are all the efforts we put into making them safe. And we ran these tests and they didn’t work. And that’s why we’re pausing, and we would recommend everyone else should pause until they have as well.” I think that will just be a much more convincing case for a pause, and target it at the time that it’s most valuable to pause.

Rob Wiblin: When I think about people doing somewhat potentially dangerous things or developing interesting products, maybe the default thing I imagine is that the government would say, “Here’s what we think you ought to do. Here’s how we think that you should make it safe. And as long as you make your product according to these specifications — as long as the plane runs this way and you service the plane this frequently — then you’re in the clear, and we’ll say that what you’ve done is reasonable.”

Do you think that RSPs are maybe better than that in general? Or maybe just better than that for now, where we don’t know necessarily what regulations we want the government to be imposing? So it perhaps is better for companies to be figuring this out themselves early on, and then perhaps it can be handed over to governments later on.

Nick Joseph: I don’t think the RSPs are a substitute for regulation. There are many things that only regulation can solve, such as what about the places that don’t have an RSP? But I think that right now we don’t really know what the tests would be or what the regulations would be. I think probably this is still sort of getting figured out. So one hope is that we can implement our RSP, OpenAI and Google can implement other things, other places will implement a bunch of things — and then policymakers can look at what we did, look at our reports on how it went, what the results of our evaluations were and how it was going, and then design regulations based on the learnings from them.

Rob Wiblin: OK. If I read it correctly, it seemed to me like the Anthropic RSP has this clause that allows you to go ahead and do things that you think are dangerous if you’re being sufficiently outpaced by some other competitor that doesn’t have an RSP, or not a very serious responsible scaling policy. In which case, you might worry, “Well, we have this policy that’s preventing us from going ahead, but we’re just being rendered irrelevant, and some other company is releasing much more dangerous stuff anyway, so what really is this accomplishing?”

Did I read that correctly, that there’s a sort of get-out-of-RSP clause in that sort of circumstance? And if you didn’t expect Anthropic to be leading, and for most companies to be operating safely, couldn’t that potentially obviate the entire enterprise because that clause could be quite likely to get triggered?

Nick Joseph: Yeah, I think we don’t intend that as like a get-out-of-jail-free card, where we’re falling behind commercially, and then like, “Well, now we’re going to skip the RSP.” It’s much more just intended to be practical, as we don’t really know what it will look like if we get to some sort of AGI endgame race. There could be really high stakes and it could make sense for us to decide that the best thing is to proceed anyway. But I think this is something that we’re looking at as a bit more of a last resort than a loophole we’re planning to just use for, “Oh, we don’t want to deal with these evaluations.”

Should concerned people be willing to take capabilities roles?

Nick Joseph: I don’t have this framing of, there’s capabilities and there is safety and they are like separate tracks that are racing. It’s one way to look at it, but I actually think they’re really intertwined, and a lot of safety work relies on capabilities advances. I gave this example of this many-shot jailbreaking paper that one of our safety teams published, which uses long-context models to find a jailbreak that can apply to Claude and to other models. And that research was only possible because we had long-context models that you could test this on. I think there’s just a lot of cases where the things come together.

But then I think if you’re going to work on capabilities, you should be really thoughtful about it. I do think there is a risk you are speeding them up. In some sense you could be creating something that is really dangerous. But I don’t think it’s as simple as just don’t do it. I think you want to think all the way through to what is the downstream impact when someone trains AGI, and how will you have affected that? That’s a really hard problem to think about. There’s a million factors at play, but I think you should think it through, come to your best judgement, and then reevaluate and get other people’s opinions as you go.

Some of the things I might suggest doing, if you’re considering working on capabilities at some lab, is try to understand their theory of change. Ask people there, “How does your work on capabilities lead to a better outcome?” and see if you agree with that. I would talk to their safety team, talk to safety researchers externally, get their take. Do they think that this is a good thing to do? And then I would also look at their track record and their governance and all the things to answer the question of, do you think they will push on this theory of change? Like over the next five years, are you confident this is what will actually happen?

One thing that convinced me at Anthropic that I was maybe not doing evil, or made me feel much better about it, is that our safety team is willing to help out with capabilities, and actually wants us to do well at that. Early on with Opus, before we launched it, we had a major fire. There were a bunch of issues that came up, and there was one very critical research project that my team didn’t have capacity to push forward.

So I asked Ethan Perez, who’s one of the safety leads at Anthropic, “Can you help with this?” It was actually during an offsite, and Ethan and most of his team just basically went upstairs to this building in the woods that we had for the offsite and cranked out research on this for the next two weeks. And for me, at least, that was like, yes. The safety team here also thinks that us staying on the frontier is critical.