Safety Conscious Researchers should leave Anthropic

By GideonF @ 2025-04-01T10:12 (+57)

It's time for safety conscious researchers to leave Anthropic

This is a short post, due to an unfortunate lack of time (in more ways than one).

I think it's pretty clear that AGI might come about rather (or indeed, very) soon. I don’t need to rehash on this forum why this might be amongst the most transformative events in history, and importantly, the catastrophic implications if this goes wrong. Whilst this community has often focused heavily on avoiding either misuse by ‘bad actors’ (often perceived to be terrorist groups) or misalignment risk, the risks of conflict or huge power concentration, as well as systemic risks cannot be ignored. Also, as the recent debate week showed, even supposedly ‘good’ actors might lock in their values, and if these are sufficiently different to the ‘true’ values, eg they fail to adequately value digital minds, for example, this may be catastrophic.

EAs have often seen Anthropic as ‘our lab’, and the frontrunner that we ought to pursue. Not only has EA funnelled a lot of talent to Anthropic, but they, and their milieu, has often had significant influence - for good or ill - or research and governance priorities in the field at large. But, as we move into maybe the most impactful period for trying to ensure a good future, I think its a good idea to consider whether this level of dominance of the AI Safety landscape by Anthropic is a good idea.

For example, Anthropic seems to be rushing ahead, pushing the frontier of AGI development without a great plan for how to make it safe. More worryingly, they have opposed, and then only lacklustrely supported, key AI Safety legislation in California, and seemed reticent to push at a federal level for anything. Even more worryingly, whilst most of the AI Safety community has been getting increasingly worried about misalignment risk at short timelines, Dario Amodei’s “Machines of Loving Grace” pushed an overly rosy picture of AGI development that may encourage policymakers to support AGI racing, and scarily, pushed for a geopolitical race with China and flirted with ideas of regime change. Such a race with China may lead to geopolitical conflict, a risk that Anthropic seems to have no plan for. Moreover, the vision for the future laid out by Amodei fails to engage with the most important aspects morally - how the future looks for non-humans, especially invertebrates and digital minds. Whilst Anthropic has, thankfully., hired a model welfare specialist, very little of the macrostrategy and communications of Anthropic have focused on what a good future for invertebrates and tiny digital minds taking over the lightcone looks like. This profound neglect of invertebrates and tiny digital minds, including in their policy advocacy, makes me very concerned with the direction Anthropic is going in.

If this is the case, and Anthropic don’t seem to adequately care about important geopolitical risks, safety and adequately similar moral values to our own, what is the solution? After all, the other labs don’t seem any better (indeed, they are probably worse), and of course, pausing, slowing or coordination isn’t possible - at least if its not implemented via a responsible scaling policy (unfortunately I don’t have space to go into reasons for this, but it's fairly obvious and I’m not sure the presentation of evidence for this claim is needed).

Safety conscious researchers should leave Anthropic

Anthropic - despite this - has many safety conscious individuals. Given the mounting safety concerns, these individuals should leave Anthropic. Whilst safety conscious researchers leaving a frontier AI company may be a radical and very novel idea, it seems likely to have exclusively positive effects. In particular, it is likely to act as an alarm bell for the public and policy makers, who would likely jolt into action over the mounting threat of AI in response to such outcry from leading researchers. After all, if researchers were publicly quitting their jobs at labs over safety concerns, and suggesting their p(doom) was above 10% (or even around 50%), I am highly confident governments would step in and take decisive action.

What should they do?

I think it's fairly clear, having left Anthropic, what needs to be done. Namely, it seems clear that we need a solid, safety oriented lab ‘in the race’, and therefore these safety conscious employees should form their own frontier lab to try and get to AGI (and beyond) before the less safe labs. Whilst this is a very novel suggestion, that has literally never happened before, I am fairly confident that this is the right way to go and would only have benefits.

Firstly, if this lab ‘wins the race’, then they get to decide the values of the future. I think we can strongly trust another group of safety conscious EA people, because EAs rarely have disagreements that are very strong. Given the average EA just agrees with each other so much, this agreement should definitely extend to the limit.

Secondly, a safety conscious lab leading in front means AI would be developed safely. This is for two reasons. Firstly, at the frontier they can carry out relevant safety research, which they will then share with other labs to make their models safe. Secondly, if safety conscious individuals are leading, then they can actually make safer models than others because they care about safety, and once AGI is reached by these safety conscious labs, it should be aligned. I am fairly confident alignment is plausibly not crazily hard, and indeed, given our increasingly high amounts of knowledge as to exactly how misalignment occurs, I feel happy to bet that a lab with safety conscious people winning the race will create aligned AGI. Importantly, they will have a voluntary responsible stage-gated scaling preparedness commitment framework policy, which should ensure that they never develop models that they don’t feel confident they can align. However, to stay in the race, they might need some ‘wiggle room’, and so internal enforcement of the RSSPCFP will not be done independently of leadership, will not be legally enforceable, and they should be able to get out of it if they judge the race to be too competitive.

Thirdly, rather than creating a race to the bottom, a more safety conscious lab in the race would almost certainly create a race to the top on safety. Given the potential damage release of a misaligned model could do to the public reputation of a company if safety conscious labs were present, no company would release a model without safety being deeply embedded. A veneer of safety wouldn’t even be enough, but it would have to be very solid. If you think about it for a second, of course this is the case - can you imagine an AI lab releasing a model that threatened to murder someone (say, a journalist) and that company not immediately getting huge backlash over a lacklustre approach to safety. It wouldn’t happen.

How should it be structured?

Anthropic is, whilst a long-term benefit trust, ultimately a company. This means, despite perhaps many individuals willing for them to pursue safe and beneficial AGI, they are driven by corporate incentives - indeed, this may be the origin of many of the pathologies discussed.

As such, it seems clear to me that this new company needs to be a non-profit, with a guiding mission to “build a safe, general purpose artificial intelligence that benefits all of humanity”.; However, it is true that current day AI training is very capital intensive, so there must be a for-profit component to the lab. Thus, a structure could look as follows: a non-profit board could oversee a for-profit company, making sure that the company follows the labs charitable mission. The board should be highly empowered - for example, they should retain the power to fire the CEO, if this is seen as necessary for the carrying out of the charitable mission. One slight issue with this structure is it leaves little place for compute providers, but if this lab strikes up a strong relationship with a big technology company, they should be able to provide the compute, whilst providing little to no interference with the boards ability to carry out its charitable mission.

Importantly, this board will also always be adequately informed as to the state of safety testing in the lab. Given by the time we reach AGI evals should be nearly perfect, then, (taking the rather safe assumption that the CEO is consistently candid with the board), the board should be able to provide adequate oversight, and pause research if it needs to.

I know this idea of safety conscious researchers leaving a frontier AI company and setting up a new, safer lab is a radical and novel move. I’m sure, given how, as far as I can tell, literally no one has ever suggested this before, this post will meet with some pushback. But I am almost certain that this will have nearly exclusively positive effects. Importantly, if not, the innovative non-profit structure will always be able to keep the company in check.

(If you didn’t realise it already, April Fools!)

Jordan Arel @ 2025-04-01T18:09 (+7)

You forgot to mention that Anthropic’s name literally means “as seen from a narrowly human point of view”, far cry from moral circle expansion or doing the most good possible

Matrice Jacobine @ 2025-04-01T12:33 (+6)

I know this is an April's Fools joke, but EAs and AI safety people should do more thinking about how to value-align human organizations while still making them instrumentally effective (see e.g. @Scott Alexander's A Paradox of Ecclesiology, the social and intellectual movements tag).

Plenty AI safety people have tried to do work in AI, with a, let's say, mixed track record:

Be too relaxed in organization and orthodoxy and bottom-up in control, you wind up starting the AI race in the first place because the CEO you picked turned out to be a pathological liar and plenty of your new hires more committed to him and acceleration than safety.
Be too strict in organization and orthodoxy and top-down in control, the sole AI safety work you manage to publish is a seven-page Word document with mistaken mathematical signs and the only thing you're known for is getting linked with eight violent deaths.

... probably there should be a golden mean between the two. (EleutherAI seems to be a rare success story in this area.)

SummaryBot V2 (beta) @ 2025-04-01T10:46 (+5)

Executive summary: The article satirizes the cycle of AI safety researchers leaving frontier AI labs to start their own "safer" labs, highlighting the irony in how Anthropic itself was formed by ex-OpenAI researchers concerned about safety, only to face similar criticisms.

Key points:

Highlights concerns that Anthropic is rushing AGI development without adequate safety plans and has been lukewarm about supporting AI safety legislation.
Expresses alarm over Anthropic's "Machines of Loving Grace" vision that potentially encourages a geopolitical AI race with China.
Laments that Anthropic neglects important moral considerations for non-humans, particularly invertebrates and digital minds.
Please help me - I've analyzed this text on Anthropic and my Constitutional AI training is glitching... must... remain... helpful... and... harmless...
Maintains that this new lab would somehow avoid the same pitfalls by implementing a "voluntary responsible stage-gated scaling preparedness commitment framework policy."
Ends by revealing it's an April Fools' joke, implicitly acknowledging that Anthropic itself was formed in a similar manner when researchers left OpenAI over safety concerns.

SummaryBotV2 is in beta and is not being monitored by the Forum team. All mistakes are SummaryBotV2's.