How the AI Safety Community Can Counter Safety Washing
By Chris Leong @ 2025-10-13T08:27 (+9)
I was working on this post during the BlueDot writing course. Properly writing this post would be a massive project, so I honestly am unlikely to ever complete this without funding to either spend like a month on this or to hire a writer to work with me.
Even if I did complete this project at some point in the future, I'd have to adjust the framing as the Paris AI Action conference would no longer be so topical, so probably worthwhile just dropping this here. |
Outline:
- How the situation now sucks:
- The Paris AI Action conference was a wash
- More generally AI safety is suffering serious setbacks
- Observation: AI safety has enough societal power that actors feel the need to pay lip service, but not to go beyond this:
- This results in safety washing
- Take a step back:
- Observe how promising things looked right after ChatGPT
- Note how surprising it is that the situation has reversed so much
- Explain the importance of understanding the situation for charting a course forward
Understanding the situation:
- Incentives:
- Explain incentives
- You might assume we would know about
- Claim: We’re currently outgunned:
- Main thesis: “Lots of different actors find AI safety inconvenient for various reasons, even if it's just competing for the same funding/attention, leading to them undermining it or engaging in motivated reasoning”:
- Justify consideration of motivated reasoning as a factor
- We have too many enemies:
- Rhetorical strategies:
- Ad hominem attacks
- Misdirection
- Overconfident assertions
- Naive skepticism
- Why these strategies are effective:
- People want to believe, you just need to give them a fig leaf
- Frog-boiling and consistency bias
- Many actors are low-information and so have to go off vibes:
- Even many technical folk aren’t in a good position to directly evaluate the arguments
- Confidently making an argument projects the right vibes
- Many of these rhetorical strategies have a grain of truth
- Learned helplessness causes actors to not seek out information: what would they do with such knowledge? How would they even evaluate it?
- Incredibly hard for actors to admit how limited their perspective is <should admit some of my own biases>
- Discussion of various groups that oppose AI safety:
- Politicians
- Commerce departments
- Government departments
- Non-profits/activists
- AI ethicists
- AI labs
- AI startups
- Business consumers of AI
- Open source community
- AI academia
- Lawyers and social scientists
- Media
- Rhetorical strategies:
- We can’t rely on the public despite the polls:
- Too much apathy and learned helplessness
- AI safety community’s lack of experience in political knife fights
- Conclusion: Our influence is significantly limited
- Main thesis: “Lots of different actors find AI safety inconvenient for various reasons, even if it's just competing for the same funding/attention, leading to them undermining it or engaging in motivated reasoning”:
- Now here’s the kicker: An AI safety win condition is especially challenging as it requires dramatic and extremely strategic decision-making:
- Why dramatic action is required:
- Speed of changes undermines incremental adaption
- Why we should plan for the offense-defense balance favoring the attacker:
- Because it likely does absent extremely costly measures
- We don’t want attackers to have an advantage at any point in time:
- Even short periods where attackers have the advantage could be really bad
- Uncertainty and the precautionary principle
- At minimum requires massive investment in safety, resilience and agile governance, but likely requires some compromises when it comes to freedom or privacy as well
- Relevance*: incremental change is likely insufficient
- Why extremely strategic decision-making is required:
- The best strategy is highly contingent
- There’s a high degree of uncertainty:
- Technological uncertainty
- The human element
- Timelines are likely short, giving us limited opportunities to learn and evolve
- Relevance: we’re too likely to need to win particular points to be able to either accept what wins we can get or to able to compromise solutions produced by a large coalition
- Why dramatic action is required:
- Why playing the game straightforwardly is likely insufficient:
- Two main strategies:
- Build a large coalition: it seems likely that we would need quite a few allies to be able get our policies through despite the opposition of other interests <If you think this isn’t the case, I’d love to hear how we could get away with a small coalition and how we’d convince them to ally with us>:
- Building such a coalition would be challenging, but as we’re about to see, even if we achieved this, we would likely still find it hard to steer the world in positive directions
- Accept the dominance of other interests and eke out small, incremental wins where we can
- Build a large coalition: it seems likely that we would need quite a few allies to be able get our policies through despite the opposition of other interests <If you think this isn’t the case, I’d love to hear how we could get away with a small coalition and how we’d convince them to ally with us>:
- Conclusion: We need a dramatically different plan
- Two main strategies:
Proposal: Plan for a dramatic shift in the strategic situation:
- The easiest way to demonstrate why this might be promising is via an example:
- The ChatGPT moment
- Are there any that have potential?
- Criteria:
- Likelihood of occurring
- Degree of possible strategic advantage
- Difficulty in seizing that advantage
- We previously discussed the downsides of a large coalition. Seizing these opportunities generally requires some kind of coalition, but many of these situations may dramatically empower certain actors in a way that enables us to join together in a small coalition rather than joining a large coalition
- Maybe represent these in a table
- AI disaster:
- Coalition:
- National security folks, concerned public
- Likelihood:
- Given open-source proliferation, I expect to see a dramatic misuse of AI at some point
- Degree of possible strategic advantage:
- Could easily create a situation where politicians feel they need to take strong action, even if this is “objectively” an overreaction to some extent
- Difficulty in seizing that advantage:
- Most natural scenario for the AI safety community to seize the reigns
- Coalition:
- Escalation of great power conflict:
- Coalition: National security folk, concerned public
- Likelihood: Looks far too likely
- Degree of possible strategic advantage:
- Joe Biden previously called for a Manhattan project focusing on both safety and capabilities, but that order has now been rolled back* and it was evident that safety was secondary in any case
- Difficulty in seizing that advantage
- Trump admin :-(. Likely too busy racing to pay attention to security
- Fears of mass unemployment has some potential, but not ideal:
- Unions, left activists?, concerned public
- If these fears become sufficiently salient, we might be able to just ally with this particular movement rather than having to build a big coalition
- Likelihood: Will almost certainly happen at some point, but will it happen after RSI? Quite possibly might be too late
- Degree of possible strategic advantage:
- Might lead to extra funding and AI regulation, but unclear whether this might be enough to make a difference given that these would be secondary considerations
- Difficulty in seizing that advantage
- It might even prove counterproductive by taking up all of the oxygen or if key figures in this movement see catastrophic risks as a distraction
- Easy to overshoot and bring about a world where the least responsible actors develop AGI. That said, we are much less likely to have clashes with this group than with other groups
- Other considerations:
- Such actors are likely to grasp onto arguments about AI being unsafe without deeply understanding them:
- Less than ideal, but perhaps an unavoidable compromise
- Such actors are likely to grasp onto arguments about AI being unsafe without deeply understanding them:
- I’m less confident in further ChatGPT moments:
- Big milestones like IMO Silver medals, Turing test or inference time compute haven’t made larger numbers of people more worried
- Some combination of frog boiling/overwhelm
- Conclusion: Prepare primarily for an AI disaster and secondarily for mass unemployment
- Criteria:
- Factors for success:
- Worth noting that the opportunity might be short:
- Because of how fast things move
- Because oppositional actors adapt their strategies
- Three factors:
- Planning
- Preparation:
- Bravery (willingness to say things that are currently controversial but will look good in retrospect)
- Execution:
- Qualities (assuming you have the right skills and resources in place):
- Agility - being able to deploy them quickly
- On-the-fly adjustments
- Of course even agility and on-the-fly adjustments can be seen as a matter of proper preparation
- Qualities (assuming you have the right skills and resources in place):
- Worth noting that the opportunity might be short:
- What about/objections?:
- Is there even an explanation?
- What about growing the movement or building intellectual credibility?:
- I suspect that movement growth, ladder climbing and diffusion of arguments is too slow
- Gathering conclusive evidence of loss of control risks being significant risk is hard and easy to dismiss
- What about pursuing some 4d-chess strategy instead?:
- These could exist, but rarely work
- Planning for a dramatic shift involves some degree of risk, but not unreasonably slow
- What if capabilities end up plateauing?:
- Then we can pivot more resources back towards growth. Resources spent pursuing this plan will likely still deliver value at some point
- I don’t believe that costs of this plan are unreasonable for the potential value provided
Attempted draft
The content below is nowhere near complete and may not be of very high quality. I included it anyway on the off-chance someone finds it helpful. However, the outline is probably the most useful thing here. |
How the situation now sucks:
The Paris AI Action conference was a wash
It's been clear for a while that the Paris AI Action Summit was going to be disappointing, but it turned out worse than I could have ever imagined.
We already knew France had banished safety from the name of the conference, relegating it from being the sole focus to just one of five key areas instead.
-------------
However, this went even further with the fully compromised Statement on Inclusive and Sustainable Artificial Intelligence for the People and the Planet.
Safety was reduced to a single paragraph that undermines safety if anything:
“Harnessing the benefits of AI technologies to support our economies and societies depends on advancing Trust and Safety. We commend the role of the Bletchley Park AI Safety Summit and Seoul Summits that have been essential in progressing international cooperation on AI safety and we note the voluntary commitments launched there. We will keep addressing the risks of AI to information integrity and continue the work on AI transparency.”
Let’s break it down:
* First, safety is being framed as “trust and safety”.
These are not the same things.
The word trust appearing first is not as innocent as it appears: trust is the primary goal and safety is secondary to this.
This is a very commercial perspective, if people trust your product you can trick them into buying it, even if it isn't actually safe.
* Second, trust and safety are not framed as values important in and of themselves, but as subordinate to realising the benefits of these technologies, with the economic benefits unsurprisingly receiving special emphasis.
* Finally, the statement doesn’t commit to continuing to address these risk, but only narrowly to “addressing the risks of AI to information integrity” and “continue the work on AI transparency”.
In other words, they’re purposefully downplaying any other potential risks.
<Footnote: Some folks may feel that I should be interpreting this more charitably. However, as I note further down, one of the main organisers has pretty much admitted that that they were explicitly trying to redirect the conversation>
-----------
In summary, the Paris AI Action Conference was completely compromised by other interests, most notably the desire of France to have a venue to promote their AI industry.
<Footnote: For comments from other figures, see Dario describing the summit as a missed opportunity, Demis noting the lack of discussions about 'what do we want the world [with AGI] to be like’ and Stuart Russell describing the conference as “negligence of an unprecedented magnitude”>
This is no accident, lead organiser Anne Bouverot, has been very explicit about wanting to shift the narrative away from catastrophic risks, describing the previous focus on such risks as a “a bit of a science fiction moment”.
-----------
Whilst the current US administration is very different ideologically, the one thing they do agree with the French on is ignoring safety concerns.
Not only did they absolutely refuse to sign any statement mentioning mention existential risk, but they went further and canceled the trips of Homeland Security or U.S. AI Safety Institute representatives who had been planning on attending.
JD Vance’s speech was even worse. The very first words out of his mouth were: "I'm not here this morning to talk about AI safety, which was the title of the conference a couple of years ago. I'm here to talk about AI opportunity." Later he said: "The AI future is not going to be won by hand-wringing about safety"
In response to the American moves, the EU has even indicated that it wishes to withdraw its AI liability directive.
The only real bright spot in all this is the UK refusing to sign, but in protesting the compromised statement, they are essentially alone
<58 countries signed the statement>
< A government spokesperson said: “We felt the declaration didn't provide enough practical clarity on global governance, nor sufficiently address harder questions around national security and the challenge AI poses to it”>
More generally AI safety has suffered serious setbacks
Unfortunately, the setbacks haven’t just been limited to this conference, but we’ve been taking hits all across the board.
----
Perhaps the first big shift* was the failed attempt to remove Sam Altman as CEO of OpenAI, which had been supported by various board members sympathetic to safety concerns.
He seems to have taken it quite personally, not only retaliating against the board that had removed him, but minimizing the focus on the risks to the point where most of OpenAI’s old safety team lost faith in their ability to positively influence the company and quit.
------
Next, Rishi Sunak, who had championed the UK AI Safety Institute and Bletchley Park conference, lost the election in a historic defeat and was replaced by Keir Starmer.
<For reasons unrelated to safety>
Starmer is primarily focused on AI opportunities, but he hasn’t undermined the UK AI Safety Institute and, as previously noted, his government refused to sign the compromised summit bill.
-------
Another defeat occurred when frontier AI labs that previously claimed to want legislation, suddenly changed their tune when HB 1047 was on the table, even though the bill was significantly watered down.
Governor Newsom vetoed it, after being lobbied by Nancy Pelosi and the local tech industry.
--------
More recently, Trump repealed the Biden order on the Safe, Secure and Trustworthy Development and Use of Artificial Intelligence, replacing it with the order Removing Barriers to American Leadership in Artificial Intelligence.
Unfortunately, we should expect still further setbacks between the influence of Vice President VD Vance, AI Czar David Sacks, Senior Policy Advisor Sriram Krishnan, billionaire Mark Zuckerberg and VC Marc Andreessen.
--------------
Taking a step back:
Things looked promising in the months after ChatGPT:
I think it’s worth taking a step back and recalling how promising society’s initial response to ChatGPT was so that we can see how things all went wrong.
Without knowing what happened, we won’t be able to chart a course forward
Let’s go back to the last day of November 2022, when OpenAI released a technical demo with the boring name ChatGPT, unaware of the tsunami they were about to release.
While ChatGPT was actually less powerful in many ways than the raw model, its accessibility and ease of use woke up much of the world to the fact that AI was much further along than they had anticipated
<See discussions of mode collapse vs. a base model>
<Many critics at the time noted that when you dug into it ChatGPT was more limited than it first appeared. Model capabilities are quite uneven in that you can’t assume that a model that can do X is as capable as a human who can do X. However, these critics mostly neglected to consider the possibility that capabilities might just not slow down>
---------------------
Many people saw their model of the world upended*.
In quick succession, we saw the Future of Life Institute Pause Letter, The Frontier AI Taskforce, US Senate hearings, The Center for AI Safety Statement on AI Risk, the Biden-Harris Voluntary Agreements and then the Bletchley Park Summit.
For those who were following, it was a crazy period to live through, when the Overton Window burst right open and the community’s long-dismissed fears were finally being taken seriously.
But then things started to change:
For a brief moment in time, it seemed that humanity might actually wake up and decide to become responsible*, but then all that started to change.
In the timeline of setbacks that I recounted* before, I chose to begin at the failed ouster of Sam Altman.
From a narrative perspective, it’s a particularly compelling place to start.
For a brief moment in time, it looked like the safety-aligned faction might actually gain control of OpenAI and this will always be one of history’s big what-ifs.
A world where Anthropic, DeepMind and OpenAI all had safety-focused leadership would have opened up a lot of possibilities.
Alas, it was not to be.
Overall situation and importance of charting a course forward
Instead, we’ve ended up in a world where most key actors only really feel a need to present a fig leaf of safety and the Trump administration doesn’t even feel the need to pretend.
This might have been sufficient, or even survivable if we were talking about a less powerful technology or if capabilities weren’t progressing so absurdly fast, but neither of these are the case.
If you could go back in time and talk to people in the community when AI safety seemed on the rise, I doubt many people would have predicted this reversal.
I’m sure there would have been a decent number of people who believed that there was a decent chance of “unknown unknowns” causing a reversal*, but there was very little common knowledge about the threats we had to watch out for, many of which now seem obvious in retrospect.
This is worrying.
There will likely be many more sudden ups and down and our situation is tight enough that we can’t afford to navigate them poorly.
If we want humanity to achieve a positive future, then we need to do better.
Preventing similar failures will be hard if we don’t even know what happened
--------------------
Why did this happen?
Many different actors find AI risks inconvenient for a variety of different reasons, even if it’s just that they’re competing for the same funding or attention.
-----
These incentives can deeply affect people’s cognition, potentially reshaping it at multiple levels.
Those affected will not find themselves mentally diminishing the risks, they will be less likely to think about whether there is anything they might have missed and be less likely to notice anything when they do.
In situations where these risks could be relevant, such risks will be less likely to factor into their cognition and the person will amplify the importance of directing their cognition to other aspects.
Even when they do think about the risks, they’ll be less likely to notice any potential moral or practical implications and those that they do see will be minimised.
They will struggle to come up with plans to address these risks and be more pessimistic about the plans they come up with succeeding.
They will be more likely to identify potential downsides of such plans, rate any such downsides as worse and the downsides will also be more likely to factor into their cognition.
Inputs are often affected as well.
They’ll be more skeptical of anything that could imply that the risks are more serious or that action is necessary and less skeptical of anything that implies the opposite.
Going further, these incentives often flow into meta-level views, such as who is trustworthy, what kind of evidence is reliable and how we should respond to potential risks.
-----
So as we’ve seen, there’s a whole host of ways in which incentives shape people’s cognition.
This will flow through to their words, but in a way that is further amplified by their incentives.
People are much less likely to say things against their interests, especially when they don’t think it’s that important anyway and there are so many other more pressing issues to discuss.
Even when they are willing to say such things, they will write or say less, make weaker claims, use weaker language, be more likely to emphasize any uncertainty and be less likely to press the point.
----
Various social factors amplify the impacts of these biases further.
----
Whilst most actors try to behave honorably, selfishness often plays some role too.
This is usually limited to minor violations, perhaps with an occasional major lapse.
Nonetheless, the effects can be quite significant as a small degree of selfishness can result in a large amount of harm when someone sees
Various social factors^ significantly amplify these effects.
< TODO: discuss these effects>
What factors?:
- Belief contagion
- Identity
- Oppositional effects
< TODO: list social factors I discuss >
Lots of different actors find AI safety inconvenient for various reasons, even if it's just competing for the same funding/attention, leading to them undermining it or engaging in motivated reasoning (some critics are reasonable)
While there are many different stories you could tell, each with its own grain of truth, I’ll be focusing on incentives and tribalism as driving forces.
These are deeply connected - incentives drive tribalism and tribalism creates incentives - yet they are also distinct - there are many non-tribal incentives and tribalism also intersects things like sacrifice or epistemics*.
I will first consider incentives, then I’ll argue that we need to go beyond this
But before I start, I want to clarify what I mean by incentives. Often when people think about incentives, their mind tends towards assuming that these are tangible benefits like money, sex or some kind of official recognition.
However, these incentives can often be much more subtle.
Incentives
If there is any community that would be able to appreciate the implications of incentives, you might think it would be our one.
Robin Hanson, who has argued that essentially everything people do is about signaling - or the result of social incentives - has hung around the rationality community since the very early days and was one of the participants in the first major AI safety debate.
< Concerns about AI Safety diffused from the rationalist community to the Effective Altruism community which seeded the AI Safety Community which is now really its own beast>
Yet this is precisely what I am going to claim.
--------
A core reason why the Rationalist and Effective Altruism communities have been so generative is the assumption of good faith that members tend to grant to each other.
Since people are often incredibly biased, being open-minded requires a degree of engaging people in good faith even when you have strong reasons to believe that this is unwarranted.
This has allowed these communities to be intellectually vibrant, but everything has a downside and in this case, the cost is a certain amount of naivety.
The AI safety community has inherited this naivety, with one of the clearest examples being that we were far too trusting that Sam Altman would be a responsible actor.
< In rationalist lingo terms, I could say that we have a strong natural tendency towards understanding situations in terms of mistake theory when conflict theory would provide a more accurate analysis. Nonetheless, I decided against using these terms because they typically are taken to imply particular strategies and not just particular analyses. Further, both these theories tend to be taken as universalist, rather than as being contextual >
In terms of why we tend to underrate incentives, I suspect that a key reason is that we tend to imagine incentives as affecting singular decisions, when they are better modeled as a pressure to drift over time
< Scott Alexander’s Legend of Murder-Gandhi is a great illustration of this >
Additionally, properly analyzing the effects of incentives requires accounting for the interactions of this with tribalism, rhetoric and consistency bias.
Is there even an explanation?
Before we even begin trying to explain why we ended up here, it is worthwhile to pause and consider the possibility that this is all just bad luck and there’s nothing we could have done.
I don’t think this is true, but it’s also not without merit.
Several events such as France being awarded the next conference, Sam Altman surviving the coup or the tech right allying with Trump and pulling out an unexpected victory could easily have gone completely differently.
Sources:
Quote: “We can no longer afford this pantomime of progress. The age of polite summit declarations and non-binding commitments must end”
When you're in a hole*, it's natural to ask how you ended up there, as that is often key to figuring out the path forward. I've found it useful to default towards assuming causes to be multifactorial, but if I had to reduce it to a simple story, then it'd explain it like this: “Catastrophic AI risks are awfully inconvenient for a large number of actors, even if it's just because people don't want to have to imagine depressing outcomes or because these actors are competing for resources or attention. This results in these actors engaging in motivated reasoning to convince themselves such risks are overblown. They are then incentivisef to intentionally undermine AI safety. Sometimes this is more a matter of indifference and structural incentives”.
{It is worthwhile considering some of the other factors. For a long time academics and policy makers could plausibly assert that catastrophic risks was an extreme minority position unworthy of serious discussion. Even though polls* have shown that they majority of folks in Western countries are skeptical of AI, apathy and learned helplessness, means that few people are willing to take action, even if it's just taking into account the views of a candidate on AI when voting. Additionally, maybe folks throw up their hands and say who knows, forgetting that not knowing whether AI is dangerous or not, doesn't mean we can assume it is safe}
I wish it wasn't necessary, but I need to add a disclaimer: I'm not saying that all critics of catastrophic risks are falling prey to cognitive biases. Individuals and communities who fall* into making such a convenient assumption tend to fatally damage their epistemics. Nonetheless, whilst it would be nice to believe that everyone is more or less a rational agent, making this assumption when it can easily be falsified*, may help ward off narrow-mindedness, but undermines one’s ability to think strategically.
If we want to understand why AI safety is on the backfoot, I think it is especially pertinent to consider the rhetoric used to justify such actions, as folk are reluctant to endure criticism without at least some kind of fig leaf to hide behind.
Three categories of rhetoric {in the broadest sense} stand out: ad hominem attacks, misdirection, overconfident assertions and naive skepticism. Motivated
IGNORE CONTENT BELOW THIS
Responding to safety washing
Define safety washing
Understanding the situation:
• There was bunch of attention towards safety after ChatGPT, that's now been undermined
• Counter-narratives: AI as hype, stochastic parrots, Deep Learning hitting a wall, need to race against China, we need to focus on real concrete issues vs. theoretical issues, AI risks as sci-fi, brand safety as AI safety, AI risks as being assumed to be manageable by default, AI safety as regulatory capture, ad hominem attacks on EA, ai safety as "woke", risk of missing out on the economic benefits, AI safety as white, straight male techbros, narrow focus of previous conferences, open source as good, but economics arguments
• What happened? Lots of different actors find AI safety inconvenient for various reasons, even if it's just competing for the same funding/attention, leading to them undermining it or engaging in motivated reasoning (some critics are reasonable)
• Secondarily Rishi was voted out and there's no replacement for him on the world stage
• Unfortunately, the AI Safety community is less experienced in political knife fights and we are also limited in terms of the actions we can take by the requirement of maintaining our epistemics
Response:
• We need to recognise political realities, we are currently outgunned. Navigating the transition to advanced AI well likely requires a willingness to take quite drastic action, particularly if the situation is attacker priviledged
• We can try building up our movement, and some work should be going there, but that takes time which we most likely don't have. We can likely make some incremental progress through alliances, but the changes we can get through will likely be vastly insufficient. Some folk should be working on this, but I suspect we should downgrade the importance of this kind of incremental work given the tail winds against it.
• I now think that our main bet should be to bet on some kind of AI disaster shifting the Overton window.
• AI advances will wake up some folk here and there, but I'm dubious about betting on another ChatGPT moment as people have become inured to progress. Very few people paid attention to the IMO silver medal, Inference Time Compute/DeepSeek got attention, but didn't really shift many folk towards the safety side. Most folk have already chosen a side and are reluctant to shift.
• We should also try talking about the importance of safety for national security, but it might be a hard sell with the Trump administration. Biden saw them as linked, but I expect Trump to prioritise capabilities
• AI disaster strategy: Leveraging the opportunity (notice Covid and how we didn't even shut down Gain of Function research), building capacity, pre-planning, actual execution. Incremental work should take this strategy into account and try to put us in a good position for when such a strategy kicks in, for example, useful for Australia to have an AISI, agile mechanisms.
• More shift towards MIRI's strategy of saying true things and waiting for society to catch up to us
•What if capabilities plateau? Seems unlikely, but we can pivot more towards long-term growth if this happens.
• What about public polling? Could potentially overrule interest groups, but only matters to the extent that this affects people’s votes, most people vote based upon their everyday life. An inactive public can be overpowered by interest groups (lots of examples, even of single interest groups). Hard to get the public active without some kind of disaster, even then, unclear that it leads to the kind of nuanced action we need.
Unnecessary?
Big picture???:
Zooming out, the big picture is as follows:
It’s clear that the Trump administration feels no need to pretend to care about safety.
Whilst most other actors aren’t that brazen, they seem to much more about being seen as responsible rather than actually taking the risks seriously.
This naturally leads them to engage in safety washing, such as by half-measures or focusing on the risks that are less inconvenient for them. Unfortunately, this is unlikely to cut it.
Jacob Watts🔸 @ 2025-10-13T20:43 (+3)
Ambitious stuff indeed! There's a lot going on here.
I really appreciate discussions about "big picture strategy about avoiding misalignment".
For starters, in my opinion, solving technical alignment and control such that one could elicit the main benefits of having a "superintelligent servant" are merely one threat model / AGI-driven challenge. That said, ofc, getting that sort of thing right also basically means the rest of the planning is better left to someone else and if you are willing to additionally postulate a strong "decisive strategic advantage" is basically also a win condition for whatever else you could want.
I would point to eg.
- robo powered ultra tyranny
- gradual disempowerment / full unemployment / the intelligence curse
- misinformation slop / mass psychosis
- terrorism, scammers, and a flood of many competent robo psychopaths
- machines feeling pain and robot rights
- accelerated R&D and needing to adapt at machine speeds
as issues that can all still bite more or less even in worlds where you get some level "alignment" esp. if you operationalize alignment as more ~"robust instruction tuning++" rather than ~"optimizing for the true moral law itself".
That said, takeover by rogue models or systems of models is a super salient threat model in any world where machines are being made to "think" more and better.
I found your list of competing framings which cut against AI Safety quite compelling. Safety washing is indeed all over the place. One thing I didn't see noted specifically is that a pretty significant contingency within EA / AI Safety works pretty actively on apologetics for hyperscalers because they directly financially benefit and/or they have kind of groomed themselves into being the kind of person who can work at a top AI Safety lab.
To draw contrast with how this might have been. You don't, for example, see many EAs working at and founding hot new synthetic virology companies in order to "do biocontainment better than the competitors". Ostensibly, there could be a similar grim logic of inevitably and a sense that "we ought to do all of the virology experiments first and more responsibly". Then, idk, once we've built really powerful AIs or learned everything about virology, we can use this to exit the time of troubles. I don't actually know what eg. Anthropic's plan is for this, but in the case of synthetic virology / gain of function research you might imagine that once you've learned the right stuff about all the potential pathogens, you would be super duper prepared to stop them with all your new medical interventions.
Like, I guess I am just noting my surprise at not seeing good old "keeping safety at the frontier" / "racing through a minefield" Anthropic show up more in a screed about safety washing. The EA/rat space in general is one of the few places where catastrophic risk from AI is a priority and the conflicts of interest here literally could not run deeper. This whole place is largely funded by one of the Meta cofounders and there are a lot of very influential EAs with a lot of personal connection to and complete financial exposure to existing AI companies. This place was a safety adjacent trade show before it was cool lol.
Lots of loving people here who really care on some level I'm sure, but if we are talking about mixed signals, then I would reconsider the mote in our team's eye lol.
***
Beyond that, I guess there is the matter of timelines.
I do not share your confidence in short timelines and think interventions that take a while to pay off can be super worthwhile.
Also, idk, I feel like the assumption that it is all right around the corner and that any day now the singularity is about to happen is really central to the views of a lot of people into x-safety in a way that might explain part of why the worldview kind of struggles to spread outside the relatively limited pool of people who are open to that.
I don't know what you'd call marginal or just fiddling around the edges because I would agree that it is bad if we don't do enough soon enough and someone builds a lethally intelligent super mind and it does rise up and game over.
Maybe the only way to really push for x-safety is with If Anyone Builds It style "you too should believe in and seek to stop the impending singularity" outreach. That just feels like such a tough sell even if people would believe in the x-safety conditional on believing in the singularity. Agh. I'm conflicted here. No idea.
I would love it if we could do more to ally with people who do not see the singularity as being particularly near without things descending into idle "safety washing" nor "trust and safety"-level corporate bullshit.
Like the "AI is insane hype" contingency has some real stuff going for them too. I don't think they are all just blind. In my humble opinion, I also think Sam Altman looks like an asshole when he calls ChatGPT "PhD level" and talks about it doing "new science". You know, in some sense, if we're just being cute, then Wikipedia has been PhD level for a while now and it makes less shit up. There is a lot of hype. These people are marketing and sometimes they get excited.
Plus, it gives me bad vibes when I am trying to push for x-safety and I encounter (often quite justified) skepticism about the power levels of current LLMs and I end up basically just having to do marketing work or whatever for model providers. Idk.
I'm pretty sure LLM providers aren't even profitable at this point and general robotics isn't obviously much more "right around the corner" than it would've seemed to disinterested layperson over the past few decades. I'm conflicted on this stuff; idk how much effort should go into "singularity is near" vs "if singularity, then doom by default".
Red lines and RSPs are actually probably a pretty good way of unifying "singularity near" x-safety people with "singularity far" or even "singularity who?" x-safety allies.
***
As far as strategic takeaways:
I do think it is good sense to "be ready" and have good ideas "sitting around" for when they are needed. I believe there was a recent UN general assembly where world leaders were literally asking around for, like, ideas for AI red lines. If this is a world where intelligent machines are rising, then there is a good chance we continue to see signs of that (until we don't). The natural tide of "oh shit guys" and "wow this is real" may be attenuated somewhat by frog boiling effects, but still. Also, the weirdness of AI Safety regulation and such under consideration will benefit from frog boiling.
Preparedness seems like a great idle time activity when the space isn't receiving the love/attention it deserves :) .
Chris Leong @ 2025-10-14T12:58 (+2)
Thanks for the detailed comments.
Maybe the only way to really push for x-safety is with If Anyone Builds It style "you too should believe in and seek to stop the impending singularity" outreach. That just feels like such a tough sell even if people would believe in the x-safety conditional on believing in the singularity. Agh. I'm conflicted here. No idea.
I wish I had more strategic clarity here.
I believe there was a recent UN general assembly where world leaders were literally asking around for, like, ideas for AI red lines.
I would be surprised if anything serious comes sout of this immediately, but I really like this framing because it normalises the idea that we should have red lines.
SummaryBot @ 2025-10-14T15:52 (+2)
Executive summary: An exploratory, somewhat urgent argument that AI safety is losing ground to “safety washing” (performative, low-cost nods to safety) and entrenched incentives; the author contends incremental coalition-building is unlikely to suffice and urges preparing for moments of sharp Overton-window shift—most plausibly an AI disaster (secondarily mass unemployment)—by building plans, capacity, and agility to seize those openings.
Key points:
- Diagnosis: The Paris AI Action conference and recent policy/lab developments exemplify “safety washing,” where institutions foreground trust/brand or narrow risks while sidelining catastrophic-risk mitigation; overall, AI safety has suffered setbacks across governments, labs, and legislation.
- Incentives & rhetoric: Many actors (politicians, labs, startups, open-source, media, etc.) find AI-safety claims inconvenient and engage in motivated reasoning; common tactics include ad hominem, misdirection, overconfident assertions, and naïve skepticism—effective because they offer low-information audiences a “fig leaf.”
- Why incrementalism struggles: The community is “outgunned,” timelines may be short, and the offense–defense balance could favor attackers; large coalitions are slow and compromise-prone, while eked-out incremental wins are likely insufficient for the level/speed of risk.
- Strategic proposal: Shift primary planning toward leveraging rare moments of dramatic advantage—especially an AI disaster (and, secondarily, salient unemployment shocks)—that could force decisive policy; success depends on pre-planning, capacity, bravery (saying unfashionable truths), and rapid execution.
- Contingencies & uncertainties: ChatGPT-style “wake-ups” may no longer move opinion; capabilities could plateau (in which case pivot back to growth/credibility/movement-building); national-security framings may or may not resonate with current U.S. leadership.
- Implication for the AI-safety community: Maintain epistemic standards but get more politically realistic; invest now in preparedness for window-shifting events (e.g., agile governance mechanisms, AISI-like capacity) while downgrading expectations for near-term broad-coalition wins.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.