Joe_Carlsmith

Working on Claude's values at Anthropic. Former senior advisor at Coefficient Giving (then Open Philanthropy). Doctorate in philosophy at the University of Oxford. Opinions my own.

Posts

Video and transcript of talk on human-like-ness in AI safety
by Joe_Carlsmith @ 2025-12-17 | +14 | 0 comments
How human-like do safe AI motivations need to be?
by Joe_Carlsmith @ 2025-11-12 | +26 | 0 comments
Leaving Open Philanthropy, going to Anthropic
by Joe_Carlsmith @ 2025-11-03 | +141 | 0 comments
Controlling the options AIs can pursue
by Joe_Carlsmith @ 2025-09-29 | +9 | 0 comments
Video and transcript of talk on giving AIs safe motivations
by Joe_Carlsmith @ 2025-09-22 | +10 | 0 comments
Giving AIs safe motivations
by Joe_Carlsmith @ 2025-08-18 | +22 | 0 comments
Video and transcript of talk on "Can goodness compete?"
by Joe_Carlsmith @ 2025-07-17 | +34 | 0 comments
Video and transcript of talk on AI welfare
by Joe_Carlsmith @ 2025-05-22 | +22 | 0 comments
The stakes of AI moral status
by Joe_Carlsmith @ 2025-05-21 | +54 | 0 comments
Video and transcript of talk on automating alignment research
by Joe_Carlsmith @ 2025-04-30 | +11 | 0 comments
Can we safely automate alignment research?
by Joe_Carlsmith @ 2025-04-30 | +13 | 0 comments
AI for AI safety
by Joe_Carlsmith @ 2025-03-14 | +34 | 0 comments
Paths and waystations in AI safety
by Joe_Carlsmith @ 2025-03-11 | +22 | 0 comments
When should we worry about AI power-seeking?
by Joe_Carlsmith @ 2025-02-19 | +21 | 0 comments
What is it to solve the alignment problem?
by Joe_Carlsmith @ 2025-02-13 | +25 | 0 comments
How do we solve the alignment problem?
by Joe_Carlsmith @ 2025-02-13 | +38 | 0 comments
Fake thinking and real thinking
by Joe_Carlsmith @ 2025-01-28 | +78 | 0 comments
Takes on "Alignment Faking in Large Language Models"
by Joe_Carlsmith @ 2024-12-18 | +72 | 0 comments
Incentive design and capability elicitation
by Joe_Carlsmith @ 2024-11-12 | +9 | 0 comments
Option control
by Joe_Carlsmith @ 2024-11-04 | +11 | 0 comments
Motivation control
by Joe_Carlsmith @ 2024-10-30 | +18 | 0 comments
How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
by Joe_Carlsmith @ 2024-10-28 | +18 | 0 comments
Video and transcript of presentation on Otherness and control in the age of AGI
by Joe_Carlsmith @ 2024-10-08 | +18 | 0 comments
What is it to solve the alignment problem? (Notes)
by Joe_Carlsmith @ 2024-08-24 | +32 | 0 comments
Value fragility and AI takeover
by Joe_Carlsmith @ 2024-08-05 | +39 | 0 comments
A framework for thinking about AI power-seeking
by Joe_Carlsmith @ 2024-07-24 | +48 | 0 comments
Loving a world you don’t trust
by Joe_Carlsmith @ 2024-06-18 | +65 | 0 comments
On “first critical tries” in AI alignment
by Joe_Carlsmith @ 2024-06-05 | +29 | 0 comments
On attunement
by Joe_Carlsmith @ 2024-03-25 | +28 | 0 comments
Video and transcript of presentation on Scheming AIs
by Joe_Carlsmith @ 2024-03-22 | +23 | 0 comments
On green
by Joe_Carlsmith @ 2024-03-21 | +61 | 0 comments
On the abolition of man
by Joe_Carlsmith @ 2024-01-18 | +71 | 0 comments
Being nicer than Clippy
by Joe_Carlsmith @ 2024-01-16 | +26 | 0 comments
An even deeper atheism
by Joe_Carlsmith @ 2024-01-11 | +26 | 0 comments
Does AI risk “other” the AIs?
by Joe_Carlsmith @ 2024-01-09 | +23 | 0 comments
When "yang" goes wrong
by Joe_Carlsmith @ 2024-01-08 | +57 | 0 comments
Deep atheism and AI risk
by Joe_Carlsmith @ 2024-01-04 | +65 | 0 comments
Gentleness and the artificial Other
by Joe_Carlsmith @ 2024-01-02 | +90 | 0 comments
Otherness and control in the age of AGI
by Joe_Carlsmith @ 2024-01-02 | +37 | 0 comments
Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-11 | +7 | 0 comments
Summing up "Scheming AIs" (Section 5)
by Joe_Carlsmith @ 2023-12-09 | +9 | 0 comments
Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs")
by Joe_Carlsmith @ 2023-12-08 | +6 | 0 comments
Simplicity arguments for scheming (Section 4.3 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-07 | +6 | 0 comments
The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-06 | +9 | 0 comments
Arguments for/against scheming that focus on the path SGD takes (Section 3 of...
by Joe_Carlsmith @ 2023-12-05 | +7 | 0 comments
Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-04 | +12 | 0 comments
Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming...
by Joe_Carlsmith @ 2023-12-03 | +6 | 0 comments
The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-02 | +6 | 0 comments
How useful for alignment-relevant work are AIs with short-term goals? (Section 2..
by Joe_Carlsmith @ 2023-12-01 | +6 | 0 comments
Is scheming more likely in models trained to have long-term goals? (Sections 2.2..
by Joe_Carlsmith @ 2023-11-30 | +6 | 0 comments
“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-29 | +7 | 0 comments
Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-28 | +8 | 0 comments
Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-27 | +11 | 0 comments
Situational awareness (Section 2.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-26 | +12 | 0 comments
On “slack” in training (Section 1.5 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-25 | +14 | 0 comments
Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-24 | +10 | 0 comments
A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-22 | +6 | 0 comments
Varieties of fake alignment (Section 1.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-21 | +6 | 0 comments
New report: "Scheming AIs: Will AIs fake alignment during training in order to...
by Joe_Carlsmith @ 2023-11-15 | +71 | 0 comments
Superforecasting the premises in “Is power-seeking AI an existential risk?”
by Joe_Carlsmith @ 2023-10-18 | +114 | 0 comments
In memory of Louise Glück
by Joe_Carlsmith @ 2023-10-15 | +22 | 0 comments
The “no sandbagging on checkable tasks” hypothesis
by Joe_Carlsmith @ 2023-07-31 | +16 | 0 comments
Predictable updating about AI risk
by Joe_Carlsmith @ 2023-05-08 | +135 | 0 comments
[Linkpost] Shorter version of report on existential risk from power-seeking AI
by Joe_Carlsmith @ 2023-03-22 | +49 | 0 comments
A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my...
by Joe_Carlsmith @ 2023-02-21 | +64 | 0 comments
Seeing more whole
by Joe_Carlsmith @ 2023-02-17 | +124 | 0 comments
Why should ethical anti-realists do ethics?
by Joe_Carlsmith @ 2023-02-16 | +118 | 0 comments
[Linkpost] Human-narrated audio version of "Is Power-Seeking AI an Existential...
by Joe_Carlsmith @ 2023-01-31 | +9 | 0 comments
On sincerity
by Joe_Carlsmith @ 2022-12-23 | +46 | 0 comments
Against meta-ethical hedonism
by Joe_Carlsmith @ 2022-12-02 | +27 | 0 comments
Against the normative realist's wager
by Joe_Carlsmith @ 2022-10-13 | +26 | 0 comments
Video and Transcript of Presentation on Existential Risk from Power-Seeking AI
by Joe_Carlsmith @ 2022-05-08 | +97 | 0 comments
On expected utility, part 4: Dutch books, Cox, and Complete Class
by Joe_Carlsmith @ 2022-03-24 | +7 | 0 comments
On expected utility, part 3: VNM, separability, and more
by Joe_Carlsmith @ 2022-03-22 | +9 | 0 comments
On expected utility, part 2: Why it can be OK to predictably lose
by Joe_Carlsmith @ 2022-03-18 | +8 | 0 comments
On expected utility, part 1: Skyscrapers and madmen
by Joe_Carlsmith @ 2022-03-16 | +22 | 0 comments
Simulation arguments
by Joe_Carlsmith @ 2022-02-18 | +39 | 0 comments
On infinite ethics
by Joe_Carlsmith @ 2022-01-31 | +96 | 0 comments
The ignorance of normative realism bot
by Joe_Carlsmith @ 2022-01-18 | +25 | 0 comments
Morality and constrained maximization, part 2
by Joe_Carlsmith @ 2022-01-12 | +9 | 0 comments
Morality and constrained maximization, part 1
by Joe_Carlsmith @ 2021-12-22 | +13 | 0 comments
Reviews of "Is power-seeking AI an existential risk?"
by Joe_Carlsmith @ 2021-12-16 | +71 | 0 comments
Anthropics and the Universal Distribution
by Joe_Carlsmith @ 2021-11-28 | +18 | 0 comments
On the Universal Distribution
by Joe_Carlsmith @ 2021-10-29 | +25 | 0 comments
SIA > SSA, part 4: In defense of the presumptuous philosopher
by Joe_Carlsmith @ 2021-10-01 | +9 | 0 comments
SIA > SSA, part 3: An aside on betting in anthropics
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
SIA > SSA, part 2: Telekinesis, reference classes, and other scandals
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
SIA > SSA, part 1: Learning from the fact that you exist
by Joe_Carlsmith @ 2021-10-01 | +17 | 0 comments
Can you control the past?
by Joe_Carlsmith @ 2021-08-27 | +46 | 0 comments
In search of benevolence (or: what should you get Clippy for Christmas?)
by Joe_Carlsmith @ 2021-07-20 | +17 | 0 comments
On the limits of idealized values
by Joe_Carlsmith @ 2021-06-22 | +80 | 0 comments
Draft report on existential risk from power-seeking AI
by Joe_Carlsmith @ 2021-04-28 | +88 | 0 comments
Problems of evil
by Joe_Carlsmith @ 2021-04-19 | +31 | 0 comments
The innocent gene
by Joe_Carlsmith @ 2021-04-05 | +16 | 0 comments
The importance of how you weigh it
by Joe_Carlsmith @ 2021-03-29 | +44 | 0 comments
On future people, looking back at 21st century longtermism
by Joe_Carlsmith @ 2021-03-22 | +102 | 0 comments
Against neutrality about creating happy lives
by Joe_Carlsmith @ 2021-03-15 | +95 | 0 comments
Care and demandingness
by Joe_Carlsmith @ 2021-03-08 | +54 | 0 comments
Subjectivism and moral authority
by Joe_Carlsmith @ 2021-03-01 | +15 | 0 comments
Two types of deference
by Joe_Carlsmith @ 2021-02-22 | +10 | 0 comments
Contact with reality
by Joe_Carlsmith @ 2021-02-15 | +47 | 0 comments
Killing the ants
by Joe_Carlsmith @ 2021-02-07 | +232 | 0 comments
Believing in things you cannot see
by Joe_Carlsmith @ 2021-02-01 | +15 | 0 comments
On clinging
by Joe_Carlsmith @ 2021-01-24 | +29 | 0 comments
A ghost
by Joe_Carlsmith @ 2021-01-21 | +2 | 0 comments
Actually possible: thoughts on Utopia
by Joe_Carlsmith @ 2021-01-18 | +86 | 0 comments
Alienation and meta-ethics (or: is it possible you should maximize helium?)
by Joe_Carlsmith @ 2021-01-15 | +19 | 0 comments
The impact merge
by Joe_Carlsmith @ 2021-01-13 | +29 | 0 comments
Shouldn't it matter to the victim?
by Joe_Carlsmith @ 2021-01-11 | +22 | 0 comments
Thoughts on personal identity
by Joe_Carlsmith @ 2021-01-08 | +21 | 0 comments
Grokking illusionism
by Joe_Carlsmith @ 2021-01-06 | +29 | 0 comments
The despair of normative realism bot
by Joe_Carlsmith @ 2021-01-03 | +34 | 0 comments
Thoughts on being mortal
by Joe_Carlsmith @ 2021-01-01 | +60 | 0 comments
Wholehearted choices and "morality as taxes"
by Joe_Carlsmith @ 2020-12-21 | +91 | 0 comments