Joe_Carlsmith

Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.

Posts

Takes on "Alignment Faking in Large Language Models"
by Joe_Carlsmith @ 2024-12-18 | +63 | 0 comments
Incentive design and capability elicitation
by Joe_Carlsmith @ 2024-11-12 | +9 | 0 comments
Option control
by Joe_Carlsmith @ 2024-11-04 | +11 | 0 comments
Motivation control
by Joe_Carlsmith @ 2024-10-30 | +18 | 0 comments
How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
by Joe_Carlsmith @ 2024-10-28 | +18 | 0 comments
Video and transcript of presentation on Otherness and control in the age of AGI
by Joe_Carlsmith @ 2024-10-08 | +18 | 0 comments
What is it to solve the alignment problem?
by Joe_Carlsmith @ 2024-08-24 | +32 | 0 comments
Value fragility and AI takeover
by Joe_Carlsmith @ 2024-08-05 | +38 | 0 comments
A framework for thinking about AI power-seeking
by Joe_Carlsmith @ 2024-07-24 | +44 | 0 comments
Loving a world you don’t trust
by Joe_Carlsmith @ 2024-06-18 | +65 | 0 comments
On “first critical tries” in AI alignment
by Joe_Carlsmith @ 2024-06-05 | +29 | 0 comments
On attunement
by Joe_Carlsmith @ 2024-03-25 | +27 | 0 comments
Video and transcript of presentation on Scheming AIs
by Joe_Carlsmith @ 2024-03-22 | +23 | 0 comments
On green
by Joe_Carlsmith @ 2024-03-21 | +61 | 0 comments
On the abolition of man
by Joe_Carlsmith @ 2024-01-18 | +71 | 0 comments
Being nicer than Clippy
by Joe_Carlsmith @ 2024-01-16 | +25 | 0 comments
An even deeper atheism
by Joe_Carlsmith @ 2024-01-11 | +25 | 0 comments
Does AI risk “other” the AIs?
by Joe_Carlsmith @ 2024-01-09 | +22 | 0 comments
When "yang" goes wrong
by Joe_Carlsmith @ 2024-01-08 | +56 | 0 comments
Deep atheism and AI risk
by Joe_Carlsmith @ 2024-01-04 | +64 | 0 comments
Gentleness and the artificial Other
by Joe_Carlsmith @ 2024-01-02 | +89 | 0 comments
Otherness and control in the age of AGI
by Joe_Carlsmith @ 2024-01-02 | +37 | 0 comments
Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-11 | +7 | 0 comments
Summing up "Scheming AIs" (Section 5)
by Joe_Carlsmith @ 2023-12-09 | +9 | 0 comments
Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs")
by Joe_Carlsmith @ 2023-12-08 | +6 | 0 comments
Simplicity arguments for scheming (Section 4.3 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-07 | +6 | 0 comments
The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-06 | +9 | 0 comments
Arguments for/against scheming that focus on the path SGD takes (Section 3 of...
by Joe_Carlsmith @ 2023-12-05 | +7 | 0 comments
Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-04 | +12 | 0 comments
Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming...
by Joe_Carlsmith @ 2023-12-03 | +6 | 0 comments
The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-02 | +6 | 0 comments
How useful for alignment-relevant work are AIs with short-term goals? (Section 2..
by Joe_Carlsmith @ 2023-12-01 | +6 | 0 comments
Is scheming more likely in models trained to have long-term goals? (Sections 2.2..
by Joe_Carlsmith @ 2023-11-30 | +6 | 0 comments
“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-29 | +7 | 0 comments
Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-28 | +8 | 0 comments
Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-27 | +11 | 0 comments
Situational awareness (Section 2.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-26 | +12 | 0 comments
On “slack” in training (Section 1.5 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-25 | +14 | 0 comments
Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-24 | +10 | 0 comments
A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-22 | +6 | 0 comments
Varieties of fake alignment (Section 1.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-21 | +6 | 0 comments
New report: "Scheming AIs: Will AIs fake alignment during training in order to...
by Joe_Carlsmith @ 2023-11-15 | +71 | 0 comments
Superforecasting the premises in “Is power-seeking AI an existential risk?”
by Joe_Carlsmith @ 2023-10-18 | +114 | 0 comments
In memory of Louise Glück
by Joe_Carlsmith @ 2023-10-15 | +22 | 0 comments
The “no sandbagging on checkable tasks” hypothesis
by Joe_Carlsmith @ 2023-07-31 | +10 | 0 comments
Predictable updating about AI risk
by Joe_Carlsmith @ 2023-05-08 | +130 | 0 comments
[Linkpost] Shorter version of report on existential risk from power-seeking AI
by Joe_Carlsmith @ 2023-03-22 | +49 | 0 comments
A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my...
by Joe_Carlsmith @ 2023-02-21 | +64 | 0 comments
Seeing more whole
by Joe_Carlsmith @ 2023-02-17 | +123 | 0 comments
Why should ethical anti-realists do ethics?
by Joe_Carlsmith @ 2023-02-16 | +118 | 0 comments
[Linkpost] Human-narrated audio version of "Is Power-Seeking AI an Existential...
by Joe_Carlsmith @ 2023-01-31 | +9 | 0 comments
On sincerity
by Joe_Carlsmith @ 2022-12-23 | +46 | 0 comments
Against meta-ethical hedonism
by Joe_Carlsmith @ 2022-12-02 | +27 | 0 comments
Against the normative realist's wager
by Joe_Carlsmith @ 2022-10-13 | +27 | 0 comments
Video and Transcript of Presentation on Existential Risk from Power-Seeking AI
by Joe_Carlsmith @ 2022-05-08 | +97 | 0 comments
On expected utility, part 4: Dutch books, Cox, and Complete Class
by Joe_Carlsmith @ 2022-03-24 | +7 | 0 comments
On expected utility, part 3: VNM, separability, and more
by Joe_Carlsmith @ 2022-03-22 | +8 | 0 comments
On expected utility, part 2: Why it can be OK to predictably lose
by Joe_Carlsmith @ 2022-03-18 | +8 | 0 comments
On expected utility, part 1: Skyscrapers and madmen
by Joe_Carlsmith @ 2022-03-16 | +22 | 0 comments
Simulation arguments
by Joe_Carlsmith @ 2022-02-18 | +39 | 0 comments
On infinite ethics
by Joe_Carlsmith @ 2022-01-31 | +94 | 0 comments
The ignorance of normative realism bot
by Joe_Carlsmith @ 2022-01-18 | +25 | 0 comments
Morality and constrained maximization, part 2
by Joe_Carlsmith @ 2022-01-12 | +9 | 0 comments
Morality and constrained maximization, part 1
by Joe_Carlsmith @ 2021-12-22 | +13 | 0 comments
Reviews of "Is power-seeking AI an existential risk?"
by Joe_Carlsmith @ 2021-12-16 | +71 | 0 comments
Anthropics and the Universal Distribution
by Joe_Carlsmith @ 2021-11-28 | +18 | 0 comments
On the Universal Distribution
by Joe_Carlsmith @ 2021-10-29 | +25 | 0 comments
SIA > SSA, part 4: In defense of the presumptuous philosopher
by Joe_Carlsmith @ 2021-10-01 | +8 | 0 comments
SIA > SSA, part 3: An aside on betting in anthropics
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
SIA > SSA, part 2: Telekinesis, reference classes, and other scandals
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
SIA > SSA, part 1: Learning from the fact that you exist
by Joe_Carlsmith @ 2021-10-01 | +16 | 0 comments
Can you control the past?
by Joe_Carlsmith @ 2021-08-27 | +46 | 0 comments
In search of benevolence (or: what should you get Clippy for Christmas?)
by Joe_Carlsmith @ 2021-07-20 | +17 | 0 comments
On the limits of idealized values
by Joe_Carlsmith @ 2021-06-22 | +80 | 0 comments
Draft report on existential risk from power-seeking AI
by Joe_Carlsmith @ 2021-04-28 | +88 | 0 comments
Problems of evil
by Joe_Carlsmith @ 2021-04-19 | +31 | 0 comments
The innocent gene
by Joe_Carlsmith @ 2021-04-05 | +16 | 0 comments
The importance of how you weigh it
by Joe_Carlsmith @ 2021-03-29 | +43 | 0 comments
On future people, looking back at 21st century longtermism
by Joe_Carlsmith @ 2021-03-22 | +102 | 0 comments
Against neutrality about creating happy lives
by Joe_Carlsmith @ 2021-03-15 | +95 | 0 comments
Care and demandingness
by Joe_Carlsmith @ 2021-03-08 | +54 | 0 comments
Subjectivism and moral authority
by Joe_Carlsmith @ 2021-03-01 | +15 | 0 comments
Two types of deference
by Joe_Carlsmith @ 2021-02-22 | +10 | 0 comments
Contact with reality
by Joe_Carlsmith @ 2021-02-15 | +47 | 0 comments
Killing the ants
by Joe_Carlsmith @ 2021-02-07 | +228 | 0 comments
Believing in things you cannot see
by Joe_Carlsmith @ 2021-02-01 | +15 | 0 comments
On clinging
by Joe_Carlsmith @ 2021-01-24 | +29 | 0 comments
A ghost
by Joe_Carlsmith @ 2021-01-21 | +2 | 0 comments
Actually possible: thoughts on Utopia
by Joe_Carlsmith @ 2021-01-18 | +86 | 0 comments
Alienation and meta-ethics (or: is it possible you should maximize helium?)
by Joe_Carlsmith @ 2021-01-15 | +19 | 0 comments
The impact merge
by Joe_Carlsmith @ 2021-01-13 | +29 | 0 comments
Shouldn't it matter to the victim?
by Joe_Carlsmith @ 2021-01-11 | +22 | 0 comments
Thoughts on personal identity
by Joe_Carlsmith @ 2021-01-08 | +21 | 0 comments
Grokking illusionism
by Joe_Carlsmith @ 2021-01-06 | +29 | 0 comments
The despair of normative realism bot
by Joe_Carlsmith @ 2021-01-03 | +34 | 0 comments
Thoughts on being mortal
by Joe_Carlsmith @ 2021-01-01 | +58 | 0 comments
Wholehearted choices and "morality as taxes"
by Joe_Carlsmith @ 2020-12-21 | +79 | 0 comments