Joe_Carlsmith
Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.
Posts
Takes on "Alignment Faking in Large Language Models"
by Joe_Carlsmith @ 2024-12-18 | +63 | 0 comments
by Joe_Carlsmith @ 2024-12-18 | +63 | 0 comments
How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
by Joe_Carlsmith @ 2024-10-28 | +18 | 0 comments
by Joe_Carlsmith @ 2024-10-28 | +18 | 0 comments
Video and transcript of presentation on Otherness and control in the age of AGI
by Joe_Carlsmith @ 2024-10-08 | +18 | 0 comments
by Joe_Carlsmith @ 2024-10-08 | +18 | 0 comments
Video and transcript of presentation on Scheming AIs
by Joe_Carlsmith @ 2024-03-22 | +23 | 0 comments
by Joe_Carlsmith @ 2024-03-22 | +23 | 0 comments
Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-11 | +7 | 0 comments
by Joe_Carlsmith @ 2023-12-11 | +7 | 0 comments
Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs")
by Joe_Carlsmith @ 2023-12-08 | +6 | 0 comments
by Joe_Carlsmith @ 2023-12-08 | +6 | 0 comments
Simplicity arguments for scheming (Section 4.3 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-07 | +6 | 0 comments
by Joe_Carlsmith @ 2023-12-07 | +6 | 0 comments
The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-06 | +9 | 0 comments
by Joe_Carlsmith @ 2023-12-06 | +9 | 0 comments
Arguments for/against scheming that focus on the path SGD takes (Section 3 of...
by Joe_Carlsmith @ 2023-12-05 | +7 | 0 comments
by Joe_Carlsmith @ 2023-12-05 | +7 | 0 comments
Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-04 | +12 | 0 comments
by Joe_Carlsmith @ 2023-12-04 | +12 | 0 comments
Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming...
by Joe_Carlsmith @ 2023-12-03 | +6 | 0 comments
by Joe_Carlsmith @ 2023-12-03 | +6 | 0 comments
The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")
by Joe_Carlsmith @ 2023-12-02 | +6 | 0 comments
by Joe_Carlsmith @ 2023-12-02 | +6 | 0 comments
How useful for alignment-relevant work are AIs with short-term goals? (Section 2..
by Joe_Carlsmith @ 2023-12-01 | +6 | 0 comments
by Joe_Carlsmith @ 2023-12-01 | +6 | 0 comments
Is scheming more likely in models trained to have long-term goals? (Sections 2.2..
by Joe_Carlsmith @ 2023-11-30 | +6 | 0 comments
by Joe_Carlsmith @ 2023-11-30 | +6 | 0 comments
“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-29 | +7 | 0 comments
by Joe_Carlsmith @ 2023-11-29 | +7 | 0 comments
Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-28 | +8 | 0 comments
by Joe_Carlsmith @ 2023-11-28 | +8 | 0 comments
Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-27 | +11 | 0 comments
by Joe_Carlsmith @ 2023-11-27 | +11 | 0 comments
Situational awareness (Section 2.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-26 | +12 | 0 comments
by Joe_Carlsmith @ 2023-11-26 | +12 | 0 comments
On “slack” in training (Section 1.5 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-25 | +14 | 0 comments
by Joe_Carlsmith @ 2023-11-25 | +14 | 0 comments
Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-24 | +10 | 0 comments
by Joe_Carlsmith @ 2023-11-24 | +10 | 0 comments
A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-22 | +6 | 0 comments
by Joe_Carlsmith @ 2023-11-22 | +6 | 0 comments
Varieties of fake alignment (Section 1.1 of “Scheming AIs”)
by Joe_Carlsmith @ 2023-11-21 | +6 | 0 comments
by Joe_Carlsmith @ 2023-11-21 | +6 | 0 comments
New report: "Scheming AIs: Will AIs fake alignment during training in order to...
by Joe_Carlsmith @ 2023-11-15 | +71 | 0 comments
by Joe_Carlsmith @ 2023-11-15 | +71 | 0 comments
Superforecasting the premises in “Is power-seeking AI an existential risk?”
by Joe_Carlsmith @ 2023-10-18 | +114 | 0 comments
by Joe_Carlsmith @ 2023-10-18 | +114 | 0 comments
[Linkpost] Shorter version of report on existential risk from power-seeking AI
by Joe_Carlsmith @ 2023-03-22 | +49 | 0 comments
by Joe_Carlsmith @ 2023-03-22 | +49 | 0 comments
A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my...
by Joe_Carlsmith @ 2023-02-21 | +64 | 0 comments
by Joe_Carlsmith @ 2023-02-21 | +64 | 0 comments
[Linkpost] Human-narrated audio version of "Is Power-Seeking AI an Existential...
by Joe_Carlsmith @ 2023-01-31 | +9 | 0 comments
by Joe_Carlsmith @ 2023-01-31 | +9 | 0 comments
Video and Transcript of Presentation on Existential Risk from Power-Seeking AI
by Joe_Carlsmith @ 2022-05-08 | +97 | 0 comments
by Joe_Carlsmith @ 2022-05-08 | +97 | 0 comments
On expected utility, part 4: Dutch books, Cox, and Complete Class
by Joe_Carlsmith @ 2022-03-24 | +7 | 0 comments
by Joe_Carlsmith @ 2022-03-24 | +7 | 0 comments
On expected utility, part 3: VNM, separability, and more
by Joe_Carlsmith @ 2022-03-22 | +8 | 0 comments
by Joe_Carlsmith @ 2022-03-22 | +8 | 0 comments
On expected utility, part 2: Why it can be OK to predictably lose
by Joe_Carlsmith @ 2022-03-18 | +8 | 0 comments
by Joe_Carlsmith @ 2022-03-18 | +8 | 0 comments
On expected utility, part 1: Skyscrapers and madmen
by Joe_Carlsmith @ 2022-03-16 | +22 | 0 comments
by Joe_Carlsmith @ 2022-03-16 | +22 | 0 comments
Reviews of "Is power-seeking AI an existential risk?"
by Joe_Carlsmith @ 2021-12-16 | +71 | 0 comments
by Joe_Carlsmith @ 2021-12-16 | +71 | 0 comments
SIA > SSA, part 4: In defense of the presumptuous philosopher
by Joe_Carlsmith @ 2021-10-01 | +8 | 0 comments
by Joe_Carlsmith @ 2021-10-01 | +8 | 0 comments
SIA > SSA, part 3: An aside on betting in anthropics
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
SIA > SSA, part 2: Telekinesis, reference classes, and other scandals
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
by Joe_Carlsmith @ 2021-10-01 | +10 | 0 comments
SIA > SSA, part 1: Learning from the fact that you exist
by Joe_Carlsmith @ 2021-10-01 | +16 | 0 comments
by Joe_Carlsmith @ 2021-10-01 | +16 | 0 comments
In search of benevolence (or: what should you get Clippy for Christmas?)
by Joe_Carlsmith @ 2021-07-20 | +17 | 0 comments
by Joe_Carlsmith @ 2021-07-20 | +17 | 0 comments
Draft report on existential risk from power-seeking AI
by Joe_Carlsmith @ 2021-04-28 | +88 | 0 comments
by Joe_Carlsmith @ 2021-04-28 | +88 | 0 comments
On future people, looking back at 21st century longtermism
by Joe_Carlsmith @ 2021-03-22 | +102 | 0 comments
by Joe_Carlsmith @ 2021-03-22 | +102 | 0 comments
Alienation and meta-ethics (or: is it possible you should maximize helium?)
by Joe_Carlsmith @ 2021-01-15 | +19 | 0 comments
by Joe_Carlsmith @ 2021-01-15 | +19 | 0 comments