evhub
Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Posts
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by evhub @ 2024-01-12 | +65 | 0 comments
by evhub @ 2024-01-12 | +65 | 0 comments
The Hubinger lectures on AGI safety: an introductory lecture series
by evhub @ 2023-06-22 | +44 | 0 comments
by evhub @ 2023-06-22 | +44 | 0 comments
Discovering Language Model Behaviors with Model-Written Evaluations
by evhub @ 2022-12-20 | +25 | 0 comments
by evhub @ 2022-12-20 | +25 | 0 comments
We must be very clear: fraud in the service of effective altruism is...
by evhub @ 2022-11-10 | +713 | 0 comments
by evhub @ 2022-11-10 | +713 | 0 comments
FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and...
by evhub @ 2020-07-01 | +13 | 0 comments
by evhub @ 2020-07-01 | +13 | 0 comments