evhub
Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
- "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training"
- “Conditioning Predictive Models”
- “How likely is deceptive alignment?”
- “How do we become confident in the safety of a machine learning system?”
- “An overview of 11 proposals for building safe advanced AI”
- “Risks from Learned Optimization”
Posts
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by evhub @ 2024-01-12 | +65 | 0 comments
by evhub @ 2024-01-12 | +65 | 0 comments
The Hubinger lectures on AGI safety: an introductory lecture series
by evhub @ 2023-06-22 | +44 | 0 comments
by evhub @ 2023-06-22 | +44 | 0 comments
Discovering Language Model Behaviors with Model-Written Evaluations
by evhub @ 2022-12-20 | +25 | 0 comments
by evhub @ 2022-12-20 | +25 | 0 comments
We must be very clear: fraud in the service of effective altruism is...
by evhub @ 2022-11-10 | +712 | 0 comments
by evhub @ 2022-11-10 | +712 | 0 comments
FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and...
by evhub @ 2020-07-01 | +13 | 0 comments
by evhub @ 2020-07-01 | +13 | 0 comments