$500 bounty for alignment contest ideas

By Akash @ 2022-06-30T01:55 (+18)

*Up to $500 for alignment contest ideas*

My team is composing questions for an AI alignment talent search contest. We want to use (or come up with) a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds.

$20 for links to existing framings of the alignment problem (or subproblems) that we find helpful.

$500 for coming up with a new framing that meets our criteria or that we use (see below for details; also feel free to send us a FB message if you want to work on this and have questions).

We’ll also consider up to $500 for anything else we find helpful.

Feel free to submit via comments or share Google Docs with akashwasil133@gmail.com. Awards are at our discretion.

-- More context --

We like Eliezer’s strawberry problem: How can you get an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else?

Nate Soares noted that the strawberry problem has the quality of capturing two core alignment challenges: (1) Directing a capable AGI towards an objective of your choosing and (2) Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.

We also imagine if we ask someone this question and they *notice* these challenges are what makes the problem difficult, and maybe come at the problem from an interesting angle as a result, that’s a really good signal about their thinking.

However, we worry if we ask exactly this question in a contest, people will get lost thinking about AI capabilities, molecular biology, etc. We also don’t like that there aren’t many impressive answers besides full answers to the alignment problem. So, we want to come up with a similar question/frame that is more contest-friendly.

Ideal criteria for the question/frame (though we can imagine great questions not meeting all of these):

More examples we like:


mic @ 2022-07-01T07:04 (+11)

Here's my proposal for a contest description. Contest problems #1 and 2 are inspired by Richard Ngo's Alignment research exercises.

AI alignment is the problem of ensuring that advanced AI systems take actions which are aligned with human values. As AI systems become more capable and approach or exceed human-level intelligence, it becomes harder to ensure that they remain within human control instead of posing unacceptable risks.

One solution to AI alignment proposed by Stuart Russell, a leading AI researcher, is the assistance game, also called a cooperative inverse reinforcement learning (CIRL) game, following these principles:

  1. "The machine’s only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior."

For a more formal specification of this proposal, please see Stuart Russell's new book on why we need to replace the standard model of AI, Cooperatively Learning Human Values, and Cooperative Inverse Reinforcement Learning.

Contest problem #1: Why are assistance games not an adequate solution to AI alignment?

  • The first link describes a few critiques; you're free to restate them in your own words and elaborate on them. However, we'd be most excited to see a detailed, original exposition of one or a few issues, which engages with the technical specification of an assistance game.

Another proposed solution to AI alignment is iterated distillation and amplification (IDA), proposed by Paul Christiano. Paul runs the Alignment Research Center and previously ran the language model alignment team at OpenAI. In IDA, a human H wants to train an AI agent, X by repeating two steps: amplification and distillation. In the amplification step, the human uses multiple copies of X to help it solve a problem. In the distillation step, the agent X learns to reproduce the same output as the amplified system of the human + multiple copies of X. Then we go through another amplification step, then another distillation step, and so on.

You can learn more about this at Iterated Distillation and Amplification and see a simplified application of IDA in action at Summarizing Books with Human Feedback.

Contest problem #2: Why might an AI system trained through IDA be misaligned with human values? What assumptions would be needed to prevent that?

Contest problem #3: Why is AI alignment an important problem? What are some research directions and key open problems? How can you or other students contribute to solving it through your career?

You're free to submit to one or more of these contest problems. You can write as much or as little as you feel is necessary to express your ideas concisely; as a rough guideline, feel free to write between 300 and 2000 words. For the first two content problems, we'll be evaluating submissions based on the level of technical insight and research aptitude that you demonstrate, not necessarily quality of writing.

I like how contest problems #1 and 2:

Contest problem #3 here isn't a technical problem, but I think it can be helpful so that participants actually end up caring about AI alignment rather than just engaging with it on a one-time basis as part of this contest. I think it would be exciting if participants learned on their own about why AI alignment matters, form a plan for how they could work on it as part of their career, and end up motivated to continue thinking about AI alignment or to support AI safety field-building efforts in India.