Visible Thoughts Project and Bounty Announcement

By So8res @ 2021-11-30T00:35 (+35)

(Update Jan. 12: We released an FAQ last month, with more details. Last updated Jan. 7.)

(Update Jan. 19: We now have an example of a successful partial run, which you can use to inform how you do your runs. Details.)

We at MIRI are soliciting help with an AI-alignment project centered around building a dataset, described below. We have $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.

If this project goes well, then it may be the first of a series of prizes we offer for various projects.

Below, I’ll say more about the project, and about the payouts and interim support we’re offering.

The Project

Hypothesis: Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.

We’d like to test this hypothesis by fine-tuning/retraining a language model using a dataset composed of thought-annotated dungeon runs. (In the manner of AI dungeon.)

A normal (un-annotated) dungeon run is a sequence of steps in which the player inputs text actions and the dungeon master responds with text describing what happened in the world as a result.

We’d like a collection of such runs, that are annotated with "visible thoughts" (visible to potential operators or programmers of the system, not to players) describing things like what just happened or is about to happen in the world, what sorts of things the player is probably paying attention to, where the current sources of plot tension are, and so on — the sorts of things a human author would think while acting as a dungeon master. (This is distinct from producing thoughts explaining what happened in the dungeon; “visible thoughts” are meant to play an active role in constructing the output.)

Once we have such a dataset, MIRI’s hope is that present or future technology will be able to train a model or models which iteratively produce visible thoughts along with storytelling, based on user actions plus previous history (including previous thoughts). The goal is to transition the state of AI dungeon technology from “An AI outputs story text in response to actions (and we have no idea how)” to “An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.

Here’s an example of the first couple of steps of a thought-annotated dungeon run (or “quest”), in the format MIRI currently thinks is worth trying. Some kinds of thoughts are marked with parentheses and/or brackets; see the next section for details on this.

Thoughts:
- [The main plot: Prota is a student at a magical academy that is under attack by zombies.]
- [(This is a quest of medium length.)]
- (This is the start of the story. The Player should be given enough backstory and immediate description to have some sense of where they are, so the prompt can go on longer than usual. The Player should be told who they are and their role in the universe. The initial prompt should describe Prota being in a magical academy, and should describe the first surprising indication that the academy is under attack by zombies.)
Prompt: Today is your first day of classes at the Magical Academy. You arrived very late last night and missed the tour, but your roommate Lior promised he would help you find your way around. You wake up in the students’ dormitory at dawn, and follow Lior to breakfast. Just as you reach the dining hall, you hear a shout in the distance.
Thoughts:
- [Prota missed the tour and doesn’t already know the way around.]
- [Prota’s roommate’s name is Lior.]
- [Lior is male.]
- There are other people in the dining hall who will have also heard the shout, and will also be surprised and reacting to that.
Player action: What was that?
Thoughts: Prota is talking to Lior.
Outcome: “What was that?”, you say to Lior.

Thoughts:
- Lior arrived earlier than Prota and had the tour of the Academy, since they knew the way to the dining hall. Lior will also have met the instructors and other students. Lior will know who is shouting and where they are shouting from.
- (The event of hearing the shout is meant to introduce the opening premise of the story.)
- The shout is by somebody who just discovered a zombie.
- The shout is by another first-year student named Morven.
- [Morven is a Fire mage. Morven is aggressive.]
- Morven will try to attack the zombie with fire. This is happening where Prota and Lior cannot see it, only hear the shouts.
- Lior knows that the shout is by Morven. Lior knows Morven is a Fire mage. Lior does not know Morven is attacking a zombie. Lior does not know that Morven has good reason to be shouting. Lior will worry that Morven is about to set something on fire. Lior is on good terms with the protagonist and will speak their concerns honestly.
Prompt: “That shout sounded like Morven, a first-year Fire mage,” says Lior. “I hope they aren’t about to set the Academy on fire. We just got here.”
Thoughts: <none>
Player action: Should we go see what’s going on? Or is that too dangerous?
Thoughts: Prota is talking to Lior.
Outcome: You say to Lior, “Should we go see what’s going on? Or is that too dangerous?”

A difficult first step in testing the hypothesis above is generating a sufficiently large dataset (suitable for language model retraining) of thought-annotated dungeon runs. This likely requires at least a moderate degree of introspective and authorial skill from the people creating the dataset. See this sample of a partial run to get a further sense of what we are looking for. More detail on the type of thing we’re looking for can hopefully be inferred from that sample, though applicants will also have a chance to ask clarifying questions.

The project of producing this dataset is open starting immediately, in a hybrid prize/grant format. We will pay $20,000 per run for the first 10 completed runs that meet our quality standard (as decided unilaterally by Eliezer Yudkowsky or his designates), and $1M total for the first batch of 100 runs beyond that.

If we think your attempt is sufficiently promising, we’re willing to cover your expenses (e.g., the costs of paying the authors) upfront, and we may also be willing to compensate you for your time upfront. You’re welcome to write individual runs manually, though note that we’re most enthusiastic about finding solutions that scale well, and then scaling them. More details on the payout process can be found below.

The Machine Learning Experiment

In slightly more detail, the plan is as follows (where the $1.2M prizes/budgets are for help with part 1, and part 2 is what we plan to subsequently do with the dataset):

1. Collect a dataset of 10, then ~100 thought-annotated dungeon runs (each run a self-contained story arc) of ~1,000 steps each, where each step contains:

Thoughts (~250 words on average per step) are things the dungeon master was thinking when constructing the story, including:
- Reasoning about the fictional world, such as summaries of what just happened and discussion of the consequences that are likely to follow (Watsonian reasoning), which are rendered in plain-text in the above example;
- Reasoning about the story itself, like where the plot tension lies, or what mysteries were just introduced, or what the player is likely wondering about (Doylist reasoning), which are rendered in (parentheses) in the above example; and
- New or refined information about the fictional world that is important to remember in the non-immediate future, such as important facts about a character, or records of important items that the protagonist has acquired, which are rendered in [square brackets] in the above example;
- Optionally: some examples of meta-cognition intended to, for example, represent a dungeon master noticing that the story has no obvious way forward or their thoughts about where to go next have petered out, so they need to back up and rethink where the story is going, rendered in {braces}.
The prompt (~50 words on average) is the sort of story/description/prompt thingy that a dungeon master gives to the player, and can optionally also include a small number of attached thoughts where information about choices and updates to the world-state can be recorded.
The action (~2–20 words) is the sort of thing that a player gives in response to a prompt, and can optionally also include a thought if interpreting the action is not straightforward (especially if, e.g., the player describes themselves doing something impossible).

It’s unclear to us how much skill is required to produce this dataset. The authors likely need to be reasonably introspective about their own writing process, and willing to try things and make changes in response to initial feedback from the project leader and/or from MIRI.

A rough estimate is that a run of 1,000 steps is around 300k words of mostly thoughts, costing around 2 skilled author-months. (A dungeon run does not need to be published-novel-quality literature, only coherent in how the world responds to characters!) A guess as to the necessary database size is ~100 runs, for about 30M words and 20 author-years (though we may test first with fewer/shorter runs).

2. Retrain a large pretrained language model, like GPT-3 or T5

A reasonable guess is that performance more like GPT-3 than GPT-2 (at least) is needed to really make use of the thought-intermediates, but in lieu of a large pretrained language model we could plausibly attempt to train our own smaller one.

Our own initial idea for the ML architecture would be to retrain one mode of the model to take (some suffix window of) the history units and predict thoughts, by minimizing the log loss of the generated thought against the next thought in the run, and to retrain a second mode to take (some suffix window of) the history units plus one thought, and produce a prompt, by minimizing the log loss of the generated prompt against the next prompt in the run.

Imaginably, this could lead to the creation of dungeon runs that are qualitatively “more coherent” than those generated by existing methods. The primary goal, however, is that the thought-producing fragment of the system gives some qualitative access to the system’s internals that, e.g., allow an untrained observer to accurately predict the local developments of the story, and occasionally answer questions about why things in the story happened; or that, if we don’t like how the story developed, we can intervene on the thoughts and get a different story in a controllable way.

Motivation for this project

Many alignment proposals floating around in the community are based on AIs having human-interpretable thoughts in one form or another (e.g., in Hubinger’s survey article and in work by Christiano, by Olah, and by Leike). For example, this is implicit in the claim that humans will be able to inspect and understand the AI’s thought process well enough to detect early signs of deceptive behavior. Another class of alignment schemes is based on the AI’s thoughts being locally human-esque in some fashion that allows them to be trained against the thoughts of actual humans.

I (Nate) personally don’t have much hope in plans such as these, for a variety of reasons. However, that doesn’t stop Eliezer and me from wanting to rush ahead and start gathering empirical evidence about how possible it is in practice to get modern AI systems to factor their cognition through human-interpretable visible intermediates.

Modern AIs are notably good at crafting English text. Some are currently used to run dungeons (with modest success). If you wanted to look at the place where current AIs excel the most in crafting artifacts, among the artifacts they are best and most impressive at crafting are English paragraphs.

Furthermore, compared to many other things AIs have learned to do, if you consider the task of running a responsive text dungeon, it seems relatively possible to ask a (relatively unusually) introspective human author to write down their thoughts about how and why they would generate the next prompt from the user’s input.

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

This sort of interpretability is distinct from the sort of transparency work in something like Circuits (led by Chris Olah) — while Circuits is trying to “open the black box” of machine learning systems by directly looking at what is happening inside of them, the project proposed here is just attempting the less ambitious task of having black-box models output interpretable intermediates producing explanations for their behavior (but how such black box models might go about doing that internally is left unconstrained). The reason for our focus on this particular project of visible thoughts isn’t because we believe it to be better or more fruitful than Circuits-style transparency (we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it), but just because it’s a different approach where it might also be possible to push progress forward.

Note that proponents of alignment strategies that involve human-esque thoughts (such as those linked above) do not necessarily endorse this particular experiment as testing any of their key uncertainties or confusions. We welcome suggested tweaks to the experiment (in the comments of the version of this announcement as it occurs on LessWrong) from any such proponents, to render it a better test of your ideas. (Though even if it doesn’t sate your own curiosity, we expect to learn some things ourselves.)

The main thing this project needs is a dataset, so MIRI is starting on producing that dataset. It’s plausible to us that GPT-3 will prove wholly unable to make use of this dataset; even if GPT-3 can’t, perhaps GPT-4 or some other future system will be able to.

There are additional more general reasons to work on this project. Specifically, it seems to me (Nate) and to Eliezer that capacity to execute projects such as this one is the current limiting bottleneck on MIRI. By pursuing this project, we attempt to resolve that bottleneck.

We hope, through this process, to build our capacity to execute on a variety of projects — perhaps by succeeding at the stated objective of building a dataset, or perhaps by learning about what we’re doing wrong and moving on to better methods of acquiring executive talent. I’ll say more about this goal in “Motivation for the public appeal” below.

Notes on Closure

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs. Locally these might look like increased coherence of the overall narrative arc, increased maintenance of local story tension, and increased consistency in the described world-state over the course of the run. If successful, the idiom might generalize further; it would have to, in order to play a role in later alignment of AGI.

As a matter of policy, whenever a project like this has plausible capabilities implications, we think the correct response is to try doing it in-house and privately before doing it publicly — and, of course, only then when the alignment benefits outweigh the plausible capability boosts. In this case, we tried to execute this project in a closed way in mid-2021, but work was not proceeding fast enough. Given that slowness, and in light of others publishing related explorations and results, and in light of the relatively modest plausible capability gains, we are moving on relatively quickly past the attempt to do this privately, and are now attempting to do it publicly.

Motivation for the public appeal

I (Nate) don’t know of any plan for achieving a stellar future that I believe has much hope worth speaking of. I consider this one of our key bottlenecks. Offering prizes for small projects such as these doesn’t address that bottleneck directly, and I don’t want to imply that any such projects are going to be world-saving in their own right.

That said, I think an important secondary bottleneck is finding people with a rare combination of executive/leadership/management skill plus a specific kind of vision. While we don’t have any plans that I’m particularly hopeful about, we do have a handful of plans that contain at least a shred of hope, and that I’m enthusiastic about pursuing — partly in pursuit of those shreds of hope, and partly to build the sort of capacity that would let us take advantage of a miracle if we get one.

The specific type of vision we’re looking for is the type that’s compatible with the project at hand. For starters, Eliezer has a handful of ideas that seem to me worth pursuing, but for all of them to be pursued, we need people who can not only lead those projects themselves, but who can understand the hope-containing heart of the idea with relatively little Eliezer-interaction, and develop a vision around it that retains the shred of hope and doesn’t require constant interaction and course-correction on our part. (This is, as far as I can tell, a version of the Hard Problem of finding good founders, but with an additional constraint of filtering for people who have affinity for a particular project, rather than people who have affinity for some project of their own devising.)

We are experimenting with offering healthy bounties in hopes of finding people who have both the leadership/executive capacity needed, and an affinity for some ideas that seem to us to hold a shred of hope.

If you’re good at this, we’re likely to make you an employment offer.

The Payouts

Our total prize budget for this program is $1.2M. We intend to use it to find a person who can build the dataset in a way that scales, presumably by finding and coordinating a pool of sufficiently introspective writers. We would compensate them generously, and we would hope to continue working with that person on future projects (though this is not a requirement in order to receive the payout).

We will pay $20k per run for the first 10 thought-annotated runs that we accept. We are willing to support applicants in producing these runs by providing them with resources up-front, including small salaries and budgets for hiring writers. The up-front costs a participant incurs will be deducted from their prizes, if they receive prizes. An additional $1M then goes to anyone among the applicants who demonstrates the ability to scale their run-creating process to produce 100 runs. Our intent is for participants to use some of that money to produce the 100 runs, and keep the remainder as a prize. If multiple participants demonstrate similar abilities to scale at similar quality-levels and similar times, the money may be split between them. We plan to report prize awards publicly.

In principle, all you need to do to get paid for thought-annotated dungeon runs is send us runs that we like. If your run is one of the first 10 runs, or if you’re the first to provide a batch of 100, you get the corresponding payment.

That said, whether or not we decide to pay for a run is entirely and unilaterally up to Eliezer Yudkowsky or his delegates, and will depend on whether the run hits a minimum quality bar. Also, we are willing to pay out from the $1M prize/budget upon becoming convinced that you can scale your process, which may occur before you produce a full 100 runs. We therefore strongly recommend getting in contact with us and proactively making sure that you’re on the right track, before sinking large amounts of time and energy into this project. Our senior research staff are willing to spend time on initial conversations and occasional check-ins. For more information on our support resources and how to access them, refer to the support and application sections below.

Note that we may tune or refine the bounty in response to feedback in the first week after this post goes live.

Support

We intend to offer various types of support for people attempting this project, including an initial conversation; occasional check-ins; office space; limited operational support; and certain types of funding.

We currently expect to have (a limited number of) slots for initial conversations and weekly check-ins, along with (a limited amount of) office space and desks in Berkeley, California for people working on this project. We are willing to pay expenses, and to give more general compensation, in proportion to how promising we think your attempts are.

If you’d like to take advantage of these resources, follow the application process described below.

Application

You do not need to have sent us an application in order to get payouts, in principle. We will pay for any satisfactory run sent our way. That said, if you would like any of the support listed above (and we strongly recommend at least one check-in to get a better understanding of what counts as success), complete the following process:

Describe the general idea of a thought-annotated dungeon run in your own words.
Write 2 (thought, prompt, thought, action, thought, outcome) sextuples you believe are good, 1 you think is borderline, and 1 you think is bad.
Provide your own commentary on this run.
Email all this to projects@intelligence.org.

If we think your application is sufficiently promising, we’ll schedule a 20 minute video call with some senior MIRI research staff and work from there.

RobBensinger @ 2022-01-20T00:52 (+3)

We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:

https://docs.google.com/document/d/1Wsh8L--jtJ6y9ZB35mEbzVZ8lJN6UDd6oiF0_Bta8vM/edit

Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial runs that meet our quality bar (up to at least the first 5,000 total steps we're sent), so the partial run here will receive $5320 from the prize pool (though the final version will presumably be much longer and receive more; we expect a completed run to be about 1000 steps).

Vanilla_cabs is open to doing paid consultation for anyone who's working on this project. So if you want feedback from someone who understands our quality bar and can demonstrably pass it, contact Vanilla_cabs via their LessWrong profile.

RobBensinger @ 2022-01-12T20:34 (+2)

In case you missed it: we now have an FAQ for this project, last updated Jan. 7.