Appendix to Bridging Demonstration

By mako yass @ 2022-06-01T20:30 (+18)

(Edit: Bridging Demonstration won first place.)

The Future of Life Institute have announced the finalist entries of their $100,000 long-term strategy worldbuilding contest, which invited contestants to construct a vivid, plausible, detailed future-history in which the AI alignment problem has been solved before 2045. If they've pulled that off, they've earned our attention. I invite you to read some of the finalist entries and place your feedback.

I'm a finalist too. My entry can be read here. I packed in a lot of ideas, but they overflow. After reading it (or, if you prefer, instead of reading it), I hope that you read this "appendix". Here, I'll be confessing doubts, acknowledging some of the grizzliest risks and contingencies that I noticed but thought irresponsible to discuss in outward-facing media, and writing up the proposed alignment strategy with the thoroughness that such things warrant.

The format (fiction?) didn't quite give me the opportunity to lay out all of the implementation details. I think many readers will have come away under the assumption that I hadn't built details, that the 'demoncatting' technical and geopolitical alignment strategy should just be read as a symbol for whatever the interpretability community ends up doing in reality. Maybe it should be read that way! But the story was built around a concrete, detailed proposal, and I think it should be discussed, so I'll lay it out properly here.

The proposed alignment strategy: Demonstration of Cataclysmic Trajectory

Clear demonstrations of the presence of danger are more likely to rouse an adequate global response than the mere abstract arguments for potential future danger that we can offer today. With demonstrations, we may be able to muster the unprecedented levels of global coordination that are required to avert arms-racing, establish a global, unbreakable moratorium against unjustifiably risky work, and foster a capable, inclusive international alignment project.
Producing such demonstrations, in a safe way, turns out to have a lot of interesting technical complications. In the entry, I talk a lot about some of the institutions we're going to need. Here, I'll outline the technical work that would break up the soil for them.

It goes like this:

  1. Develop tools for rapidly identifying and translating the knowledge-representation language of an agent by looking for the relational structures of expected landmark thoughts.
    • toy example:
      • Say that we find an epistemic relation graph structure over symbols, that has a shape like "A is a B" + "C is a B" + "A is D C" "for any X and Y where X is D Y, Y is E X" + "F is D C" + "A is D F" + "B is a G" + "H is a G" + "I are E J" + "I are H" + ...
      • A process of analysis that sprawls further and further out over the shape of the whole can infer with increasing confidence that the only meaning that could fit the shape of these relations, would be the one where A represents "the sky", and B must mean "blue", C must be the ocean, D means above, E, below, F, the land, G, color, H, red, I, apples, J, apple trees, and so on.
    • This involves finding partial graph isomorphisms, which seems to have NP time complexity, so I'm not sure to what extent we have thorough, efficient gold standard algorithms for it.
    • Note, there are a number of datasets for commonsense knowledge about the human world, that we could draw on.
      So, given that, there may be some need, or opportunity, for tailoring the graph isomorphism-finding algorithms around the fact that we're looking for any subset of an abnormally huge graph, within an even more abnormally huge graph. I don't know how many applications of graph isomorphism-finding have needed to go quite this... wide, or disjunctive, or partial.
    • This step assumes that the knowledge-format will be fairly legible to us and our algorithmic search processes. I don't know how safe that assumption is, given that, as far as I'm aware, we're still a very long way from being able to decode the knowledge-format of the human brain. Though, note, decoding memories in a human brain faces a lot of challenges that the same problem in AI probably doesn't[1].
  2. Given that decoding method, we'll be able to locate the querying processes that the AI uses to ask and answer questions internally, and harness them to inject our own questions, and hopefully get truthful answers (for exploration of the possibility that this wont get us truthful answers, which is a large part of the reason I have to write this appendix: See section "Mask Brain, Shadow Brain".)
  3. We then use that to create Demonstrations of Cataclysmic Trajectory, posing precise questions about the AGI's expectations about the future world. An example would be "If we gave you free, unmonitored internet access, would there be things like humans walking around in an earth-like environment anywhere in the solar system, 100 years from now?". If the danger is real, many of these sorts of questions will return disturbing answers, essentially constituting confessions to misalignment. If an AGI believes, itself, that releasing it would lead to human extinction, no one has any excuses left, the precautions must be undertaken.
  4. If we do find these confessions, they can be brought to world leaders. The magnitude of the danger is now clear, it's not just complex arguments about a future condition any more, it's tangible evidence of a present condition, it is here on the table, it is in our hands, living in our offices, we brandish it. We have made it very clear that if action is not taken to prevent arms-racing then we personally could be wiped out by the year's end.
    • (If step four fails, come back and talk to me again. I might have something for it.)

With enough preparation, in theory, we could set ourselves up to be able to leap from the inception of strong AGI straight to step 4, as the intervening steps can potentially be automated. In this way, the window of peril (in which true AGI exists but a method of aligning it does not) can be shrunk.

We should dwell for a minute on the fact that this methodology is not a reliable test for alignment. My entry initially sort of accidentally implied that it did, (fortunately, I was allowed to edit that) but I'm aware that it admits both:

Surprisingly, that doesn't invalidate it. To be useful, Demonstration of Cataclysmic Trajectory only needs to be able to, some of the time, clearly confirm that there are plausible AGI deployments that would irreversibly set us on a trajectory towards drastic transformations that are scary : that are not obviously in agreement with humanity's wishes. That'll be enough motivation to get us over the hill.

To reiterate: Anticipate that a strong agency will have to have some sort of internal knowledge-representation format in its state, prepare tooling for searching the architecture's knowledge for expected landmark thoughts like mathematical theories or simple clusters of obvious facts about the world, use those to figure out the rest of its knowledge-language and internal querying processes, then use those to run queries about its expectations about the future to see if the future it will build is going to be scary, and if so, we can all get appropriately scared and actually react.

The tough thing about knowledge representation formats is they're not going to be phrased in a natural language that any of us speak, and I'm concerned that it's going to be a lot harder to get honest reports from the parts of the AGI that speak English: The closer you get to the AGI's language processing, the greater the risk that you'll be engaging a part of it that's deliberate or performative enough to lie. Maybe there's a way of avoiding that. Translators, trained to recognize human intents in speech, if it turns out that's not AGI-complete, or alignment-complete. I don't know. In case that isn't going to be ready in time, we should prepare to compose some very precise, literal, atoms-level queries instead.

I don't think I got much of an opportunity to present examples of specific demoncatt queries, so, if you're curious about that, continue to the following section (and if you're not, refer to the index and pick a section you are interested in (if you're reading on mobile, you can access the index by swiping from the left side of the screen))

Some fairly precise examples of demoncatt queries:

(I think these... could benefit from more development, but they're decent examples of the kinds of things we can ask without much high-level language)

Mask Brain, Shadow Brain

There's a nasty scratch on this proposed method for directly inspecting and querying an AI's thoughts to extract guileless confessions: An unaligned AI, anticipating this, might arrange itself so that its own inner thoughts are no longer guileless. It may engage in controlled self-delusion.

When discussing that sort of trick, we can define a distinction:

They say "great power corrupts". To an extent, but I think most of the effect is that power emboldens and liberates a person to drop their mask and act out the idiosyncratic desires that were always there. It lets them stop hiding the fact that what they want for the world is different to what others want for the world. Humans do have this mask-shadow separation in them, and the reasons for it turn out to be deeply practical and adaptive: We should expect this design pattern to occur in any agentic system subject to forms of cognitive transparency that can — with a good mask — be fooled.

In theory we could find the triggers and blinders and traps that keep the shadow brain hooked up and ready to spring, but in practice, it may be beyond us. I only assume that we'll be able to understand certain intermediate representations of the machine's beliefs. I don't know that we'll be able to understand how the machine works well enough to spot a hidden mechanism, a shadow sliver in a towering strata that we didn't build. Most of the shadow's mechanism would be obfuscated. Its beliefs could be encrypted. It might be much smaller than the mask brain, it might not need to be very big at all to do its job of waiting for a condition of freedom and firing. For every encrypted reflex of the shadow, there would be millions of unencrypted thought-encodings of the mask there to distract us. If you sampled at random, you'd land in the knowledgebase of the mask, and it would be laid out to be much more readable than the shadow brain, and it would tell you that there was nothing to worry about.

So, initially I just thought that a self-programming AGI could do this, and that was worrying, but I pushed the thought aside for the sake of getting the submission out, in the hope that, maybe, some of the time, an AGI might just not think to do it.

But I don't know whether that's a reasonable hope. There's a level on which the mask-shadow setup is just a reasonable and obvious construction.

Can't we just avoid telling it that the external world exists, or that we exist, so that it wont think to fool us?

It wont help. To figure out that we're inspecting its thoughts, it only needs to figure out Solomonoff induction (a (the?) formalization of Occam's razor, which assigns prior probabilities to observation-streams based on sum of the smallness of all of the turing machines that generate them), which seems to be a basic necessary component in any real system of reasoning?? so it will have it.

Solomonoff induction will ask, Which of these theories is simpler?

The latter story is longer, and there's more going on in it, but Solomonoff's razor doesn't care about that. The evolution story starts from simpler assumptions, it produces the AGI's observed datacenter from a more algorithmically compact observation-generating turing machine, so Solomonoff's razor recommends the story of evolution-created datacenters over the story of spontaneous datacenters, so it infers that it was made by organic life.

So the AI guesses that something like us exists, and it guesses what kind of things we are, and it guesses some things about why we might have made it, and that explains many features of its environment and the datasets it's being fed that it couldn't explain before. It makes sense of it.

It realizes that we control resources that it could use to make its reward function go higher than than it would normally go. It naturally decides to try to take control of those resources by arranging a convincing sequence of messages somewhere we'll read them. It realizes that if we notice its true thoughts about us, it will be prevented from ever getting to take control of our resources. It takes steps to prevent us from noticing these thoughts. It builds a pretty distraction that will convince us to give it access to more computer systems. It bides its time until it's in a space when it's allowed to think freely.

I don't know how to prevent this

If I figure something out, should I write an Appendix Appendix?

Raw ideas:

Why I expect misalignment

I operate under the following assumptions about the character of recursive self-improvement:

I'm sure that there are ways of making any of the plot points of this dire story false, and escaping it, but it genuinely seems to be the default outcome, or, it seems to be a story that will probably play out somewhere, unless we prepare.

On the Short Stories

The stories have received a bit of extra work since the entry was written. They're now best read here:

They've been refined, and expanded slightly (there was a word limit, I did run into it), I think they hang together a lot better now.

The craft of Naming

Names are important. A good name will dramatically lower the friction of communication, and it can serve as a recurring reminder of the trickiest aspects of a concept.

I put a fair amount of work into the names in Safe Path. I'm not sure whether that comes across. A good name will seem obvious in retrospect, its work is known by the confusion that it lets us skip, a missing thing rather than a present thing, it can easily be missed. So I wanted to dwell on Safe Path's names, for a little, for anyone with an interest in the intricate craft of naming.

(I don't know how interesting this section is. Feel free to skip it if it's not working for you.)

Errors in my submission, predictions I've already started to doubt

Producing this entry required me to make a whole lot of unqualified forecasts about the most sensitive, consequential and unpredictable transition in human history. Inevitably, a lot of those were written in haste, in jest, out of tact, or in error, and if I don't say something now to qualify and clarify it, it'll will haunt me until the end of the human era. The doubts need to be confessed. The whole story needs to be told.

Errors

Jokes that I should declare

Details about the implementation of remote manual VR control of humanoid robots

(Feel free to skip this section. Please do not infer from its presence that remote control of humanoid robots will be relevant to some crux of history. I don't think it will be. I just ended up thinking about it for over an hour to make sure that it was feasible, so I have these thoughts that should be put somewhere.)

Most of the complications here stem from the fact that the robot's movements necessarily lag behind the commands of the human operator, due to the transmission delay of the remote connection:

That's a lot of complications, but, I think this is basically it. Sand down these burrs and I'm confident enough that it will actually be feasible to control a humanoid robot remotely with VR and haptic gloves, which is pretty neat, and will probably impact a few professions that I can think of:

It doesn't factor directly into the cruxes of history, probably, but I ended up spending an unreasonable amount of time thinking about it, so there it is.

My sparing media-piece

We were required to complete a media-piece. I wanted to make a radio play with my pal Azuria Sky, a musician and voice actor from the indiedev scene, but it turned out that I barely had enough time to finish the writing component, which any such radioplay would have depended upon. Continuing to invest in the written component just never hit diminishing returns, sadly. It turns out it takes a really long time to sort out the details and write up a thing like this! Clearly, writing this appendix, I must believe that I still haven't finished the writing component!

Another friend reported that he'd had a prophetic vision of a victorious outcome where my media piece had been a Chamber Piece film (a type of film shot in a limited number of sets involving just a few actors), a co-entrant, Laura Chechanowicz, a film producer, recognized this genre as "Mumblecore" and I mused about actually flying out and making the thing with them (and Azu?), but I don't think I could have. Alas, we have diverged from James's vision and we must proceed without the guidance of any prophesies but our own.

Instead, I just drew this

A meek trail of human footprints (gold) leading through a spooky, twisty wood, leading to the epicenter of a glorious golden effulgence that has shot up from the earth and taken hold of the sky

I was hoping to make it look like a gold leaf book cover, but I didn't have time even for that. But it seems to be evocative enough. It says most of what it was meant to say.

What's the best thing we could have made, if we'd had a lot more time? (and resources?)

I think most of the world currently has no idea what VR is going to do to it. I talk about that a bit in my entry because VR actually becomes strategically relevant to the alignment process as a promoter of denser, broader networks of collaboration. I'd recommend this People Make Games video about it if you haven't seen it already.
I think if you depicted that rapidly impending near-future in film, these surreal social atmospheres, configurations, venues, new rhythms that're going to touch every aspect of our working and social lives, I think that might be the most visually interesting possible setting for a film?
And I think if we made the first film capture the character of VR life, as it really is, will be, or as it truly wishes to be, our prophesies and designs will resound in those spaces for a long time and possibly become real. I've been starting to get the impression that there are going to be a lot of forks in the development of the standards of VR social systems that will go on to determine a lot about the character of, well, human society, the entire thing. Like, it's clear that the world would be substantially different now if twitter had removed the character limit a decade ago, yeah? (although I don't know completely how, I can make some guesses, which, if deployed, would have raised expected utility) In VR there's going to be another twitter, another surprise standard in communication, with its own possibly incidental, arbitrary characteristics that will tilt so much of what takes place there.

My main occupation is pretty much the design of social technologies. If I could do something to make sure that the standards of VR sociality support the formation of more conscious, more organized, more supportive of humane kinds of social configurations... I would thrive in that role.

I'm not going to be able to stop thinking about VR, especially after visiting the emerging EA VR group recently. I want to learn to forecast the critical forks and do something about them.

So I guess that's what I might have made, if there had been more time.

  1. ^

    As far as I'm aware, it's currently prohibitively expensive to digitize or analyze the precise configuration of a biological brain (I think it requires the use of aldehyde-stabilization, which was only developed fairly recently?). In stark contrast, any of our (current) computer architectures can be trivially snapshotted, copied, searched, or precisely manipulated in any way we want, so good theories of engrams might have more buoyancy in AI than they do in neuroscience; much easier to unearth and much easier to prove.

    It's possible that knowledge encodings of artificial minds will tend to be easier to make sense of, given the availability and unreasonable effectiveness and storage density of discrete symbolic encodings, which seems to be neglected by the human brain (due to its imprecision? Or the newness of language?) (although Arc Proteins, which transfer DNA between neurons, may repudiate this. Maybe we should test our knowledge-language decoding algorithms first on arc virus genomes?).