AGI and Lock-In

By Lukas Finnveden, Jess_Riedel, CarlShulman @ 2022-10-29T01:56 (+146)

This is a linkpost to https://docs.google.com/document/d/1mkLFhxixWdT5peJHq4rfFzq4QbHyfZtANH1nou68q88/edit#

The long-term future of intelligent life is currently unpredictable and undetermined. In the linked document, we argue that the invention of artificial general intelligence (AGI) could change this by making extreme types of lock-in technologically feasible. In particular, we argue that AGI would make it technologically feasible to (i) perfectly preserve nuanced specifications of a wide variety of values or goals far into the future, and (ii) develop AGI-based institutions that would (with high probability) competently pursue any such values for at least millions, and plausibly trillions, of years.

The rest of this post contains the summary (6 pages), with links to relevant sections of the main document (40 pages) for readers who want more details.

(See here for links to an HTML and PDF version of the full text.)

0.0 The claim

Life on Earth could survive for millions of years. Life in space could plausibly survive for trillions of years. What will happen to intelligent life during this time? Some possible claims are:

A. Humanity will almost certainly go extinct in the next million years.

B. Under Darwinian pressures, intelligent life will spread throughout the stars and rapidly evolve toward maximal reproductive fitness.

C. Through moral reflection, intelligent life will reliably be driven to pursue some specific “higher” (non-reproductive) goal, such as maximizing the happiness of all creatures.

D. The choices of intelligent life are deeply, fundamentally uncertain. It will at no point be predictable what intelligent beings will choose to do in the following 1000 years.

E. It is possible to stabilize many features of society for millions or trillions of years. But it is possible to stabilize them into many different shapes — so civilization’s long-term behavior is contingent on what happens early on.

Claims A-C assert that the future is basically determined today.  Claim D asserts that the future is, and will remain, undetermined. In this document, we argue for claim E: Some of the most important features of the future of intelligent life are currently undetermined but could become determined relatively soon (relative to the trillions of years life could last).

In particular, our main claim is that artificial general intelligence (AGI) will make it technologically feasible to construct long-lived institutions pursuing a wide variety of possible goals. We can break this into three assertions, all conditional on the availability of AGI:

  1. It will be possible to preserve highly nuanced specifications of values and goals far into the future, without losing any information.
  2. With sufficient investments, it will be feasible to develop AGI-based institutions that (with high probability) competently and faithfully pursue any such values until an external source stops them, or until the values in question imply that they should stop.
  3. If a large majority of the world’s economic and military powers agreed to set-up such an institution, and bestowed it with the power to defend itself against external threats, that institution could pursue its agenda for at least millions of years (and perhaps for trillions).

Note that we’re mostly making claims about feasibility as opposed to likelihood. We only briefly discuss whether people would want to do something like this in Section 2.2.

(Relatedly, even though the possibility of stability implies E, in the top list, there could still be a strong tendency towards worlds described by one of the other options A-D. In practice, we think D seems unlikely, but that you could make reasonable arguments that any of the end-points described by A, B, or C are probable.)

Why are we interested in this set of claims? There are a few different reasons:

We will now go over claims 1., 2., and 3., from above in more detail.

0.1 Preserving information

In the beginning of human civilization, the only way of preserving information was to pass it down from generation to generation, with inevitable corruption along the way. The invention of writing significantly boosted civilizational memory, but writing has relatively low bandwidth. By contrast, the invention of AGI would enable the preservation of entire minds. With whole-brain emulation (WBE), we could preserve entire human minds, and ask them what they would think about future choices. Even without WBE, we could preserve newly designed AGI minds that would give (mostly) unambiguous judgments of novel situations. (See section 4.1.)

Such systems could encode information about a wide variety of goals and values, for example:

Crucially, using digital error correction, it would be extremely unlikely that errors would be introduced even across millions or billions of years. (See section 4.2.) Furthermore, values could be stored redundantly across many different locations, so that no local accident could destroy them. Wiping them all out would require either (i) a worldwide catastrophe, or (ii) intentional action. (See section 4.3.)

0.2 Executing intentions

So let’s say that we can store nuanced sets of values. Would it be possible to design an institution that stays motivated to act according to those values?

Today, tasks can only be delegated to humans, whose goals and desires often differ from the goals of the delegator. With AGI, all tasks necessary for an institution's survival could instead be automated, performed by artificial minds instead of biological humans. We will discuss the following 2 questions:

0.2.1 Aligning AGI

Currently, humanity knows less about how to predict and control the behavior of advanced AI systems than about predicting and controlling the behavior of humans. The problem of how to control the behaviors and intentions of AI is commonly known as the alignment problem, and we do not yet have a solution to it.

However, there are reasons why it could eventually be far more robust to delegate problems to AGI, than to rely on (biological) humans:

Thus, we suspect that an adequate solution to AI alignment could be achieved given sufficient time and effort. (Though whether that will actually happen is a different question, not addressed since our focus is on feasibility rather than likelihood.)

Note also that if we don’t make substantial progress on the alignment problem, but still keep building more AI systems that are more capable and more numerous, this could eventually lead to permanent human disempowerment. In other words, if this particular step of the argument doesn’t go through, the alternative is probably not a business-as-usual human world (without the possibility of stable institutions), but instead a future where misaligned AI systems are ruling the world. 

(For more, see section 5.)

0.2.2 Preventing drift

As mentioned in section 0.1, digital error correction could be used to losslessly preserve the information content of values. But this doesn’t entirely remove the possibility of value-drift.

In order to pursue goals, AGI systems need to learn many facts about the world and update their heuristics of how to deal with new challenges and local contexts. Perhaps it will be possible to design AGI systems with goals that are cleanly separated from the rest of their cognition (e.g. as an explicit utility function), such that learning new facts and heuristics doesn’t change the systems’ values. But the one example of general intelligence we have — humans — instead seem to store their values as a distributed combination of many heuristics, intuitions, and patterns of thought. If the same is true for AGI, it is hard to be confident that new experiences would not occasionally cause their values to shift. 

Thus, although it’s not clear how much of a concern this will be, we will discuss how an institution might prevent drift even if individual AI systems sometimes changed their goals. Possible options include:

Some of these options might reveal inputs where AI systems systematically behave badly, or where it’s not clear if they’re behaving well or badly. For example, they might:

In most cases, it is probably the case that the reason for the discrepancy could be identified, and the AI design could be modified to act as desired. But it’s worth noting that even in situations where it remains unclear what the desired behavior is, or in situations where it’s somehow difficult to design a system that responds in the desired way, a sufficiently conservative institution could simply opt to prevent AI systems from being exposed to inputs like that (picking some sub-optimal but non-catastrophic resolution to any dilemmas that can’t be properly considered without those inputs).

Given all these options, it seems more likely than not that an institution could practically eliminate any internal sources of drift that it wanted to. (For more, see section 6.)

0.3 Preventing disruption

So let’s say that it will remain mostly-unambiguous what an institution is supposed to do, in any given situation, and furthermore that the institution will keep being motivated to act that way.

Now, let’s consider a situation where this institution — at least temporarily — has uncontested military and economic dominance (let’s call this a “dominant institution”). Let’s also say that the institution’s goals include a consequentialist drive to maintain that dominance (at least instrumentally). Could the institution do this? On our best guess, the answer would be “yes” (with exceptions for encountering alien civilizations, and for the eventual end of usable resources).

Any resources, information, and agents necessary for the institution’s survival could be copied and stored redundantly across the Earth (and, eventually, other planets). Thus, in order to prevent the institution from rebuilding, an event would need to be global in scope. 

As we argue in section 7, natural events of civilization-threatening magnitude are rare, and the main mechanism they have to pose a global threat to human civilization is that they would throw up enough dust to blot out the sun for a few years. A well-prepared AI civilization could easily survive such events by having energy sources that don’t depend on the sun. In a few billion years, the expansion of the Sun will prevent further life on Earth, but a technologically sophisticated stable institution could avoid destruction by spreading to space.

As we argue in section 8, a dominant institution could also prevent other intelligent actors from disrupting the institution. Uncontested economic dominance would allow the institution to manufacture and control loyal AGI systems that far outnumber any humans or non-loyal AI systems. Thus, insofar as any other actors could pose a threat, it would be economically cheap to surveil them as much as necessary to suppress that possibility. In practice, this could plausibly just involve enough surveillance to:

The main exception to this is alien civilizations, which could at first contact already be more powerful than the Earth-originating institution.

Ultimately, the main boundaries to a stable, dominant institution would be (i) alien civilizations, (ii) the eventual end of accessible resources predicted by the second law of thermodynamics, and (iii) any disruptive Universe-wide physical events (such as a Big Rip scenario), although to our knowledge no such events are predicted by standard cosmology.

0.4 Some things we don’t argue for

To be clear, here are two things that we don’t argue for:

First, we don’t think that the future is necessarily very contingent, from where we stand today. For example, it might be the case that almost no humans would make an ultra-stable institution that pursues a goal that those humans themselves couldn’t later change (if they changed their mind). And it might be the case that most humans would eventually end up with fairly similar ideas about what is good to do, after thinking about it for a sufficiently long time.

Second, we don’t think that extreme stability (of the sort that could make the future contingent on early events) would necessarily require a lot of dedicated effort. The options for increasing stability we sketch in sections 0.2.2 and 6 and the assumption of a singleton-like entity in sections 0.3 and 8 are brought up to make the point that stability is feasible at least in those circumstances. It seems plausible that they wouldn’t be necessary in practice. Perhaps stability will only require a smaller amount of effort. Perhaps the world’s values would stabilize by default given the (not very unlikely) combination of:

0.5 Structure of the document

Readers should feel free to skip to whatever parts they’re interested in. (See also the table of contents.)

Contributions

Lukas Finnveden was the lead author. Some parts of this document started as an unfinished report prepared by Jess Riedel while he was an employee at Open Philanthropy. Carl Shulman contributed many of the ideas, and both Jess and Carl provided multiple rounds of comments. Lukas did most of the work while he was part of the Research Scholars Programme at the Future of Humanity Institute (although at the time of publishing, he works for Open Philanthropy). All views are our own.
 


Lizka @ 2022-11-02T13:50 (+21)

I'm curating this (although I wish it had a more skimmable summary). 

It's an important topic (and a weak point in the classic most important century discussion) and a lot of the considerations[1] seem important and new (at least to me!). I like that the post and document make a serious attempt at clarifying what isn't being said (like some claims about likelihood), flag different levels of uncertainty in the various claims, and clarify what is meant by "AGI"[2].

Here's a quick attempt at a restructured/slightly paraphrased summary — please correct me if I got something wrong: 

  1. ^
  2. ^

    See here

  3. ^

    For the different levels of confidence the authors have in these arguments, you can look at this section in the document

Lukas_Finnveden @ 2022-11-03T06:23 (+5)

Thanks Lizka. I think about section 0.0 as being a ~1-page summary (in between the 1-paragraph summary and the 6-page summary) but I could have better flagged that it can be read that way. And your bullet point summary is definitely even punchier.

Wei_Dai @ 2022-10-29T07:18 (+12)

Consider a civilization that has "locked in" the value of hedonistic utilitarianism. Subsequently some AI in this civilization discovers what appears to be a convincing argument for a new, more optimal design of hedonium, which purports to be 2x more efficient at generating hedons per unit of resources consumed. Except that this argument actually exploits a flaw in the reasoning processes of the AI (which is widespread in this civilization) such that the new design is actually optimized for something different from what was intended when the "lock in" happened. The closest this post comes to addressing this scenario seems to be "An extreme version of this would be to prevent all reasoning that could plausibly lead to value-drift, halting progress in philosophy." But even if a civilization was willing to take this extreme step, I'm not sure how you'd design a filter that could reliably detect and block all "reasoning" that might exploit some flaw in your reasoning process.

Maybe in order to prevent this, the civilization tried to locked in "maximize the quantity of this specific design of hedonium" as their goal instead of hedonistic utilitarianism in the abstract. But 1) maybe the original design of hedonium is already flawed or highly suboptimal, and 2) what if (as an example) some AI discovers an argument that they should engage in acausal trade in order to maximize the quantity of hedonium in the multiverse, except that this argument is actually wrong.

This is related to the problem of metaphilosophy, and my hope that we can one day understand "correct reasoning" well enough to design AIs that we can be confident are free from flaws like these, but I don't know how to argue that this is actually feasible.

Lukas_Finnveden @ 2022-10-29T16:40 (+2)

I broadly agree with this. For the civilizations that want to keep thinking about their values or the philosophically tricky parts of their strategy, there will be an open question about how convergent/correct their thinking process is (although there's lots you can do to make it more convergent/correct — eg. redo it under lots of different conditions, have arguments be reviewed by many different people/AIs, etc).

And it does seem like all reasonable civilizations should want to do some thinking like this. For those civilizations, this post is just saying that other  sources of instability could be removed (if they so chose, and insofar as that was compatible with the intended thinking process).

Also, separately, my best guess is that competent civilizations (whatever that means) that were aiming for correctness would probably succeed (at least in areas were correctness is well defined). Maybe by solving metaphilosophy and doing that, maybe because they took lots of precautions like mentioned above, maybe just because it's hard to get permanently stuck at incorrect beliefs if lots of people are dedicated to getting things right, have all the time and resources in the world, and are really open-minded. (If they're not open-minded but feel strongly attached to keeping their current views, then I become more pessimistic.)

But even if a civilization was willing to take this extreme step, I'm not sure how you'd design a filter that could reliably detect and block all "reasoning" that might exploit some flaw in your reasoning process.

By being unreasonably conservative. Most AIs could be tasked with narrowly doing their job, a few with pushing forward technology/engineering, none with doing anything that looks suspiciously like ethics/philosophy.  (This seems like a bad idea.)

Jess_Riedel @ 2022-10-29T16:14 (+1)

Just to be clear: we mostly don’t argue for the desirability or likelihood of lock-in, just its technological feasibility. Am I correctly interpreting your comment to be cautionary, questioning the desirability of lock-in given the apparent difficulty of doing so while maintaining sufficiently flexibility to handle unforeseen philosophical arguments?

Wei_Dai @ 2022-10-29T17:45 (+6)

To take a step back, I'm not sure it makes sense to talk about "technological feasibility" of lock-in, as opposed to say its expected cost, because suppose the only feasible method of lock-in causes you to lose 99% of the potential value of the universe, that seems like a more important piece of information than "it's technologically feasible".

(On second thought, maybe I'm being unfair in this criticism, because feasibility of lock-in is already pretty clear to me, at least if one is willing to assume extreme costs, so I'm more interested in the question of "but can it be done at more acceptable costs", but perhaps this isn't true of others.)

That aside, I guess I'm trying to understand what you're envisioning when you say "An extreme version of this would be to prevent all reasoning that could plausibly lead to value-drift, halting progress in philosophy." What kind of mechanism do you have in mind for doing this? Also, you distinguish between stopping philosophical progress vs stopping technological progress, but since technological progress often requires solving philosophical questions (e.g., related to how to safely use the new technology), do you really see much distinction between the two?

trammell @ 2022-11-04T11:55 (+10)

Thanks, great post!

You say that "using digital error correction, it would be extremely unlikely that errors would be introduced even across millions or billions of years. (See section 4.2.) " But that's not entirely obvious to me from section 4.2. I understand that error correction is qualitatively very efficient, as you say, in that the probability of an error being introduced per unit time can be made as low as you like at the cost of only making the string of bits a certain small-seeming multiple longer (and my understanding is that multiple shrinks the longer the original string was?). But for any multiple, there's some period of time long enough that the probability of faithfully maintaining some string of bits for that long is low. Is there any chance you could offer an estimate of, say, how much longer you'd have to make a petabyte in order to get the probability of an error over a billion years below 1%?

Lukas_Finnveden @ 2022-12-05T00:05 (+16)

This is a great question. I think the answer depends on the type of storage you're doing.

If you have a totally static lump of data that you want to encode in a harddrive and not touch for a billion years, I think the challenge is mostly in designing a type of storage unit that won't age. Digital error correction won't help if your whole magnetism-based harddrive loses its magnetism. I'm not sure how hard this is.

But I think more realistically, you want to use a type of hardware that you regularly use, regularly service, and where you can copy the information to a new harddrive when one is about to fail. So I'll answer the question in that context.

As an error rate, let's use the failure rate of 3.7e-9 per byte per month ~= 1.5e-11 per bit per day from this stack overflow reply.  (It's for RAM, which I think is more volatile than e.g. SSD storage, and certainly not optimised for stability, so you could probably get that down a lot.)

Let's use the following as an error correction method: Each bit is represented by N bits; for any computation the computer does, it will use the majority vote of the N bits; and once per day,[1] each bit is reset to the majority vote of its group of bits.

If so...

  • for N=1, the probability that a bit is stable for 1e9 years is ~exp(-1.5e-11*365*1e9)=0.4%. Yikes!
  • for N=3, the probability that 2 bit flips happen in a single day is ~3*(1.5e-11)^2 and so the probability that a group of bits is stable for 1e9 years is ~exp(-3*(1.5e-11)^2*365*1e9)=1-2e-10. Much better, but there will probably still be a million errors  in that petabyte of data.
  • for N=5, the probability that 3 bit flips happen in a single day is ~(5 choose 2)*(1.5e-11)^3 and so the probability that the whole petabyte of data is safe for 1e9 years is ~99.99%. And so on this scheme, it seems that 5 petabytes of storage is enough to make 1 petabyte stable for a billion years.

Based on the discussion here, I think the errors in doing the majority-voting calculations are negligible compared to the cosmic ray calculations. At least if you do it cleverly so that you don't get too many correlations and ruin your redundance (which there are ways to do according to results on error correcting computations — though I'm not sure if they might require some fixed amount of extra storage space to do this, in which case you might need N somewhat greater than 5).

Now this scheme requires that you have a functioning civilization that can provide electricity for the computer, that can replace the hardware when it starts failing, and stuff — but that's all things that we wanted to have anyway. And any essential component of that civilization can run on similarly error-corrected hardware.

And to account for larger-scale problems than cosmic rays (e.g. local earthquake throws harddrive to the ground and shatters it, or you accidentally erase a file when you were supposed to make a copy of it), you'd probably want backup copies of the petabyte on different places across the Earth, which you replaced each time something happened to one of them. If there's an 0.1% chance of that happening in any one day (corresponding to once/3 years, which seems like an overestimate if you're careful), and you immediately notice it and replace the copy within a day, and you have 5 copies in total, the probability that one of them keeps working at all times is ~exp(-(0.001)^5*365*1e9)~=99.96%. So combined with the previous 5, that'd be a multiple of 5*5=25.

This felt enlightening. I'll add a link to this comment from the doc.

  1. ^

    Using a day here rather than an hour or a month isn't super-motivated. If you reset things very frequently, you might interfere with normal use of the computer, and errors in the resetting-operation might start to dominate the errors from cosmic rays. But I think a day should be above the threshold where that's much of an issue.

trammell @ 2022-12-05T12:49 (+6)

Cool, thanks for thinking this through!

This is super speculative of course, but if the future involves competition between different civilizations / value systems, do you think having to devote say 96% (i.e. 24/25) of a civilization's storage capacity to redundancy would significantly weaken its fitness? I guess it would depend on what fraction of total resources are spent on information storage...?

Also, by the same token, even if there is a "singleton" at some relatively early time, mightn't it prefer to take on a non-negligible risk of value drift later in time if it means being able to, say, 10x its effective storage capacity in the meantime?

(I know your 24/25 was a conservative estimate in some ways; on the other hand it only addresses the first billion years, which is arguably only a small fraction of the possible future, so hopefully it's not too biased a number to anchor on!)

Lukas_Finnveden @ 2022-12-05T22:25 (+6)

Depends on how much of their data they'd have to back up like this. If every bit ever produced or operated on instead had to be be 25 bits — that seems like a big fitness hit. But if they're only this paranoid about a few crucial files (e.g. the minds of a few decision-makers), then that's cheap.

And there's another question about how much stability contributes to fitness. In humans, cancer tends to not be great for fitness. Analogously, it's possible that most random errors in future civilizations would look less like slowly corrupting values and more like a coordinated whole splintering into squabbling factions that can easily be conquered by a unified enemy. If so, you might think that an institution that cared about stopping value-drift and an institution that didn't would both have a similarly large interest in preventing random errors.

Also, by the same token, even if there is a "singleton" at some relatively early time, mightn't it prefer to take on a non-negligible risk of value drift later in time if it means being able to, say, 10x its effective storage capacity in the meantime?

The counter-argument is that it will be super rich regardless, so it seems like satiable value systems would be happy to spend a lot on preventing really bad events from happening with small probability. Whereas instabiable value systems would notice that most resources are in the cosmos, and so also be obsessed with avoiding unwanted value drift. But yeah, if the values contain a pure time preference, and/or doesn't care that much about the most probable types of value drift, then it's possible that they wouldn't deem the investment worth it.

MichaelPlant @ 2022-11-03T11:04 (+4)

Just skimmed this, but I notice there seems to be something inconsistent between this and the usual AI dooomerism stuff. For instance, above you claim that we should be worried about values lock-in because we will be able to align AI - cf doomerism that says alignment won't work; equally, above you state the value drift could be prevented by 'turning the AGI off and on again' - which is, again, at odds with the doomerist claim that we can't do this. I'm unsure what to make of this tension.

Lukas_Finnveden @ 2022-11-03T18:16 (+4)

Quoting from the post:

Thus, we suspect that an adequate solution to AI alignment could be achieved given sufficient time and effort. (Though whether that will actually happen is a different question, not addressed since our focus is on feasibility rather than likelihood.)

AI doomers tend to agree with this claim.  See e.g. Eliezer in list of lethalities:

None of this is about anything being impossible in principle.  The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.  (...) What's lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we're going to be doing everything with metaphorical sigmoids on the first critical try.  No difficulty discussed here about AGI alignment is claimed by me to be impossible - to merely human science and engineering, let alone in principle - if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries.  This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.

N N @ 2022-11-08T11:21 (+2)

Stipulate, for the sake of the argument, that Lukas et al. actually disagree with the doomers about various points. What would follow from that?

Jim Buhler @ 2023-01-28T17:13 (+1)

Insightful! Thanks for writing this.

> Perhaps it will be possible to design AGI systems with goals that are cleanly separated from the rest of their cognition (e.g. as an explicit utility function), such that learning new facts and heuristics doesn’t change the systems’ values.

In that case, value lock-in is the default (unless corrigibility/uncertainty is somehow part of what the AGI values), such that there's no need for the "stable institution" you keep mentioning, right?

> But the one example of general intelligence we have — humans — instead seem to store their values as a distributed combination of many heuristics, intuitions, and patterns of thought. If the same is true for AGI, it is hard to be confident that new experiences would not occasionally cause their values to shift. 

Therefore, it seems to me that most of your doc assumes we're in this scenario? Is that the case? Did I widely misunderstand something?

Lukas_Finnveden @ 2023-02-13T21:59 (+2)

If AGI systems had goals that were cleanly separated from the rest of their cognition, such that they could learn and self-improve without risking any value drift (as long as the values-file wasn't modified), then there's a straightforward argument that you could stabilise and preserve that system's goals by just storing the values-file with enough redundancy and digital error correction.

So this would make section 6 mostly irrelevant. But I think most other sections remain relevant, insofar as people weren't already convinced that being able to build stable AGI systems would enable world-wide lock-in.

Therefore, it seems to me that most of your doc assumes we're in this scenario [without clean separation between values and other parts]?

I was mostly imagining this scenario as I was writing, so when relevant, examples/terminology/arguments will be taylored for that, yeah.