“The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models

By Froolow @ 2023-05-19T12:03 (+48)

Summary

1. Introduction

In very broad terms, models can be thought of as a collection of parameters, and instructions for how we organise those parameters to correspond to some real-world phenomenon of interest. These instructions are described as the model’s ‘structure’, and include decisions like what parameters will be used to analyse the behaviour of interest and how those parameters will interact with each other. Because I am a bit of a modelling nerd I like to think of the structure as being the model’s ontology – the sorts of things need to exist within the parallel world of the model in order to answer the question we’ve set for ourselves. The impact of structure on outcomes is often under-appreciated; the structure can obviously constrain the sorts of questions you can ask, but the structure can also embody subtle differences in how the question is framed, which can have an important effect on outcomes. 

In any real-world model worth building, both parameters and structure will be far too complex to capture in perfect detail. The art of building a good model is knowing when and how to make simplifying assumptions so that the dynamics of the phenomenon of interest can be investigated and (usually) the right decision can be made as a result. In my experience of teaching people to build models in a health economics context, people almost always have a laserlike focus on simplifying assumptions relating to parameters, but very frequently a blindspot when it comes to assumptions relating to structure. This is a bad habit - if your model structure is bad, spending a lot of time and effort making the parameters very accurate is not going to improve the quality of the model output overall for the same reason a bad recipe isn’t improved by following it more precisely. 

My observation is that this blindspot also appears to exist in AI Risk analysis. I conducted a systematic review of the EA / LessWrong / AI Alignment forums to identify previous work done on formal models of AI risk and noticed the same imbalance between effort spent refining parameters and effort spent refining structure. For example, descriptions of what constitutes structural decision-making such as this wiki entry are incomplete and misleading, and attempts to engage the community in discussions about structural uncertainty analysis such as here have not met with significant engagement (in fairness this piece is extremely technical!). 

To an outsider, one of the most obvious ways this imbalance reveals itself is by looking at the difference in the number of approaches between modelling structural approaches and modelling parameterisation approaches. When looking at parameters there are a nearly endless number of methods represented – biological anchors, expert survey, prediction markets, historical discontinuity approaches and probably a load I’ve missed because I’m not an expert in the field. However, by contrast there is – from what I can tell from my review – exactly one published structural model of AI Risk, Carlsmith (2021). This structure is replicated below. In lay terms we might think of the structure of the model as operating by describing a number of weighted coinflips which can come up ‘good’ or ‘bad’. If all of these coinflips come up ‘bad’ then we have a Catastrophe. In this sort of model, the overall risk of Catastrophe is simply the probability of each of these individual steps multiplied together. Although only Carlsmith has extensively specified his model, you can see that this sort of ‘weighted coinflip’ approach is motivating other qualitative discussions of AI Risk structures (e.g. here).

I don’t want to give the impression that Carlsmith’s approach is uncritically accepted as some kind of ‘universal modelling structure’. For example, Soares (2022) makes the excellent point that the Carlsmith model assumes through its structure that AI Catastrophe is a narrow target – six coinflips need to go against us in succession before Catastrophe occurs. Soares makes the case that in fact we could turn the problem on its head; assume that Catastrophe is the default and that a bunch of coinflips need to go in our favour for Catastrophe to be prevented. While it would have been ideal if Soares had also attached probability estimates to each step in his argument, it is a great example of the central point of this essay – model structure doesn’t just constrain the sorts of questions you can ask of a model, but also embodies subtle assumptions about how the question should be framed. Surfacing those assumptions so they can be properly debated is an important role of a modelling peer-review community[1].

Nevertheless, in comparison to the number and sophistication of analyses on the topic of AI Risk prediction, model structure is notably under-explored. With a few important exceptions like Soares’ piece, there seems to be a strong (but implicit?) consensus that there is a standard way of building a model of AI Risk, and limited incentive to explore other model structures[2]. I think there is a risk in not promoting more structural analysis in the AI Risk modelling space, since structural issues with models are often significantly pernicious – it takes an expert to spot ‘the dog that didn’t bark’ and identify that an entity or relationship which should exist in the model has instead been overlooked. This essay is attempting to serve as a proof by example of the level of predictive imprecision which can arise if AI Risk modellers spend too long focussing on parameter uncertainty and not enough time on structural uncertainty. 

2. The Race to the End of Humanity”

As discussed above, the purpose of this essay is to identify an assumption in existing AI Risk models (ie Carlsmith’s paper) and show that by not making that assumption results change dramatically. The idea is that by demonstrating the way that structure can be just as important as parameters I might encourage people with a bit more expert knowledge to start producing their own discussions about structure.

With that in mind, Section 2.1 identifies and describes the assumption I will be investigating. Section 2.2 describes one method of building a model without that assumption, and Section 2.3 offers some preliminary results from the model. I save all the conclusions for Section 3 so that you can safely skip all the technical modelling stuff and still get the main insights of the essay.

2.1 The assumption of 'time independence'

I am not an AI Risk expert so to avoid making an embarrassing error I am going to identify an assumption where the expert-level thinking has already been done for me by Apart Research, and I can focus entirely on modelling out the implications. Specifically, I am going to trace out the implications of uncertainty existing in the timelines for AI Alignment, and therefore the implication that we don’t have perfect knowledge of whether Alignment will come before Catastrophe. The motivation for this approach described in this post (and summarised very nicely by the diagram below).

The easiest way to understand what is happening in the above graph is to switch the metaphor we use to describe the model we just looked at from one of weighted coinflips to one where we are watching a runner in a one-man hurdles race. The runner starts today, and then runs down the track until they encounter a hurdle – “AI Invented”, for example. These hurdles are of variable height, and the runner clears the hurdle with some probability depending on its height (if they fail to clear the hurdle they are disqualified). When the runner finishes the race, they get to push a button which destroys all of humanity. Although the ‘runner doing hurdles to get to a doomsday button’ is a pretty convoluted way of describing a ‘weighted coinflip’, I hope you see how it is the same underlying structural problem.

However, this ‘runner in a race’ metaphor makes a very strong structural assumption about what question we are interested in – in a normal race we don’t typically care if some particular runner finishes, but rather which runner finishes the race first. Instead of the weighted coinflip approach where AI is the only runner in the hurdles race, the real world looks more like a ‘Race to End Humanity’ amongst multiple competitors. For example, a naturally occurring Catastrophic asteroid impact, global pandemic, nuclear war etc are mutually exclusive with AI Catastrophe and so if one of those happens there is no possibility for AI to destroy humanity later. Once humanity is destroyed, there’s no prize for second place – even if the AI competitor finishes the race somehow they’ve lost their chance to push the button and the AI cannot itself cause a Catastrophe. This is also true if an external force prevents an AI from completing the race under any circumstances – perhaps one day we will be able to Align AIs so effectively that Unaligned AI ceases to be a risk.

The Carlsmith model asks the question, “What is the probability that the runner finishes the race (rather than being disqualified)?”, but this is potentially the wrong question to be asking; some runner will always finish the race over a long enough timeframe, because there are simply no serious hurdles to stop, for example, a stellar nova from wiping out humanity. What we are interested in instead is rather the probability that AI finishes the race first. We cannot answer the question “What is the probability that humanity will suffer an existential Catastrophe due to loss of control over an AGI system?” in isolation from other potential causes of Catastrophe, since the first Catastrophe (or Alignment) automatically precludes any other Catastrophe (or Alignment) in the relevant sense any human being cares about. And if we are interested in the first of something, we need a way to track when events are occurring. This can’t really be done in a ‘weighted coinflips’ model, so we can be fairly confident that Carlsmith’s model oversimplifies[3].

It happens that the specific change I have identified here has a technical name – ‘time dependency’. Time dependency is a general name for any situation where the model exhibits dynamic behaviour with respect to time; most commonly that parameters change over time (as in this case, where the probability of achieving Alignment changes over time) but you can have very exotic time dependent models too, where the entire structure of the model changes repeatedly over its runtime. The structural change this model needs can be summed up in a single sentence: Carlsmith’s model of Alignment is time-independent (‘static’) whereas I think the real-world process of Alignment is likely time-dependent (‘dynamic’). As an added bonus, adding time-dependency for the purpose of modelling Alignment will allow me to investigate other time-dependent elements like X-Risk and S-Risk which I’m fairly sure are unimportant but there’s no harm in checking!

2.2 Time dependency in a Carlsmith-like model

If you consult a modelling textbook it will tell you in no uncertain terms that you should absolutely not try and encode time dependent behaviour into a ‘weighted coinflip’ model like the Carlsmith model, and that there are instead specific model structures which handle time dependence very gracefully. However, I wanted to try and keep the Carlsmith structure as far as I could, partly because I didn’t know what other assumptions I might accidentally encode if I deviated a long way off the accepted approach in the field and partly because I wanted to show that it was only the structural decision about time-dependency that was driving results (not any of the other modelling dark arts I would employ to create a more typical modelling structure for this kind of problem). Disappointingly, it turns out that the textbooks are bang on the money here – the model I came up with is extremely inelegant and in hindsight I wish I’d taken my own advice and thought a bit harder about model structure before I made it! 

Nevertheless, I did succeed in making a model structure that was almost exactly like the Carlsmith model but which had the ability to look at time-dependent results. The model is available here, and it might be helpful to have it open to follow along with the next section. The structure is below:

The general principle of the model is that each box represents an event which will occur with some probability (like in the Carlsmith model), but also each arrow indicating a transition between events represents a certain amount of time. There are three ‘runners’ in this model, ‘Unrelated Catastrophe’, ‘Out of Control AGI Outcome’ and ‘Good AI Outcome’.  Once every runner has finished, I work out how long they took and then calculate who the winner was after everyone is finished. Parameter uncertainty about the time between each step is extremely important (otherwise the race will always have the same winner provided that runner finishes), so I simulate the ‘race’ thousands of times with different starting parameters drawn from a survey of the AI Risk community and look at the probability that each participant wins.

The three possible outcomes from the model are:

Note that ‘AI ALIGNMENT’ in the OOC AI Risk track represents Alignment being so trivially easy that we don’t have to do any work to achieve it, whereas AI Alignment in the green boxes represents Alignment which is tough but possible. The idea here is to get at the distinction between the bit of the Alignment problem which is affected by longer timelines and the bit that isn’t.

The model is populated from a survey of the AI Risk community I conducted in 2022. I am not a triallist and have no particular expertise in survey design beyond hobbyist projects, so in hindsight there are some design errors with the survey. In particular, note that the instrument does not condition on the catastrophe occurring by 2070, and since receiving feedback on my first AI essay I realise that some of the questions are quite ambiguous. Nevertheless, the responses from the survey seem mostly in line with the Carlsmith model / community consensus and have good internal consistency: 

The community believes that there is a 17.8% chance of Catastrophe conditional on AI being invented, but because the distinction between In-Control and Out-of-Control Catastrophe matters it is perhaps more relevant to say that the community believes there is a 9.7% risk of Out-of-Control AI Catastrophe. Of interest is that the median Catastrophe date is 2054-2057 (depending on the elicitation technique used) compared to a median Alignment date of 2060. The conventional model structure which uses median timelines indicates that we will miss Alignment by an agonisingly small margin.

The base case of the model uses the data in the following way:

The only feature of significant interest here is that I have set all the transition probabilities to a fixed value, corresponding to their mean (by contrast I use ‘SDO sampling’ for the timelines, which is just a fancy way of saying that I pick a value at random from the set which was presented in the survey). The reason for this is that I want to prove beyond any doubt that it is the structure of the model which drives the change in results, not any fancy sampling technique like I demonstrate in the sister essay to this one. For technical reasons it actually makes no difference to do things this way, I'm just aware that the audience of this post won't all be modelling experts so I want to ensure that I remove even the perception of a modelling sleight-of-hand.

There are also a few other small features of the model initialisation which don’t really deserve their own paragraph so I will put them in the following footnote[4].

2.3 Results

Main results

The point of making these changes is not to create increasingly arcane variants of common modelling structures just for aesthetic reasons (however much I wish someone would let me do this) but rather to check whether altering our assumptions about a model’s ontology meaningfully alters the way we should approach the conclusion. There’s no fancy statistics or algorithms that are used here – we just literally compare the two results side-by-side and make qualitative remarks about whether we think the change in structure has had an important impact. This is sometimes an extremely expert process; cases where it is not clear whether an impact is important or not it can require considerable nuance to interpret results.

However fortunately for us, the change in model structure has had an absolutely massive impact which even a non-expert like me can tell should drive decision-making. The diagram below shows the results from running the community’s numerical assumptions through a model designed to look broadly like Carlsmith’s structural assumptions:

It gives a risk of AI Catastrophe of around 18% (ie entirely in line with the community survey as expected - all results are going to be approximate because the simulation contains a lot of random elements which change between model runs). Carlsmith doesn’t distinguish between ‘In Control’ and ‘Out of Control’ AI Risk so the value Open Philanthropy are specifically interested in is the 10% column on the far left.

The next diagram shows the output of the base case analysis of my time-dependent model:

You can see that a nearly half of that probability mass has redistributed itself to non-AI competitors in the race, and the OOC AI Risk column has shrunk to just 6.7%. Consequently, I would conclude that accounting for time-dependency in models of AI Risk leads to an approximate 30% increase in the risk of OOC AI Catastrophe. That might be a slight overstatement of the result, because quite a lot of this probability has just found its way to other bad outcomes like nuclear war and totalitarian subjugation of humanity. Nevertheless, however you slice the result it is clear that humanity is a lot safer if we look at the full range of Alignment timelines rather than just the median timelines.

Sensitivity analysis of time

One of the interesting features of a time-dependent model is that it makes more sense of questions we might have about model dynamics. For example, Open Philanthropy are especially interested in scenarios where AI is invented prior to 2070, but are somewhat uncertain about exactly when AI will be invented conditional on that. We can see what impact the two extremes of possible impact will have by considering first the distribution of results if AI is invented in 2024:

And if it is invented in 2070:

We see – as we should probably expect – that inventing AI later gives humanity a lot longer to try to find a solution for Alignment and reach a good AI future. The effect of delaying AI’s invention by ~50 years is to approximately halve the risk of OOC AI Catastrophe. As a strict output from the model we might say that a delay to AI invention of one year increases the chance that humanity survives to experience a good AI future by about 0.1% (although I’m not really confident my model is sophisticated enough to make this sort of judgement). What I can confidently say is that if you have short AI Catastrophe timelines the effect of time dependency is much less pronounced than if you have long AI Catastrophe timelines – effects which become more pronounced over time will have less of an impact if you compress the time in which they are able to occur. 

Sensitivity analysis of original inputs

I hope these results have a ring of truth to them, because I think they reflect the uncertainty in the community survey quite accurately. The diagram below shows the community’s estimates for AI Invention timelines versus their estimates for AI Alignment timelines. The 45 degree line shows the cutoff where people believe that AI will be Aligned before it is Invented. Any datapoint at the far edge of the graph indicates that either the respondent gave a value that was too big to display or didn’t give a value and therefore is assumed to believe that AI cannot be Aligned (that is, either way they do not believe that AI will be Aligned before Invented). Only one brave iconoclast believes AI will be Aligned before it is Invented, which is why it might be surprising that adding in time dependency based on this survey makes any difference to the results.

However, the next graph shows the same plotting for people’s estimates of AI Catastrophe timelines alongside their AI Alignment timelines. About 20% of responses are showing Alignment as occurring strictly before Catastrophe, and there are a few more which are almost exactly on the 45 degree line. In fact, of the respondents who believe Alignment is possible within the next century, 65% conclude that Alignment is likely to occur before Catastrophe.  This is why I believe that the inclusion of time-dependency is likely to improve the way the model aggregates community forecasts; simply taking the median timelines ignores 20% of people and gives a distorted result.

3. Conclusions

This essay makes the claim that parameters of AI Risk models have been extensively cultivated, to the detriment of discussion about model structure. I present a proof by example of this claim in Section 2; overlooking time-dependency as a structural factor leads to a significant over-estimate of the risk of AI Catastrophe, because it ignores worlds in which AI Alignment is hard but possible (and the dynamic mechanisms which drive the Alignment process). As I am not an expert in AI Risk modelling, the specific numbers and results I arrive at should be taken with a degree of caution. However, I think they are sufficient for proving the key contention of this essay, which is that structural uncertainty is a significant blindspot in the way in which AI Risk is articulated and discussed. I would be extremely surprised if someone more expert than me couldn’t find an effect twice the size I managed with a trivial amount of work by combing over other commonly-accepted structural assumptions in AI Risk modelling with a fine-tooth comb.

There is a nuanced but important point involved in interpreting these results; I have not made a ‘better’ model than Carlsmith in any kind of unambiguous way. By making my model more granular I have introduced problems that Carlsmith can (correctly) ignore because his model is more abstract. This tradeoff between granularity and abstractness is a real problem in professional structural uncertainty analysis, and part of what makes modelling an art rather than a science. For the record, consider some ways in which my approach is clearly inferior to Carlsmith:

I think on balance my model makes sense for the specific problem Open Philanthropy are interested in - the extra complexity of adding a time-dependent element to the model's ontology is worth it when the problem has a time-dependent element in it (such as the requirement that AI be invented before 2070). However that wouldn't necessarily be true for other problems, even problems which look quite superficially similar.

That said, there are some very clear ways in which my approach could be improved to unambiguously improve its suitability for Open Philanthropy’s question: 

More generally, I can offer a few conclusions of relevance to AI Risk analysts:

Although I am pleased I’ve been able to contribute to a debate on the probabilities of AI Catastrophe, I’m under no illusions that this is the most sophisticated structural critique of AI Risk modelling that is possible; it is just the critique which I am qualified to offer. The nature of modelling is that tiny assumptions that other people overlook can have a massive impact on the eventual outcomes, and the assumptions which are important are not necessarily those which modellers are natively interested in. I believe there would be considerable value in individuals or teams taking a detailed look at each step in the Carlsmith model and hunting for further errant assumptions. I'd be delighted if experts were able to show that my work on time-dependency is ultimately a trivial footnote to the real driver of OOC AI Catastrophe Risk, since identifying the true structural drivers of risk is the first step in a plan to mitigate and ultimately defeat that risk.

 

  1. ^

    There are also a handful of incomplete attempts to move beyond the ‘weighted coinflip’ approach and into more sophisticated modelling structures. The most salient of these would probably be the MTAIR project, although I believe that MTAIR is now sadly defunct following the FTX collapse.

  2. ^

    As a hypothesis, my guess is that a lot of AI Alignment researchers come into the movement via Effective Altruism and then find the AI Catastrophe arguments convincing once they have already bought into broader EA logic. By coincidence, high-profile conventional charitable interventions (meaning, for example, those funded by GiveWell) are extremely well suited to the deterministic decision tree, and so the ‘weighted coinflip’ model is an excellent default choice in almost all conventional charity analyses. If many AI Alignment researchers have been socialised into always using the deterministic decision tree structure, and that approach always works, it is probably unsurprising that this approach becomes like water to a fish – even if it is understood that other model structures exist, there are significantly more important things to be doing than messing around with the approach that is fast, logical and has always worked in the past.

  3. ^

    Of course, nobody – least of all Carlsmith – believes that his model really describes reality any more than I believe my new model really describes reality. Instead, Carlsmith has made a simplifying assumption that treating AI Risk timelines as implicitly fixed will still allow him to investigate the sorts of dynamics he is interested in. He is probably right about this - Carlsmith asks and answers the question, "Is Power-Seeking AI an Existential Risk?" for which he is basically looking for a binary answer ("Yes it is" or "No it isn't"). Consequently it doesn't really matter if his model oversimplifies as long as it produces robust order-of-magnitude risk estimates. However, Open Philanthropy ask the question, "Conditional on AGI being developed by 2070, what is the probability that humanity will suffer an existential catastrophe due to loss of control over an AGI system?" for which a more complex model is needed, one which can produce results on a more granular level than just order-of-magnitude.

    I will revisit this theme later in the essay, but an observation that modelling outsiders sometimes miss is that there is no such thing as a 'better' or 'worse' model provided you meet some minimum bar for competence in execution. Instead, there are models which are better suited for some questions and others which are better suited for other questions. So the paragraph spawning this footnote might look like a criticism of Carlsmith when instead it is really a compliment - Carlsmith has found a very simple and elegant way of modelling what he needs which (sadly) oversimplifies relative to what I need.

  4. ^

    A few small notes on the base case that aren't interesting enough to include in the main body of the text:

    - I have forced AI to be invented before 2070, in line with Open Philanthropy’s preferences, by capping the data I generated from my survey to dates prior to then and forcing the probability it happens to be 100%. 

    - I have included X-Risk and S-Risk drawn from literature sources - Ord (2020) and Caplan (2008) respectively. S-Risk estimates seem very low to me - it would be extremely interesting to investigate this further in the future because I think it might be another area where some careful structural thinking could reveal some interesting insights. But it isn't AI Risk so isn't relevant for now.

    - Some people did not give a date for when they thought Alignment would occur, which I interpreted as meaning they did not think it would ever be possible (backed up by reading some of their comments on their entries). I added a bit of an ad hoc term to make Alignment literally impossible in those worlds, so the only possible outcome is some sort of Catastrophe. If I ran the survey again, I would explicitly ask about this.

    - PS1, PS2 and PS3 are my attempt to include some early thoughts about ‘partial successes’ in Alignment which are not as strict as the definition Carlsmith uses where Alignment reduces the risk of OOC AI to zero. I would describe my success here as ‘mixed’, so I have turned them all off for the base case.

  5. ^

    I had enough time to code up some exploratory work on the idea that 'Alignment' could be a moving target depending on how far advanced we were down the pathway towards AI Catastrophe. In the model below, ‘Alignment’ occurs either if we find a way to make AI completely safe as in the main model, but also if we develop a TEST which can show us whether an AI is aligned before we expose any AI to high-risk inputs. The intuition here is that we might be able to convince governments to act to ban unsafe AIs if we haven't already handed power to AIs, but after handing power to AIs we need some kind of 'weapon' to defeat them. The graph below shows the result of this analysis:

    Unsurprisingly, this results in a significant probability mass redistribution away from AI Risk towards Alignment, as scenarios which were once the absolute worst-case for OOC AI Risk have the possibility of being defused even before AI is invented.

    I got a bit anxious at the thought of presenting these alongside the main results because I think I've done a fairly good job of only talking about things I understand properly in the main body of the text, and I don't understand the nuts-and-bolts elements of Alignment at all well. But I thought people might be interested to see this even if it isn't perfect, if only to prove my point that there’s probably another halving of risk available to the first team who can present a plausible account of how a ‘multi-step’ model of alignment could work.

  6. ^

    Part of the art of modelling is knowing when you have made the model complex enough for whatever purpose you intend to use it for. Any model which fixed the problems with my model would inevitably have problems that needed to be fixed by another model and so on. Although this conveyer belt keeps modellers like me (somewhat) gainfully employed, it can be a trap for organisations who just want to use the best possible model to make decisions, because ‘best possible’ is a moving target. As discussed above, the structure needs to match the needs of the decision problem and there isn't a generic solution for how to find the best structure for a particular job.


quinn @ 2023-05-19T15:25 (+11)

psyched about this post, want to jot down a quick nit-pick

  • S-Risk can occur any time from now until the end of the race, and represents – for example - a totalitarian government seizing control of the world to such an extent that human flourishing is permanently curtailed, but the development of AI is not (so S-Risk can occur before AI is Invented).

I don't think curtailing human flourishing constitutes s-risk, I don't think the suffering-focused community likes to draw equivalences between opportunity cost and more immediate or obvious disvalue. When the s-risk community talks about malevolent actors (see CLR), they're talking more about associations between totalitarianism and willingness/ability to literally-torture at scale, whereas other theorists (not in the suffering-focused community) may worry about a flavor of totalitarianism where everyone has reasonable quality of life they just can't steer or exit. 

One citation for the idea that opportunity costs (say all progress but spacefaring continues) and literally everyone literally dying is morally similar is the precipice. We can (polarizingly!) talk about "existential risk" not equalling "extinction risk" but equaling under some value function. This is one way of thinking about totalitarianism in the longtermist community. 

Political freedoms and the valence of day to day experience aren't necessarily the exact same thing. 

Froolow @ 2023-05-20T08:51 (+3)

Thank you, really interesting comment which clarifies a confusion I had when writing the essay!

Vasco Grilo @ 2023-05-20T08:16 (+3)

Great analysis, Froolow!

Analysts discussing AI Risk should describe the structure of their model much more explicitly. I observe there is a bit of a tendency on the forums to be cagey about one’s ‘actual’ model of AI Risk when presenting estimates of Catastrophe, and imply that the ‘actual’ model of AI Risk one has is significantly more complicated than could possibly be explained in the space of a single post (phrases like, “This is roughly my model” are a signifier of this).

Agreed!

Ben Stewart @ 2023-07-01T13:22 (+2)

I really liked this!