Counting arguments provide no evidence for AI doom

By Nora Belrose, Quintin Pope @ 2024-02-27T23:03 (+84)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Matthew_Barnett @ 2024-02-28T00:57 (+51)

(I might write a longer response later, but I thought it would be worth writing a quick response now.)

I have a few points of agreement and a few points of disagreement:

Agreements:

Some points of disagreement:

Nora Belrose @ 2024-02-28T01:09 (+1)

I think the title overstates the strength of the conclusion

This seems like an isolated demand for rigor to me. I think it's fine to say something is "no evidence" when, speaking pedantically, it's only a negligible amount of evidence.

Ultimately I think you've only rebutted one argument for scheming—the counting argument

I mean, we do in fact discuss the simplicity argument, although we don't go in as much depth.

the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme

Without a concrete proposal about what that might look like, I don't feel the need to address this possibility.

If future AIs are "as aligned as humans", then AIs will probably scheme frequently

I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.

I don't think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have "goals" that they robustly attempt to pursue.

We did not intend to deny that some AIs will be well-described as having goals.

Linch @ 2024-02-28T09:51 (+25)

Minor, but: searching on the EA Forum, your post and Quentin Pope's post are the only posts with the exact phrase "no evidence" (EDIT: in the title, which weakens my point significantly but it still holds) The closest other match on the first page is There is little (good) evidence that aid systematically harms political institutions, which to my eyes seem substantially more caveated.

Over on LessWrong, the phrase is more common, but the top hits are multiple posts that specifically argue against the phrase in the abstract. So overall I would not consider it an isolated demand for rigor if someone were to argue against the phrase "no evidence" on either forum.

Matthew_Barnett @ 2024-02-28T01:24 (+19)

This seems like an isolated demand for rigor to me. I think it's fine to say something is "no evidence" when, speaking pedantically, it's only a negligible amount of evidence.

I think that's fair, but I'm still admittedly annoyed at this usage of language. I don't think it's an isolated demand for rigor because I have personally criticized many other similar uses of "no evidence" in the past.

I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.

That's plausible to me, but I'm perhaps not as optimistic as you are. I think AIs might easily end up becoming roughly as misaligned with humans as humans are to each other, at least eventually.

We did not intend to deny that some AIs will be well-described as having goals.

If you agree that AIs will intuitively have goals that they robustly pursue, I guess I'm just not sure why you thought it was important to rebut goal realism? You wrote,

The goal realist perspective relies on a trick of language. By pointing to a thing inside an AI system and calling it an “objective”, it invites the reader to project a generalized notion of “wanting” onto the system’s imagined internal ponderings, thereby making notions such as scheming seem more plausible.

But I think even on a reductionist view, it can make sense to talk about AIs "wanting" things, just like it makes sense to talk about humans wanting things. I'm not sure why you think this distinction makes much of a difference.

Nora Belrose @ 2024-02-28T02:27 (+4)

The goal realism section was an argument in the alternative. If you just agree with us that the indifference principle is invalid, then the counting argument fails, and it doesn't matter what you think about goal realism.

If you think that some form of indifference reasoning still works— in a way that saves the counting argument for scheming— the most plausible view on which that's true is goal realism combined with Huemer's restricted indifference principle. We attack goal realism to try to close off that line of reasoning.

Joe_Carlsmith @ 2024-02-28T05:16 (+37)

(Copying over my response from LessWrong)

Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting argumentssimplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast progress (for all their centrality to the AI risk discourse, I think arguments re: scheming have received way too little direct attention). And if, indeed, the arguments for scheming are all bogus, this is super good news and would be an important update, at least for me, re: p(doom) overall. So overall I’m glad you’re doing this work and think this is a valuable post. 

On other note up front: I don’t think this post “surveys the main arguments that have been put forward for thinking that future AIs will scheme.” In particular: both counting arguments and simplicity arguments (the two types of argument discussed in the post) assume we can ignore the path that SGD takes through model space. But the report also discusses two arguments that don’t make this assumption – namely, the “training-game independent proxy goals story” (I think this one is possibly the most common story, see e.g. Ajeya here, and all the talk about the evolution analogy) and the “nearest max-reward goal argument.” I think that the idea that “a wide variety of goals can lead to scheming” plays some role in these arguments as well, but not such that they are just the counting argument restated, and I think they’re worth treating on their own terms. 

On counting arguments and simplicity arguments

Focusing just on counting arguments and simplicity arguments, though: Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks. I know that it has extremely ample situational awareness – e.g., it has highly detailed models of the world, the training process it’s undergoing, the future consequences of various types of power-seeking, etc – and that it’s getting high reward because it’s pursuing some goal (the report conditions on this). Ok, what sort of goal? 

We can think of arguments about scheming in two categories here. 

Let’s focus first on (I), the more-agnostic-about-SGD’s-inductive-biases type. Here’s a way of pumping the sort of intuition at stake in the hazy counting argument:

  1. A very wide variety of goals can prompt scheming.
  2. By contrast, non-scheming goals need to be much more specific to lead to high reward.
  3. I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals.
  4. So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer. 

Now, as I mention in the report, I'm happy to grant that this isn't a super rigorous argument. But how, exactly, is your post supposed to comfort me with respect to it? We can consider two objections, both of which are present in/suggested by your post in various ways.

Let’s start with (A). I agree that this sort of reasoning would lead you to giving significant weight to SGD overfitting, absent any further evidence. But it’s not clear to me that giving this sort of weight to overfitting was unreasonable ex ante, or that having learned that SGD-doesn't-overfit, you should now end up with low p(scheming) even given your ongoing ignorance about SGD's inductive biases.

Thus, consider the sort of analogy I discuss in the counting arguments section. Suppose that all we know is that Bob lives in city X, that he went to a restaurant on Saturday, and that town X has a thousand chinese restaurants, a hundred mexican restaurants, and one indian restaurant. What should our probability be that he went to a chinese restaurant? 

In this case, my intuitive answer here is: “hefty.”[1] In particular, absent further knowledge about Bob’s food preferences, and given the large number of chinese restaurants in the city, “he went to a chinese restaurant” seems like a pretty salient hypothesis. And it seems quite strange to be confident that he went to a non-chinese restaurant instead. 

Ok but now suppose you learn that last week, Bob also engaged in some non-restaurant leisure activity. For such leisure activities, the city offers: a thousand movie theaters, a hundred golf courses, and one escape room. So it would’ve been possible to make a similar argument for putting hefty credence on Bob having gone to a movie. But lo, it turns out that actually, Bob went golfing instead, because he likes golf more than movies or escape rooms.

How should you update about the restaurant Bob went to? Well… it’s not clear to me you should update much. Applied to both leisure and to restaurants, the hazy counting argument is trying to be fairly agnostic about Bob’s preferences, while giving some weight to some type of “count.” Trying to be uncertain and agnostic does indeed often mean putting hefty probabilities on things that end up false. But: do you have a better proposed alternative, such that you shouldn’t put hefty probability on “Bob went to a chinese restaurant”, here, because e.g. you learned that hazy counting arguments don’t work when applied to Bob? If so, what is it? And doesn’t it seem like it’s giving the wrong answer?

Or put another way: suppose you didn’t yet know whether SGD overfits or not, but you knew e.g. about the various theoretical problems with unrestricted uses of the indifference principle. What should your probability have been, ex ante, on SGD overfitting? I’m pretty happy to say “hefty,” here. E.g., it’s not clear to me that the problem, re: hefty-probability-on-overfitting, was some a priori problem with hazy-counting-argument-style reasoning. For example: given your philosophical knowledge about the indifference principle, but without empirical knowledge about ML, should you have been super surprised if it turned out that SGD did overfit? I don’t think so. 

Now, you could be making a different, more B-ish sort of argument here: namely, that the fact that SGD doesn’t overfit actively gives us evidence that SGD’s inductive biases also disfavor schemers. This would be akin to having seen Bob, in a different city, actively seek out mexican restaurants despite there being many more chinese restaurants available, such that you now have active evidence that he prefers mexican and is willing to work for it. This wouldn’t be a case of having learned that bob’s preferences are such that hazy counting arguments “don’t work on bob” in general. But it would be evidence that Bob prefers non-chinese.

I’m pretty interested in arguments of this form. But I think that pretty quickly, they move into the territory of type (II) arguments above: that is, they start to say something like “we learn, from SGD not overfitting, that it prefers models of type X. Non-scheming models are of type X, schemers are not, so we now know that SGD won’t prefer schemers.”

But what is X? I’m not sure your answer (though: maybe it will come in a later post). You could say something like “SGD prefers models that are ‘natural’” – but then, are schemers natural in that sense? Or, you could say “SGD prefers models that behave similarly on the training and test distributions” – but in what sense is a schemer violating this standard? On both distributions, a schemer seeks after their schemer-like goal. I’m not saying you can’t make an argument for a good X, here – but I haven’t yet heard it. And I’d want to hear its predictions about non-scheming forms of goal-misgeneralization as well.  

Indeed, my understanding is that a quite salient candidate for “X” here is “simplicity” – e.g., that SGD’s not overfitting is explained by its bias towards simpler functions. And this puts us in the territory of the “simplicity argument” above. I.e., we’re now being less agnostic about SGD’s preferences, and instead positing some more particular bias. But there’s still the question of whether this bias favors schemers or not, and the worry is that it does. 

This brings me to your take on simplicity arguments. I agree with you that simplicity arguments are often quite ambiguous about the notion of simplicity at stake (see e.g. my discussion here). And I think they’re weak for other reasons too (in particular, the extra cognitive faff scheming involves seems to me more important than its enabling simpler goals). 

But beyond “what is simplicity anyway,” you also offer some other considerations, other than SGD-not-overfitting, meant to suggest that we have active evidence that SGD’s inductive biases disfavor schemers. I’m not going to dig deep on those considerations here, and I’m looking forward to your future post on the topic. For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly. That is, whatever their a priori merits, counting arguments are attempting to proceed from a position of lots of uncertainty and agnosticism, which only makes sense if you’ve got no other good evidence to go on. But if we do have such evidence (e.g., if (3) above is false), then I think it can quickly overcome whatever “prior” counting arguments set (e.g., if you learn that Bob has a special passion for mexican food and hates chinese, you can update far towards him heading to a mexican restaurant). In general, I’m very excited for people to take our best current understanding of SGD’s inductive biases (it’s not my area of expertise), and apply it to p(scheming), and am interested to hear your own views in this respect. But if we have active evidence that SGD’s inductive biases point away from schemers, I think that whether counting arguments are good absent such evidence matters way less, and I, for one, am happy to pay them less attention.

(One other comment re: your take on simplicity arguments: it seems intuitively pretty non-simple to me to fit the training data on the training distribution, and then cut to some very different function on the test data, e.g. the identity function or the constant function. So not sure your parody argument that simplicity also predicts overfitting works. And insofar as simplicity is supposed to be the property had by non-overfitting functions, it seems somewhat strange if positing a simplicity bias predicts over-fitting after all.)

A few other comments

Re: goal realism, it seems like the main argument in the post is something like: 

  1. Michael Huemer says that it’s sometimes OK to use the principle of indifference if you’re applying it to explanatorily fundamental variables. 
  2. But goals won’t be explanatorily fundamental. So the principle of indifference is still bad here.

I haven’t yet heard much reason to buy Huemer’s view, so not sure how much I care about debating whether we should expect goals to satisfy his criteria of fundamentality. But I'll flag I do feel like there’s a pretty robust way in which explicitly-represented goals appropriately enter into our explanations of human behavior – e.g., I have buying a flight to New York because I want to go to New York, I have a representation of that goal and how my flight-buying achieves it, etc. And it feels to me like your goal reductionism is at risk of not capturing this. (To be clear: I do think that how we understand goal-directedness matters for scheming -- more here -- and that if models are only goal-directed in a pretty deflationary sense, this makes scheming a way weirder hypothesis. But I think that if models are as goal-directed as strategic and agentic humans reasoning about how to achieve explicitly represented goals, their goal-directedness has met a fairly non-deflationary standard.)

I’ll also flag some broader unclarity about the post’s underlying epistemic stance. You rightly note that the strict principle of indifference has many philosophical problems. But it doesn’t feel to me like you’ve given a compelling alternative account of how to reason “on priors” in the sorts of cases where we’re sufficiently uncertain that there’s a temptation to spread one’s credence over many possibilities in the broad manner that principles-of-indifference-ish reasoning attempts to do. 

Thus, for example, how does your epistemology think about a case like “There are 1000 people in this town, one of them is the murderer, what’s the probability that it’s Mortimer P. Snodgrass?” Or: “there are a thousand white rooms, you wake up in one of them, what’s the probability that it’s room number 734?” These aren’t cases like dice, where there’s a random process designed to function in principle-of-indifference-ish ways. But it’s pretty tempting to spread your credence out across the people/rooms (even if in not-fully-uniform ways), in a manner that feels closely akin to the sort of thing that principle-of-indifference-ish reasoning is trying to do. (We can say "just use all the evidence available to you" -- but why should this result in such principle-of-indifference-ish results?)

Your critique of counting argument would be more compelling to me if you had a fleshed out account of cases like these -- e.g., one which captures the full range of cases where we’re pulled towards something principle-of-indifference-ishsuch that you can then take that account and explain why it shouldn’t point us towards hefty probabilities on schemers, a la the hazy counting argument, even given very-little-evidence about SGD’s inductive biases.

More to say on all this, and I haven't covered various ways in which I'm sympathetic to/moved by points in the vicinity of the one's you're making here.  But for now: thanks again for writing, looking forward to future installments. 

evhub @ 2024-03-01T03:14 (+27)

I won't repeat my full LessWrong comment here in detail; instead I'd just recommend heading over there and reading it and the associated comment chain. The bottom-line summary is that, in trying to cover some heavy information theory regarding how to reason about simplicity priors and counting arguments without actually engaging with the proper underlying formalism, this post commits a subtle but basic mathematical mistake that makes the whole argument fall apart.

David Mathers @ 2024-02-28T13:48 (+8)

' Rosenberg and the Churchlands are anti-realists about intentionality— they deny that our mental states can truly be “about” anything in the world..' 

Taken literally this is insane. It means no one has ever thought about going out to the shops for some milk. If it's extended to language (and why wouldn't it?) it means that we can't say that science sometimes succeeds in representing the world's reasonably well, since nothing represents anything. It is also very different from the view that mental states are real, but they are behavioral dispositions, not inner representations in the brain, since the latter view is perfectly compatible with known facts like "people sometimes want a beer". 

I'm also suspicious of what the world "truly" is doing in this sentence if it's not redundant. What exactly is the difference between "our mental states can be about things in the world" and "truly our mental states can be about things in the world"? 

Larks @ 2024-02-28T04:56 (+8)

Thanks for writing this up and sharing it.

Do you have a good account for when counting arguments do and do not work? My impression is they often do work in everyday life, or at least can provide a good prior to be updated away from. Like if I'm wondering who I will meet first when I go into school, a counting argument correctly predicts that any single specific person is quite unlikely. Is the idea that humans are typically good at ontology dividing, such that each of the options are roughly equivalent, but this intuition doesn't work well for SGD?

titotal @ 2024-02-28T10:30 (+7)

Great post! I think the mixing up of “colloquial” type goals with “fanatical utility function maximization” type goals is a key flaw in a lot of x-risk arguments. I think the first thing could extend to some mild scheming, but is unlikely to extend to “kill everyone and tile the universe with paperclips”. 

I really don’t get the “simplicity” arguments for fanatical maximising behaviour. When you consider subgoals, it seems that secretly plotting to take over the world will obviously be much more complicated? Do you have any idea how much computing power and subgoals it takes to try and conquer the entire planet? 

I don’t buy the story that an AI starts with the simple “goal” of “maximise paperclips”, then gets yelled at for demolishing a homeless shelter to expand the factory, and then updates to a  goal of “maximise paperclips in the long term, by hiding your intentions and conducting a secret world domination plot”. Why not update to “make lots of paperclips, but don’t try any galaxy brained shit”? It seems simpler and less computationally expensive. 

Matthew_Barnett @ 2024-02-28T19:05 (+11)

I really don’t get the “simplicity” arguments for fanatical maximising behaviour. When you consider subgoals, it seems that secretly plotting to take over the world will obviously be much more complicated? Do you have any idea how much computing power and subgoals it takes to try and conquer the entire planet? 

I think this is underspecified because 

  1. The hard part of taking over the whole planet is being able to execute a strategy that actually works in a world with other agents (who are themselves vying for power), rather than the compute or complexity cost of having the subgoal of taking over the world
  2. The difficulty of taking over the world depends on the level of technology, among other factors. For example, taking over the world in the year 1000 AD was arguably impossible because you just couldn't manage an empire that large. Taking over the world in 2024 is perhaps more feasible, since we're already globalized, but it's still essentially an ~impossible task.

My best guess is that if some agent "takes over the world" in the future, it will look more like "being elected president of Earth" rather than "secretly plotted to release a nanoweapon at a precise time, killing everyone else simultaneously". That's because in the latter scenario, by the time some agent has access to super-destructive nanoweapons, the rest of the world likely has access to similarly-powerful technology, including potential defenses to these nanoweapons (or their own nanoweapons that they can threaten you with).

Ebenezer Dukakis @ 2024-10-11T06:02 (+6)

Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful.

I noticed you switched here from talking about "SGD" to talking about "deep learning". That seems dodgy. I think you are neglecting the possible implicit regularization effect of SGD.

I don't work at OpenAI, but my prior is that insofar as ChatGPT generalizes, it's the result of many years of research into regularization being applied during its training. (The fact that the term 'regularization' doesn't even appear in this post seems like a big red flag.)

We've figured out now how to train neural networks so they generalize, and we could probably figure out how to train neural networks without schemers if we put in similar years of effort. But in the same way that the very earliest neural networks were (likely? I'm no historian) overfit by default, it seems reasonable to wonder if the very earliest neural networks large enough to have schemers will have schemers by default.

Will Howard🔹 @ 2024-10-08T15:47 (+4)

I'm curating this post. Reading this post was a turning point for me from taking counting arguments seriously to largely rejecting them without a strong reason to think the principle of indifference holds. I thought the reductio arguments at the start were really well chosen to make the conclusion seem obvious (at least against the strict form of the argument) without leaving room for ML-specific nitpicks.

David Mathers @ 2024-02-28T13:35 (+3)

Only glanced at one or two sections but the "goal realism is anti-Darwinian" section seems possibly irrelevant to the argument to me. When you first introduce "goal realism" it seems like it is a view that goals are actual internal things somehow "written down" in the brain/neural net/other physical mind, so that you could modify the bit of the system where the goal is written down and get different behaviour, rather than there really being nothing that is the representation of the AIs goals, because "goals" are just behavioral dispositions. But the view your criticizing in the "goal realism is anti-Darwinian" section is the view that there is always a precise fact of the matter about what exactly is being represented at a particular point in time, rather than several different equally good candidates for what is represented. But I can think of representations are physically real vehicles-say, that some combination of neuron firings is the representation of flys/black dots that causes frogs to snap at them-without thinking it is completely determinate what-flies or black dots-is represented by those neuron firings. Determinacy of what a representation represents is not guaranteed just by the fact that a representation exists. ~

EDIT: Also, is Olah-style interpretability working presuming "representation realism"? Does it provide evidence for it? Evidence for realism about goals specifically? If not, why not? 

Reply

 

SummaryBot @ 2024-02-28T13:34 (+1)

Executive summary: Counting arguments that future AIs will likely "scheme" against humans are flawed and provide no good reason to worry AIs will intentionally deceive or try dominating humans.

Key points:

  1. A counting argument for neural networks massively overfitting training data is structurally identical and more plausible than counting arguments for AI schemers, yet overfitting rarely happens. This shows counting arguments are generally unsound.
  2. Counting arguments rely on the principle of indifference, which is known to give absurd results in many cases. There is no good reason to apply indifference reasoning to an AI's goals or behaviors.
  3. The assumption that AIs have fundamental inner goals separate from their behaviors is doubtful. Behaviors likely come first, with goal-attribution happening later based on patterns in those behaviors.
  4. Even AIs with explicit inner optimization objectives would not necessarily behave in ways that coherently pursue those objectives. Their behaviors result from complex interactions between components and are not cleanly dictated by any simple objective.
  5. Other arguments for worrying about AI schemers similarly rely on unsound indifference reasoning or implausible assumptions. Once indifference reasoning is rejected, there is very little reason left to believe AIs will spontaneously become schemers.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.