Can you control the past?

By Joe_Carlsmith @ 2021-08-27T19:34 (+46)

I think that you can “control” events you have no causal interaction with, including events in the past, and that this is a wild and disorienting fact, with uncertain but possibly significant implications. This post attempts to impart such disorientation.

My main example is a prisoner’s dilemma between perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own. This, I claim, is extremely weird.

My topic, more broadly, is the implications of this weirdness for the theory of instrumental rationality (“decision theory”). Many philosophers, and many parts of common sense, favor causal decision theory (CDT), on which, roughly, you should pick the action that causes the best outcomes in expectation. I think that deterministic twins, along with other examples, show that CDT is wrong. And I don’t think that uncertainty about “who are you,” or “where your algorithm is,” can save it.

Granted that CDT is wrong, though, I’m not sure what’s right. The most famous alternative is evidential decision theory (EDT), on which, roughly, you should choose the action you would be happiest to learn you had chosen. I think that EDT is more attractive (and more confusing) than many philosophers give it credit for, and that some putative counterexamples don’t withstand scrutiny. But EDT has problems, too.

In particular, I suspect that attractive versions of EDT (and perhaps, attractive attempts to recapture the spirit of CDT) require something in the vicinity of “following the policy that you would’ve wanted yourself to commit to, from some epistemic position that ‘forgets’ information you now know.” I don’t think that the most immediate objection to this – namely, that it implies choosing lower pay-offs even when you know them with certainty – is decisive (though some debates in this vicinity seem to me verbal). But it also seems extremely unclear what epistemic position you should evaluate policies from, and what policy such a position actually implies.

Overall, rejecting the common-sense comforts of CDT, and accepting the possibility of some kind of “acausal control,” leaves us in strange and uncertain territory. I think we should do it anyway. But we should also tread carefully.

I. Grandpappy Omega

Decision theorists often assume that instrumental rationality is about maximizing expected utility in some sense. The question is: what sense?

The most famous debate is between CDT and EDT. CDT chooses the action that will have the best effects. EDT chooses the action whose performance would be the best news.

More specifically: CDT and EDT disagree about the type of “if” to use when evaluating the utility to expect, if you do X. CDT uses a counterfactual type of “if” — one that holds fixed the probability of everything outside of action X’s causal influence, then plays out the consequences of doing X. In this sense, it doesn’t allow your choice to serve as “evidence” about anything you can’t cause — even when your choice is such evidence.

EDT, by contrast, uses a conditional “if.” That is, to evaluate X, it updates your overall picture of the world to reflect the assumption that action X has been been performed, and then sees how good the world looks in expectation. In this sense, it takes all the evidence into account, including the evidence that your having done X would provide.

To see what this difference looks like in action, consider:

Newcomb’s problem: You face two boxes: a transparent box, containing a thousand dollars, and an opaque box, which contains either a million dollars, or nothing. You can take (a) only the opaque box (one-boxing), or (b) both boxes (two-boxing). Yesterday, Omega — a superintelligent AI — put a million dollars in the opaque box if she predicted you’d one-box, and nothing if she predicted you’d two-box. Omega’s predictions are almost always right.

CDT two-boxes. Your choice, after all, is evidence about what’s in the opaque box, but it doesn’t actually affect what’s in the box — by the time you’re choosing, the opaque box is either already empty, or already full. So CDT assigns some probability p to the box being full, and then holds that probability fixed in evaluating different actions. Let’s say p is 1%. CDT’s expected payoffs are then:

One-boxing: 1% probability of $1M, 99% probability of nothing = $10K.
Two-boxing: 1% probability of $1M + $1K, 99% probability of $1K = $11K.

Note that there’s some ambiguity, here, about whether CDT then updates p based on its knowledge that it’s about to two-box, then recalculates the expected utilities, and only goes forward if it finds equilibrium. And in some problems, this sort of recalculation makes CDT’s decision-making unstable — see e.g. Gibbard and Harper’s (1978) “Death in Damascus.” But in Newcomb’s problem, no matter what p you use, CDT always says that two-boxing is $1K better, and so two-boxes regardless of what it thinks Omega did, or what evidence its own plans provide.

EDT, by contrast, one-boxes. Learning that you one-boxed, after all, is the better news: it means that Omega probably put a million in the opaque box. More specifically, in comparing one-boxing with two-boxing, EDT changes the probability that the box is full. Why? Because, well, the probability is different, conditional on one-boxing vs. two-boxing. Thus, EDT’s pay-offs are:

One-boxing: ~100% chance of $1M = ~$1M.
Two-boxing: ~100% chance of $1K = ~$1K.

What’s the right choice? I think: one-boxing, and I’ll say much more about why below. But I feel the pull towards two-boxing, for CDT-ish reasons.

Imagine, for example, that you have a friend who can see what’s in the opaque box (see Drescher (2006) for this framing). You ask them: what choice will leave me richer? They start to answer. But wait: did you even need to ask? Whether the opaque box is empty or full, you know what they’re going to say. Every single time, the answer will be: two-boxing, dumbo. Omega, after all, is gone; the box’s contents are fixed; the past is past. The question now is simply whether you want an extra $1,000, or not.

I find that my two-boxing intuition strengthens if Omega is your great grandfather, long dead (h/t Amanda Askell for suggesting this framing to me years ago), and if we specify that he’s merely a “pretty good” predictor; one who is right, say, 80% of the time (EDT still says to one-box, in this case). Suppose that he left the boxes in the attic of your family estate, for you to open on your 18th birthday. At the appointed time, you climb the dusty staircase; you brush the cobwebs off the antique boxes; you see the thousand through the glass. Are you really supposed to just leave it there, sitting in the attic? What sort of rationality is that?

Sometimes, one-boxers object: if two-boxers are so rational, why do the one-boxers end up so much richer? But two-boxers can answer: because Omega has chosen to give better options to agents who will choose irrationally. Two-boxers make the best of a worse situation: they almost always face a choice between nothing or $1K, and they, rationally, choose $1K. One-boxers, by contrast, make the worse of a better situation: they almost always face a choice between $1M or $1M+$1K, and they, irrationally, choose $1M.

But wouldn’t a two-boxer want to modify themselves, ahead of Omega’s prediction, to become a one-boxer? Depending on the modification and the circumstances: yes. But depending on the modification and the circumstances, it can be rational to self-modify into any old thing — especially if rich and powerful superintelligences are going around rewarding irrationality. If Omega will give you millions if you believe that Paris is in Ohio, self-modifying to make such a mistake might be worth it; but the Eiffel Tower stays put. At the very least, then, arguments from incentives towards self-modification require more specificity. (Though we might try to provide this specificity, by focusing on self-modifications whose advantages are sufficiently robust, and/or on a restricted class of cases that we deem “fair.”)

CDT’s arguments and replies to objections here are simple, flat-footed, and I think, quite strong. Indeed, many philosophers are convinced by something in the vicinity (see e.g. the 2009 Phil Papers survey, in which two-boxing, at 31%, beats one-boxing, at 21%, with the other 47% answering “other” – though we might wonder what “other” amounts to in a case with only two options). And more broadly, that I think that relative to EDT at least, CDT fits better with a certain kind of common sense. Action, we think, isn’t about manipulating our evidence about what’s already the case – what David Lewis calls “managing the news.” Rather, action is about causing stuff. In this sense, CDT feels to me like a basic and hard-headed default. In my head, it’s the “man on the street’s” decision theory. It’s not trying to get “too fancy.” It can feel like solid ground.

II. Writing on whiteboards light-years away

Nevertheless, I think that CDT is wrong. Here’s the case that convinces me most.

Perfect deterministic twin prisoner’s dilemma: You’re a deterministic AI system, who only wants money for yourself (you don’t care about copies of yourself). The authorities make a perfect copy of you, separate you and your copy by a large distance, and then expose you both, in simulation, to exactly identical inputs (let’s say, a room, a whiteboard, some markers, etc). You both face the following choice: either (a) send a million dollars to the other (“cooperate”), or (b) take a thousand dollars for yourself (“defect”).

(Prisoner’s dilemmas, with varying degrees of similarity between the participants, are common in the decision theory literature: see e.g. Lewis (1979), and Hofstadter (1985)).

CDT, in this case, defects. After all, your choice can’t causally influence your copy’s choice: you’re in your room, and he’s in his, far away. Indeed, we can specify that such influence is physically impossible – by the time information about your choice, traveling at the speed of light, can reach him, he’ll have already chosen (and vice versa). And regardless of what he chooses, you get more money by taking the thousand.

But defecting in this case, I claim, is totally crazy. Why? Because absent some kind of computer malfunction, both of you will make the same choice, as a matter of logical necessity. If you press the defect button, so will he; if you cooperate, so will he. The two of you, after all, are exact mirror images. You move in unison; you speak, and think, and reach for buttons, in perfect synchrony. Watching the two of you is like watching the same movie on two screens.

Indeed, for all intents and purposes, you control what he does. Imagine, for example, that you want to get something written on his whiteboard: let’s say, the words “I am the egg man; you are the walrus.” What to do? Just write it on your own whiteboard. Go ahead, try it. It will really work. When you two rendezvous after this is all over, his whiteboard will bear the words you chose. In this sense, your whiteboard is a strange kind of portal; a slate via which you can etch your choices into his far-away world; a chance to act, spookily, at a distance.

And it’s not just whiteboards: you can make him do whatever you want – dance a silly samba, bang his head against the wall, press the cooperate button — just by doing it yourself. He is your puppet. Invisible strings, more powerful and direct than any that operate via mere causality, tie every movement of your mind and body to his.

What’s more: such strings can’t be severed. Try, for example, to make the two whiteboards different. Imagine that you’ll get ten million dollars if you succeed. It doesn’t matter: you’ll fail. Your most whimsical impulse, your most intricate mental acrobatics, your special-est snowflake self, will never suffice: you can no more write “up” while he writes “down” than you can floss while the man in the bathroom mirror brushes his teeth. In this sense, if you find yourself reasoning about scenarios where he presses one button, and you press another – e.g., “even if he cooperates, it would be better for me to defect” – then you are misunderstanding your situation. Those scenarios just aren’t on the table. The available outcomes here are only defect-defect, and cooperate-cooperate. You can get a thousand, by defecting, or you can get a million, by cooperating; but you can’t get less, or more.

To me, it’s an extremely easy choice. Just press the “give myself a million dollars” button! Indeed, at this point, if someone tells me “I defect on a perfect, deterministic copy of myself, exposed to identical inputs,” I feel like: really?

Note that this doesn’t seem like a case where any idiosyncratic predictors are going around rewarding irrationality. Nor, indeed, does feel to me like “cooperating is an irrational choice, but it would be better for me to be the type of person who makes such a choice” or “You should pre-commit to cooperating ahead of time, however silly it will seem in the moment” (I’ll discuss cases that have more of this flavor later). Rather, it feels like what compels me is a direct, object-level argument, which could be made equally well before the copying or after. This argument recognizes a form of acausal “control” that our everyday notion of agency does not countenance, but which, pretty clearly, needs to be taken into account. Indeed, in effect, I feel like the case discovers a kind of magic; a mechanism for writing on whiteboards light-years away; a way of moving my copy’s hand to the cooperate button, or the defect button, just by moving mine. Ignoring this magic feels like ignoring a genuine and decision-relevant feature of the real world.

III. Who is the eggman, and who is the walrus?

I want to acknowledge and emphasize, though, that this kind of magic is extremely weird. Recognizing it, I think, involves a genuinely different way of understanding your situation, and your power. It makes your choices reverberate in new directions; it gives you a new type of control, over things you once thought beyond your sphere of influence – including, I’ll suggest, over events in the past (more on this below).

What’s more, I think, it changes – and clarifies — your sense of what your agency amounts to. Consider: who is the eggman, here, and who is the walrus? Suppose you want to send your copy a message: “hello, this is a message from your copy.” So you write it on your whiteboard, and thus on his. You step back, and see a message on your own whiteboard: “hello, this is a message from your copy.” Did he write that to you? Was that your way of writing to him? Are you actually alone, writing to yourself? All of three at once. I said earlier that your copy is your puppet. But equally, you are his puppet. But more truly, neither of you are puppets. Rather, you are both free men, in a strange but actually possible situation. You stand in front of your whiteboard, and it is genuinely up to you what you write, or do. You can write “I am a little lollypop, booka booka boo.” You can draw a demon kitten eating a windmill. You can scream, and dance, and wave your arms around, however you damn well please. Feel the wind on your face, cowboy: this is liberty. And yet, he will do the same. And yet, you two will always move in unison.

We can think of the magic, here, as arising centrally because compatibilism about free will is true. Let’s say you got copied on Monday, and it’s Friday, now – the day both copies will choose. On Monday, there was already an answer as to what button you and your copy will press, given exposure to the Friday inputs. Maybe we haven’t computed the answer yet (or maybe we have); but regardless, it’s fixed: we just need to crunch the numbers, run the deterministic code. From this sort of pre-determination comes a classic argument against free will: if the past and the physical laws (or their computational analogs, e.g. your state on Monday, and the rest of the code that will be run on Friday) are only compatible with your performing one of (a) or (b), then you can’t be free to choose either, because this would imply that you are free to choose the past/or the physical laws, which you can’t. Here, though, we pull a “one person’s reductio is another’s discovery”: because only one of (a) or (b) is compatible with the past/the physical laws, and because you are free to choose (a) or (b), it turns out that in some sense, you’re free to choose the past/the physical laws (or, their computational analogs).

What? That can’t be right. But isn’t it, in the practically relevant sense? Consider: the case is basically one where, if it’s the case that your state on Monday (call this Monday-Joe), copied and evolved according to deterministic process P, outputs “cooperate,” then you get a million dollars; and if it outputs “defect,” you get a thousand dollars (see e.g. Ahmed (2014)‘s “Betting on the Past” for an even simpler version of this). It’s Friday now. The state of Monday-Joe is fixed; Monday-Joe lives in the past. And process P, let’s say, was fixed on Monday, too. In this sense, the question of what Monday-Joe + process P outputs is already fixed. You, on Friday, are evolving-Joe: that is, Monday-Joe-in-the-midst-of-evolving-according-to-process-P. If you choose cooperate, it will always have been the case that Monday-Joe + process P outputs cooperate. If you choose defect, it will always have been the case that Monday-Joe + process P outputs defect. In this very real sense – the same sense at stake in every choice in a deterministic world – you get to choose what will have always been the case, even before your choice.

Try it. It will really work. Make your Friday choice, then leave the simulation, go get an old and isolated copy of Monday-Joe and Process P – one that’s been housed, since Monday, somewhere you could not have touched or tampered with — press play, and watch what comes out the other end. You won’t be surprised.

Is that changing the past? In one sense: no. It’s not that Joe’s state on Monday was X, but then because of what Evolving-Joe did on Friday, Joe’s state on Monday became Y instead. Nor does the output of Monday-Joe + Process P alter over the course of the week. Don’t be silly. You can’t change these things like you can change the contents of your fridge: milk on one day, juice on the next. It’s not milk at noon on Monday, and then on Friday, juice at noon on Monday instead. We must distinguish between the ability to “change things” in this sense, and the ability to “control” them in some broader sense.

But nevertheless: you get to decide, on Friday, the thing that will always have been true; the one thing that will always have been in your fridge, since the beginning of time. And perhaps this approaches, ultimately, the full sense of compatibilist decision-making, compatibilist “control,” even in cases of causal influence. Perhaps, that is, you can change the past, here, about as much as you can change the future in a deterministic world: that is, not at all, and enough to matter for practical purposes. After all, in such a world, the future is already fixed by the past. Your ability to decide that future was, therefore, always puzzling. Perhaps your ability to decide the past isn’t much more so (though certainly, it’s no less).

CDT can’t handle this kind of thing. CDT imagines that we have severed the ties between you and your copy, between you and the history that determines every aspect of you. It imagines that you can hold your copy’s arm fixed, and move yours freely; that you can break apart the future from the past, and let the future swing, at your pleasure, along some physically (indeed, logically!) impossible hinge. But you can’t. The echoes of your choice started before you chose. You are implicated in a structure that reverberates in all directions. You pull your arm, and the past and the universe trail behind; and yet, the past and universe push your arm; and yet, neither: you, the past, the future, the universe, are all born in the same timeless instant — free, fixed, consistent, a full and living painting of someone painting it as they go along.

And CDT’s mistake, here, is not just abstract misconception: rather, it misleads you in straightforward and practically-relevant ways. In particular, it prompts CDT to compare actions using expected utilities that you shouldn’t actually expect – which, when you step back, seems pretty silly. Suppose, for example, that as a CDT agent, you start out with a credence p that your copy will defect of 99%. Thus, as in Newcomb’s problem above, your payoffs are:

Expected utility from defecting: $1K guaranteed + $10K from a 1% probability of getting a million from my copy = $11K.
Expected utility from cooperating: $10K from a 1% probability of getting a million from my copy = $10K.

But you shouldn’t actually expect only $10k, if you cooperate, given the logical necessity of his doing what you do. That’s just … not the right number. So why are you considering it? This is no time to play around with fantasy distributions over outcomes; there’s real money on the line. And of course, this sort of objection will hold for any p. As long as you and your copy’s choice are correlated, CDT is going to ignore that correlation, hold p constant given different actions, and in that sense, prompt you to choose as though your probabilities are wrong.

EDT does better, here, of course: choosing based on what utility you, as a Bayesian, should actually expect, given different actions, is EDT’s forté by definition, and a powerful argument in its favor (see e.g. Christiano’s “simple argument for EDT” here). And the considerations about compatibilism and determinism I’ve been discussing seem friendly to EDT as well. After all, if you are a living in already-painted painting, it seems unsurprising if choice comes down to something like “managing the news.” The problem with managing the news, after all, was supposed to be that the news was already fixed. But in an already-painted painting, the future has already been fixed, too: you just don’t know what it is. And when you act, you start to find out. Insofar as you can choose how to act – and per compatibilism, you can – then you can choose what you’re going to find out, and in that sense, influence it. Do you hope that this already-fixed universe is one where you eat a sandwich? Well, go make a sandwich! If you do, you’ll discover that your dream for the universe has always been true, since the beginning of time. If you don’t make a sandwich, though, your dream will die. Why should the applicability of such reasoning be limited by the scope of “causation” (whatever that is)?

IV. What if the case is less clean?

I took pains, above, to specify that the copying process was perfect, and the inputs received exactly identical. It’s perfectly possible to satisfy this constraint, and we don’t need to use “atom-for-atom” copies and the like, or assume determinism at a physical level; we can just make you an AI system running in a deterministic simulation. What’s more, this constraint helps make the point more vivid; and it suffices, I think, to show that CDT is wrong.

However, I don’t think it’s necessary. Consider, for example, a version where there are small errors in the copying process; or in which you get a blue hat, and your copy, a red; or in which your environment involves some amount of randomness. These may or may not suffice to ruin your ability to write exactly what you want on his whiteboard. But very plausibly, the strong correlation between your choice of button, and his, will persist: and to the extent it does, this information is worthy of inclusion in your decision-making process.

What if you know that your copy has already chosen, before you make your choice? To the extent that the correlations between your choice and his persist in such conditions, I think that the same argument applies. Note, though, that your knowing that he’s already chosen means that the two of you got different inputs in a sense that seems more likely to affect your decision-making than getting different colored hats. That is, you saw a light indicating “your copy has already chosen”; he didn’t; and some people, faced with a light of that kind, start acting all weird about how “his choice is already made, I can’t affect it, might as well defect” and so on, in a way that they don’t when the light is off. So the question of what sorts of correlations are still at stake is more up for grabs. Does learning that you cooperate, after seeing such a light, still make it more likely that he cooperated, without seeing one? If so, that seems worth considering.

(This sort of “different inputs” dynamic also blocks certain types of loops/contradictions that could come from learning what a deterministic copy of you already did. E.g., if you learn what he chose — say, that he cooperated — before you make your choice, it’s still compatible with the case’s set up that you defect, as long as he got different inputs: e.g., he didn’t also learn that you cooperated. If he did “learn” that you cooperated, then things are getting more complicated. In particular, either you will in fact cooperate, or some feature of the case’s set-up is false. This is similar to how, if you travel back in time and try to kill you grandfather, either you will in fact fail, or the case’s set-up is false. Or to how, if you hear an infallible prediction that you’ll do X, then either you will in fact do X, or the prediction wasn’t infallible after all.)

V. Monopoly money

I think that “perfect deterministic twin prisoner’s dilemma”-type cases suffice to show that CDT is wrong. But I also want to note another type of argument I find persuasive, in the context of Newcomb’s problem, and which also evokes the type of “magic” I have in mind.

Imagine doing “tryout runs” of Newcomb’s problem, using monopoly money, as many times as you’d like, before facing the real case (h/t Drescher (2006) again). You try different patterns of one-boxing and two-boxing, over and over. Every time you one-box, the opaque box is full. Each time you two-box, it’s empty.

You find yourself thinking: “wow, this Omega character is no joke.” But you try getting fancier. You fake left, then go right — reaching for the one box, then lunging for the second box too at the last moment. You try increasingly complex chains of reasoning. Before choosing, you try deceiving yourself, bonking yourself on the head, taking heavy doses of hallucinogens. But to no avail. You can’t pull a fast one on ol’ Omega. Omega is right every time.

Indeed, pretty quickly, it starts to feel like you can basically just decide what the opaque box will contain. “Shazam!” you say, waving your arms over the boxes: “I hereby make it the case that Omega put a million dollars into the box.” And thus, as you one box, it is so. “Shazam!” you say again, waving your arms over a new set of boxes: “I hereby make it the case that Omega left the box empty.” And thus, as you two-box, it is so. With Omega’s help, you feel like you have become a magician. With Omega’s help, you feel like you can choose the past.

Now, finally, you face the true test, the real boxes, the legal tender. What will you choose? Here, I expect some feeling like: “I know this one; I’ve played this game before.” That is, I expect to have learned, in my gut, what one-boxing, or two-boxing, will lead to — to feel viscerally that there are really only two available outcomes here: I get a million dollars, by one boxing, or I get a thousand, by two-boxing. The choice seems clear.

VI. Against undue focus on folk-theoretical names

Of course, the same two-boxing responses I noted above apply here, too. It’s true that every time you one-box, you would’ve gotten an extra $1,000 if you’d two-boxed, assuming CDT’s “counterfactual” construal of “would.” It’s true that you leave the $1,000 dollars on the table; that is this is predictably regrettable for some sense of “regret”; and we can say, for this reason, that “Omega is just play-rewarding your play-irrationality.” I don’t have especially deep responses to these objections. But I find myself persuaded, nevertheless, that one-boxing is the way to go.

Or at least, it’s my way. When I step back in Newcomb’s case, I don’t feel especially attached to the idea that it's the way, the only “rational” choice (though I admit I feel this non-attachment less in perfect twin prisoner’s dilemmas, where defecting just seems to me pretty crazy). Rather, it feels like my conviction about one-boxing start to bypass debates about what’s “rational” or “irrational.” Faced with the boxes, I don’t feel like I’m asking myself “what’s the rational choice?” I feel like I’m, well, deciding what to do. In one sense of “rational” – e.g., the counterfactual sense – two-boxing is rational. In another sense – the conditional sense — one-boxing is. What’s the “true sense,” the “real rationality”? Mu. Who cares? What’s that question even about? Perhaps, for the normative realists, there is some “true rationality,” etched into the platonic realm; a single privileged way that the normative Gods demand that you arrange your mind, on pain of being… what? “Faulty”? Silly? Subject to a certain sort of criticism? But for the anti-realists, there is just the world, different ways of doing things, different ways of using words, different amounts of money that actually end up in your pocket. Let’s not get too hung up on what gets called what.

There’s a great line from David Lewis, which I often think of on those rare and clear-cut occasions when philosophical debate starts to border on the terminological.

“Why care about objective value or ethical reality? The sanction is that if you do not, your inner states will fail to deserve folk-theoretical names. Not a threat that will strike terror into the hearts of the wicked! But whoever thought that philosophy could replace the hangman?”

I want to highlight, in particular, the idea of “failing to deserve folk-theoretical names.” Too often, philosophy – especially normative philosophy — devolves into a debate about what kind of name-calling is appropriate, when. But faced with the boxes, or the buttons, our eyes should not be on the folk-theoretical names at stake. Rather, our eyes should be on the choice itself.

Note that my point here is not that “rationality is about winning” (see e.g. Yudkowsky (2009)). “Winning,” here, is subject to the same ambiguity as “rational.” One-boxers tend to end up richer, yes. But faced with a choice between $1k, or nothing (the choice that the two-boxer is actually presented with), $1k is the winning choice. Still, I am with Yudkowsky in spirit, in that I think that too much interest in the word “rational” here is apt to move our eyes from the prize.

(All that said, I’m going to continue, in what follows, to use the standard language of “what’s rational,” “what you should do,” etc, in discussing these cases. I hope that this language will be interpreted in a sense that connects directly to the actual, visceral process of deciding what to do, name-calling be damned. I acknowledge, though, that there’s a possible motte-and-bailey dynamic here, where the one-boxer goes in hard for claims like “CDT is wrong” and “c’mon, defecting in perfect twin prisoner’s dilemmas is just ridiculous!” and then backs off to “hey man, you’ve got your way, I’ve got my way, what’s all this obsession with the word ‘rationality’?” when pressed about the counterintuitive consequences of their own position. And more broadly, it can be hard to combine object level normative debate, which often reflects with a kind of “realist” flavor, with adequate consciousness and communication of some more fundamental meta-ethical arbitrariness. If necessary, we might go back through the whole post and try to rewrite it in more explicitly anti-realist terms — e.g., “I reject CDT.” But I’ll skip that, partly because I suspect that something beyond naive meta-ethical realism gets lost in this sort of move, even if we don’t have an explicit account of what it is.)

VII. Identity crises are no defense of CDT

I’ve now covered two data-points that I take to speak very strongly against CDT: namely, that one should cooperate in a twin prisoner’s dilemma, and that one should one-box in Newcomb’s problem. I want to briefly discuss an unusual way of trying to get CDT to one-box: namely, by appealing to uncertainty about whether you faced with the real boxes, or whether you are in a simulation being used by Omega to predict your future choice (see e.g. Aaronson (2005) and Critch (2017) for suggestions in this vein, though not necessarily in these specific terms). Basically, I don’t think this move works, in general, as a way of saving CDT, though the type of uncertainty in question might be relevant in other ways.

How is the story supposed to go? Imagine that you know that the way Omega predicts whether you’ll one-box, or two-box, is by running an extremely high-fidelity simulation of you. And suppose that both real-you and sim-you only care about what happens to real-you. By hypothesis, sim-you shouldn’t be able to figure out whether he’s simulated or real, because then he’ll serve as worse evidence about real-you’s future behavior (for example, if sim-you appears in a room with writing on the wall saying “you’re the sim,” then he can just one-box, thereby causing Omega to add the money to the opaque box, thereby allowing real-you, appearing in a room saying “you’re the real one,” to two-box, get the full million-point-one, and make Omega’s “prediction” wrong). So it needs to be the case that you’re uncertain – let’s say, 50-50 — about whether you’re simulated or not. Thus, the thought goes, you should one-box, because there’s a 50% chance that doing so will cause Omega to put the million in the box, and your real-self (who will also, presumably, one-box, given the similarity between you) will get it.

(Calculation: feel free to skip. Suppose that you currently expect yourself to one-box, as both real-you and sim-you, with 99% probability. Then the CDT calculation runs as follows:

50% chance you’re the sim, in which case:
- EV of one-boxing = 99% chance real-you gets a $1M, 1% chance real-you gets $1M + $1K = $1,000,010.
- EV of two-boxing = 99% chance real-you gets nothing, 1% chance real-you gets $1K = $10.
50% chance you’re real, in which case:
- EV of one-boxing: 99% chance real-you gets $1M, 1% chance real-you gets nothing = $990,000.
- EV of two-boxing: 99% chance real-you gets $1M + $1K, 1% chance real-you gets $1k = $991,000.
So overall:
- EV of one-boxing = 50% * $1,000,010 + 50% * $990,000 = $995,005.
- EV of two-boxing = 50% * $10 + 50% * $991,000 = $495,505.

Depending on the details, CDT may then need to adjust its probability that both sim-you and real-you one-box. But high-confidence that both versions of you one-box is a stable equilibrium (e.g., CDT still one-boxes, give such a belief); whereas high-confidence that both will two-box is not (e.g, CDT one-boxes, given such a belief). There are also some problems, here, with making such calculations consistent with assigning a specific probability to Omega being right in her prediction, but I’m setting those aside.)

My objections here are:

This move doesn’t work if you’re indexically selfish (e.g., you don’t care about copies of yourself).
This move doesn’t work for twin prisoner’s dilemma cases more broadly.
It’s not clear that simulations are necessary for predicting your actions in the relevant cases.
In general, it really doesn’t feel like this type of thing is driving my convictions about these cases.

Let’s start with (1). Suppose that real-you and sim-you aren’t united in sole concern for real-you. Rather, suppose that you’re both out for yourselves. Sim-you, let’s suppose, faces bleak prospects: whatever happens, Omega is going to shut down the simulation right after sim-you’s choice gets made. So sim-you doesn’t give a shit about this whole ridiculous situation with the god-damn boxes; the world is dust and ashes. Real-you, by contrast, is a CDT agent. So real-you, left to his own devices, is a two-boxer. Hence, sim-you doesn’t care, and real-you wants to two-box; and thus, uncertain about who you are, you two-box.

(Calculation, feel free to skip. Suppose you start out 99% confident that both versions of you will two-box. Thus:

50% chance you’re the sim, in which case: you get nothing no matter what.
50% chance you’re real, in which case:
- EV of one-boxing: 1% chance of $1M, 99% chance of nothing = $10,000.
- EV of two-boxing: 1% chance of $1M + $1K, 99% chance of $1k = $11,000
So overall:
- EV of one-boxing = 50% * $0 + 50% * $10,000 = $5,000.
- EV of two-boxing = 50% * $0 + 50% * $11,000= $5,500.

This dynamic holds regardless of your initial probabilities on how different versions of you will act, and regardless of your probability on being the sim vs. being real.)

Of course, real-you can try to “acausally induce” sim-you to one-box, by one-boxing himself. But “acausally inducing” other versions of yourself to do stuff isn’t the CDT way; rather, it’s the type of magical thinking silliness that CDT is supposed to eschew.

Perhaps one objects: sim-you should care about real-you! For one thing, though, this seems unobvious: indexical selfishness seems perfectly consistent and understandable (and indeed, for anti-realists, you can care about whatever you want). But more importantly, it’s an objection to a utility function, rather than to two-boxing per se; and decision theorists don’t generally go in for objecting to utility functions. If the claim is that “CDT is compatible with indexically altruistic agents one-boxing in Newcomb cases involving simulations,” then fair enough. But what about everyone else?

This leads us to objection (2): namely, that the twin prisoner’s dilemma, which I take to be one of the strongest reasons to reject CDT, is precisely a case of indexical selfishness. Perhaps I am uncertain about which copy I am; but regardless, I only care about myself; and on CDT, whatever that other guy does, I should defect. But defecting on your perfect deterministic twin, I claim, is totally crazy, even if you are indexically selfish. So CDT, I think, is still wrong.

What’s more, as I noted above, we can imagine versions of the case where I do know who I am; for example, I am the one with the blue hat, he’s the one with the red hat; I am the one who want to create flourishing Utopias, and he (the authorities changed my values during the copying process) wants to create paperclips. Unlike “sim vs. real,” these distinctions that are epistemically accessible. Still, though, if my choices are sufficiently correlated with those of my copy (and mutual cooperation is sufficiently beneficial), I should cooperate.

This is related to objection (3): namely, that not all cases where CDT gives the wrong verdicts involve simulations, or uncertainty about “who you are.” Twin prisoner’s dilemmas, where you are slightly but discernably different from your twin, are one example: no simulations or predictions necessary. But we might also wonder about Newcomb cases more broadly. Does Omega really need to be predicting your behavior via a simulation or model that you might actually be, in order for one-boxing to be the right call? This seems, at least, a substantively additional claim. And we might wonder about e.g. predicting your behavior via your genes (see e.g. Oesterheld (2015)), by observing lots of people who are “a lot like you,” or some via other unknown method.

That said, I want to acknowledge that one of the arguments for one-boxing that I find most persuasive – e.g., running the case lots of times with “play money,” before deciding what to do for real – works a lot better in contexts with very fine-grained prediction capabilities. This is because when I’m “playing around” with no real stakes, it makes more sense to imagine me using intricate and arbitrary decision-making processes, which the incentives at stake in the real case will not constrain. Thus, for example, maybe I try forms of pseudo-randomization (“I’ll one-box if the number of letters in the sentence I’m about to make up is odd” – see Aaronson here); maybe I try spinning myself around with my eyes closed, then pressing whichever button I see first; and so on. In order for Omega’s predictions to stay well-correlated with my behavior, here, it seems plausible she needs a very (unrealistically?) high-fidelity model. And we can say something similar about the twin prisoner’s dilemma. That is, the argument for cooperating is most compelling when his arm literally moves in logically-necessary lock-step with your own, as you reach towards the buttons. Once that’s not true, if we try to imagine a “play money” version of the case, then even with fairly minor psychology differences, you and your copy’s modes of “playing around” might de-correlate fast.

This feature of the intuitive landscape seems instructive. The sense that you acausally “control” what Omega predicts, or what your copy does, seems strongest when you can, as it were, do any old thing, for any old reason, and the correlation will remain. Once the correlation requires further constraints, the intuitive case weakens. That said, if you’re in the real case, with the real incentives, then it’s ultimately the correlation given those incentives that seems relevant: e.g., maybe Omega is accurate only for real-money cases; maybe you and your copy are only highly correlated when the real money comes out. In such a case, I think, you should still one-box/cooperate.

My final objection to the “appeal to uncertainty about who are you” sort of view just: it doesn’t feel like uncertainty about whether I’m a simulation is actually driving my one-boxing impulse. In the play-money Newcomb case, for example, I feel like what actually persuades me is a visceral sense that “one-boxing is going to result in me having a million dollars, two-boxing is going to result in me having a thousand dollars.” Questions about whether I’m a simulation, or whether Omega needs to simulate me in order to achieve this level of accuracy, just aren’t coming into it.

I conclude, then, that simulation uncertainty and related ideas can’t save CDT. Aaronson thinks that he can “pocket the $1,000,000, but still believe that the future doesn’t affect the past.” I think he’s wrong — at least in many cases where one wants the million, and can get it. He should face, I think, a weirder music.

VIII. Maybe EDT?

But what sort of music, exactly? And exactly how weird are we talking? I don’t know.

Consider, for example, EDT – CDT’s most famous rival. I think that a lot of philosophers write off EDT too quickly. As I mentioned earlier, EDT has the unique and compelling distinction of being the only view to use the utility you should actually expect, given the performance of action X, in order to calculate the expected utility of performing action X. In this sense, it’s the basic, simple-minded Bayesian’s decision theory; the type of decision theory you would use if you were, you know, trying to predict the outcomes of different actions.

What’s more, I think, a number of prominent objections to EDT seem to me, at least, much more complicated than they’re often made out to be. Consider, for example, the accusation that EDT endorses attempts to “manage the news.” There’s something true about this, but we should also be puzzled by it. Managing the news is obviously fine when you can influence the events the news is about. It’s fine, for example, to “manage the news” about whether you get a promotion, by working harder at the office. And it’s interestingly hard to “manage the news” successfully – e.g., change your rational credence in how good the future will be – with respect to things you can’t influence. Suppose, for example, that you’re worried (at, say, 70% credence) that your favored candidate lost yesterday’s election. Do you “manage the news” by refusing to read the morning’s newspaper, or by scribbling over the front page “Favored Candidate Wins Decisively!”? No: if you’re rational, your credence in the loss is still 70%.

Or take a somewhat more complicated case, discussed in Ahmed (2014). Suppose that you wake up not knowing what time it is, and all your clocks are broken. You hope that you’re not already late to work, and you consider running, to avoid either being late at all, or being later. Suppose, further, that people who run to work tend to be already late. Should you refrain from running, on the grounds that running would make it more likely that you’re already late? No. But plausibly, EDT doesn’t say you should, because running to work, in this case, wouldn’t be additional evidence that you’re already late, once we condition on the fact that you don’t know when you woke up, the reasons (including the subtle hunches about what time it might be) that you’d be running, and so on. After all, many of the already-late people running for work know that they’re already late, and are running for that reason. Your situation is different.

OK, so what does it take for the problematic type of news-management to be possible? This question matters, I think, because in some of the examples where EDT is supposed to go in for the problematic type of news-management, it’s not clear that the news-management in question would succeed. Consider:

Smoking lesion: Almost everyone who smokes has a fatal lesion, and almost everyone who doesn’t smoke doesn’t have this lesion. However, smoking doesn’t cause the lesion. Rather, the lesion causes people to smoke. Dying from the lesion is terrible, but smoking is pretty good. Should you smoke?

EDT, the objection goes, doesn’t smoke, here, because smoking increases your credence that you have the lesion. But this, the thought goes, is stupid. You’ve either already got the lesion, or you don’t have it and won’t get it. Either way, you should smoke. Not smoking is just “managing the news.”

I used to treat this case a fairly decisive reason to reject EDT. Now I feel more confused about it. For starters, EDT clearly smokes in some versions of the case. Suppose, for example, that the way the lesion causes people to smoke is by making them want to smoke. Conditional on someone wanting to smoke, though, there’s no additional correlation between actually smoking and having the lesion. Thus, if you notice that you want to smoke (e.g., you feel a “tickle”), then that’s the bad news right there: you’ve already got all the smoking-related evidence you’re going to get about whether you’ve got the lesion. Actually smoking, or not, doesn’t change the news: so, no need for further management. This sort of argument will work for any mechanism of influence on your decision that you notice and update on. Thus the so-called “Tickle Defense” of EDT.

Ok, but what if you don’t notice any tickle, or whatever other mechanism of influence is at stake? As Ahmed (2014, p. 91) characterizes it, the tickle defense assumes that all the inputs to your decision-making are “transparent” to you. But this seems like a strong condition, and granted greater ignorance, my sense is that in some versions of the case (for example, versions where the lesion makes you assign positive utility to smoking, but you don’t know what your utility function is, even as you use it in making decisions), EDT is indeed going to give the intuitively wrong result (see e.g., Demski’s “Smoking Lesion Steelman” for a worked example). Christiano argues that this is fine – “No matter how good your decision procedure is, if you don’t know a critical fact about the situation then you can make a decision that looks bad” – but I’m not so sure: prima facie, not smoking in smoking-lesion type cases seems like the type of mistake one ought to be able to avoid, even granted uncertainty about some aspects of your own psychology, and/or how the lesion works.

More generally, though, my sense is that really trying to dig into the details of tickle-defense type moves gets complicated fast, and that there’s some tension between (a) trying to craft a version of EDT where the “tickle defense” always works – e.g., one that somehow updates on everything influencing its decision-making (I’m not sure how this is supposed to work) – and (b) keeping EDT meaningfully distinct from CDT (see e.g. Demski’s sequence “EDT = CDT?”). Maybe some people are OK with collapsing the distinction, and OK, even, if EDT starts two-boxing in Newcomb’s problems (see e.g. Demski’s final comments here), and defecting on deterministic twins (I’ve been setting this possibility aside above, and following the standard understanding of how EDT acts in these cases). But for my part, a key reason I’m interested in EDT at all is because I’m interested in one-boxing and cooperating. Maybe I can get this in other ways (see e.g. the discussion of “follow the policy you would’ve committed to” below); but then, I think, EDT will lose much of its appeal (though not all; I also like the “basic Bayesian-ness” of it).

One other note on smoking lesion. You might think that the “do it over and over with monopoly money” type argument that I found persuasive earlier will give the intuitively wrong verdict on smoking lesion, suggesting that such an argument shouldn’t be trusted. After all, we might think, almost every time you smoke in a “play life,” you’ll end up with the play-lesion; and every time you don’t, you won’t. But note that when we dig in on this, the smoking lesion case can start to break in a maybe-instructive manner.

Suppose, for example, that I know that the base rate of lesions in the population is 50%, and I get “spawned” over and over into the world, where I can choose to smoke, or not. How can my “playing around” remain consistent with this 50% base rate? Imagine, for example, that I decide to refrain from smoking a million times in a row. If the case’s hypothesized correlations hold, then I will in fact spawn, consistently, without the lesion. In that case, though, it starts to look like my choice of whether to smoke or not actually is exerting a type of “control” over whether I get born as someone with the lesion – in defiance of the base rate. And if my choice can do that, then it’s not actually clear to me that non-smoking, here, is so crazy.

Maybe we could rule this out by fiat? “Well, if the base rate is 50%, then it turns you will, in fact, decide to ‘play around’ in a way that involves smoking ~50% of the time” (thanks to Katja Grace for discussion). But this feels a bit forced, and inconsistent with the spirit of “play around however you want; it’ll basically always work” – the spirit that I find persuasive in Newcomb’s case and sufficiently-high-fidelity twin prisoner’s dilemmas. Alternatively, we could specify that I’m not allowed to know the base rate, and then we can shift it around to remain consistent with my making whatever play choices I want and spawning at the base rate. But now it looks like I can control the base rate of lesions! And if I can do that, once again, I start to wonder about whether non-smoking is so crazy after all.

That said, maybe the right thing to say here is just that the correlations posited in smoking lesion don’t persist under conditions of “play around however you want” – something that I expect holds true of various versions of Newcomb’s problem and Twin Prisoner’s dilemma as well.

What about other putative counter-examples to EDT? There are lots to consider, but at least one other one – namely, “Yankees vs. Red Sox” (see Arntzenius (2008)) — strikes me as dubious (though also, elegant). In this case, the Yankees win 90% of games, and you face a choice between the following bets:

Yankees win Red Sox win

You bet on Yankees 1 -2

You bet on Red Sox -1 2

Or, if we think of the outcomes here as “you win your” and “you lose your bet” instead, we get:

You win your bet You lose your bet

You bet on Yankees 1 -2

You bet on Red Sox 2 -1

Before you choose your bet, an Oracle tells you whether you’re going to win your next bet. The issue is that once you condition on winning or losing (regardless of which), you should always bet on the Red Sox. So, the thought goes, EDT always bets on the Red Sox, and loses money 90% of the time. Betting on the Yankees every time does much better.

But something is fishy here. Specifically, the Oracle’s prediction, together with your knowledge of your own decision, leaks information that should render your decision-making unstable. Suppose, for example, that the Oracle tells you that you will lose your next bet. You then reason: “Conditional on knowing that I will lose my bet, I should bet on the Red Sox. But given that I’ll lose, this means that the Yankees will win, which means I should bet on the Yankees, which means I will win my bet. But I can’t win my bet, so the Yankees will lose, so I should bet on the Red Sox,” and so on. That is, you oscillate between reasoning using the second matrix, and reasoning using the first; and you never settle down.

(Note that if we allow for playing around with monopoly money, then this case, too, suffers from the same base-rate related problems as smoking lesion: e.g., either you can change the base rates of Yankee victory at will, or you’re somehow forced to play around in a manner consistent with both the 90% base rate and the Oracle’s accuracy, or somehow the Oracle’s accuracy doesn’t hold in conditions where you can play around.)

Even if we set aside smoking lesion and Yankees vs. Red Sox, though, there is at least one counterexample to EDT that seems to me pretty solidly damning, namely:

XOR blackmail: Termites in your house is a million-dollar loss, and you don’t know if you have them. A credible and accurate predictor finds out if you have termites, then writes the following letter: “I am sending you this letter if and only if (a) I predict that you will pay me $1,000 dollars upon receiving it, or (b) you have termites, but not both.” She then makes her prediction and follows the letter’s outlined procedure. If you receive the letter, should you pay?

(See Yudkowsky and Soares (2017), p. 24).

EDT pays, here. Why? Because conditional on paying, it’s much less likely that you’ve got termites, so paying is much better news than not paying. If you refuse to pay, you should call the exterminator (or do whatever you do with termites) pronto; if you pay, you can relax.

Or at least, you can relax for a bit. But if you’re EDT, you’re getting these letters all the time. Maybe the predictor decides to pull this stunt every day. You’re flooded with letters, all reflecting the prediction that you’ll pay. If you’d only stop paying, the letters would slow to a base-rate-of-termites-sized trickle. Try it with monopoly money: as you spawn over and over, you’ll find you can modulate the frequency of letter receipt at will, just by deciding to pay, or not, on the next round. But in real life, once you’ve got the letter, do you ever wise up, and decide, instead of paying, to already have termites? On EDT, it’s not clear (at least to me) why you would, absent some other change to the situation. Termites, after all, are terrible. And look at this letter, already sitting in your hand! It only comes given one of two conditions…

Perhaps one thinks: the core issue here isn’t that you’re getting so many letters. Even if you know that the predictor is only going to pull this stunt once, paying seem pretty silly. Why? It’s that old thing about the past having already happened, about the opaque box already being empty or full. You’ve either already got termites, or you don’t, dude: stop trying to manage the news.

But is that the core issue? Consider:

More active termite blackmail: The predictor gets more aggressive. Once a year, she writes the following letter: “I predicted that you would pay me $1,000 upon receipt of this letter. If I predicted ‘yes,’ I left your house alone. If I predicted ‘no,’ I gave you termites.” Then she predicts, obeys the procedure, and sends. If you receive the letter, should you pay?

Here, the “it’s too late, dude” objection still applies. CDT ignores letters like this. But CDT also gets given termites once a year. EDT, by contrast, pays, and stays termite free. What’s more, by hypothesis, the stunt gets pulled on everyone the same number of times, regardless of their payment patterns. In this sense, it’s more directly analogous to Newcomb’s problem. And I find that paying, here, seems more intuitive than in the previous case (though the fact that you ultimately want to deter this sort of behavior from occurring at all may bring in additional complications; if it helps, we can specify that the predictor’s not actually in this for getting money or for giving people termites — rather, she just likes putting people in weird decision-theory situations, and will do this regardless of how her victims respond).

We can consider other problems with EDT as well, beyond XOR blackmail. For example, a naïve formulation of EDT has trouble with cases where it starts out certain about what it’s going to do, or even very confident (see e.g. the “cosmic ray problem” on p. 24 of Yudkowsky and Soares (2017)). And more generally, the “managing the news” flavor of EDT makes it feel, to me, like the type of thing one could come up with counter-examples to. But it’s XOR blackmail, I find, that currently gives me the most pause (and note, too, that in XOR blackmail, we can imagine that you have arbitrary introspective access, such that tickle-defense type questions about whether all the factors influencing your decision are “transparent” or not don’t really apply). And I think that the importance of the way paying influences how many letters you get, as opposed to its trying to “control the past” more broadly, may be instructive.

Summarizing this section, then: my current sense is that:

EDT’s “basic Bayesianism” makes it attractive.
Really digging into EDT, especially re: tickle defenses, can get kind of gnarly.
Yankees vs. Red Sox isn’t a good counterargument to EDT.
EDT messes up in XOR blackmail.
There are probably a bunch of other problems with EDT that I’m not really considering/engaging with.

Does this make EDT better or worse than CDT? Currently, I’m weakly inclined to say “better” – at least in theory. But trying to actually implement EDT also seems more liable to lead to pretty silly stuff. I’ll discuss some of this silly stuff in the final section. First, though, and motivated by XOR blackmail, I want to discuss one more broad bucket of decision-theoretic options and examples – namely, those associated with following policies you would’ve wanted yourself to commit to, even when it hurts.

IX. What would you have wanted yourself to commit to?

Consider:

Parfit’s hitchhiker: You are stranded in the desert without cash, and you’ll die if you don’t get to the city soon. A selfish man comes along in a car. He is an extremely accurate predictor, and he’ll take you to the city if he predicts that once you arrive, you’ll go to an ATM, withdraw ten thousand dollars, and give it to him. However, once you get to the city, he’ll be powerless to stop you from not paying.

If you get to the city, should you pay him? Both CDT and EDT answer: no. By the time you get to the city, the risk of death in the desert is gone. Paying him, then, is pure loss (assuming you don’t value his welfare, and there are no other downstream consequences). Because they answer this way, though, both CDT and EDT agents rarely make it to the city: the man predicts, accurately, that they won’t pay.

Is this a problem? Some might answer: no, because paying in the city is clearly irrational. In particular, it violates what MacAskill (2019) calls:

Guaranteed Payoffs: When you’re certain about what the pay-offs of your different options would be, you should choose the option with the highest pay-off.

Guaranteed Payoffs, we should all agree, is an attractive principle, at least in the abstract. If you’re not taking the higher payoff, when you know exactly what payoffs your different actions will lead to, then what the heck are you doing, and why would we call it “rationality”?

On the other hand, is paying the driver really so silly? To me, it doesn’t feel that way. Indeed, I feel happy to pay, here (though I also think that the case brings in extra heuristics about promise-keeping and gratitude that may muddy the waters; better to run it with a mean and non-conscious AI system who demands that you just burn the money in the street, and kills itself before you even get to the ATM). What’s more, I want to be the type of person who pays. Indeed: if, in the desert, I could set-up some elaborate and costly self-binding scheme – say, a bomb that blows off my arm, in the city, if I don’t pay — such that paying in the city becomes straightforwardly incentivized, I would want to do it. But if that’s true, we might wonder, why not skip all this expensive faff with the bomb, and just, you know, pay in the city? After all, what if there are no bombs around to strap to my arm? What if I don’t know how to make bombs? Need my survival be subject to such contingencies? Why not learn, and practice, that oh-so valuable (and portable, and reliably available) skill instead: how to make, and actually keep, commitments? (h/t Carl Shulman, years ago, for suggesting this sort of framing.)

That said, various questions tend to blur together here – and once we pull them apart, it’s not clear to me how much substantive (as opposed to merely verbal) debate remains. Everyone agrees that it’s better to be the type of person who pays. Everyone agrees that if you can credibly commit to paying, you should do it; and that the ability to make and keep commitments is an extremely useful one. Indeed, everyone agrees that, if you’re a CDT or EDT agent about to face this case, it’s better, if you can, to self-modify into some other type of agent – one that will pay in the city (and are commitments and self-modifications really so different? Is cognition itself so different from self-modification?). As far as I can tell (and I’m not alone in thinking this), the only remaining dispute is whether, given these facts, we should baptize the action of paying in the city with the word “rational,” or if we should instead call it “an irrational action, but one that follows from a disposition it’s rational to cultivate, a self-modification it’s rational to make, a policy its rational to commit to,” and so on.

Is that an interesting question? What’s actually at stake, when we ask it? I’m not sure. As I mentioned above, I tend towards anti-realism about normativity; and for anti-realists, debates about the “true rationality” aren’t especially deep. Ultimately, there are just different ways of arranging your mind, different ways of making decisions, different shapes that can be given to this strange clay of self and world. Ultimately, that is, the question is just: what you in fact do in the city, and what in fact that decision means, implies, causes, and so on. We talk about “rationality” as a means of groping towards greater wisdom and clarity about these implications, effects, and so on; but if you understand all of this, and make your decisions in light of full information, additional disputes about what compliments and insults are appropriate don’t seem especially pressing.

All that said, terminology aside, I do think that Parfit’s hitchhiker-type cases can lead to genuinely practical and visceral forms of internal conflict. Consider:

Deterrence: You have a button that will destroy the world. The aliens want to invade, but they want the world intact, and they won’t invade if they predict that you’ll destroy the world upon observing their invasion. Being enslaved by the aliens is better than death; but freedom far better. The aliens predict that you won’t press the button, and so start to invade. Should you destroy the world?

This is far from a fanciful thought experiment. Rather, this is precisely the type of dynamic that decision-makers with real nuclear codes at their fingertips have to deal with. Same with tree-huggers chaining themselves to trees, teenagers playing chicken, and so on.

Or, more fancifully, consider:

Counterfactual mugging: Omega doesn’t know whether the X-th digit of pi is even or odd. Before finding out, she makes the following commitment. If the X-th digit of pi is odd, she will ask you for a thousand dollars. If the X-th digit is even, she will predict whether you would’ve given her the thousand had the X-th digit been odd, and she will give you a million if she predicts “yes.” The X-th digit is odd, and Omega asks you for the thousand. Should you pay?

(I use logical randomness, rather than e.g. coin-flipping, to make it more difficult to appeal to concern about versions of yourself that live in other quantum branches, possible worlds, and so on. Thanks to Katja Grace for suggesting this. That said, perhaps some such appeals are available regardless. For example, how did X get decided?)

Finally, consider a version of Newcomb’s problem in which both boxes are transparent – e.g., you can see how Omega has predicted you’ll behave. Suppose you find that Omega has predicted that you’ll one-box, and so left the million there. Should you one-box, or two-box? What if Omega has predicted that you’ll two-box?

We can think of all these cases as involving an inconsistency between the policy that an agent would want to adopt, at some prior point in time/from some epistemic position (e.g., before the aliens invade, before we know the value of the X-th digit, before Omega makes her predictions), and the action that Guaranteed Payoffs would mandate given full information. And there are lots of other cases in this vein as well (see e.g., The Absent-Minded Driver, and the literature on dynamical inconsistency in game theory).

There is a certain broad class of decision theories, a number of which are associated with the Machine Intelligence Research Institute (MIRI), that put resolving this type of inconsistency in favor of something like “the policy you would’ve wanted to adopt” at center stage. (In general, MIRI’s work on decision theory has heavily influenced my own thinking – influence on display throughout this post. See also Meacham (2010) for another view in this vein, as well as the work of Wei Dai and others on “updatelessness.”) There are lots of different ways to do this (see e.g. the discussion of the 2x2x3 matrix here), and I don’t feel like I have a strong grip on all of the relevant choice-points. Many of these views are united, though, in violating Guaranteed Payoffs, for reasons that feel, spiritually, pretty similar.

What’s more, and importantly, these theories tend to get cases like XOR blackmail right, where e.g. classic EDT gets them wrong. Consider, for example, whether before you receive any letter, you would want to commit to paying, or not paying, upon receipt. If we assume that the base rate of termites will stay constant regardless, then committing to not paying seems the clear choice. After all, doing so won’t make it more likely that you get termites; rather, it’ll make it less likely that you get letters.

If necessary, these theories can also get results like one-boxing, and cooperating with your twin, without appeal to any weird magic about controlling the past. After all, one-boxing and cooperating are both policies that you would want yourself to commit to, at least from some epistemic positions, even in a plain-old, common-sense, CDT-spirited world. Maybe executing these policies looks like trying to execute some kind of acausal control — and maybe, indeed, advocates of such policies talk in terms of such control. But maybe this is just talk. After all, executing policies that violate Guaranteed Payoffs looks pretty weird in general (for example, it looks like burning money for certain), and perhaps we need not take decisions about how to conceptualize such violations all that seriously: the main thing is what happens with the money.

A key price of this approach, though, is the whole “burning money for certain” thing; and here, perhaps, some people will want to get off the train. “Look, I was down for one-boxing, or for cooperating with my twin, when I didn’t actually know the payoffs in question. But violating Guaranteed Payoffs is just too much! You’re just destroying value for certain. That’s all. That’s the whole thing you do. You blow up the world, trying to prevent something that you know has already happened. Yes, it’s good to commit to doing that ex ante. But ex post, isn’t it also just obviously stupid?”

For people with this combination of views, though, I think it’s important to keep in mind the spiritual continuity between violating Guaranteed Payoffs, and one-boxing/cooperating more generally. After all, one of the strongest arguments for two-boxing is that, if you knew what was in the box (like, e.g., your friend does), you’d be in a Guaranteed Payoffs-type situation, and then a follower of Guaranteed Payoffs would two-box every time. Indeed, I think that part why “great grandpappy Omega, now long dead, leaves the boxes in the attic” prompts a two-boxing intuition is that in the attic, you sense that you’re about to move from a non-transparent Newcomb’s problem to a transparent one. That is, after you bring the one-box down from the attic, and open it, the other box isn’t going to disappear. The attic door is still open. The stairs still beckon. You could just go back up there and get that thousand. Why not do it? If you got the million, it’s not going to evaporate. And if you didn’t get the million, what’s the use of letting a thousand go to waste? But that’s just the type of thinking that leads to empty boxes…

X. Building statues to the average of all logically-impossible Gods

Overall, I don’t see violations of Guaranteed Payoffs as a decisive reason to reject approaches in the vein of “act in line with the policy you would’ve wanted to commit to from some epistemic position P” – and some disputes in this vicinity strike me as verbal rather than substantive. That said, I do want to flag an additional source of uncertainty about such approaches: namely, that it seems extremely unclear what they actually imply.

In particular, all the “violate Guaranteed Payoffs” cases above rely on some implied “prior” epistemic position (e.g., before the aliens invade, before Omega has made her prediction, etc), relative to which the policy in question is evaluated. But why is that the position, instead of some other one? Even if we were just “rewinding” your own epistemology (e.g., to back before you knew that the aliens were invading, but after you learned that about how they were going to make their decision), there would be a question of how far to rewind. Back to your childhood? Back to before you were born, and were an innocent platonic soul about to be spawned into the world? What features does this soul have? In what order were those features added? Does your platonic soul know basic facts about logic? What credence does it have that it’ll get born as a square circle, or into a world where 2+2=5? What in the goddamn hell are we talking about?

Also, it isn’t just a question of “rewinding” your own epistemology to some earlier epistemic position you (or even, a stripped-down version of you) held. There may be no actual time when you knew the information you’re trying to “remember” (e.g., that Omega is going to pull a counter-factual-mugging type stunt) but not the information you’re trying to “forget” (e.g., that the X-th digit of pi is odd). So it seems like the epistemic position in question may need to be one that no one – and certainly not you — has ever, in fact, occupied. How are we supposed to pick out such a position? What desiderata are even relevant? I haven’t engaged much with questions in this vein, but currently, I basically just don’t know how this is supposed to work. (I’m also not the only one with these questions. See e.g. Demski here, on the possibility that “updatelessness is doomed,” and Christiano here. And they’ve thought more about it.)

What’s more, some (basically all?) of these epistemic positions don’t seem particularly exciting from a “winning” perspective — and not just because they violate Guaranteed Payoffs. For example: weren’t you a member of some funky religion as a child — one that you now reject? And weren’t you more generally kind of dumb and ignorant? Are you sure you want to commit to a policy from that epistemic position (see e.g. Kokotajlo (2019) for more )? Or are we, maybe, imagining a superintelligent version of your childhood self, who knows everything? But wait: don’t forget to forget stuff, too, like what will end up in the boxes. But what should we “forget,” what should we “remember,” and what should we learn-for-the-first-time-because-apparently-we’re-talking-about-superintelligences now?

And even if we had such an attractive and privileged epistemic position identified, it seems additionally hard (totally impossible?) to know what policy this position would actually imply. Suppose, to take a normal everyday example that definitely doesn’t involve any theoretical problems, that you are about to be inserted as a random “soul” into a random “world.” What policy should you commit to? As Parfit’s Hitchhiker, should you pay in the city? Or should you, perhaps, commit to not getting into the man’s car at all, even if doing so is free, in order to disincentivize your younger self from taking ill-advised trips into the desert? Or should you, perhaps, commit to carving the desert sands into statues of square circles, and then burning yourself at the stake as an offering to the average of all logically impossible Gods? One feels, perhaps, a bit at sea; and a bit at risk of, as it were, doing something dumb. After all, you’ve already gone in for burning value for certain; you’ve already started trying to reason like someone you’re not, in a situation that you aren’t in. And without constraints like “don’t burn value for certain” as a basic filter on your action space, the floodgates open wide. One worries about swimming well in such water.

XI. Living with magic

Overall, the main thing I want to communicate in this post is: I think that the perfect deterministic twin’s prisoner’s dilemma case basically shows that there is such a thing as “acausal control,” and that this is super duper weird. For all intents and purposes, you can decide what gets written on whiteboards light-years away; you can move another man’s arm, in lock-step with your own, without any causal contact between him and you. It actually works, and that, I think, is pretty crazy. It’s not the type of power we think of ourselves as having. It’s not the type of power we’re used to trying to wield.

What does trying to wield it actually look it, especially in our actual lives? I’m not sure. I don’t have a worked out decision-theory that makes sense of this type of thing, let alone a view about how to apply it. As a first pass, though, I’d probably start by trying to figure out what EDT actually implies, once you account for (a) tickle-defense type stuff, and (b) decorrelations between your decision and the decisions of others that arise because you’re doing some kind of funky EDT-type reasoning, and they probably aren’t.

For example: suppose that you want other people to vote in the upcoming election. Does this give you reason to vote, not out of some sort of abstract “be the change you want to see in the world” type of ethic, but because, more concretely, your voting, even in causal isolation from everyone else, will literally (if acausally) increase non-you voter turnout? Let’s first stop and really grok that voting for this reason is a weird thing to do. You’re not just trying to obey some Kantian maxim, or to do your civic duty. You’re not just saying “what if everyone acted like that?” in the abstract, like a schoolteacher to an errant child, with no expectation that “everyone,” as it were, actually will. And you’re certainly not knocking on doors or driving neighbors to the polls. Rather, you’re literally trying to influence the behavior of other people you’ll never interact with, by walking down to the voting booth on your causally isolated island. Indeed, maybe your island is in a different time zone, and you know that the polls everywhere else are closed. Still, you reason, your choice’s influence can slip the surly bonds of space and time; the evening news can still be managed (indeed, some non-EDT decision theories vote even after they’ve seen the evening news).

Is this sort of thinking remotely sensible? Well, note that the EDT version, at least, makes sense only if you should actually expect a higher non-you voter turnout, conditional on you voting for this sort of reason, than otherwise. If the voting population is “perfect deterministic copies of myself who will see the exact same inputs,” this condition holds; and it holds in various weaker conditions, too. How much does it hold in the real world, though? That’s much less clear; and as ever, if you’re considering trying to manage the news, the first thing to check is whether the news is actually manageable.

In particular, as Abram Demski emphasizes here, the greater the role of weird-decision-theory type calculations in your thinking, the less correlated your decisions will be those of others who are thinking in less esoteric ways. Perhaps you should consider the influence of your behavior on the other people interested in non-causal decision-theories (evening news: “the weird decision theorists turn out in droves!”); but it’s a smaller demographic. That said, what sorts of correlations are at stake here is an empirical question, and there’s no guarantee that something common-sensical will emerge victorious. It seems possible, for example, that many people are implicitly implementing some proto-version of your decision theory, even if they’re not explicit about it.

Here’s another case that seems to me even weirder. Suppose that you’re reading about some prison camps from World War I. They sound horrible, but the description leaves many details unspecified, and you find yourself hoping that the guards in the prison camps were as nice as would be compatible with the historical evidence you’ve seen thus far. Does this give you, perhaps, some weak reason to be nicer to other people, in your own life, on the grounds that there is some weak correlation between your niceness, and the niceness of the guards? You’re all, after all, humans; you’ve got extremely similar genes; you’re subject to broadly similar influences; perhaps you and some of the guards are implementing vaguely similar decision procedures at some level; perhaps even (who knows?) there was some explicit decision theory happening in the trenches. Should you try to be the change you want to see in the past? Should you, now, try to improve the conditions in World War I prison camps? And if so: have you, perhaps, lost your marbles?

Perhaps some people will answer: look, the correlations are too weak, here, for such reasoning to get off the ground. To others, though, this will seem the wrong sort of reply. The issue isn’t that you’re wrong, empirically, about the correlations at stake – indeed, the extent of such correlations seems, in some sense, an open question. The issue is that you’re trying to improve the past at all.

There are other weird applications to consider as well. For example, once you can “control” things you have no causal interaction with, your sphere of possible control could in principle expand throughout a very large universe, allowing you to “influence” the behavior of aliens, other quantum branches, and so on (see e.g. Oesterheld (2017) for more). Indeed, there’s an argument for treating yourself as capable of such influence, even if you have comparatively low credence on the relevant funky decision theories, because being able to influence the behavior of tons of agents raises the stakes of your choice (see e.g. MacAskill et al (2019)). And taken seriously enough, the possibility of non-causal influence can lead to a very non-standard picture of the future – one in which “interactions” between causally-isolated civilizations throughout the universe/multi-verse move much closer to center stage.

Once you’ve started trying to acausally influence the behavior of aliens throughout the multiverse, though, one starts to wonder even more about the whole lost-your-marbles thing. And even if you’re OK with this sort of thing in principle, it’s a much further question whether you should expect any efforts in this broad funky-decision-theoretic vein to go well in practice. Indeed, my strong suspicion is that with respect to multiverse-wide whatever whatevers, for example, any such efforts, undertaken with our current level of understanding, will end up looking very misguided in hindsight, even if the decision theory that motivated them ends up vindicated. Here I think of Bostrom’s “ladder of deliberation,” in which one notices that whether an intervention seems like a good or bad idea switches back and forth as one reasons about it more, with no end in sight, thus inducing corresponding pessimism about the reliability of one’s current conclusions. Even if the weird-decision-theory ladder is sound, we are, I think, on a pretty early rung.

Overall, this whole “acausal control” thing is strange stuff. I think we should be careful with it, and generally avoid doing things that look stupid by normal lights, especially in the everyday situations our common-sense is used to dealing with. But the possibility of new, weird forms of control over the world also seems like the type of thing that could be important; and I think that perfect deterministic twins demonstrate that something in this vicinity is, at least sometimes, real. Its nature and implications, therefore, seem worth attention.

(My thanks to Paul Christiano, Bastian Stern, Nisan Stiennon, and especially to Katja Grace and Ketan Ramakrishnan, for discussion. And thanks, as well, to Abram Demski, Scott Garrabrant, Nick Beckstead, Rob Bensinger, and Ben Pace, for this exchange on related topics.)

Harrison D @ 2021-08-29T17:11 (+12)

I’ve seen similar discussions of EDT vs. CDT and most of the associated thought experiments elsewhere, but the emphasis here seems to be much more about whether you can actually have a causal impact on the past. You’ll have to forgive me if you address this set of points+objections somewhere and I just missed it (it’s a long post!), but my thought process is:

Most (if not all) of the situations you describe seem to assume away important beliefs about what is physically/epistemically possible in terms of predictive accuracy. This seems to contribute to a lot of the surprise/confusion effect.
CDT does seem flawed if it is stubbornly not willing to acknowledge how your decision could correlate with or provide evidence about the past, whereas EDT does seem to do that well. But I also don’t know how justified it is to blame CDT for not handling generally-unrealistic assumptions: it might be a more effective heuristic in many other (realistic) situations.
Perhaps most specific to your post, I really am skeptical of describing these situations as “controlling the past.” Rather, it seems that—as is typically the case—the past is controlling you and/or your actions are just correlated. When I think of it that way and combine the point of (1), I don’t find most of these situations that surprising or insightful.

I will say, one thought experiment I didn’t see mentioned after briefly ctrl+f’ing for it is the idea of “are we in a civilizational simulation” (not just “is Omega simulating me?”). That’s one area where I actually think this might be pretty interesting to think about: if we end up creating reality simulations, does that provide evidence about whether we are currently in a simulation?

Joe_Carlsmith @ 2021-09-02T01:39 (+2)

"the emphasis here seems to be much more about whether you can actually have a causal impact on the past" -- I definitely didn't mean to imply that you could have a causal impact on the past. The key point is that the type of control in question is acausal.

I agree that many of these cases involve unrealistic assumptions, and that CDT may well be an effective heuristic most of the time (indeed, I expect that it is).

I don't feel especially hung up on calling it "control" -- ultimately it's the decision theory (e.g., rejecting CDT) that I'm interested in. I like the word "control," though, because I think there is a very real sense in which you get to choose what your copy writes on his whiteboard, and that this is pretty weird; and because, more broadly, one of the main objections to non-CDT decision theories is that it feels like they are trying to "control" the past in some sense (and I'm saying: this is OK).

Simulation stuff does seem like it could be one in principle application here, e.g.: "if we create civilizations simulations, then this makes it more likely that others whose actions are correlated with ours create simulations, in which case we're more likely to be in a simulation, so because we don't want to be in a simulation, this is a reason to not create simulations." But it seems there are various empirical assumptions about the correlations at stake here, and I haven't thought about cases like this much (and simulation stuff gets gnarly fast, even without bringing weird decision-theory in).

Charles He @ 2021-08-28T21:15 (+10)

As a non-decision theorist, here’s some thoughts, well, objections really.

I think maybe my thoughts are useful to look at because they represent what a “layman” or non-specialist might think in response to your post.

But I am a scrub. If I am wrong, please feel free to just stomp all over what I write. That would be useful and illustrative. Stomp stomp stomp!

To start, I’ll quote your central example for context:

My main example is a prisoner’s dilemma between perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own. This, I claim, is extremely weird...Nevertheless, I think that CDT is wrong. Here’s the case that convinces me most.
Perfect deterministic twin prisoner’s dilemma: You’re a deterministic AI system, who only wants money for yourself (you don’t care about copies of yourself). The authorities make a perfect copy of you, separate you and your copy by a large distance, and then expose you both, in simulation, to exactly identical inputs (let’s say, a room, a whiteboard, some markers, etc). You both face the following choice: either (a) send a million dollars to the other (“cooperate”), or (b) take a thousand dollars for yourself (“defect”).

But defecting in this case, I claim, is totally crazy. Why? Because absent some kind of computer malfunction, both of you will make the same choice, as a matter of logical necessity. If you press the defect button, so will he; if you cooperate, so will he. The two of you, after all, are exact mirror images. You move in unison; you speak, and think, and reach for buttons, in perfect synchrony. Watching the two of you is like watching the same movie on two screens.
To me, it’s an extremely easy choice. Just press the “give myself a million dollars” button! Indeed, at this point, if someone tells me “I defect on a perfect, deterministic copy of myself, exposed to identical inputs,” I feel like: really?

Objection 1:

So let’s imagine an agent who shares exactly your thoughts up to exactly the moment above.

So this agent, at this moment, has just thought about this “extremely easy choice” to cooperate and also exactly all of the logic leading up to this moment, just before they are about to press the "cooperate" button.

But then at this moment, they break from your story.

The agent (deterministic AI system, who only wants money) thinks, “Well I can get 1M + 1K by defecting, so I’m going to do that”.

Being a clone, having every atom, particle, quantum effect and random draw being identical, does not stop this logic. There’s no causal control of any kind, right?

So RIP cooperation. We get the same prisoner dilemma effect again.

(Notice that linked thinking between agents doesn’t solve this. They just can think “Well my clone must be having the same devious thoughts. Also, purple elephant spaghetti.”)

Objection 2:

Let’s say that the agents share exactly your strong views toward cooperation, and even in the strongest way, believe in the acausal control or mirroring that allows them to cooperate (or scratch their nose, perform weird mental gymnastics, etc.).

They cooperate. Great!

Ok, but here the agency/design/control was not exercised by the two agents, but instead by whatever device/process that linked/created them in the first place.

That process established this linkage with such certainty that it allows people to sync or cooperate over many light years.

It’s that process that created/pulled their strings, like a programmer writing a program.

This is a pretty boring form of control and not what you envision. There’s no control, causal or otherwise exercised at all by your agents.

Objection 3:

Again, let’s have the agents do exactly as you suggest and cooperate.

Notice that you assert some sort of abhorrence against the astronomical waste of defecting.

Frankly, you lay it on pretty thick:

To me, it’s an extremely easy choice. Just press the “give myself a million dollars” button! Indeed, at this point, if someone tells me “I defect on a perfect, deterministic copy of myself, exposed to identical inputs,” I feel like: really?

Note that this doesn’t seem like a case where any idiosyncratic predictors are going around rewarding irrationality. Nor, indeed, does feel to me like “cooperating is an irrational choice, but it would be better for me to be the type of person who makes such a choice” or “You should pre-commit to cooperating ahead of time, however silly it will seem in the moment” (I’ll discuss cases that have more of this flavor later). Rather, it feels like what compels me is a direct, object-level argument, which could be made equally well before the copying or after. This argument recognizes a form of acausal “control” that our everyday notion of agency does not countenance, but which, pretty clearly, needs to be taken into account.

Ok. Your agents follow exactly your thinking, exactly as you describe, and cooperate.

But now, your two agents aren’t really playing “prisoner’s dilemma” at all.

They do have the mechanical, physical payoffs of 1M and 1K in front of them, but as your own emphatic writing lays out, this doesn’t describe their preferences at all.

Instead, it’s better to interpret your agents preferences/“game”/“utility function” as having some term for the cost of defecting or aversion to astronomical waste.

Please don't hurt me.

Charles He @ 2021-08-31T01:24 (+4)

Ok, so I thought about this more and want to double down on my Objection 1:

Consider the following three scenarios for clarity:

Scenario 1: Two identical, self interested agents play prisoner’s dilemma in your respective rooms, light years apart. These two agents are just straight out of our econ 101 lecture. Also, they know they are identical and self-interested. Ok dokie. So we get the “usual” defect/defect single-shot result. Note that we can have these agents identical, down to the last molecule and quantum effect, but it doesn’t matter. I think we all accept that we get the defect/defect result.

Scenario 2: We have your process or Omega create two identical agents, molecularly identical, quantum effect identical, etc. Again, they know they are identical and self-interested. Now, again they play the game in their respective rooms, light years apart. Again, once I point out that nothing has changed from Scenario 1, I think you would agree we get the defect/defect result.

Scenario 3: We have your process or Omega create one primary agent, and then create a puppet or slave of this primary agent that will do exactly what the primary agent does (and we put them in the two rooms with whiteboards, etc.). Now, it’s going to seem counterintuitive how this puppeting works, across the light-years, with no causation or information passing between agents. What’s going on is that, just as in Newcomb’s boxing thingy, that Omega is exercising extraordinary agency or foresight, probably over both agent and copy, e.g. it’s foreseen what the primary agent will do and creates that over the puppet.

Ok. Now, in Scenario 3, indeed your story about getting the cooperate result works, because it’s truly mirroring and the primary agent can trust the puppet will copy as they do.

However, I think your story is merely creating Scenario 2, and the copying doesn’t go through.

There is no puppeting effect or Omega’s effect—this is what is biting for Scenario 3.

To see why the puppeting doesn’t go through, it’s because Scenario 2 is the same as Scenario 1.

Another way of seeing this is that, imagine in your story in your post, imagine your agent doing something horrific, almost unthinkable, like committing genocide or stroking a cat backwards. Despite both the agent and the copy are able to do the horrific act, and despite the fact that they would mirror eachother, is not adequate for this act to actually happen. Both agents need to do/choose this.

You get your result, by rounding this off. You point out how tempting cooperate looks like, which is indeed true and indeed human subjects will actually probably cooperate in this situation. But that’s not causality or control.

As a side note, I think this "Omega effect", or control/agency is the root of the Newcomb’s box paradox thing. Basically CDT’s refuse the idea that they are in the inner loop of Omega or in Omega's mind’s eye as they eye the $1000 box, and think they can grab two boxes without consequence. But this rejects the premise of the whole story and doesn’t take Omega's agency seriously (which is indeed extraordinary and maybe very hard to imagine). This makes Newcomb’s paradox really uninteresting.

Also, I read all this Newcomb stuff over the last 24 hours, so I might be wrong.

jackmalde @ 2021-08-27T21:12 (+9)

I haven't read the whole post, but:

But defecting in this case, I claim, is totally crazy. Why? Because absent some kind of computer malfunction, both of you will make the same choice, as a matter of logical necessity.

Is this definitely true when you take into account quantum randomness? Maybe it is, but, if so, I think it might be worth explaining why.

Joe_Carlsmith @ 2021-08-28T08:25 (+7)

I'm imagining computers with sufficiently robust hardware to function deterministically at the software level, in the sense of very reliably performing the same computation, even if there's quantum randomness at a lower level. Imagine two good-quality calculators, manufactured by the same factory using the same process, which add together the same two numbers using the same algorithm, and hence very reliably move through the same high-level memory states and output the same answer. If quantum randomness makes them output different answers, I count that as a "malfunction."

jackmalde @ 2021-08-28T08:40 (+3)

OK thanks, and I have read through now and seen that you discuss randomness in section 4.

Overall a very interesting read! Out of interest, is this idea of "acausal control" entirely novel or has it/something similar been discussed by others?

Joe_Carlsmith @ 2021-09-02T02:00 (+2)

Not sure exactly what words people have used, but something like this idea is pretty common in the non-CDT literature, and I think e.g. MIRI explicitly talks about "controlling" things like your algorithm.

irving @ 2021-08-27T19:45 (+9)

I'm not sure anyone else is going to be brave enough to state this directly, so I'll do it:

After reading some of this post (and talking to Paul a bunch and Scott a little), I remain unconfused about whether we can control the past.

Joe_Carlsmith @ 2021-08-28T08:12 (+8)

I have sympathy for responses like "look, it's just so clear that you can't control the past in any practically relevant sense that we should basically just assume the type of arguments in this post are wrong somehow." But I'm curious where you think the arguments actually go wrong, if you have a view about that? For example, do you think defecting in perfect deterministic twin prisoner's dilemmas with identical inputs is the way to go?

irving @ 2021-08-28T13:39 (+12)

So certainly physics-based priors is a big component, and indeed in some sense is all of it. That is, I think physics-based priors should give you an immediate answer of "you can't influence the past with high probability", and moreover that once you think through the problems in detail the conclusion will be that you could influence the past if physics were different (including boundary conditions, even if laws remain the same), but still that boundary condition priors should still tell us you can't influence the past. I'm happy to elaborate.

First, I think saying CDT is wrong, full stop, is much less useful than saying that CDT has a limited domain of applicability (using Sean Carroll's terminology from The Big Picture). Analogously, one shouldn't say that Newtonian physics is wrong, but that it is has a limited domain of applicability, and one should be careful to apply it only in that domain. Of course, you can choose to stick to the "wrong" terminology; the claim is only that this is less useful.

So what's the domain of applicability of CDT? Roughly, I think the domain is cases where the agent can't be predicted by other agents in the world. I personally like to call this the "free will" case, but that's my personal definition, so if you don't like that definition we can call it the non-prediction case. The deterministic twin case violates this, as there is a dimension of decision making where non-prediction fails: each twin can perfectly predict the other's actions conditional on their own actions. So deterministic twins are outside the domain of applicability of CDT.

A consequence of this view is whether we are in or out of the domain of applicability of CDT is an empirical question: you can't resolve it from pure theory. I further claim (without pinning down the definitions very well) that "generic, un-tuned" situations fall into the non-prediction case. This is again an empirical claim, and roughly says that "something needs to happen" to be outside the non-prediction case. In the deterministic twin case, this "something" is the intentional construction of the twins. Some detailed claims:

Humanity's past fits the non-prediction case. For example, it is not the case that "perhaps you and some of the guards are implementing vaguely similar decision procedures at some level" in World War 1, not least because most of decision theory was invented after World War 1. Again, this is a purely empirical claim: it could have been otherwise, and I'm claiming it wasn't.
The multiverse fits the non-prediction case. I also believe that once we have a sufficient understanding of cosmology, we will conclude that it is most likely that the multiverse fits the non-prediction case, roughly because the causal linkages behind the multiverse (through quantum branching, inflation, or logical possibilities) are high temperature in some sense. This is an again an empirical prediction about cosmology, though of course it's much harder to check and I'm much less confident in it than for (1).
The world does not entirely fall into the non-prediction case. As an example, it is perilous when advertisers have too much information and computation asymmetry with users, since that asymmetry can break non-prediction (more here). A consequence of this is that it's good that people are studying decision theories with larger domains of applicability.
AGI safety v1 can likely be made to fall into the non-prediction case. This is another highly contingent claim, and requires some action to ensure, namely somehow telling AGIs to avoid the non-prediction case in appropriate senses (and designing them so that this is possible to do). (I expect to get jumped on for this one, but before you believe I'm just ignorant it might be worth asking Paul whether I'm just ignorant.) And I do mean v1; it's quite possible that v2 goes better if we have the option of not telling them this.

I do want to emphasize that as a consequence of (3), (4), uncertainty about (2), and a way tinier amount of uncertainty about (1), I'm happy people are exploring this space. But of course I'm also going to place a lower estimate on its importance as a consequence of the above.

Joe_Carlsmith @ 2021-09-02T01:22 (+5)

Thanks for these comments.

Re: “physics-based priors,” I don't think I have a full sense of what you have in mind, but at a high level, I don’t yet see how physics comes into the debate. That is, AFAICT everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past, “change” the past, and so on. The question as I see it (and perhaps I should’ve emphasized this more in the post, and/or put things less provocatively) is more conceptual/normative: whether when making decisions we should think of the past the way CDT does — e.g., as a set of variables whose probabilities our decision-making can’t alter — or in the way that e.g. EDT does — e.g., as a set of variables whose probabilities our decision-making can alter (and thus, a set of variables that EDT-ish decision-making implicitly tries to “control” in a non-causal sense). Non-causal decision theories are weird; but they aren’t actually “I don’t believe in normal physics” weird. They’re more “I believe in managing the news about the already-fixed past” weird.

Re: CDT’s domain of applicability, it sounds like your view is something like: “CDT generally works, but it fails in the type of cases that Joe treats as counter-examples to CDT.” I agree with this, and I think most people who reject CDT would agree, too (after all, most decision theories agree on what to do in most everyday cases; the traditional questions have been about what direction to go when their verdicts come apart). I’m inclined to think of this as CDT being wrong, because I’m inclined to think of decision theory as searching for the theory that will get the full range of cases right — but I’m not sure that much hinges on this. That said, I do think that even acknowledging that CDT fails sometimes involves rejecting some principles/arguments one might’ve thought would hold good in general (e.g. “c’mon, man, it’s no use trying to control the past,”the "what would your friend who can see what's in the boxes say is better" argument, and so on) and thereby saying some striking and weird stuff (e.g. “Ok, it makes sense to try to control the past sometimes, just not that often").

Re: 1-4, I agree that whether or not CDT leads you astray in a given case is an empirical question. I don’t have strong views about what range of actual cases are like this — though I’m sympathetic to your view re: 1, and as I mention in the post, I generally think we should just err on the side of not doing stuff that looks silly by normal lights. I also don’t have strong views about the relevance of non-causal decision-theory research for AGI safety (this project mostly emerged from personal interest).

irving @ 2021-09-02T13:22 (+3)

By “physics-based” I’m lumping together physics and history a bit, but it’s hard to disentangle them especially when people start talking about multiverses. I generally mean “the combined information of the laws of physics and our knowledge of the past”. The reason I do want to cite physics too, even for the past case of (1), is that if you somehow disagreed about decision theorists in WW1 I’d go to the next part of the argument, which is that under the technology of WW1 we can’t do the necessary predictive control (they couldn’t build deterministic twins back then).

However, it seems like we’re mostly in agreement, and you could consider editing the post to make that more clear. The opening line of your post is “I think that you can “control” events you have no causal interaction with, including events in the past.” Now the claim is “everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past”. These two sentences seem inconsistent, and especially since your piece is long and quite technical opening with a wrong summary may confuse people.

I realize you can get out of the inconsistency by leaning on the quotes, but it still seems misleading.

irving @ 2021-09-02T13:28 (+3)

Ah, I see: you’re going to lean on the difference between “cause” and “control”. So to be clear: I am claiming that, as an empirical matter, we also can’t control the past, or even “control” the past.

To expand, I’m not using physics priors to argue that physics is causal, so we can’t control the past. I’m using physics and history priors to argue that we exist in the non-prediction case relative to the past, so CDT applies.

Joe_Carlsmith @ 2021-09-02T18:47 (+3)

Cool, this gives me a clearer picture of where you're coming from. I had meant the central question of the post to be whether it ever makes sense to do the EDT-ish try-to-control-the-past thing, even in pretty unrealistic cases -- partly because I think answering "yes" to this is weird and disorienting in itself, even if it doesn't end up making much of a practical difference day-to-day; and partly because a central objection to EDT is that the past, being already fixed, is never controllable in any practically-relevant sense, even in e.g. Newcomb's cases. It sounds like your main claim is that in our actual everyday circumstances, with respect to things like the WWI case, EDTish and CDT recommendations don't come apart -- a topic I don't spend much time on or have especially strong views about.

"you’re going to lean on the difference between 'cause' and 'control'" -- indeed, and I had meant the "no causal interaction with" part of opening sentence to indicate this. It does seem like various readers object to/were confused by the use of the term "control" here, and I think there's room for more emphasis early on as to what specifically I have in mind; but at a high-level, I'm inclined to keep the term "control," rather than trying to rephrase things solely in terms of e.g. correlations, because I think it makes sense to think of yourself as, for practical purposes, "controlling" what your copy writes on his whiteboard, what Omega puts in the boxes, etc; that more broadly, EDT-ish decision-making is in fact weird in the way that trying to control the past is weird, and that this makes it all the more striking and worth highlighting that EDT-ish decision-making seems, sometimes, like the right way to go.

MichaelStJules @ 2021-08-28T20:05 (+8)

I think the thought experiments you give are pretty decisive in favour of the EDT answers over the CDT answers, and I guess I would agree that we have some kind of subtle control over the past, but I would also add:

Acting and conditioning on our actions doesn't change what happened in the past; it only tells us more about it. Finding out that Ancient Egyptians were happier than you thought before doesn't make it so that they were happier than you thought before; they already observed their own welfare, and you were just ignorant of it. While EDT would not recommend for the sake of the Ancient Egyptians to find out more about their welfare (the EV would be 0, since the ex ante distributions are the same) or even filter only for positive information about their welfare (you would need to adjust your beliefs for this bias), doesn't it suggest that if you happen to find out that the Egyptians were better off than you thought, you did something good, and if you happen to find out that the Egyptians were worse off than you thought, you did something bad?

If we control the past in the way you suggest in your thought experiments, do we also control it just by reading the Wikipedia page on Ancient Egyptians? Or do we only use EDT to evaluate the expected value of actions beforehand and not their actual value after the fact, or at least not in this way?

And then, why does this seems absurd, but not the EDT answers to your thought experiments?

Joe_Carlsmith @ 2021-09-02T01:52 (+2)

I think this is an interesting objection. E.g., "if you're into EDT ex ante, shouldn't you be into EDT ex post, and say that it was a 'good action' to learn about the Egyptians, because you learned that they were better off than you thought in expectation?" I think it depends, though, on how you are doing the ex post evaluation: and the objection doesn't work if the ex post evaluation conditions on the information you learn.

That is, suppose that before you read Wikipedia, you were 50% on the Egyptians were at 0 welfare, and 50% they were at 10 welfare, so 5 in expectation, but reading is 0 EV. After reading, you find out that their welfare was 10. OK, should we count this action, in retrospect, as worth 5 welfare for the Egyptians? I'd say no, because the ex post evaluation should go: "Granted that the Egyptians were at 10 welfare, was it good to learn that they were at 10 welfare?". And the answer is no: the learning was a 0-welfare change.

MichaelStJules @ 2021-09-02T04:40 (+2)

That is, suppose that before you read Wikipedia, you were 50% on the Egyptians were at 0 welfare, and 50% they were at 10 welfare, so 5 in expectation, but reading is 0 EV. After reading, you find out that their welfare was 10. OK, should we count this action, in retrospect, as worth 5 welfare for the Egyptians? I'd say no, because the ex post evaluation should go: "Granted that the Egyptians were at 10 welfare, was it good to learn that they were at 10 welfare?". And the answer is no: the learning was a 0-welfare change.

This sounds like CDT, though, by conditioning on the past. If, for Newcomb's problem, we condition on the past and so the contents of the boxes, we get that one-boxing was worse:

"Granted that the box that could have been empty was not empty, was it better to pick only that box?". And the answer is no: you could have gotten more by two-boxing.

Of course, there's something hidden here, which is that if the box that could have been empty was not empty, you could not have two-boxed (or with a weaker predictor, it's unlikely that the box wasn't empty and you would have two-boxed).

Will Aldred @ 2023-06-02T15:22 (+4)

This is a good post; I’m happy it exists. One thing I notice, which I find a little surprising, is that the post doesn’t seem to include what I'd consider the classic example of controlling the past: evidentially cooperating with beings/civilizations that existed in past cycles of the universe.^[1]

^{^}
This example does rely on a cyclic (e.g., Big Bounce) model of cosmology,^ which has a couple of issues. Firstly, that such a cosmological model is much less likely to be true, in my all-things-considered view, than eternal inflation. Secondly, that within a cyclic model, there isn't a clearly meaningful notion of time across cycles. However, I don't think these issues undercut the example. Controlling faraway events through evidential cooperation is no less possible in an eternally inflating multiverse, it's just that space is doing more of the work now than time (which makes it a less classic example for controlling the past). Also, while to an observer within a cycle, the notion of time outside their cycle may not hold much meaning, I think that from a God's eye view, there is a material sense in which the cycles occur sequentially, with some in the past of others.
In addition, the example can be adapted, I believe, to fit the simulation hypothesis. Sequential universe cycles become sequential simulation runs,* and the God’s eye view is now the point of view of the beings in the level of reality one above ours, whether that be base reality or another simulation. *(It seems likely to me that simulation runs would be massively, but not entirely, parallelized. Moreover, even if runs are entirely parallelized, it would be physically impossible—so long as the level-above reality has physical laws that remotely resemble ours—for two or more simulations to happen in the exact same spatial location. Therefore, there would be frames of reference in the base reality from which some simulation runs take place in the past of others.)
^ (One type of cyclic model, conformal cyclic cosmology, allows causal as well as evidential influence between universes, though in this model one universe can only causally influence the next one(s) in the sequence (i.e., causally controlling the past is not possible). For more on this, see "What happens after the universe ends?".)

MichaelStJules @ 2021-09-22T06:50 (+2)

I think we should be able to find lots of examples in the real world like Smoking Lesion, and I think CDT looks better than EDT in more typical choice scenarios because of it. The ones where CDT goes wrong and EDT is right (as discussed in your post) seem pretty atypical to me, although they could still matter a lot. I think both theories are probably wrong.

What matters in Smoking Lesion are:

Variables , $B$ and $C$ , with $B \neq A$ and $B \neq C$ .
- In Smoking Lesion, $A = L e s i o n$ , $B = S m o k i n g$ and $C = C a n c e r$ .
$B$ is what you're deciding on.
$A$ causes both $B$ and $C$ (positively), and this is the only way these three variables are related (or also $B$ prevents $C$ (causes $n o t C$ or reduces $C$ ), but overall $C$ is more likely conditional on $B$ , without accounting for $A$ .
$C$ and $B$ have causal effects with opposite utility, conditioning on each value of $A$ , but $C$ 's far outweigh $B$ 's (e.g. even just some cost to choose $B$ over not $B$ if binary, or choose to increase the value of $B$ ) conditioning on each of $A$ and not $A$ , but $C$ 's far outweigh $B$ 's. (And maybe some of this is stronger than necessary.)
You have enough uncertainty about $A$ , and to block the Tickle Defense, you also have enough uncertainty about any intermediate outcomes between $A$ and $C$ .

For example, IQ and vegetarianism tend to be positively correlated, but I think this is likely almost entirely explained by higher IQs (or something close to it) causing vegetarianism (or lower IQs perventing it), or possibly something that causes both, not a large positive effect of vegetarianism on IQ. In this case, $A = C = - I Q$ (negative of IQ, which we want to minimize, to maximize IQ) and $B = n o t v e g$ . Of course, the last condition seems likely to not be satisfied for many people, but others may have enough uncertainty about their own IQs, because they've never taken an IQ test or observed anything that's sufficiently correlated with IQ.

And to take things further, childhood and adult IQ are correlated, so an adult can increase their expected childhood IQ by becoming vegetarian.

There are probably others with IQ we could find.

Maybe we can take $A$ to be health consciousness, but there are probably many things ( $= B$ ) health conscious people are more likely to do that don't actually help. $C$ would be future health outcomes.

It seems like consistently following EDT risks justifying useless or even actively harmful self-signaling.

PhantoMinecrafter @ 2021-10-31T22:00 (+1)

Alright... So, let's imagine, that I have a vaping habit which I may believe is not harmful at all. I have an option A, that states: "If I will believe vaping is not harmful -- I won't suffer in my future, as if I would smoke normal cigarettes".

Then I die, respawn and have an option not to smoke/vape. However, (if I am dying with the memory) I can memorize that vaping was cool. So here I have option B, which states:"I miss the melon flavor so much...".

Here comes the Universe and gives me the link in the youtube advertisement regarding vaping and other stuff, which, for instance, maybe this one https://vawoo.co.uk/ (random from google, not spam, sorry, just as an example).

And here I have two more options:
C) To suffer from not smoking and try to force me to forget about that, even though the google system is smarter as we think and will continue to attack me with advertisements, forcing me to buy their subscription.
D) Forget about everything and then start to smoke again. But without the belief, this is not harmful.

I mean, this is life. People created so much stuff in order to kill themselves very slowly.