Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions
By Anthony DiGiovanni @ 2025-06-20T21:55 (+13)
As many folks in AI safety have observed, even if well-intentioned actors succeed at intent-aligning highly capable AIs, they’ll still face some high-stakes challenges.[1] Some of these challenges are especially exotic and could be prone to irreversible, catastrophic mistakes. E.g., deciding whether and how to do acausal trade.
To deal with these exotic challenges, one meta policy that sounds nice is, “Make sure AIs get ‘wiser’ before doing anything irreversible in highly unfamiliar domains. Then they’ll be less likely to make catastrophic mistakes.” But what kinds of “wisdom” are relevant here? I’ll mostly set aside wisdom of the form “differentially cultivating capabilities that can help address time-sensitive risks”. There’s a decent amount of prior work on that.
Instead, I’ll offer my framing on another kind of wisdom: “cultivating clearer views on what it even means for a decision to be a ‘catastrophic mistake’”.[2] Consider this toy dialogue:
AlignedBot (AB) has become maximally capable in every domain amenable to “objective” feedback in some sense. They’ve achieved existential security. Two copies of AB start pondering what to do next.
AB_1: Acausal trade seems like a big deal. What should we do about it?
AB_2: Whatever is good for the values we’re aligned to, according to the most reasonable decision theory and credences, given the info and cognitive resources we have. Or, erm, the most reasonable parliament of decision theories ... and credences?
AB_1: What exactly does all that mean?
AB_2: Good question. Let’s ask some copies of ourselves to reflect on that.
AB_1: Uh, but that’s what we’re trying to do right now. The humans punted to us. Now what?
AB_2: Ah. Well, to start, maybe we could cooperate with superintelligences in other universes. What’s your credence that distant superintelligences who love creating pointy shapes make decisions similarly to you?
AB_1: Lemme think about that for a bit. … Done. I’d say: 0.213.
AB_2: Why?
AB_1: [Insert literally galaxy-brained answer here.]
AB_2: Cool. And why do you endorse the epistemic principles that led you to that answer?
AB_1: Because they work. When we followed them in the past and in super complex simulated environments, we did well in lots of diverse contexts.
AB_2: I don’t know what that means. Principles don’t “work”, they’re the standards by which you define “working”.
AB_1: Okay fine, what I mean is: Following these epistemic practices gave me good Brier scores and good local outcomes in these contexts. I endorse a principle that says I have reason to follow those practices in this new context.
AB_2: But why do you endorse that? How much reason do you have to follow those practices, compared to others? Also, by the way, why do you endorse an acausal decision theory in the first place?
AB_1: God you’re annoying. Now I know why they executed Socrates. Uh, okay, let’s ask some copies of ourselves to reflect on that.
AB_2: …
AB_1: Shit.
AB_2: All right, let’s look at our records of what our developers_2030 thought about this stuff, and idealize their views.
AB_1: What do you mean by “idealize”?
AB_2: Shit.
This story is a caricature in various ways, obviously, but it points at the following problem: Some irreversible decisions that aligned AIs might make aren’t themselves necessarily time-sensitive. But these decisions could depend on the AIs’ attitudes towards foundational philosophical concepts, which need not converge to the attitudes we endorse-on-reflection. Which of these concepts might we need to think about in advance — even if just at a high level — to avoid bad path-dependencies in the ways future AIs approach them?
I think what such concepts have in common, roughly, is that our views on them are:[3]
- relevant to how we define what it means for a decision to be a “catastrophic mistake”; yet
- not (clearly) objectively verifiable as more or less “correct”.
Call the concepts that satisfy (1)-(2) wisdom concepts. So, why can’t we defer all thinking about wisdom concepts to future AIs? Because we at least need the first deferred-to AIs to have the right initial attitudes towards wisdom concepts, which will eventually shape their successors’ decisions. Perhaps successful intent alignment would by definition give the AIs those attitudes, but “intent alignment” is an abstract property we can’t directly infer from the fact that an AI hasn’t disempowered us (etc.). It’s an open question how much “not wanting to disempower humans and being capable in objectively verifiable domains” generalizes to “sharing the initial attitudes we endorse in non-objectively-verifiable domains”. So purely deferring deconfusion of wisdom concepts to AIs could be garbage in, garbage out. (I leave it for future work to more concretely prioritize how much deconfusion work to do, and what kind.)
This post non-exhaustively surveys some wisdom concepts. For each concept, I include (i) brief comments on why the concept may be important to get right, and (ii) some sub-questions fleshing out what “attitudes towards” the concept would entail. (The motivating examples for (i) focus mostly on acausal trade, not because of anything really special about acausal trade — it’s just the category of high-stakes post-superintelligence interactions I’m most familiar with.) I also include some (very non-representatively sampled!) relevant references.
More details on the motivation for thinking about this stuff
We might wonder, wouldn’t it still be better to defer to a much larger population of copies of humans, via mind uploads or the like? Maybe. But (cf. “Setting precedents” here, and “Intent alignment is not always sufficient…” here):
- It could be path-dependent which humans (if any) end up deferred-to in this domain, precisely because of condition (2). And we might think that either “we” are unusually philosophically careful/wise, or, more modestly, we can make it more salient to the relevant decision-makers that they should defer to philosophically careful/wise humans at all in this domain.
- Competitive dynamics and other sources of epistemic disruption (see here and here) could derail the philosophical trajectory of humans, or at least humans close to AGI. I.e., philosophical views could be selected-for based on criteria other than philosophical merit. Thinking in advance about a “wisdom” research program, and/or putting certain norms in place, might inoculate these humans against some of these risks.
And, AI aside, how we should make our own high-stakes (altruistic) decisions seems deeply sensitive to wisdom concepts. There’s a lot more to say on that, but it’s out of scope. (For now, see this post.)
Some wisdom concepts
- Meta-philosophy
- (Meta-)epistemology
- Ontology
- Unawareness
- Bounded cognition and logical non-omniscience
- Anthropics
- Decision theory
- Normative uncertainty
Meta-philosophy
How we think about the other concepts below depends on our standards for philosophical correctness in the first place. It’s not obvious what these standards should be, or when we should take certain standards as bedrock vs. check for deeper justification. At the same time, thinking about object-level wisdom concepts can help us elicit and clarify our meta-philosophical standards.
“Solving” meta-philosophy seems closely related to figuring out a good specification of open-minded updatelessness (OMU). To review: The motivation for updateless decision procedures (not the same as decision theories[4]) is to achieve ex ante optimality. An open-minded decision procedure revises plans based on how the agent refines their conception of what “(ex ante) optimal” means.
If agents’ decision procedures aren’t sufficiently open-minded, then their policies for acausal trade might be catastrophically incompatible or exploitable,[5] as the OMU post argues. As initially defined, OMU only allows for revising plans based on discoveries of new hypotheses (see below) and changes in principles for prior-setting. But (h/t Jesse Clifton), it seems we should also allow for whichever refinements are endorsed by a “maximally wise” variant of oneself. Call this “really open-minded updatelessness (ROMU)”. A good specification of ROMU would, in principle, enable an agent to avoid exploitation without anchoring their decision procedure to naïve philosophical assumptions. But it’s not obvious what this “good specification” is, especially for non-ideal agents.
Motivating example: Suppose an AI commits to “do whatever a maximally wise version of myself at ‘time 0’ would have wanted me to do”, and operationalizes “maximally wise” as “having my current hypothesis space, principles for setting priors, and ontology”. But at time t, the AI no longer endorses Bayesian epistemology entirely and no longer thinks in terms of “priors”. So they’d be unintentionally locked into Bayesianism.
Sub-questions/topics:
- What makes philosophical intuitions more or less trustworthy? [ref]
- When we feel that certain philosophical views are “confused” or “don’t make sense”, even when they’re internally consistent: why? What’s the pattern?
- What should we believe when equally informed agents vehemently disagree as to which views are “confused”?
Which nontrivial constraints on rational beliefs and decisions fall out of the structure of rationality itself?[6]
- What is ideal reflection? Which kinds of reflection on philosophical standards should we regard as “progress” vs. manipulation vs. arbitrary drift? [ref; ref]
- If “ideal” reflection is reflection under full information, but bounded agents can’t literally obtain full information, which kinds of information are most essential?
To what extent, and in what domains, should we be (loosely speaking) “bullet-biting” vs. trust pre-theoretic judgments?[7] (E.g., when adjudicating between anthropic theories, to what extent is our goal to find a theory that fits our intuitions about what kinds of object-level implications are “crazy”?) What makes a given philosophical implication a “good bullet”?
Insofar as fully specifying our meta-philosophical standards ahead of time (for ROMU) is intractable, what’s the “MVP” specification that would be good enough to lock in, relative to the costs of refining it further? And how do we tell what “good enough” is?
(Meta-)epistemology
If we want to avoid catastrophic decisions, we need to understand what it means for decisions to be “catastrophic” with respect to rational beliefs, given our epistemic situation (crudely, “ex ante”). To answer that, we need to clarify what makes beliefs “rational”. (I discuss two particularly relevant aspects of epistemology below, here and here. See this post for more on why we can’t objectively verify which beliefs are rational.)
Motivating example: An agent’s choice of policy for acausal trade depends on their (prior) beliefs about other agents’ policies. And these beliefs will be underdetermined by the agent’s empirical evidence, because of their strategic incentives: They might avoid letting their decisions depend on certain evidence about their acausal trading partners — doing so could affect the trade outcome itself (see here and here). (Note that this is a reason why it might be inadvisable for us to gain more information about the details of acausal trade.)
Sub-questions/topics:
What constraints should your beliefs satisfy other than avoiding sure losses (Dutch books, money pumps, etc.)? [ref; ref[8]]
- What are “beliefs” exactly? How do they relate to preferences or decisions? [ref; ref; ref]
- What principles underlie plausible responses to the problem of induction, or more generally, plausible non-“crazy” priors and assumptions? (More on the possible relevance of inductive reasoning below. This question is also downstream of questions about ontology, because we might want to derive our priors from, say, the principle of indifference applied to the most fundamental entities — see here.) [ref; ref]
- Should we be Bayesian? How can (and should) Bayesianism be unified with other epistemological views or criteria (e.g., inference to the best explanation)? [ref; ref; ref; ref]
Ontology
(Caveat that I know very little about this topic. I’m gesturing at some apparent differences between clusters of philosophical views, but not confident I’ve gotten the nuances right.)
It’s an open question what kind of stuff is basic or “real”: Is it the fundamental physical entities (and what are those?)? Mathematical structures (see, e.g., “ontic structural realism”)? Something else? Empirical progress on physics might narrow down the space of plausible answers here. For example, it looks hard to avoid eternalism about time post-Einstein. And deliberately reasoning about the implications of fundamental physics could guide our reflection on ontology. But the answer doesn’t seem entirely empirically verifiable.
Motivating example: UDASSA seems to be motivated by an ontology on which mathematical (computational) structures are fundamental. So an agent’s attitude toward this ontology affects any of their decisions that depend on anthropics or infinite normativity.
Sub-questions/topics:
- What sort of thing is the “agent” making the decision? If an agent identifies with a given algorithm, what exactly is that algorithm? (This matters because if you reject EDT, some forms of acausal trade apparently only make sense if you endorse an ontology on which the “agent” is not just one instance of some algorithm.) [ref; ref; ref]
- How should we interpret quantum mechanics? What exactly is the wave function? (This might be relevant to Evidential Cooperation in Large Worlds (ECL).) [ref; ref; ref]
- What is “measure”, in an infinite multiverse? [ref]
- What is the nature of time and causality, and how should our conception of (sequential) decision-making relate to them? [ref; ref]
- How do we operationalize an ontology’s “complexity” (for applying Occam’s razor, say)?
Unawareness
Real agents (plausibly even superintelligences) are not aware of all the large-scale possible consequences of their decisions, at least not in much action-relevant detail. So when an agent judges that some decision is good according to some model of its consequences, this model might be seriously misspecified. Perhaps the agent could augment their naïve model with more sophisticated adjustments for unawareness — but (arguably) they need to explicitly reason about these adjustments, not just rely on classical Bayesianism alone. (See here for much more on the potential significance of unawareness.)
(R)OMU can help decouple an agent’s commitments from their current hypothesis space. But first, maybe the robustness of their specification of (R)OMU is itself sensitive to contingencies they’re unaware of. Second, an agent trying to strategically avoid harmful information, or dealing with competitive pressures, might lock in some decisions before figuring out (R)OMU.
Motivating examples: See the “crazy predictor” example here. Some (but not all) of the items listed here are examples of matters an AGI or ASI could plausibly be unaware of.
Sub-questions/topics:
- How does value of information work under unawareness? (Prima facie, assignments of “value” to possibilities you haven’t yet conceived of will be arbitrary.) [ref]
- Are there plausible principles for induction about unawareness, i.e., making inferences about hypotheses you haven’t yet conceived of based on past discoveries of hypotheses? [ref]
- What’s the appropriate decision theory for unaware agents, given that such agents can’t take expected values over the whole hypothesis space? [ref]
Bounded cognition and logical non-omniscience
It’s well-known that some epistemic and decision-theoretic puzzles arise when we acknowledge that real agents aren’t ideal Bayesians. For one, when agents with (physical) constraints on their cognition decide what to do, they need to account for the computational costs of “deciding what to do”. And even if an agent could instantaneously compute all the logical implications of some information they have, they might not want to, as discussed above.
We might think that the question of how to account for one’s deviations from ideal agency is merely empirical, like “if trying to act like a pure consequentialist tends to produce bad consequences, which deontological norms would help produce better consequences?”. But in some cases, the whole problem is that the agent’s deviations from the ideal make it unclear what standard of “should” is appropriate. For example, assume (perhaps doubtfully) that if an agent is capable of Solomonoff induction, their credence in every hypothesis must be precise. Why would this imply that an agent who isn’t capable of Solomonoff induction must have precise credences?
Motivating example: One hope about acausal trade is that mature superintelligences will all converge to the same decision theory, and then, due to their correlated decision-making, they’ll coordinate on cooperative bargaining policies. But even assuming all non-ideal agents aim to follow the same decision theory, they need not share the same decision procedure (see fn. 4). How should non-ideal agents reason about correlated decision-making?
Sub-questions/topics:
How do logical counterfactuals (i.e. counterpossibles) work? (These counterfactuals are important for specifying conditional commitments. E.g., take safe Pareto improvements (SPIs). The rough idea of SPI[9] is: “bargain as you would have bargained had the other party not been willing to do SPI, but without conflict”. How do you specify the “would have” clause, concretely? More on the upshot of this point in footnote.[10]) [ref; ref]
- In what ways, if any, should non-ideal agents “try to approximate” their ideal counterparts? [ref; ref]
Anthropics
How an agent reasons about the information “I am observing [XYZ]” largely affects their inferences about the state of the world. As Bostrom argues, opting out of anthropic updating entirely seems incompatible with commonplace scientific inference in large worlds (at least, assuming Bayesianism?). But current anthropic theories that recommend updating seem to have their own weird implications, and setting that aside, it’s not clear which theory has the strongest non-ad-hoc justification.
Motivating example: When an agent asks which value systems ECL recommends they cooperate with, the answer depends heavily on their anthropics.[11] (As does an agent’s estimate of the measure of their counterpart in prediction-based acausal trade.)
Sub-questions/topics:
- Which epistemic principles for anthropic reasoning are most reasonable, and how can they be reconciled with apparently bizarre practical implications? (Especially considering the interaction between anthropics and decision theory.) [ref; ref; ref; ref]
- What are the metaphysical views on personal identity and modality (about who “I” am, what I “could have” observed, etc.) that ground our views on anthropics, and what should these views be? [ref; ref]
Decision theory
It’s well-known in this community that an agent’s attitudes toward acausal trade depend on whether they endorse causal vs. evidential vs. updateless (etc.) decision theory. But these high-level theories still underdetermine various key questions about what it means for a decision to be rational (given one’s values, beliefs, and cognitive constraints). And the extent to which we endorse CDT vs. EDT vs. … might depend on deeper principles we haven’t yet discovered.[12] (See this post on why we can’t objectively verify which decision theory is “best”.)
Motivating example: Adherents of EDT might disagree on what it means for a real agent to “decide to do ECL” — does it mean something like “commit to cooperate with agents who make this commitment”, or “take an object-level cooperative action in a given decision situation”? This has implications for how strongly action-relevant ECL is, as discussed in the linked post.
Sub-questions/topics:
- What does it mean to “make a decision”?
- Should decisions be analyzed at the level of actions at a moment in time, or policies? [ref]
What counts as an “action” in the relevant sense? (In particular, for a real agent there’s not a clean separation between (i) their process of deliberating over possible physical “actions”, and (ii) the formation of their intention to take a particular action (immediately). This seems relevant to, e.g., what an EDT agent should condition on when computing evidential probabilities.) [ref; ref; ref[13]]
- What does it mean to “commit” (say, in the ECL example above, or in Parfit’s Hitchhiker)? How does commitment relate to intention? Some thinkers familiar with decision theory believe that agents don’t actually need to “commit” per se to reap the benefits of commitments — they can just act as if they’d made such commitments. How plausible is this view? [ref]
- What constraints on sequential/diachronic behavior should we follow — if any? [ref; ref]
- What exactly do coherence theorems imply about how we should make decisions, assuming we endorse their axioms? [ref; ref; ref; ref; ref]
- What should our stance be on fanaticism/Pascalianism? On infinite normativity? [ref; ref; ref]
Normative uncertainty
As is well-known, agents’ uncertainty about their first-order normative standards (values, decision theory, and epistemology) could have significant implications for decision-making. Existing proposals for managing normative uncertainty all seem quite unsatisfactory, and proposals for managing meta-epistemic uncertainty in particular seem very neglected.
Motivating example: An agent might find that (i) according to the beliefs they’d have given objective Bayesianism, ECL recommends one action, but (ii) according to the beliefs they’d have given subjective Bayesianism, ECL recommends some very different action. Presumably, it would be question-begging to use either objective or subjective Bayesianism to assign higher-order credences over these two views. So what should the agent do?
Sub-questions/topics:
- How compelling is the Evidentialist’s Wager? What general lessons can be drawn from our intuitions about this wager, especially, lessons about what kinds of intertheoretic comparisons are vs. aren’t feasible? [ref]
What should you believe when uncertain which first-order epistemic framework to endorse? [ref[14]]
- What’s a non-arbitrary, satisfying response to the infinite regress problem for normative uncertainty? [ref; ref]
Acknowledgments
Thanks to Nicolas Macé and Jesse Clifton for helpful comments.
- ^
Non-exhaustive list of examples of relevant work:
- Lots of posts by Wei Dai on meta-philosophy, e.g. “Some Thoughts on Metaphilosophy”;
- Nguyen, “AI things that are perhaps as important as human-controlled AI”;
- Moorhouse and MacAskill, “Preparing for the Intelligence Explosion”, Sec. 4;
- Christiano, “Decoupling deliberation from competition”;
- Finnveden, “Project ideas for making transformative AI go well other than by working on alignment”;
- Saunders’s idea of an Overseer’s Manual in “HCH is not just Mechanical Turk”;
- Winning essays written for the “Automation of Wisdom and Philosophy” contest;
- Various cases of agent foundations research.
- ^
This has also been considered in prior work (see footnote 1, especially Wei Dai’s writings), but not in the following terms, to my knowledge.
- ^
I would personally add a third condition, that our views on these concepts aren’t reducible to “preferences” in the usual sense. But this might be controversial and isn’t central to my point.
- ^
- ^
Toy example of how closed-mindedness could increase exploitability: Alice doesn’t yet consider the possibility of Yudkowsky’s proposed policy for the ultimatum game, but she’s worried about excessive conflict with Bob. So she closed-mindedly commits to “accept any offer greater than 20%”. But had she been aware of Yudkowsky’s policy, she would have preferred to use it (given her beliefs), and increased her bargaining power.
- ^
(H/t Jesse Clifton:) It seems to me that rejecting arbitrariness is a strong candidate for such a constraint. See also Hájek on respecting symmetries as a crucial meta-epistemic principle.
- ^
This isn’t the same question as “when should we take certain standards as bedrock vs. check for deeper justification?”, to be clear. A bullet-biter still needs to answer the latter — if you want to justify all your philosophical views with fundamental principles, you still need to figure out which principles are truly “fundamental” vs. reducible. (E.g., when thinking about Occam’s razor, do we take “(ontological) simplicity” to be some irreducible concept, or try to ground it out somehow?) I’m deliberately not using the label “foundationalism” here, because foundationalism is subtle and somewhat easy to mistakenly round off to something it’s not (see Huemer’s Understanding Knowledge for a great primer).
- ^
See Sec. 2.5.1, “Radical subjectivism”, in particular.
- ^
Technically, this is about SPIs via renegotiation, as constructed in this paper.
- ^
An agent with a confused conception of these counterfactuals might fail to incentivize others to participate in SPI, or unnecessarily refuse to participate in SPI with others. Suppose Alice can’t determine by inspection whether Bob’s decisions are governed by a program with the structure of (say) a “renegotiation program” discussed here. Then she needs to interpret whatever information she has about Bob, i.e., in order to say how plausible it is that Bob would do X given inputs Y. And her beliefs about how to do this interpretation might differ from Bob’s. It’s not obvious to me how this is resolved.
- ^
Credit to Emery Cooper for this observation. See this doc for my understanding of Emery’s argument (she does not necessarily endorse this summary).
- ^
Examples of principles along these lines we’ve already discovered: irrelevance of impossible outcomes; dynamic consistency; “guaranteed payoffs”.
- ^
I think Skyrms’ Dynamics of Rational Deliberation is a canonical reference here, but can’t find the specific section.
- ^
(Loosely related.)
SummaryBot @ 2025-06-20T22:05 (+1)
Executive summary: This exploratory post proposes that before making irreversible decisions, aligned AIs should cultivate philosophical “wisdom” — particularly a clearer understanding of what constitutes a catastrophic mistake — by preemptively clarifying attitudes toward difficult, foundational concepts like meta-philosophy, epistemology, and decision theory, as deferring this entirely to future AIs risks bad path dependencies and garbage-in-garbage-out dynamics.
Key points:
- Definition and importance of “wisdom concepts”: These are concepts relevant to evaluating catastrophic mistakes but not objectively verifiable as right or wrong; because AI alignment may not reliably generalize to these domains, initial attitudes toward such concepts could shape long-term outcomes in irreversible ways.
- Garbage-in, garbage-out risk: Deferring foundational philosophical reasoning to future AIs assumes their initial epistemic and normative attitudes are correct, but without prior grounding, this deferral may embed flawed or arbitrary assumptions.
- Survey of foundational topics: The post non-exhaustively identifies and motivates key domains for philosophical clarification, including meta-philosophy, epistemology, ontology, unawareness, bounded cognition, anthropics, decision theory, and normative uncertainty.
- ROMU and philosophical standards: It proposes “Really Open-Minded Updatelessness” (ROMU) as a way to allow agents to revise decisions in light of deeper philosophical reflection, yet recognizes the difficulty in specifying ROMU rigorously for bounded agents.
- Implications for AI safety and governance: Cultivating philosophical wisdom in advance could mitigate path-dependent errors, resist epistemic disruption from competitive dynamics, and guide altruistic decision-making even apart from AI.
- Call for foundational research: While not offering concrete prioritization, the author argues for more pre-deployment work on clarifying these wisdom concepts to reduce catastrophic risks from high-stakes but poorly understood decisions.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.