Disentangling Some Important Forecasting Concepts and Terms

By Marcel D @ 2023-06-25T17:31 (+16)

TL;DR / Intro

Some people say things like "You can't put numbers on the probability of X; nobody knows the probability / it's unknowable!" This sentiment is facially understandable, but it is ultimately misguided and based on conflating forecasting concepts that I have started calling “best estimates,” “forecast resilience,” and “forecast legibility.” These distinctions are especially important when discussing some existential risks (e.g., AI x-risk). I think the conflation of terms leads to bad analysis and communication, such as using unjustifiably low probability estimates for risks—or implicitly assuming no risk at all! Thus, I recommend disentangling these concepts. 

The terms I suggest using are as follows:

I especially welcome feedback on whether this topic is worth further elaboration.

 

Problems/Motivation

Recently, I was at a party with some EAs. One person (“Bob”) was complaining about the fact that most of the “cost-benefit analysis” supporting AI policy/safety has little empirical evidence and is otherwise too “speculative,” whereas there is clear evidence supporting the efficacy of causes like global health, poverty alleviation, and animal welfare.

I responded: “Sure, it’s true that those kinds of causes have way more empirical evidence supporting their efficacy, but you can still have ‘accurate’ or ‘justified’ forecasts that AI existential risk is >1%, you just can’t be as confident in your estimates and they may not be very ‘credible’ or ‘legible.’”

Bob: “What? Those are the same things; you can’t have an accurate forecast if you aren’t confident in it, and you can’t rely on uncredible forecasts. We just don’t have enough empirical evidence to make reliable, scientific forecasts.”[1]

This is just one of many conversations I have been a part of or witnessed that spiral into confusion or other epistemic failure modes. To be more specific, some of these problems include:

Source: https://twitter.com/james_acton32/status/1506980433799618566?s=20&t=oyY3PrrB938-uJjql2p9LA 
Source: https://twitter.com/JgaltTweets/status/1671907149939634177 
Source: https://twitter.com/ESYudkowsky/status/852981816180973568https://astralcodexten.substack.com/p/mr-tries-the-safe-uncertainty-fallacy.

It seems there are many reasons why people do these things, and this article will not try to address all of the reasons. However, I think one important and mitigatable factor is that people sometimes conflate forecasting concepts such as “best estimates,” resilience, and legibility. Thus, in this article I try to highlight and disentangle some of these concepts.
 

Distinguishing probability estimates, resilience, and legibility

I am still very uncertain about how best to define or label the various concepts; I mainly just present the following as suggestions to spark further discussion:

Best estimate (or more precisely, “Error-minimizing estimate”)

Basically, given an actual person’s set of information and analysis, what is the estimate that minimizes that person’s expected error (e.g., Brier score)?

“Resilience” (or “credal resilience”)

How much do I currently expect my 'best estimate' of this event to change in response to receiving new information or thinking more about the problem prior to a given reference point in time or decision juncture (e.g., the event happening, a grant has to be written, ‘one year from now’)?[11][12]

 

Low resilience

High resilience

Estimate w/ low discrimination

100th flip of a biased coin
(Before seeing first flip)

Fair coin flip

Estimate w/ high discrimination

[Not logically possible?[15]]

100th flip of a biased coin
(After seeing 99 flips)

Forecast “legibility”[17]

The current definition I prefer is vaguely something like “How much time/effort do I expect a given audience would require to understand the justification behind my estimate (even if they do not necessarily update their own estimates to match), and/or to understand my estimate is made in good-faith as opposed to resulting from laziness, incompetence, or deliberate dishonesty?” (I do not have a more precise operationalization at the moment[18]

 

Other Concepts

I do not want to get too deep into some other concepts, but I’ll briefly note two others that I have not fully fleshed out:

 

Closing Remarks

I think there are many points of confusion that undermine people’s ability to effectively make or correctly interpret forecasts. However, in my experience it seems that one important contributor to these problems is that people sometimes entangle concepts such as best estimates, resilience, and legibility. This post is an exploratory attempt to disentangle these concepts.

Ultimately, I am curious to hear people’s thoughts on:

  1. Whether they frequently see people (especially inside of EA) conflating these terms;
  2. Whether these instances of conflation are significant/impactful for discussions and decisions; and
  3. Whether the distinctions and explanations I provide make sense and help to reduce the frequency/significance of confusion?

 

  1. ^

    I can’t remember exactly how everything was said, but this is basically how I interpreted it at the time.

  2. ^

    I recognize this may seem like a bold claim, but it may simply be that my professors incorrectly interpreted/taught Knightian Uncertainty, as some of the claims they made about the inability to assign probabilities to things seemed clearly wrong when taken seriously/literally. FWIW, there were multiple situations where I felt my public policy professors did not fully understand what they were teaching. (In one case, I challenged a formula/algorithm for cost-benefit analysis that the professor had been teaching from a textbook "for years," and the professor actually reversed and admitted that the textbook's algorithm was wrong. )

  3. ^

    For example, saying “your claims that the technology could be dangerous are unscientific and unfalsifiable” while at the same time (implicitly) forecasting that the technology will almost certainly be safe. 

  4. ^

    Relatedly, see https://www.lesswrong.com/posts/qgm3u5XZLGkKp8W8b/retrospective-forecasting: “More serious is that we don't know what we're talking about when we ask about the probability of an event taking place when said event happened.”

  5. ^

    Basically, someone familiar with the process remarked that in some cases a loosely-predictable cost or benefit might simply not be incorporated into the review because the methodology or assumptions for calculating the effect are not sufficiently scientific or precedented (even if the method is justified).

  6. ^

    For the sake of simplicity, I am setting aside some arguments about quantum mechanics, which is often not what people have in mind when they say that things like coin flips are “random.” Moreover, I have not looked too deeply into this but my guess is that once a coin has begun flipping, randomness at the level of quantum mechanics is not strong enough to change the outcome.

  7. ^

    Or 1/n for n possible outcomes more generally.

  8. ^

    Of course, often when people are doing this they implicitly mean “we have no empirical data [and this seems like a radical idea / the fact that we ‘have no empirical data’ actually is the data itself: this thing has never occurred],” but they fail to make their own reasoning explicit by default.

  9. ^

    Note, what I would call the “ground truth” (a concept I originally discussed in more detail but ended up mostly removing and just briefly discussing in the "other concepts" section to avoid causing confusion and getting entangled in debates over quantum mechanics) of every flip is still either 100% heads or 100% tails, but with sufficient knowledge about the direction of the bias your best estimate would become either 80% tails or heads.

  10. ^

    I think it’s plausible that people would be more likely to avoid such a mistake when the probabilities are cut-and-dry and the context (a coin flip) is familiar compared to other situations where I see people make this mistake.

  11. ^

    Note that there are some potential issues with beliefs about changes in one’s own forecasting quality which this question does not intend to capture. For example, a forecaster should not factor in considerations like “there is a 1% chance that I will suffer a brain injury within the next year that impairs my ability to forecast accurately and will cause me to believe there is a 90% chance that X will happen, whereas I will otherwise think the probability is just 10%”

  12. ^

    Some people might prefer to use terms that focus on expected level of “surprise,” maybe including terms like “forecast entropy”? However, my gut instinct (“low resilience”) tells me that people should be averse to using pedagogical terms where more-natural terms are available.

  13. ^

    To further illustrate, suppose someone lays out 5 red dice and 5 green dice. You and nine other people each take one die at random, and you are told that one of the sets of dice rolls a 6 with probability 50% and the other set is normal, but you are not told which one is which. Suppose you are the last to roll and everyone rolls their dice on a publicly-viewable table, but you have to make one initial forecast and then one final forecast right before you roll. Your initial "best guess" for the probability that you roll a 6 will be [(50% * 1/6) + (50% * 3/6)] = 2/6. However, you can say that this initial estimate is low resilience, because you expect the estimate to change in response to seeing 9 other people roll their dice, which provides you with information about whether your die color is fair or weighted. In contrast, your final estimate will be high resilience. If you were the last of 100 people to go, you could probably say at the outset “I think there is a 50% chance that by the time I must roll my 'best estimate' will be ~1/2, but I also think there is a 50% chance that by that time my 'best estimate' will be ~1/6. So my current best guess of 2/6 is very likely to change.

  14. ^

    I.e., a probability close to 0 or 1.

  15. ^

    I suppose it might depend on whether your formula for determining "high" vs. "low" resilience focuses on relative changes (1% → 2% is a doubling) as opposed to nominal changes (1% → 2% is only a 1% increase) in probability estimates? If someone assumes "low resilience" includes "I currently think the probability is 0.1%, but there's a 50% chance I will think the probability is actually 0.0000001%", then this cell in the table is not illogical. However I do not recommend such a conception of resilience; in this cell of the table, I had in mind a situation where someone thinks "I think the probability is ~0.1%, but I think there's a ~10% chance that by the end of this year I will think the probability is 10% rather than 0.1%" (which is illogical since it fails to incorporate expected updating based on the low resilience).

    As I note in the next bullet, I am open to suggestions on formulas for measuring resilience! It might even help me with a task in my current job, at the Forecasting Research Institute.

  16. ^

    In part because I don’t think it’s necessary to have a proof of concept, in part because I do not have a comparative advantage in the community for creating mathematical formulas, and in part because I’m not particularly optimistic about people being interested in this post.

  17. ^

    An alternative I’ve considered is “credibility,” but I don’t know if this term is too loaded or confusing?

  18. ^

    I do think that the evaluation of legibility probably has to be made in reference to a specific audience, and perhaps should also adopt the (unrealistic) assumption that the audience’s sole motivation is truth-seeking (or forecast error-minimization), rather than other motivations such as “not admitting you were wrong.”

  19. ^

    I relegated this to a footnote upon review because it seemed a bit superfluous compared to the heavily condensed version included in text. The longer explanation is as follows: 

    Suppose that you have a strange-looking virtual coin-flip app on your phone which you know through personal experience is fair, but you present it to a random person who has never seen it before and you claim that flipping heads is 50% likely in the app. This claim is at least somewhat less legible than the same forecast for one of the real coins in the other person’s pocket: their best estimate for the app may be 50% heads with low-medium resilience, but if the first two flips are tails they might change their forecast (because they put little weight on your promise and have less rigid priors about the fairness of virtual coin flipping apps). However, a particularly paranoid person might think that the first few flips are designed to be normal to mislead people, then on the third flip it always is heads. Thus, someone might be skeptical unless they can access and review the app’s source code.

  20. ^

    Note, however, that objectivity in a measurement (“I can objectively measure X”) does not mean that the metric is necessarily the right thing to focus on (“X is the most important thing to measure”).

  21. ^

    (For example, in the Big Short, Dr. Burry is portrayed as a very ineffective communicator with some of his investors.)

  22. ^

    There’s also a second definition I’ve considered which is basically “how much weight should my accuracy relative to others on this question be given when calculating my weighted average performance across many questions” (perhaps similar to wagering more money on a specific bet?)—but I have not worked this out that well. It also seems like this may just mathematically not make sense unless it is indirectly/inefficiently referring to my current preferred definition of resilience or other terms: if you think you are more likely to be “wrong” about one 80% estimate relative to another, should this not cause you to adjust your forecast towards 50%, unless you’re (inefficiently) trying to say something like “I will likely change my mind”?


niplav @ 2023-06-27T09:39 (+3)

This is a post that rings in my heart, thank you so much. I think people very often conflate these concepts, and I also think we're in a very complex space here (especially if you look at Bayesianism from below and see that it grinds on reality/boundedness pretty hard and produces some awful screeching sounds while applied in the world).

I agree that these concepts you've presented are at least antidotes to common confusions about forecasts, and I have some more things to say.

I feel confused about credal resilience:

The example you state appears correct, but I don't know how that would look like as a mathematical object. Some people have talked about probability distributions on probability distributions, in the case of a binary forecast that would be a function , which is…weird. Do I need to tack on the resilience to the distribution? Do I compute it out of the probability distribution on probability distributions? Perhaps the people talking about imprecise probabilities/infrabayesianism are onto something when they talk about convex sets of probability distributions as the correct objects instead of probability distributions per se.

One can note that AIXR is definitely falsifiable, the hard part is falsifying it and staying alive.

There will be a state of the world confirming or denying the outcome, there's just a correlation between our ability to observe those outcomes and the outcomes themselves.

Knightian uncertainty makes more sense in some restricted scenarios especially related to self-confirming/self-denying predictions. If one can read the brain state of a human and construct their predictions of the environment out of that, then one can construct an environment where the human has Knightian uncertainty by constructing outcomes that the human assigned the smallest probability to. (Even a uniform belief gets fooled: We'll pick one option and make that happen many times in a row, but as soon as our poor subject starts predicting that outcome we shift to the ones less likely in their belief).

It need not be such a fanciful scenario: It could be that my buddy James made a very strong prediction that he will finish cleaning his car by noon, so he is too confident and procrastinates until the bell tolls for him. (Or the other way around, where his high confidence makes him more likely to finish the cleaning early, in that case we'd call it Knightian certainty).

This is a very different case than the one people normally state when talking about Knightian uncertainty, but an (imho) much more defensible one. I agree that the common reasons named for Knightian uncertainty are bad.

Another common complaint I've heard is about forecasts with very wide distributions, a case which evoked especially strong reactions was the Cotra bio-anchors report with (iirc) non-negligible probabilities on 12 orders of magnitude. Some people apparently consider such models worse than useless, harkening back to forecast legibility. Apparently both very wide and very narrow distributions are socially punished, even though having a bad model allows for updating & refinement.

Another point touched on very shortly in the post is on forecast precision. We usually don't report forecasts with six or seven digits of precision, because at that level our forecasts are basically noise. But I believe that some of the common objections are about (perceived) undue precision; someone who reports 7 digits of precision is scammy, so reporting about 2 digits of precision is… fishy. Perhaps. I know there's a Tetlock paper on the value of precision in geopolitical forecasting, but it uses a method of rounding to probabilities instead of odds or log-odds. (Approaches based on noising probabilities and then tracking score development do not work—I'm not sure why, though.). It would be cool to know more about precision in forecasting and how it relates to other dimensions.

I think that also probabilities reported by humans are weird because we do not have the entire space of hypothesis in our mind at once, and instead can shift our probabilities during reflection (without receiving evidence). This can apply as well to different people: If I believe that X has a very good reasoning process based on observations on Xs past reasoning, I might not want to/have to follow Xs entire train of thought before raising my probability of their conclusion.

Sorry about the long comment without any links, I'm currently writing this offline and don't have my text notes file with me. I can supply more links if that sounds interesting/relevant.

Marcel D @ 2023-06-30T17:42 (+3)

Sorry about the delayed reply, I saw this and accidentally removed the notification (and I guess didn't receive an email notification, contrary to my expectations) but forgot to reply. Responding to some of your points/questions:

One can note that AIXR is definitely falsifiable, the hard part is falsifying it and staying alive.

I mostly agree with the sentiment that "if someone predicts AIXR and is right then they may not be alive", although I do now think it's entirely plausible that we could survive long enough during a hypothetical AI takeover to say "ah yeah, we're almost certainly headed for extinction"—it's just too late to do anything about it. The problem is how to define "falsify": if you can't 100% prove anything, you can't 100% falsify anything; can the last person alive say with 100% confidence "yep, we're about to go extinct?" No, but I think most people would say that this outcome basically "falsifies" the claim "there is no AIXR," even prior to the final person being killed.

 

Knightian uncertainty makes more sense in some restricted scenarios especially related to self-confirming/self-denying predictions.

This is interesting; I had not previously considered the interaction between self-affecting predictions and (Knightian) "uncertainty." I'll have to think more about this, but as you say I do still think Knightian uncertainty (as I was taught it) does not make much sense.  

 

This can apply as well to different people: If I believe that X has a very good reasoning process based on observations on Xs past reasoning, I might not want to/have to follow Xs entire train of thought before raising my probability of their conclusion.

Yes, this is the point I'm trying to get at with forecast legibility, although I'm a bit confused about how it builds on the previous sentence.

 

Some people have talked about probability distributions on probability distributions, in the case of a binary forecast that would be a function , which is…weird. Do I need to tack on the resilience to the distribution? Do I compute it out of the probability distribution on probability distributions? Perhaps the people talking about imprecise probabilities/infrabayesianism are onto something when they talk about convex sets of probability distributions as the correct objects instead of probability distributions per se.

Unfortunately I'm not sure I understand this paragraph (including the mathematical portion). Thus, I'm not sure how to explain my view of resilience better than what I've already written and the summary illustration: someone who says "my best estimate is currently 50%, but within 30 minutes I think there is a 50% chance that my best estimate will become 75% and a 50% chance that my best estimate becomes 25%" has a less-resilient belief compared to someone who says "my best estimate is currently 50%, and I do not think that will change within 30 minutes." I don't know how to calculate/quantify the level of resilience between the two, but we can obviously see there is a difference.