Some learnings I had from forecasting in 2020

By Linch @ 2020-10-03T19:21 (+90)

crossposted from my own short-form

Here are some things I've learned from spending a decent fraction of the last 6 months either forecasting or thinking about forecasting, with an eye towards beliefs that I expect to be fairly generalizable to other endeavors.

Before reading this post, I recommend brushing up on Tetlock's work on (super)forecasting, particularly Tetlock's 10 commandments for aspiring superforecasters.

1. Forming (good) outside views is often hard but not impossible.

I think there is a common belief/framing in EA and rationalist circles that coming up with outside views is easy, and the real difficulty is a) originality in inside views, and also b) a debate of how much to trust outside views vs inside views.

I think this is directionally true (original thought is harder than synthesizing existing views) but it hides a lot of the details. It's often quite difficult to come up with and balance good outside views that are applicable to a situation. See Manheim and Muelhauser for some discussions of this.

2. For novel out-of-distribution situations, "normal" people often trust centralized data/ontologies more than is warranted.

See here for a discussion. I believe something similar is true for trust of domain experts, though this is more debatable.

3. The EA community overrates the predictive validity and epistemic superiority of forecasters/forecasting.

(Note that I think this is an improvement over the status quo in the broader society, where by default approximately nobody trusts generalist forecasters at all)

I've had several conversations where EAs will ask me to make a prediction, I'll think about it a bit and say something like "I dunno, 10%?"and people will treat it like a fully informed prediction to make decisions about, rather than just another source of information among many.

I think this is clearly wrong. I think in almost any situation where you are a reasonable person and you spent 10x (sometimes 100x or more!) time thinking about a question then I have, you should just trust your own judgments much more than mine on the question.

To a first approximation, good forecasters have three things: 1) They're fairly smart. 2) They're willing to actually do the homework. 3) They have an intuitive sense of probability.

This is not nothing, but it's also pretty far from everything you want in a epistemic source.

4. The EA community overrates Superforecasters and Superforecasting techniques.

I think the types of questions and responses Good Judgment .* is interested in is a particular way to look at the world. I don't think it is always applicable (easy EA-relevant example: your Brier score is basically the same if you give 0% for 1% probabilities, and vice versa), and it's bad epistemics to collapse all of the "figure out the future in a quantifiable manner" to a single paradigm.

Likewise, I don't think there's a clear dividing line between good forecasters and GJP-certified Superforecasters, so many of the issues I mentioned in #3 are just as applicable here.

I'm not sure how to collapse all the things I've learned on this topic in a few short paragraphs, but the tl;dr is that I trusted superforecasters much more than I trusted other EAs before I started forecasting stuff, and now I consider their opinions and forecasts "just" an important overall component to my thinking, rather than a clear epistemic superior to defer to.

5. Good intuitions are really important.

I think there's a Straw Vulcan approach to rationality where people think "good" rationality is about suppressing your System 1 in favor of clear thinking and logical propositions from your system 2. I think there's plenty of evidence for this being wrong*. For example, the cognitive reflection test was originally supposed to be a test of how well people suppress their "intuitive" answers to instead think through the question and provide the right "unintuitive answers", however we've later learned (one fairly good psych study. May not replicate, seems to accord with my intuitions and recent experiences) that more "cognitively reflective" people also had more accurate initial answers when they didn't have the time to think through the question.

On a more practical level, I think a fair amount of good thinking is using your System 2 to train your intuitions, so you have better and better first impressions and taste for how to improve your understanding of the world in the future.

*I think my claim so far is fairly uncontroversial, for example I expect CFAR to agree with a lot of what I say.

6. Relatedly, most of my forecasting mistakes are due to emotional rather than technical reasons.

Here's a Twitter thread from May exploring why; I think I still mostly stand by it.

Matt_Lerner @ 2020-10-04T15:34 (+6)

The EA community overrates the predictive validity and epistemic superiority of forecasters/forecasting.

This seems to be true and also to be an emerging consensus (at least here on the forum).

I've only been forecasting for a few months, but it's starting to seem to me like forecasting does have quite a lot of value—as valuable training in reasoning, and as a way of enforcing a common language around discussion of possible futures. The accuracy of the predictions themselves seems secondary to the way that forecasting serves as a calibration exercise. I'd really like to see empirical work on this, but anecdotally it does feel like it has improved my own reasoning somewhat. Curious to hear your thoughts.

Linch @ 2020-10-05T04:55 (+4)

Thanks for the comment!

This seems to be true and also to be an emerging consensus (at least here on the forum).

Can you point to some examples?

I've only been forecasting for a few months, but it's starting to seem to me like forecasting does have quite a lot of value...

This seems right to me. I think society as a whole underprices forecasting, and EA underprices a bunch of subniches within forecasting (even if they overrate predictive validity specifically).

...as valuable training in reasoning.

I think this is right. I think to some degree, the value of forecasting is similar to what Parfit ascribes to thought experiments:

Most of these cases are[...]purely imaginary[...]. We can use them to discover, not what the truth is, but what we believe

Similarly, I think of a lot of the value of inputting probabilities and distributions is as a way to have internal coherence/validity, to help represent/bring to the forefront of what I believe.

...and as a way of enforcing a common language around discussion of possible futures

This sounds right to me. Stefan Schubert has a fun comparison of forecasting and analytic philosophy.

jacobpfau @ 2020-10-08T02:35 (+1)

Do your opinion updates extend from individual forecasts to aggregated ones? In particular how reliable do you think is the Metaculus median AGI timeline?

On the one hand, my opinion of Metaculus predictions worsened as I saw how the 'recent predictions' showed people piling in on the median on some questions I watch. On the other hand, my opinion of Metaculus predictions improved as I found out that performance doesn't seem to fall as a function of 'resolve minus closing' time (see https://twitter.com/tenthkrige/status/1296401128469471235). Are there some observations which have swayed your opinion in similar ways?

matthew.vandermerwe @ 2020-10-08T17:34 (+5)

With regards to the AGI timeline, it's important to note that Metaculus' resolution criteria are quite different from a 'standard' interpretation of what would constitute AGI^[1], (or human-level AI^[2], superintelligence^[3], transformative AI, etc.). It's also unclear what proportion of forecasters have read this fine print (interested to hear others' views on this), which further complicates interpretation.

For these purposes we will thus define "an artificial general intelligence" as a single unified software system that can satisfy the following criteria, all easily completable by a typical college-educated human.

Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.

Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the "Winogrande" challenge or comparable data set for which human performance is at 90+%

Be able to score 75th percentile (as compared to the corresponding year's human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)

Be able to learn the classic Atari game "Montezuma's revenge" (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play (see closely-related question.)

By "unified" we mean that the system is integrated enough that it can, for example, explain its reasoning on an SAT problem or Winograd schema question, or verbally report its progress and identify objects during videogame play. (This is not really meant to be an additional capability of "introspection" so much as a provision that the system not simply be cobbled together as a set of sub-systems specialized to tasks like the above, but rather a single system applicable to many problems.)

jacobpfau @ 2020-10-08T19:53 (+4)

Agreed, I've been trying to help out a bit with Matt Barnett's new question here. Feedback period is still open, so chime in if you have ideas!

I suspect most Metaculites are accustomed to paying attention to how a question's operationalization deviates from its intent FWIW. Personally, I find the Montezuma's revenge criterion quite important without which the question would be far from AGI.

My intent with bringing up this question, was more to ask about how Linch thinks about the reliability of long-term predictions with no obvious frequentist-friendly track record to look at.

Pablo_Stafforini @ 2020-10-08T12:40 (+4)

On the one hand, my opinion of Metaculus predictions worsened as I saw how the 'recent predictions' showed people piling in on the median on some questions I watch.

Can you say more about this? I ask because this behavior seems consistent with an attitude of epistemic deference towards the community prediction when individual predictors perceive it to be superior to what they can themselves predict given their time and ability constraints.

jacobpfau @ 2020-10-08T19:45 (+1)

Sure at an individual level deference usually makes for better predictions, but at a community level deference-as-the-norm can dilute the weight of those who are informed and predict differently from the median. Excessive numbers of deferential predictions also obfuscate how reliable the median prediction is, and thus makes it harder for others to do an informed update on the median.

As you say, it's better if people contribute information where their relative value-add is greatest, so I'd say it's reasonable for people to have a 2:1 ratio of questions on which they deviate from the median to questions on which they follow the median. My vague impression is that the ratio may be lower -- especially for people predicting on <1 year time horizon events. I think you, linch and other heavier Metaculus users may have a more informed impression here though, so would be happy to see disagreement.

I think it would be interesting to have a Metaculus on which for every prediction you have to select a general category for your update e.g. "New Probability Calculation", "Updated to Median", "Information source released", etc. Seeing the various distributions for each would likely be quite informative.

Linch @ 2020-10-08T21:57 (+2)

Do your opinion updates extend from individual forecasts to aggregated ones?

I think the best individual forecasters are on average better than the aggregate Metaculus forecasts at the moment they make the prediction. Especially if they spent a while on the prediction. I'm less sure if you account for prediction lag (The Metaculus and community predictions are usually better at incorporating new information), and my assessment for that will depend on a bunch of details.

In particular how reliable do you think is the Metaculus median AGI timeline?

I think as noted by matthew.vandermerwe, the Metaculus question operationalization for "AGI" is very different from what our community typically uses. I don't have a strong opinion on whether a random AI Safety person will do better on that operationalization.

For something closer to what EAs care about, I'm pretty suspicious of the current forecasts given for existential risk/GCR estimates (for example in the Ragnarok series), and generally do not think existential risk researchers should strongly defer to them (though I suspect the forecasts/comments are good enough that it's generally worth most xrisk researchers studying the relevant questions to read).