Metaculus’ predictions are much better than low-information priors

By Vasco Grilo🔸 @ 2023-04-11T08:36 (+53)

Disclaimer: this is not a project from Arb Research.

Summary

Acknowledgements

Thanks to Misha Yagudin from Arb Research, who had the initial idea for this analysis.

Introduction

I really like Metaculus!

Methods

Predictions

To estimate the Brier score one would achieve by forecasting Metaculus’ binary questions outside of question groups applying RS, I considered as the reference class the questions which share at least one category with the question being forecasted when this was published. I studied 2 variants:

These were the simplest rules I came up with, and I did not test others. Misha suggested exponential updating, and integrating information about the resolution of other questions during the course of the lifetime of the question being forecasted, but I ended up choosing linear updating, and only using information available at the publish time for simplicity. Initially, I was thinking about a prediction which evolved linearly to the outcome between the publish and resolve time, but abandoned it after realising the outcome and resolve time are not known a priori. I also wondered about the linear prediction going to 0 (or 1) for questions of the type “will something happen (not happen) by a certain date?”, and being equal to the constant prediction for other questions, but:

The Brier score evaluated at all times, i.e. between the publish and close time, is:

I looked into Metaculus’ binary questions with an ID from 1 to 15000 on 10 April 2023[3]. The calculations are in this Colab.

Interpreting Brier scores

The table below helps interpret Brier scores. The error is the difference between the prediction and outcome, and the Brier score is the mean square error (of a probabilistic prediction).

Absolute value of the error

Brier score

Prediction for negative outcome

Prediction for positive outcome

0

0

0

1

0.1

0.01

0.1

0.9

0.2

0.04

0.2

0.8

0.3

0.09

0.3

0.7

0.4

0.16

0.4

0.6

0.5

0.25

0.5

0.5

0.6

0.36

0.6

0.4

0.7

0.49

0.7

0.3

0.8

0.64

0.8

0.2

0.9

0.81

0.9

0.1

1

1

1

0

Note:

Results

The table below contains the Brier score evaluated all times for RS’ constant and linear predictions, Metaculus’ community (MC’s) predictions[4], and Metaculus’ (M’s) predictions[5] for all binary questions, and those of categories whose designation contains “artificial intelligence” or “AI” (AI categories). For context, I also show the Brier score evaluated 10 % of the question lifetime (after the publish time) for Metaculus’ community predictions, and Metaculus’ predictions. I took Metaculus’ Brier scores from here. The full results for the RS’ predictions are in this Sheet (see tab “TOC”).

There are less resolved questions for RS’ predictions than for Metaculus’ predictions because I have only analysed binary questions outside of question groups. However, these are accounted for in the Brier scores provided by Metaculus[6].

Category

Brier score evaluated at… (number of resolved questions)

All times

10 % of the question lifetime

RS’ constant predictions

RS’ linear predictions

MC’s predictions

M’s predictions

MC’s predictions

M’s predictions

AI and Machine Learning

0.248 (40)

0.277 (40)

0.210 (45)

0.166 (45)

0.229 (44)

0.185 (44)

Artificial intelligence

0.271 (35)

0.315 (35)

0.234 (46)

0.175 (46)

0.248 (45)

0.188 (45)

AI ambiguity series

0.144 (7)

0.0482 (7)

0.203 (7)

0.051 (7)

0.261 (7)

0.106 (7)

AI Milestones

0.200 (7)

0.220 (7)

0.219 (7)

0.091 (7)

0.200 (7)

0.099 (7)

Forecasting AI Progress

(0)

(0)

(0)

(0)

(0)

(0)

AI categories (any of the above)

0.248 (56)

0.285 (56)

0.210 (64)

0.160 (64)

0.226 (64)

0.182 (64)

Any (all questions)

0.247 (1,372)

0.291 (1,372)

0.127 (1,735)

0.122 (1,735)

0.160 (1,667)

0.152 (1,672)

Discussion

Predictions

Both within AI categories and all questions, the Brier score evaluated at all times of:

Even when evaluated at 10 % of the question lifetime after the publish time, the Brier score within AI categories and all questions of:

RS’ linear predictions performed badly arguably because they correspond to unprincipled extremising. For what principled extremising looks like, check this post from Jaime Sevilla.

I believe RS’ exponential predictions would perform even worse than RS’ linear predictions, as:

Interpreting Brier scores

The Brier score:

I think replacing a probability of 50 % by 45.8 % or 54.2 % (values calculated above for Metaculus’ community predictions about AI) in an expected value calculation (implicitly or explicitly) does not usually lead to decision-relevant differences. However, this is not so relevant, because there will be greater benefits in accuracy for each question.

  1. ^

     Mean of column H of tab “Questions” of this Sheet.

  2. ^

     I confirmed the formulas are correct with Wolfram Alpha. I derived them by integrating the square of the error, and then dividing it by the difference between the publish and close time. The error is, if the question resolves:

    - Negatively, and:

       -- p0 < 1/2, p0/(t_c - t_p)*(t_c - t).

       -- p0 = 1/2, 1/2.

       -- p0 > 1/2, (p0*t_c - t_p + (1 - p0)*t)/(t_c - t_p).

    - Positively, and:

       -- p0 < 1/2, (t_p - (1 - p0)*t_c - p0*t)/(t_c - t_p).

       -- p0 = 1/2, -1/2.

       -- p0 > 1/2, - (1 - p0)/(t_c - t_p)*(t_c - t).

  3. ^

     The pages of Metaculus’ questions have the format “https://www.metaculus.com/questions/ID/”.

  4. ^

     From here, “the Community prediction is the median prediction of all predictors, weighted by recency”.

  5. ^

     From here, “the Metaculus prediction is the Metaculus best estimate of how a question will resolve, using a sophisticated model to calibrate and weight each user. It has been continually refined since June 2017”.

  6. ^

     I have asked Metaculus to enable filtering by type of question in their track record page.