Metaculus’ predictions are much better than low-information priors

By Vasco Grilo🔸 @ 2023-04-11T08:36 (+53)

Disclaimer: this is not a project from Arb Research.

Summary

I really like Metaculus!
I estimated the Brier score one would achieve by forecasting Metaculus’ binary questions outside of question groups applying Laplace’s rule of succession (RS) by category. I studied 2 variants:
- A constant prediction which equals RS’ prediction (similar to the base rate, except for the 1st few questions).
- A linear prediction which, if RS’ prediction is:
  - Lower than 0.5, goes to 0.
  - 0.5, is always 0.5.
  - Higher than 0.5, goes to 1.
Both within AI categories and all questions, the Brier score evaluated at all times of:
- RS’ constant predictions is as bad (0.248 and 0.247) as the one achieved by predicting 50 % for every question (0.25).
- RS’ linear predictions is worse (0.285 and 0.291).
- Metaculus’ community predictions is better and much better (0.210 and 0.127).
- Metaculus’ predictions is much better (0.160 and 0.122).
The Brier score within AI categories of:
- Metaculus’ community predictions matches that of predicting 45.8 % for the questions which resolve negatively, and 54.2 % for the ones which resolve positively.
- Metaculus’ predictions matches that of predicting 40.0 % for the questions which resolve negatively, and 60.0 % for the ones which resolve positively.

Acknowledgements

Thanks to Misha Yagudin from Arb Research, who had the initial idea for this analysis.

Introduction

I really like Metaculus!

Methods

Predictions

To estimate the Brier score one would achieve by forecasting Metaculus’ binary questions outside of question groups applying RS, I considered as the reference class the questions which share at least one category with the question being forecasted when this was published. I studied 2 variants:

A constant prediction equal to p0 = (1 + “number of positively resolved questions”)/(2 + “number of resolved questions”). This is similar to the base rate when there are many positively resolved questions sharing at least one category with the question being forecasted.
A linear prediction which:
- Between the publish and close time (t_p and t_c), evolves linearly from the constant prediction to:
  - 0 if p0 is closer to 0 than to 1 (p0 < 1/2).
  - 1/2 if p0 is as close to 0 as to 1 (p0 = 1/2).
  - 1 if p0 is closer to 1 than to 0 (p0 > 1/2).
- Is given by (according to the above):
  - If p0 < 1/2, p0/(t_c - t_p)*(t_c - t), which is 0 for t = t_c.
  - If p0 = 1/2, 1/2.
  - If p0 > 1/2, p0 + (1 - p0)/(t_c - t_p)*(t - t_p), which is 1 for t = t_c.

These were the simplest rules I came up with, and I did not test others. Misha suggested exponential updating, and integrating information about the resolution of other questions during the course of the lifetime of the question being forecasted, but I ended up choosing linear updating, and only using information available at the publish time for simplicity. Initially, I was thinking about a prediction which evolved linearly to the outcome between the publish and resolve time, but abandoned it after realising the outcome and resolve time are not known a priori. I also wondered about the linear prediction going to 0 (or 1) for questions of the type “will something happen (not happen) by a certain date?”, and being equal to the constant prediction for other questions, but:

The “certain date” date differs from the close time, which makes it harder to automate the rule such that 0 (or 1) is reached at the “certain date”.
Since almost all Metaculus’ binary questions are of the type “will something happen by a certain date?”, making the prediction reach 0 (or 1) at the close date would imply all predictions going to 0, but this goes against 34.0 %^[1] of the questions having been resolved positively.

The Brier score evaluated at all times, i.e. between the publish and close time, is:

For the constant predictions, if the question resolves:
- Negatively, p0^2.
- Positively, (1 - p0)^2.
For the linear predictions, if the question resolves^[2]:
- Negatively, and:
  - p0 < 1/2, 1/3*p0^2.
  - p0 = 1/2, 1/4.
  - p0 > 1/2, 1/3 + 1/3*p0 + 1/3*p0^2.
- Positively, and:
  - p0 < 1/2, 1 - p0 + 1/3*p0^2.
  - p0 = 1/2, 1/4.
  - p0 > 1/2, 1/3*(1 - p0)^2.

I looked into Metaculus’ binary questions with an ID from 1 to 15000 on 10 April 2023^[3]. The calculations are in this Colab.

Interpreting Brier scores

The table below helps interpret Brier scores. The error is the difference between the prediction and outcome, and the Brier score is the mean square error (of a probabilistic prediction).

Absolute value of the error	Brier score	Prediction for negative outcome	Prediction for positive outcome
0	0	0	1
0.1	0.01	0.1	0.9
0.2	0.04	0.2	0.8
0.3	0.09	0.3	0.7
0.4	0.16	0.4	0.6
0.5	0.25	0.5	0.5
0.6	0.36	0.6	0.4
0.7	0.49	0.7	0.3
0.8	0.64	0.8	0.2
0.9	0.81	0.9	0.1
1	1	1	0

Note:

The maximally uncertain prediction of 0.5 leads to a Brier score of 0.25 for negative and positive outcomes, which means such a Brier score does not depend on the base rate.
For example, a Brier score of 0.09 corresponds to an error whose absolute value is 0.3, which can be achieved by predicting 0.3 for questions which resolve negatively, and 0.7 for ones which resolve positively.
In practice, the error will vary during the course of the question lifetime, and across questions.

Results

The table below contains the Brier score evaluated all times for RS’ constant and linear predictions, Metaculus’ community (MC’s) predictions^[4], and Metaculus’ (M’s) predictions^[5] for all binary questions, and those of categories whose designation contains “artificial intelligence” or “AI” (AI categories). For context, I also show the Brier score evaluated 10 % of the question lifetime (after the publish time) for Metaculus’ community predictions, and Metaculus’ predictions. I took Metaculus’ Brier scores from here. The full results for the RS’ predictions are in this Sheet (see tab “TOC”).

There are less resolved questions for RS’ predictions than for Metaculus’ predictions because I have only analysed binary questions outside of question groups. However, these are accounted for in the Brier scores provided by Metaculus^[6].

Category	Brier score evaluated at… (number of resolved questions)
	All times				10 % of the question lifetime
	RS’ constant predictions	RS’ linear predictions	MC’s predictions	M’s predictions	MC’s predictions	M’s predictions
AI and Machine Learning	0.248 (40)	0.277 (40)	0.210 (45)	0.166 (45)	0.229 (44)	0.185 (44)
Artificial intelligence	0.271 (35)	0.315 (35)	0.234 (46)	0.175 (46)	0.248 (45)	0.188 (45)
AI ambiguity series	0.144 (7)	0.0482 (7)	0.203 (7)	0.051 (7)	0.261 (7)	0.106 (7)
AI Milestones	0.200 (7)	0.220 (7)	0.219 (7)	0.091 (7)	0.200 (7)	0.099 (7)
Forecasting AI Progress	(0)	(0)	(0)	(0)	(0)	(0)
AI categories (any of the above)	0.248 (56)	0.285 (56)	0.210 (64)	0.160 (64)	0.226 (64)	0.182 (64)
Any (all questions)	0.247 (1,372)	0.291 (1,372)	0.127 (1,735)	0.122 (1,735)	0.160 (1,667)	0.152 (1,672)

Discussion

Predictions

Both within AI categories and all questions, the Brier score evaluated at all times of:

RS’ constant predictions is as bad (0.248 and 0.247) as the one achieved by predicting 50 % for every question (0.25).
RS’ linear predictions is worse (0.285 and 0.291).
Metaculus’ community predictions is better and much better (0.210 and 0.127).
Metaculus’ predictions is much better (0.160 and 0.122).

Even when evaluated at 10 % of the question lifetime after the publish time, the Brier score within AI categories and all questions of:

Metaculus’ community predictions is better and much better (0.226 and 0.160) than that of predicting 50 % for every question.
Metaculus’ predictions is much better (0.182 and 0.152).

RS’ linear predictions performed badly arguably because they correspond to unprincipled extremising. For what principled extremising looks like, check this post from Jaime Sevilla.

I believe RS’ exponential predictions would perform even worse than RS’ linear predictions, as:

RS’ linear predictions performed worse than RS’ constant predictions.
RS’ exponential predictions would approach 0 or 1 faster than RS’ linear predictions.

Interpreting Brier scores

The Brier score:

Within AI categories, of:
- Metaculus’ community predictions matches that of predicting 45.8 % (= 0.210^0.5) for the questions which resolve negatively, and 54.2 % (= 1 - 45.8 %) for the ones which resolve positively.
- Metaculus’ predictions matches that of predicting 40.0 % (= 0.160^0.5) for the questions which resolve negatively, and 60.0 % (= 1 - 40.0 %) for the ones which resolve positively.
Within all questions, of:
- Metaculus’ community predictions matches that of predicting 35.6 % (= 0.127^0.5) for the questions which resolve negatively, and 64.4 % (= 1 - 35.6 %) for the ones which resolve positively.
- Metaculus’ predictions matches that of predicting 34.9 % (= 0.122^0.5) for the questions which resolve negatively, and 65.1 % (= 1 - 34.9 %) for the ones which resolve positively.

I think replacing a probability of 50 % by 45.8 % or 54.2 % (values calculated above for Metaculus’ community predictions about AI) in an expected value calculation (implicitly or explicitly) does not usually lead to decision-relevant differences. However, this is not so relevant, because there will be greater benefits in accuracy for each question.

^{^}
Mean of column H of tab “Questions” of this Sheet.
^{^}
I confirmed the formulas are correct with Wolfram Alpha. I derived them by integrating the square of the error, and then dividing it by the difference between the publish and close time. The error is, if the question resolves:
- Negatively, and:
-- p0 < 1/2, p0/(t_c - t_p)*(t_c - t).
-- p0 = 1/2, 1/2.
-- p0 > 1/2, (p0*t_c - t_p + (1 - p0)*t)/(t_c - t_p).
- Positively, and:
-- p0 < 1/2, (t_p - (1 - p0)*t_c - p0*t)/(t_c - t_p).
-- p0 = 1/2, -1/2.
-- p0 > 1/2, - (1 - p0)/(t_c - t_p)*(t_c - t).
^{^}
The pages of Metaculus’ questions have the format “https://www.metaculus.com/questions/ID/”.
^{^}
From here, “the Community prediction is the median prediction of all predictors, weighted by recency”.
^{^}
From here, “the Metaculus prediction is the Metaculus best estimate of how a question will resolve, using a sophisticated model to calibrate and weight each user. It has been continually refined since June 2017”.
^{^}
I have asked Metaculus to enable filtering by type of question in their track record page.