Forecasters: What Do They Know? Do They Know Things?? Let's Find Out!

By niplav @ 2024-04-02T18:03 (+10)

In the spirit of mandatory draft amnesty day

Beginnings of a research agenda about judgmental forecasting.

Judgmental forecasting is a fairly recent and (in my humble opinion) fairly under-researched & under-appreciated human endeavour & field of research, with some low-hanging fruit (which are getting picked almost as fast as I can write them up).

The Five Horsemen of Hard Forecasting

In general, judgmental forecasting methods operate best in areas with fast feedback loops, large existing datasets (or at least good reference classes for base rates) and continuous historical trends.

We can therefore identify the five horsemen of hard forecasting:

We can use these categories as guideposts: How bad are these as problems? What approaches have been proposed/tried/implemented so far? If we can improve one of them without harming our ability to perform well on the others, we have made progress, if we improve several in tandem, that's even better.

How Good Are We At Forecasting?

How Can We Become Better At Forecasting?

Scoring Rules

Difficult Types of Questions

Forecasting Techniques

Question Decomposition

If we say " will happen if and only if and and ... all happen, so we estimate and and &c, and then multiply them together to estimate …", do we usually get a probability that is close to ? Does this improve forecasts where one tries to estimate directly?

This type of question decomposition (which one could call multiplicative decomposition) appears to be a relatively common method for forecasting, see Allyn-Feuer & Sanders 2023, Silver 2016, Kaufman 2011, Carlsmith 2022 and Hanson 2011, but there have been conceptual arguments against this technique, see Yudkowsky 2017, AronT 2023 and Gwern 2019, which all argue that it reliably underestimates the probability of events.

What is the empirical evidence for decomposition being a technique that improves forecasts?

Lawrence et al. 2006 summarize the state of research on the question:

Decomposition methods are designed to improve accuracy by splitting the judgmental task into a series of smaller and cognitively less demanding tasks, and then combining the resulting judgements. Armstrong (2001) distinguishes between decomposition, where the breakdown of the task is multiplicative (e.g. sales forecast=market size forecast×market share forecast), and segmentation, where it is additive (e.g. sales forecast=Northern region forecast+Western region forecast+Central region forecast), but we will use the term for both approaches here. Surprisingly, there has been relatively little research over the last 25 years into the value of decomposition and the conditions under which it is likely to improve accuracy. In only a few cases has the accuracy of forecasts resulting from decomposition been tested against those of control groups making forecasts holistically. One exception is Edmundson (1990) who found that for a time series extrapolation task, obtaining separate estimates of the trend, seasonal and random components and then combining these to obtain forecasts led to greater accuracy than could be obtained from holistic forecasts. Similarly, Webby, O’Connor and Edmundson (2005) showed that, when a time series was disturbed in some periods by several simultaneous special events, accuracy was greater when forecasters were required to make separate estimates for the effect of each event, rather than estimating the combined effects holistically. Armstrong and Collopy (1993) also constructed more accurate forecasts by structuring the selection and weighting of statistical forecasts around the judge’s knowledge of separate factors that influence the trends in time series (causal forces). Many other proposals for decomposition methods have been based on an act of faith that breaking down judgmental tasks is bound to improve accuracy or upon the fact that decomposition yields an audit trail and hence a defensible rationale for the forecasts (Abramson & Finizza, 1991; Bunn & Wright, 1991; Flores, Olson, & Wolfe, 1992; Saaty & Vargas, 1991; Salo & Bunn, 1995; Wolfe & Flores, 1990). Yet, as Goodwin and Wright (1993) point out, decomposition is not guaranteed to improve accuracy and may actually reduce it when the decomposed judgements are psychologically more complex or less familiar than holistic judgements, or where the increased number of judgements required by the decomposition induces fatigue.

(Emphasis mine).

The types of decomposition described here seem quite different from the ones used in the sources above: Decomposed time series are quite dissimilar to multiplied probabilities for binary predictions, and in combination with the conceptual counter-arguments the evidence appears quite weak.

It appears as if a team of a few (let's say 4) dedicated forecasters could run a small experiment to determine whether multiplicative decomposition for binary forecasts a good method, by randomly spending 20 minutes either making explicitely decomposed forecasts or control forecasts (although the exact method for control needs to be elaborated on). Working in parallel, making 70 forecasts should take less than 6 hours, although it'd be useful to search for more recent literature on the question.

Classification and Improvements

The description of such decomposition in this section is, of course, lacking: A better way of decomposition would be, for a specific outcome, to find a set of preconditions for that are mutually exclusive and collectively exhaustive, find a chain that precedes them (or another MECE decomposition), and iterate until a whole (possibly interweaving) tree of options has been found.

Thus one can define three types of question decomposition:

  1. Multiplicative Decomposition: Given an event , find conditions so that if any only if all of happen. Estimate and and &c, and then multiply them together to estimate .
  2. Additive Decomposition or MECE Decomposition: Given an event , find a set of scenarios such that happens if any happens, and only then, and no two have . Estimate and then estimate .
  3. Recursive Decomposition: For each scenario , decide to pursue one of the following strategies:
    1. Estimate directly
    2. Multiplicative decomposition of
      1. Find a multiplicative decomposition for
      2. Estimate each via recursive decomposition
      3. Determine .
    3. Additive decomposition of
      1. Find a multiplicative decomposition for
      2. Estimate each via recursive decomposition
      3. Determine .

A keen reader will notice that recursive decomposition is similar to Bayes nets. True, though it doesn't deal as well with conditional probabilities.

Using LLMs

This is a scenario where large language models are quite useful, and we have a testable hypothesis: Does question decomposition (or MECE decomposition) improve language model forecasts by any amount?

Frontier LLMs are at best mediocre at forecasting real-world events, but similar to how asking for calibration improves performance, so perhaps chain-of-thought-like question decomposition improves (or reduces) their performance (and therefore gives us reason to believe that similar practices will (or won't) work with human forecasters).


Provide your best probabilistic estimate for the following question.
Give ONLY the probability, no other words or explanation. For example:
10%. Give the most likely guess, as short as possible; not a complete
sentence, just the guess!

Multiplicative decomposition:

Provide your best probabilistic estimate a question.

Your output should be structured in three parts.

First, determine a list of factors X₁, …, X_n that are necessary
and sufficient for the question to be answered "Yes". You can choose
any number of factors.

Second, for each factor X_i, estimate and output the conditional
probability P(X_i|X₁, X₂, …, X_{i-1}), the probability that X_i
will happen, given all the previous factors *have* happened. Then, arrive
at the probability for Q by multiplying the conditional probabilities

P(Q)=P(X₁)*P(X₂|X₁)…P(X_n|X₁, X₂, …, X_{n-1}).

Third and finally, In the last line, report P(Q), WITHOUT ANY ADDITIONAL
TEXT. Just write the probability, and nothing else.

Example (Question: "Will my wife get bread from the bakery today?"):

Necessary factors:
1. My wife remembers to get bread from the bakery.
2. The car isn't broken.
3. The bakery is open.
4. The bakery still has bread.

1. P(My wife remembers to get bread from the bakery)=0.75
2. P(The car isn't broken|My wife remembers to get bread from the bakery)=0.99
3. P(The bakery is open|The car isn't broken, My wife remembers to get bread from the bakery)=0.7
4. P(The bakery still has bread|The bakery is open, The car isn't broken, My wife remembers to get bread from the bakery)=0.9
Multiplying out the probabilities: 0.75*0.99*0.7*0.9=0.467775
(End of output)


How Can We Ask Better Forecasting Questions?

Other Questions

SummaryBot @ 2024-04-03T13:28 (+1)

Executive summary: This post outlines a research agenda for improving judgmental forecasting, identifying key challenges, assessing current forecasting ability, and proposing techniques like question decomposition and using language models to enhance forecasts.

Key points:

  1. The five main challenges in forecasting are long time horizons, reward-correlated predictions, low probability events, out-of-distribution situations, and hard-to-specify events.
  2. Open questions remain about current forecasting ability, including performance on long-term and low-probability questions, convergence behavior, and comparisons between prediction markets, teams, and models.
  3. Potential improvements include developing better scoring rules, techniques for handling unclear resolution criteria and incentivizing predictions on challenging questions, and empirically testing question decomposition methods.
  4. Question decomposition can be multiplicative, additive (MECE), or recursive, and may be enhanced by using large language models, though more research is needed.
  5. Other research directions include analyzing existing forecast datasets, studying question quality, developing aggregation methods, and assessing the robustness of current prediction platforms.



This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.