Forecasting With LLMs - An Open and Promising Research Direction

By Marcel D @ 2024-03-12T04:23 (+13)

I've been working on this for the past week, but both Scott Alexander and ImportAI have begun covering it, so I figured I might as well just jump in while there's attention... I feel a bit like Schoenegger probably did getting scooped by Halawi!

Summary

Recent research suggests that LLMs may approach or even match the performance of regular forecaster crowds (e.g., on Metaculus).
- It seems likely (>50%) that AI-only systems using LLMs could at least match such crowds by 2026.^[1]
- These crowds may not be as good as superforecaster teams, so calling them the “gold standard” might be a bit misleading—but regular forecaster crowds are still one of the best benchmarks we have (especially given that they tend to outperform individual subject matter experts).
Even if LLMs only performed as well as regular forecaster crowds this could be very useful for some applications.
We could be far from the ceiling of performance even with current LLMs.
- For example, it seems likely that methods like rationale generation and synthesis would improve performance—and neither recent paper tested this.
Further research could (and should) be done immediately, modularly, and relatively cheaply.

Why would merely “decent” LLM forecasting be a big deal?

Even if LLMs’ performance was only equal to or slightly worse than crowds of regular forecasters, this could still be very impactful for the following reasons:

LLMs could rapidly make many forecasts. This could be valuable for important industries or activities, such as finance or policymaking (e.g., intelligence analysis), which might be constrained by analysts’ time.
LLMs might augment human forecasting, such as by flagging key considerations^[2] or making complex conditional forecasting more manageable (given that the connections between conditional forecasts can be messy and grow faster than the number of nodes, such as with some Bayes nets).
“Hindcast” benchmarks to evaluate performance: We often lack contemporary alternative forecasts, and hindsight bias can make it very difficult to determine whether a past forecast was “justifiable.”^[3] However, you could have LLMs provide forecasts for past events that occurred after the models’ cutoff dates.^[4] These forecasts could become benchmarks for evaluating human analyses or other AI/statistical methods.
Experimenting on LLMs to improve human forecasting methods: We could use LLMs like animal models for human reasoning and test thousands of interventions on them to see what might improve human forecasting (e.g., “does X structured analytical technique (SAT) improve LLM performance?”). While some of these experiments could be possible with human subjects, it may have very high opportunity cost and limited capacity for exploratory research.
Insight about AGI timelines
- The changes in forecasting performance between model generations might give us more^[5] information about the rate of improvement in model capabilities compared to other benchmarks: you presumably can’t cheat real-time forecasts.^[6]
- Changes in forecasting performance between systems/setups (e.g., recursive prompting, aggregation, fine-tuning) for a given generation of models might give us more information about the potential impact of effects other than model training or parameter scale.

The past and present of forecasting with language models

Research prior to 2024 generally found that AI systems (with or without LLMs) were worse than human crowds at geopolitical and related judgmental forecasting.^[7]
Two preprints from late February of 2024 report that systems with LMs perform as well as or only slightly worse than the human crowd.^[8]
- Both papers use a system which aggregates estimates from multiple language models, such as GPT-4, Claude, LLaMA, etc.
- These papers have some notable differences, such as:
  - Date of questions: Schoenegger et al.’s approach avoids the possibility that the question results were in the training set, because the LMs’ forecasts are made in “real time”—i.e., when humans are also making forecasts.^[9] Halawi et al. uses questions published after the models’ reported knowledge cut-off dates (hindcasting).
  - Retrieval augmentation: Halawi et al.’s system allows LMs to query a news API (with date limits to prevent cheating); Schoenegger et al. uses some models with search capabilities (e.g., GPT-4 Bing) but does not otherwise provide a retrieval system.
  - Reasoning prompts: Schoenegger et al. use human-written prompts to guide the system’s reasoning^[10] while Halawi et al. use an LM-generated prompt template that they found maximized performance on the validation data.^[11]
  - Calibration fine-tuning: Schoenegger et al. do not use calibration fine-tuning and report that the system’s forecasts demonstrate a clear “acquiescence bias” (i.e., overestimating the likelihood of positive resolution). Halawi et al. fine tune the models to be more calibrated.^[12]
  - Pre-registration: Only Schoenegger et al. report using pre-registration for analyses.
There is an ongoing, public experiment with trading bots on Manifold Markets. It is currently unclear how insightful the results will be, but it is worth following.^[13]

We are likely not at the ceiling even with current models

Setting aside the recent release of Claude-3 and the impact of future models, there is likely room for meaningful improvement just through better system design with current models.

Schoenegger et al.’s system suffered from predictable miscalibration (acquiescence bias) and did not incorporate Halawi et al.’s calibration fine-tuning, yet reported performance equivalent to the human crowd.
- Halawi et al. also report their system suffered from some seemingly unreasonable hedging, which they hypothesize is due to some models’ safety training. This might be fixable via targeted fine-tuning, prompting, or aggregation methods.
Neither paper appears to test the impact of generating a large number of rationales and having LMs review the reasoning to produce or improve their estimates.
- Consider that there are many problems where it's easier to evaluate the quality of a given solution than to generate or find that solution.^[14] Forecasting on some questions seems like a case where there are many possible lines of argument to explore, but sometimes it is much easier to evaluate the quality of an argument or rebuttal than to generate the argument—especially if a model is unlikely to consider some perspectives after exploring one line of reasoning, or if different models have different tendencies or strengths.
- One could plausibly also experiment with having LMs take on personas or schools of thought (e.g., subject matter experts vs. superforecasters, inside vs. outside view, realism vs. liberalism) when generating their rationales, so they are more likely to creatively explore diverse lines of argument.
It seems that neither paper experiments with red-teaming methods or structured analytical techniques (e.g., analysis of competing hypotheses).^[15]

Potential next steps for research

Aside from scrutinizing the papers’ datasets once they are released, someone could use Halawi et al.’s hindcasting approach to immediately evaluate:

The impact of rationale generation and synthesis (described above), including with persona prompting and opportunities for rebuttal;
- Even if this does not improve performance, it would be interesting to determine whether it has different effects between model generations (e.g., the effect is negative for GPT-3.5 but neutral for GPT-4^[16])
The impact of prompts that instruct models to use structured analytical techniques;
The impact of asking the reversal of questions (to see if it offsets the acquiescence bias observed in Schoenegger et al.);
The impact of asking models to identify and provide conditional forecasts;
The performance of Claude 3 (and the similarity of its rationales to GPT-4’s).

^{^}
My forecast is admittedly underspecified (e.g., “on what types of questions?”) and it is meant to exclude crowd use of LLMs. I also assess it is >15% likely that LLMs/etc. could do as well as superforecaster crowds by 2026. For both assessments I expect new information over the next few months could dramatically change this forecast.
^{^}
For some relevant recent research on this, see https://arxiv.org/pdf/2402.07862.pdf.
^{^}
This is explicitly bemoaned here: How Feasible Is Long-range Forecasting?
^{^}
For example, would an LLM with a December 2021 cutoff date have considered the Russian invasion of Ukraine (and Ukraine’s strong defense) “obvious?”
^{^}
Alternatively, it may not truly provide “more” information, but the information might be more legible/credible to skeptics who might otherwise just argue other benchmarks (e.g., MMLU, GPQA) are broken or unrealistic.
^{^}
Cheating with hindcasting might be possible in some cases, but it may require misreporting training cutoff dates.
^{^}
See for example https://openreview.net/pdf?id=LbOdQrnOb2q.
This paper from 2023 found that GPT-4 underperformed against the crowd, but the paper had a variety of limitations and design choices that may have undermined comparability (especially considering that GPT-4’s Brier score was higher in Schoenegger et al. 2024), such as only eliciting one forecast rather than aggregating multiple responses from GPT-4: https://arxiv.org/pdf/2310.13014.pdf.
^{^}
https://arxiv.org/pdf/2402.18563.pdf; https://arxiv.org/pdf/2402.19379.pdf.
^{^}
However, Schoenegger et al.’s approach raises the possibility that the crowd performed poorly because the human crowd only had a short window of time (possibly just 48 hours?) to make a forecast.
^{^}
See Figures 1 and 3. https://arxiv.org/pdf/2402.19379.pdf
^{^}
To be clear, this does not include prompts for instructions. See for example Figure 15. https://arxiv.org/pdf/2402.18563.pdf.
^{^}
Halawi et al. do claim to separate training, validation, and test datasets for the fine-tuning process. However, I could not quickly determine whether Halawi et al. address the possibility of leakage via correlation between questions across the training/test datasets (e.g., the outcome of a specific Congressional election in 2020 might correlate with the outcome of the Presidential election).
^{^}
For some coverage/discussion, see here: https://www.astralcodexten.com/p/mantic-monday-21924.
^{^}
Many mathematical problems are of this form, including many cryptography algorithms.
^{^}
For some discussion of structured analytical techniques, see here: https://greydynamics.com/structured-analytic-techniques-basics-and-applications-for-intelligence-analysts/. There is debate about the validity of SATs given the supposedly limited experimental evidence of their effectiveness—which experimenting with LLMs might help address! For more on this debate, see here: https://www.rand.org/pubs/research_reports/RR1408.html.
^{^}
This is mostly inspired by the supposed finding from “Self-Taught Optimizer (STOP)” paper from 2023: the "self-improvement" instructions made GPT-3.5 perform worse over iterations but made GPT-4 slightly better.