Forecasting With LLMs - An Open and Promising Research Direction

By Marcel D @ 2024-03-12T04:23 (+13)

I've been working on this for the past week, but both Scott Alexander and ImportAI have begun covering it, so I figured I might as well just jump in while there's attention... I feel a bit like Schoenegger probably did getting scooped by Halawi!

Summary

Why would merely “decent” LLM forecasting be a big deal?

Even if LLMs’ performance was only equal to or slightly worse than crowds of regular forecasters, this could still be very impactful for the following reasons:

The past and present of forecasting with language models

We are likely not at the ceiling even with current models

Setting aside the recent release of Claude-3 and the impact of future models, there is likely room for meaningful improvement just through better system design with current models.

Potential next steps for research

Aside from scrutinizing the papers’ datasets once they are released, someone could use Halawi et al.’s hindcasting approach to immediately evaluate:

  1. ^

     My forecast is admittedly underspecified (e.g., “on what types of questions?”) and it is meant to exclude crowd use of LLMs. I also assess it is >15% likely that LLMs/etc. could do as well as superforecaster crowds by 2026. For both assessments I expect new information over the next few months could dramatically change this forecast.

  2. ^

     For some relevant recent research on this, see https://arxiv.org/pdf/2402.07862.pdf.

  3. ^

     This is explicitly bemoaned here: How Feasible Is Long-range Forecasting?

  4. ^

     For example, would an LLM with a December 2021 cutoff date have considered the Russian invasion of Ukraine (and Ukraine’s strong defense) “obvious?”

  5. ^

     Alternatively, it may not truly provide “more” information, but the information might be more legible/credible to skeptics who might otherwise just argue other benchmarks (e.g., MMLU, GPQA) are broken or unrealistic.

  6. ^

     Cheating with hindcasting might be possible in some cases, but it may require misreporting training cutoff dates.

  7. ^

     See for example https://openreview.net/pdf?id=LbOdQrnOb2q
    This paper from 2023 found that GPT-4 underperformed against the crowd, but the paper had a variety of limitations and design choices that may have undermined comparability (especially considering that GPT-4’s Brier score was higher in Schoenegger et al. 2024), such as only eliciting one forecast rather than aggregating multiple responses from GPT-4: https://arxiv.org/pdf/2310.13014.pdf.

  8. ^
  9. ^

     However, Schoenegger et al.’s approach raises the possibility that the crowd performed poorly because the human crowd only had a short window of time (possibly just 48 hours?) to make a forecast.

  10. ^
  11. ^

     To be clear, this does not include prompts for instructions. See for example Figure 15. https://arxiv.org/pdf/2402.18563.pdf.

  12. ^

     Halawi et al. do claim to separate training, validation, and test datasets for the fine-tuning process. However, I could not quickly determine whether Halawi et al. address the possibility of leakage via correlation between questions across the training/test datasets (e.g., the outcome of a specific Congressional election in 2020 might correlate with the outcome of the Presidential election).

  13. ^

     For some coverage/discussion, see here: https://www.astralcodexten.com/p/mantic-monday-21924

  14. ^

     Many mathematical problems are of this form, including many cryptography algorithms.

  15. ^

     For some discussion of structured analytical techniques, see here: https://greydynamics.com/structured-analytic-techniques-basics-and-applications-for-intelligence-analysts/. There is debate about the validity of SATs given the supposedly limited experimental evidence of their effectiveness—which experimenting with LLMs might help address! For more on this debate, see here: https://www.rand.org/pubs/research_reports/RR1408.html

  16. ^

     This is mostly inspired by the supposed finding from “Self-Taught Optimizer (STOP)” paper from 2023: the "self-improvement" instructions made GPT-3.5 perform worse over iterations but made GPT-4 slightly better.