Analysis of Automated Prompt Engineering for Forecasting

By christian, Benjamin Wilson @ 2025-06-12T15:49 (+11)

This is a linkpost to https://www.metaculus.com/notebooks/38421/automated-prompt-engineering-for-forecasting/

By Benjamin Wilson, Research Automation Engineer at Metaculus

Main Findings:

Introduction:

As part of a grant for the Foresight Institute, Metaculus has been running some experiments to make open-source research tools for forecasting bots and generally improve forecasting bot performance. This last sprint, I created an automated prompt optimizer to test whether some prompts do better than others. These are preliminary findings that we think can be useful to other researchers and bot makers. We plan to further test and improve this approach and ideally make the optimizer a publicly usable tool. Below is a snapshot of early findings. 

Methodology:

Optimization Results:

In the graphs below, the first bar represents the "perfect predictor" or the score of the community prediction (this is only perfect in the sense that it maximizes expected baseline score). The next bar is the Control Group (using the control prompt), and the final two bars are the top two scoring prompts using the training question set (collected when the prompt optimizer was run). Error bars denote 90% confidence intervals calculated using a t-test. White stars indicate the best scores in each category (though this only matters for graphs not using the perfect predictor). Uncertain questions (red) are those with community predictions between 10% and 90%, while certain questions (dark blue) are those outside these ranges. The bar that really matters is the "All questions" bar (light blue). On the 230-question test set, the “perfect predictor” scored an expected baseline score of 38.8. This is the max expected baseline score that a bot can get on the test set. A bot that predicts 50% on all questions would get a score of 0.

GPT-4.1-nano Optimization:

The control prompt did worse than predicting 50% on every question (score: -6.35 +/-7.96). The second optimized prompt (score: 11.72 +/- 4.89) significantly outperforms the control prompt, though the first prompt (score: 0.40 +/- 7.09) also has a noticeable improvement (data).

 

GPT-4.1 Optimization:

Both the first optimized prompt (score: 21.53 +/- 4.02) and second optimized prompt (score: 19.72 +/- 4.27) beat the control prompt (score: 15.02 +/- 6.28) by a noticeable amount, but not as much as for gpt-4.1-nano (data).

 

DeepSeek-R1 Optimization:

Neither the first optimized prompt (score: 16.28 +/- 6.63) nor the second optimized prompt (score: 18.11 +/- 3.52) outperformed the control prompt (score: 20.30 +/- 3.9) for DeepSeek-R1 (data)

 

Other Results:

GPT-4.1's prompts on other models:

I tested the top 2 prompts from the GPT-4.1 optimization run on other models using the 230-question test set. For each model, the bars below alternate between the control prompt, best prompt, and second-best prompt. The models tested in order are: gpt-4.1, gpt-4.1-nano, claude-sonnet-4 (default w/o thinking), o4-mini, deepseek-r1. The best benchmark for each category has a white star.

You'll notice that both prompts beat the control prompt for both 4.1,  4.1-nano, and o4-mini. However, they both underperformed the control prompt for DeepSeek-R1 and Claude-Sonnet-4. These changes are noticeable, and only some seem to be statistically significant from visual inspection (I hope to run some stats to more precisely test these differences later). Also notably, the optimized prompt for GPT-4.1 outperformed all the other prompts on all models on the "All questions" category (though with insignificant margins at times).

  (data1, data2, data3, data4, data5)

Expected baseline score of bots from Q1:

As a quick sanity check, I benchmarked some of the bots we ran in the Q1 AI Benchmarking Tournament to double-check that I could get the same rankings with expected baseline score as we got in Q1. In Q1, we found that o1 beat DeepSeek-R1 and both beat GPT-4o. A quick run on the 112-question training set (the smaller set was chosen to reduce cost) found the same ordering (though with overlapping error bars). Also included in this test was Gemini 2.5 pro, which got similar scores to o1. I’m now curious to see if o1 and Gemini 2.5 pro score similarly at the end of our Q2 tournament. Note, though, that these evaluations only use binary questions. The AI Benchmark Tournament uses numeric and multiple-choice as well (data)

 

Misc Findings:

Best prompts for GPT 4.1 and Control Prompt:

The best prompts for GPT-4.1 are somewhat lengthy, though quite interesting, so please check out the prompts here. They focus a lot on red/blue teaming.

The Control Prompt is:


You are a professional forecaster interviewing for a job.

Your interview question is:

{question_text}

Question background:

{background_info}

This question's outcome will be determined by the specific criteria below. These criteria have not yet been satisfied:

{resolution_criteria}

{fine_print}

Your research assistant says:

{research}

Today is {today}.

Before answering you write:

(a) The time left until the outcome to the question is known.

(b) The status quo outcome if nothing changed.

(c) A brief description of a scenario that results in a No outcome.

(d) A brief description of a scenario that results in a Yes outcome.

You write your rationale remembering that good forecasters put extra weight on the status quo outcome since the world changes slowly most of the time.

The last thing you write is your final answer as: "Probability: ZZ%", 0-100


Code and Data:

All results can be found here. Specific files are linked above. The most up-to-date code can be found in the forecasting-tools Python package, though this is a snapshot of the optimization script for the last commit that was used to generate the current results, and here is the script for evaluating the test set (the relevant package version is 0.2.41). All prompts can be found in the benchmark files, which are a list of BenchmarkForBot objects. After loading, you can access the prompt via `benchmark.forecast_bot_config["prompt"]`. Benchmarks can be easily viewed/navigated with the benchmark_displayer.py streamlit app in the forecasting-tools package (see package README).