Types of specification problems in forecasting

By Juan Gil @ 2021-07-20T04:17 (+35)

In the context of forecasting, how you ask a question can be a large factor in how much information you get from the forecasts on that question. We will thus explore the problems associated with the way you operationalize a question on a forecasting platform, which we will call the question specification. (Terms also used include resolution criteria or operationalization.) Specifically:

There are various other alignment problems related to forecasting platforms that this post will not address (see this paper for more on alignment of forecasting platforms):

Summary

In this post, I’ll describe three major categories of specification problems on forecasting platforms, discuss how they’re related to one another, and outline how you might mitigate or trade off some of the costs. Note that these categories are not “crisp”; they are related to one another and a question can simultaneously have problems in different categories.

In particular, I’ll explore how these problems impose costs on both forecasters and question-askers. (Note: I’ll use “question-asker” to refer to any party that benefits from accurate predictions on a particular question.)

Ambiguous questions

Ambiguous questions do not have clear, well-defined specifications. It might not be clear what states of the world will cause the question to resolve one way or another, or there’s room for competing interpretations.

For an example from the EA Forum, take a look at this bet between Michael Dickens and Buck Shlegeris on the proposition: “By the end of 2021, a restaurant regularly sells an item primarily made of a cultured animal product with a menu price less than $100.”

This bet was made in 2016, but it was quite unclear how the bet should resolve near the end of 2021. What does it mean to “regularly[1] sell[2]” something? What is a “restaurant”? Ultimately, an arbiter was selected to resolve the dispute and further operationalize the question to remove ambiguity.[3]

Metaculus has a guide on how to write unambiguous and useful questions here. I find that it illustrates well the different ways that a specification can be unintentionally ambiguous. Especially check out the section “Guidelines for resolution conditions”:

There are different levels of ambiguity, and while I don’t think it’s often feasible to remove all ambiguity, reducing unintended ambiguity is typically better. In the worst cases, too much ambiguity can make the question incoherent or self-contradictory. More commonly, the question only becomes ambiguous with low probability (e.g. through edge cases or unanticipated events in the world). The lower the probability of ambiguity, the less costly it will be to forecasters.

This problem becomes harder on longer time scales since there’s more time for the world to change in a way that makes your question no longer make sense or resolve clearly one way or another. For example, if the resolution criteria is based on the reporting of some media outlet, but that media outlet ceases to exist, then the question will resolve ambiguously unless you have fall-back criteria.

Forecasting platforms primarily deal with ambiguous questions in two ways:

Costs of ambiguity

Costs to forecasters:

Costs to question-asker:

Reducing the costs for forecasters of a market resolving ambiguous:

Misaligned questions

A misaligned question has a specification that, while potentially unambiguous, fails to cover what the question-asker actually cares about. In other words, a misaligned question is one in which the letter of the question diverges from the spirit of the question.

There are different degrees of misalignment. Creating a question that predicts a proxy for what you care about can be fine if the proxy gives you substantial information about the thing you care about. For example, if I’m creating a market to predict the presidential election, it’s probably fine to resolve based on some combination of credible media reports since these are pretty tightly correlated to the actual election results (but these do diverge sometimes!) So, a question is only misaligned to the extent that the proxy they choose diverges from what they care about.

An example of a well-specified but misaligned resolution criteria (here): A question asked whether North Korea would have a missile launch by a certain date. While this missile launch did occur, the resolution criteria required a confirmation from the US Department of Defense, but that never happened, so it was resolved negatively.

One of the contract’s rules is that “the source used to confirm a test missile being launched and leaving North Korean airspace will be the U.S. Department of Defense.” The problem is that, according to Tradesports spokesman Matt Bonner, they made “numerous efforts to receive direct confirmation from the DoD” but were told “no statement involving the missile test and North Korean airspace would be forthcoming, as those specifics are considered a matter of national intelligence/security.” Bonner emphasized that “a confirmation source is, by definition and necessity, an integral part of the proposition on which contracts trade” – and said that traders are “obligated to be familiar with the rules of a contract before they place an order.”

I don’t think this example is particularly egregious (as in, I don’t think the question-askers were obviously wrong in their specification) since DoD reports do seem correlated with North Korea missile launches, but this happened to be a high-profile case where the spirit of the question and the letter diverged.

It can be hard to tell how tightly your proxy is tied to what you care about, especially over longer time horizons. For example, maybe I want my resolution to depend on how some reputable organization reports on it. But, what if that organization changes methodology, leadership, focus, etc. over time such that its reporting diverges from my actual question at resolution time?

Goodhart’s Law and forecasting

Another form of alignment failure can happen if the optimization power of the forecasting platform is strong enough (e.g. if there’s lots of money at stake) that forecasters try to change the world to meet some resolution criteria in a way that makes the resolution criteria less correlated with what you care about. This is an example of Goodhart’s law, often summarized as “When a measure becomes a target, it ceases to be a good measure.”

For example, maybe the number of daily visits to the EA Forum is a reasonable proxy for size of the EA community. But, if the stakes are high enough in a forecasting setting, it’s potentially easy to artificially inflate those numbers using bots.

An additional problem (besides the forecast potentially becoming less useful for the original purposes) involves how the optimizers change the world to meet their goal. They might make the world a worse place in doing so, i.e. create negative externalities.

I don’t think this is a common problem, but it becomes more likely as the stakes get higher and for questions that individuals in the market are well-positioned to influence.[6]

Costs of misalignment

In the case of misalignment, there seems to be a tradeoff between costs to the forecasters and costs to the question-asker, depending on whether the misalignment is anticipated by forecasters or not:

There are also the costs of the negative externalities in the cases where the optimizing force of the prediction market causes the world to change, discussed above.

Tradeoff between ambiguity and alignment

I want to briefly point out the apparent tradeoff between the ambiguity and alignment of a question, in the context where some arbiter decides how the ambiguity is resolved.

When a forecasting question is unambiguous, the arbiter has little or no wiggle room to decide how the question will resolve. In many situations, this is beneficial to forecasters and thus the question-asker.

However, in a situation where misalignment seems likely (or it seems hard to create an unambiguous specification that is aligned), ambiguity might be beneficial. In this case, a trusted arbiter could resolve the question according to the spirit of the question.

Uncompelling questions

An uncompelling question is one that, while maybe unambiguous and well-specified, is just not interesting or important enough for forecasters to care about participating.

This is specifically a problem for platforms like Metaculus. While a forecaster might be modelled theoretically as someone trying to maximize their points, in reality, points are often secondary to the question actually being interesting/important for them to think about. Relatedly, these platforms and individuals on them have more limited attention, which might be less the case in markets with larger financial incentives.

So, this problem is less about the strong optimization force of the prediction platform being misaligned, and more about the limited optimization power of the platform being reduced.

Identifying and implementing better question specifications

This section will cover two questions:

How to write better specifications

How to change the specification

If you identify a better specification before you launch the question, then it’s simple to change it before it’s open for forecasters. However, if the problem is only identified after the launch of the question, it’ll be trickier.

Here are a few options:

Credits

This research is a project of Rethink Priorities.

It was written by Juan Gil, an intern at Rethink Priorities. Thanks to alexrjl, Michael Aird, Nuño Sempere, and Linch Zhang for helpful feedback and conversations that led to this post. If you like our work, please consider subscribing to our newsletter. You can see more of our work here.

Notes


  1. One issue that came up was whether “regularly” and “by” means “the period leading up to the end of the specified time period”, or just “any period of time before the EOY 2021, inclusive” ↩︎

  2. One possible resolution source was Supermeat’s test kitchen, which offered cultured meat that’s initially free. ↩︎

  3. Even after resolution, it was still sufficiently ambiguous that the judge was not certain that the resolution was correct (private info) ↩︎

  4. Note that these costs are only significant if the arbiter’s decision is not predictable. In many cases, the spirit of the question is clear and the arbiter is trustworthy, so forecasters can just predict based on the spirit of the question. ↩︎

  5. An old version of Augur suffered from a similar problem, and scammers profited from creating invalid questions. See here. ↩︎

  6. An example in which the stakes were high and the resolution was easy to influence, from Avraham Eisenberg on Polymarket questions: “There was a market on how many times Souljaboy would tweet during a given week. The way these markets are set up, they subtract the total number of tweets on the account at the beginning and end, so deletions can remove tweets. Someone went on his twitch stream, tipped a couple hundred dollars, and said he'd tip more if Soulja would delete a bunch of tweets. Soulja went on a deleting spree and the market went crazy. Multiple people made over 10k on this market; at least one person made 30k and at least one person lost 15k.” I’m not sure what the point of this question was in the first place, so I’m not sure how costly this was to the question-asker, but if the goal was to predict how much Soulja tweets on an average week, this market certainly failed to do that. ↩︎

  7. Tachyons could be used for this purpose. ↩︎

  8. In this case, you may need to truncate the points awarded to keep the scoring rule proper, though I haven’t thought about it in depth. See here for motivation about why that might be the case. ↩︎


Aaron Gertler @ 2021-07-21T21:54 (+6)

This is a nice reference!

When you publish a post like this (explaining a major subtopic, as "specification problems" are for the topic of "forecasting"), I recommend looking at the EA Wiki article for the topic in case you see a chance to update it. 

This could mean adding your post to the bibliography, updating the article text to reference the existence of specification problems, etc.