Prize: Interesting Examples of Evaluations

By Elizabeth @ 2020-11-28T21:11 (+26)

TLDR

Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated.

Motivation

A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we've been investigating evaluations specific to research and evaluations in a much broader sense.

We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation.

Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems.

I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help.

Task

Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means "a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards", but if you think of something that doesn't seem to fit, err on the side of inclusion

Prize

The prize is $50 for the top submission.

Rules

To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.

Rubric

To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:

Usefulness/uniqueness of lesson from the example
Novelty or surprise of the entry itself, for Elizabeth
Novelty of the lessons learned from the entry, for Elizabeth.

Accepted Submission Types

I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:

A single example in one of the categories already mentioned
Four paragraphs on an unusual exam and its interesting impacts
A babbled list of 104 things that vaguely sound like evaluations

Examples of Interesting Evaluations

We have a full list here, but below is a subset to not anchor you too much. Don't worry about submitting duplicates: I’d rather risk a duplicate than miss an example.

Chinese Imperial Examination
Westminster Dog Show
Turing Test
Consumer Reports Product Evaluations
Restaurant Health Grades
Art or Jewelry Appraisal
ESGs/Socially Responsible Investing Company Scores
“Is this porn?”
1. Legally?
2. For purposes of posting on Facebook?
Charity Cost-Effectiveness Evaluations
Judged Sports (e.g. Gymnastics)

Motivating Research

These are some of our previous related posts:

EdoArad @ 2020-11-28T22:40 (+16)

Babble!

Psychological evaluation
Job interview
Debate competition judge
Using emotion recognition (say by image recognition) to find out consumer's preferences
Measuring Pavlov's dog saliva
Debate in ancient greek
Factored cognition
forecasting
karma on the forum / reddit
democratic voting
Stock prices as an evaluation of a companies value
Bibliometrics. Impact factor.
Using written recommendations to evaluate candidates.
Measuring truth-telling using a polygraph
Justice system evaluation of how bad crimes are based on previous cases
Justice system use of a Jury
Lottery - random evaluation
Measuring dopamine signals as a proxy to a fly's brain valence (which is an evaluation of its situation)
throw stuff into a neural net
python
Discrete Valuation Rings
Signaling value using jewels.
Evaluation based on social class
fight to the death
torturing people until they confess
market price
A mathematical exam
A high-school history exam
An ADHD test
Stress testing a phone by putting it in extreme situations
checking if a car is safe by using a crash dummy and checking impact force
Software testing
Open source as a signal of "someone had looked into me and I'm still fine"
Colonoscopy
number of downloads for an app
running polls
running stuff by experts
asking god what she thinks of it
The choice of a pope
Public consensus, 50 years down the line
RCT
broad population study
Nobel prize committee
Testing purity of chemical ingredients
Testing problems in chip manufacturing
Reproduce a study/project and see if the results replicate
Set quantitative criteria in advance, and check the results after the fact
ask people what they'd think that the results would be in advance, and ask people to evaluate the results afterward. Focus the evaluation on the parameters which the people before the test did not consider
Adequacy analysis (like in Inadequate Equilibria)
a flagging system for moderators
New York Times Best Seller list
subjective evaluation
subjective evaluation when on drugs
subjective evaluation by psychopaths (which are also perfect utilitarians!)
subjective evaluation by a color choosing octopus
Managerial decisions (a 15-minute powerpoint presentation and then an arbitrary decision)
Share-holder reports
bottleneck/limiting-factor analysis
Crucial considerations
Theory of Change model
Taking a set amount of time to critically analyze the subject, focusing on trying to find as many downsides.
Using weights and a two-sided scale to measure goods.
Setting a benchmark that one only evaluates against.
A referee evaluating a Boxing match
Using score for football
Buying a car - getting the information from the seller and assessing their truthfullness
Looking at a fancy report and judging based on length, images, and businessy words

peer review in science
citations count
journal status
grant making - assessing requests, say by scoring according to a fixed scoring template
evaluating a scoring template by comparing similarity of different people's scoring of the same text
Code review
Fact-Checking
Editor going through a text
360 peer feedback - sociometry
gut intuition after long relationship/experience
Amazon Reviews
ELO
Chess engine position assessment
Theoretical assessment of a chess position - experts explaining what is good or bad about the position
Running a tournament starting with this position, evaluating based on success percentage
multiple choice exam
political lobbying for or against something
the grandma test

It was fun! Hope that something here might be helpful :)

alexrjl @ 2020-11-29T10:27 (+9)

Difficulty ratings in outdoor rock-climbing
Common across all types of climbing are the following features of grades:

A subjective difficulty assessment of the climb, by the first person to climb it, is used for them to "propose" a grade.
Other people to manage the same climb may suggest a different grade. Often the grade of a climb will not be agreed upon in the community until several ascents have been made.
Climbing guidebooks publish grades, typically based on the authors' opinion of current consensus, though some online platforms where people can vote on grades exist.
Grades can change even after a consensus has appeared stable. This might be due to a hold breaking, however it may also be due to a new sequence being discovered.
Grades tend to approach a single stable point, even though body shape and size (particularly height and armspam) can make a large difference to difficulty.

There are many different grading systems for different types of climb, a good overview is here. Some differences of interest:

While most systems grade the overall difficulty of the entire climb, British trad climbs have two grades, niether of which purely map to overall difficulty. The first describes a combination of overall difficulty and safety (so an unsafe, but easy climb, may have a higher rating than a safe), the second describes the difficulty only of the hardest move or short sequence (which can be very different from the overall difficulty, as endurance is a factor).
Aid climbs, which allow climbers to use ropes to aid their movement rather than only for protection, are graded seperately. However other technology is not considered "aid". In particular, climbing grades have steadily increased over time, at least in part due to development of better shoe technology. More recently, the development of rubberised kneepads has lead to several notable downgrades of hard boulders and routes, as the kneepads make much longer rests possible.

I think climbing grading is interesting as the grades emerge out of a complex set of social interactions, and despite most climbers frequently saying things like "grades are subjective", and "grades don't really matter", they in general remain remarkably stable, and important to many climbers.

Misha_Yagudin @ 2020-12-12T17:41 (+7)

Correlating subjective metrics with objective outcomes to provide better intuitions about what an additional point on a scale might mean. Resulting intuitions still suffers from "correlation ≠ causation" and all curses of self-reported data (which, in my opinion, makes such measurements close to useless) but is a step forward.

See this tweet and whole tread https://twitter.com/JessieSunPsych/status/1333086463232258049 h/t Guzey

Misha_Yagudin @ 2020-12-12T17:45 (+1)

Huh! The thread I linked to and David Manheim's winning comment cite the same paper :)

Tetraspace Grouping @ 2020-12-06T19:55 (+7)

Simple linear models, including improper ones(!!). In Chapter 21 of Thinking Fast and Slow, Kahneman writes about Meehl's book Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review, which finds that simple algorithms made by getting some factors related to the final judgement and weighting them gives you surprisingly good results.

The number of studies reporting comparisons of clinical and statistical predictions has increased to roughly two hundred, but the score in the contest between humans and algorithms has not changed. About 60% of the studies have shown significantly better accuracy for the algorithms. The other comparisons scored a draw in accuracy [...]

If they are weighted optimally to predict the training set, they're called proper linear models, and otherwise they're called improper linear models. Kahneman says about Dawes' The Robust Beauty of Improper Linear Models in Decision Making that

A formula that combines these predictors with equal weights is likely to be just as accurate in predicting new cases as the multiple-regression formula that was ptimal in the original sample. More recent research went further: formulas that assign equal weights to all the predictors are often superior, because they are not affected by accidents of sampling.

That is to say: to evaluate something, you can get very far just by coming up with a set of criteria that positively correlate with the overall result and with each other and then literally just adding them together.

Elizabeth @ 2020-12-08T18:09 (+6)

Winner

Last week we announced a prize for the best example of an evaluation. The winner of the evaluations prize is David Manheim, for his detailed suggestions on quantitative measures in psychology. I selected this answer because, although IAT was already on my list, David provided novel information about multiple tests that saved me a lot of work in evaluating them. David has had involvement with QURI (which funded this work) in the past and may again in the future, so this feels a little awkward, but ultimately it was the best suggestion so it didn’t feel right to take the prize away from him.

Honorable mentions to Orborde on financial stress tests, which was a very relevant suggestion that I was unfortunately already familiar with, and alexrjl on rock climbing route grades, which I would never have thought of in a million years but has less transferability to the kinds of things we want to evaluate.

Post-Mortem

How useful was this prize? I think running the contest was more useful than $50 of my time, however it was not as useful as it could have been because the target moved after we announced the contest. I went from writing about evaluations as a whole to specifically evaluations that worked, and I’m sure if I’d asked for examples of that they would have been provided. So possibly I should have waited to refine my question before asking for examples. On the other hand, the project was refined in part by looking at a wide array of examples (generated here and elsewhere), and it might have taken longer to hone in on a specific facet without the contest.

Davidmanheim @ 2020-12-10T06:55 (+7)

Thanks - I'm happy to see that this was useful, and strongly encourage prize-based crowdsourcing like this in the future, as it seems to work well.

That said, given my association with QURI, I elected to have the prize money donated to Givewell.