Prize: Interesting Examples of Evaluations
By Elizabeth @ 2020-11-28T21:11 (+26)
TLDR
Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated.
Motivation
A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we've been investigating evaluations specific to research and evaluations in a much broader sense.
We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation.
Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems.
I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help.
Task
Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means "a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards", but if you think of something that doesn't seem to fit, err on the side of inclusion
Prize
The prize is $50 for the top submission.
Rules
To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.
Rubric
To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:
- Usefulness/uniqueness of lesson from the example
- Novelty or surprise of the entry itself, for Elizabeth
- Novelty of the lessons learned from the entry, for Elizabeth.
Accepted Submission Types
I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:
- A single example in one of the categories already mentioned
- Four paragraphs on an unusual exam and its interesting impacts
- A babbled list of 104 things that vaguely sound like evaluations
Examples of Interesting Evaluations
We have a full list here, but below is a subset to not anchor you too much. Don't worry about submitting duplicates: I’d rather risk a duplicate than miss an example.
- Chinese Imperial Examination
- Westminster Dog Show
- Turing Test
- Consumer Reports Product Evaluations
- Restaurant Health Grades
- Art or Jewelry Appraisal
- ESGs/Socially Responsible Investing Company Scores
- “Is this porn?”
- Legally?
- For purposes of posting on Facebook?
- Charity Cost-Effectiveness Evaluations
- Judged Sports (e.g. Gymnastics)
Motivating Research
These are some of our previous related posts:
- Shallow Review of Consistency in Statement Evaluation
- Can we hold intellectuals to similar public standards as athletes?
- Prediction-Augmented Evaluation Systems
- Can We Place Trust in Post-AGI Forecasting Evaluations?
- ESC Process Notes: Claim Evaluation vs. Syntheses
- Predicting the Value of Small Altruistic Projects: A Proof of Concept Experiment
EdoArad @ 2020-11-28T22:40 (+16)
Babble!
- Psychological evaluation
- Job interview
- Debate competition judge
- Using emotion recognition (say by image recognition) to find out consumer's preferences
- Measuring Pavlov's dog saliva
- Debate in ancient greek
- Factored cognition
- forecasting
- karma on the forum / reddit
- democratic voting
- Stock prices as an evaluation of a companies value
- Bibliometrics. Impact factor.
- Using written recommendations to evaluate candidates.
- Measuring truth-telling using a polygraph
- Justice system evaluation of how bad crimes are based on previous cases
- Justice system use of a Jury
- Lottery - random evaluation
- Measuring dopamine signals as a proxy to a fly's brain valence (which is an evaluation of its situation)
- throw stuff into a neural net
- python
- Discrete Valuation Rings
- Signaling value using jewels.
- Evaluation based on social class
- fight to the death
- torturing people until they confess
- market price
- A mathematical exam
- A high-school history exam
- An ADHD test
- Stress testing a phone by putting it in extreme situations
- checking if a car is safe by using a crash dummy and checking impact force
- Software testing
- Open source as a signal of "someone had looked into me and I'm still fine"
- Colonoscopy
- number of downloads for an app
- running polls
- running stuff by experts
- asking god what she thinks of it
- The choice of a pope
- Public consensus, 50 years down the line
- RCT
- broad population study
- Nobel prize committee
- Testing purity of chemical ingredients
- Testing problems in chip manufacturing
- Reproduce a study/project and see if the results replicate
- Set quantitative criteria in advance, and check the results after the fact
- ask people what they'd think that the results would be in advance, and ask people to evaluate the results afterward. Focus the evaluation on the parameters which the people before the test did not consider
- Adequacy analysis (like in Inadequate Equilibria)
- a flagging system for moderators
- New York Times Best Seller list
- subjective evaluation
- subjective evaluation when on drugs
- subjective evaluation by psychopaths (which are also perfect utilitarians!)
- subjective evaluation by a color choosing octopus
- Managerial decisions (a 15-minute powerpoint presentation and then an arbitrary decision)
- Share-holder reports
- bottleneck/limiting-factor analysis
- Crucial considerations
- Theory of Change model
- Taking a set amount of time to critically analyze the subject, focusing on trying to find as many downsides.
- Using weights and a two-sided scale to measure goods.
- Setting a benchmark that one only evaluates against.
- A referee evaluating a Boxing match
- Using score for football
- Buying a car - getting the information from the seller and assessing their truthfullness
- Looking at a fancy report and judging based on length, images, and businessy words
- peer review in science
- citations count
- journal status
- grant making - assessing requests, say by scoring according to a fixed scoring template
- evaluating a scoring template by comparing similarity of different people's scoring of the same text
- Code review
- Fact-Checking
- Editor going through a text
- 360 peer feedback - sociometry
- gut intuition after long relationship/experience
- Amazon Reviews
- ELO
- Chess engine position assessment
- Theoretical assessment of a chess position - experts explaining what is good or bad about the position
- Running a tournament starting with this position, evaluating based on success percentage
- multiple choice exam
- political lobbying for or against something
- the grandma test
It was fun! Hope that something here might be helpful :)
alexrjl @ 2020-11-29T10:27 (+9)
Difficulty ratings in outdoor rock-climbing
Common across all types of climbing are the following features of grades:
- A subjective difficulty assessment of the climb, by the first person to climb it, is used for them to "propose" a grade.
- Other people to manage the same climb may suggest a different grade. Often the grade of a climb will not be agreed upon in the community until several ascents have been made.
- Climbing guidebooks publish grades, typically based on the authors' opinion of current consensus, though some online platforms where people can vote on grades exist.
- Grades can change even after a consensus has appeared stable. This might be due to a hold breaking, however it may also be due to a new sequence being discovered.
- Grades tend to approach a single stable point, even though body shape and size (particularly height and armspam) can make a large difference to difficulty.
There are many different grading systems for different types of climb, a good overview is here. Some differences of interest:
- While most systems grade the overall difficulty of the entire climb, British trad climbs have two grades, niether of which purely map to overall difficulty. The first describes a combination of overall difficulty and safety (so an unsafe, but easy climb, may have a higher rating than a safe), the second describes the difficulty only of the hardest move or short sequence (which can be very different from the overall difficulty, as endurance is a factor).
- Aid climbs, which allow climbers to use ropes to aid their movement rather than only for protection, are graded seperately. However other technology is not considered "aid". In particular, climbing grades have steadily increased over time, at least in part due to development of better shoe technology. More recently, the development of rubberised kneepads has lead to several notable downgrades of hard boulders and routes, as the kneepads make much longer rests possible.
I think climbing grading is interesting as the grades emerge out of a complex set of social interactions, and despite most climbers frequently saying things like "grades are subjective", and "grades don't really matter", they in general remain remarkably stable, and important to many climbers.
Misha_Yagudin @ 2020-12-12T17:41 (+7)
Correlating subjective metrics with objective outcomes to provide better intuitions about what an additional point on a scale might mean. Resulting intuitions still suffers from "correlation ≠ causation" and all curses of self-reported data (which, in my opinion, makes such measurements close to useless) but is a step forward.
See this tweet and whole tread https://twitter.com/JessieSunPsych/status/1333086463232258049 h/t Guzey
Misha_Yagudin @ 2020-12-12T17:45 (+1)
Huh! The thread I linked to and David Manheim's winning comment cite the same paper :)
Tetraspace Grouping @ 2020-12-06T19:55 (+7)
Simple linear models, including improper ones(!!). In Chapter 21 of Thinking Fast and Slow, Kahneman writes about Meehl's book Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review, which finds that simple algorithms made by getting some factors related to the final judgement and weighting them gives you surprisingly good results.
The number of studies reporting comparisons of clinical and statistical predictions has increased to roughly two hundred, but the score in the contest between humans and algorithms has not changed. About 60% of the studies have shown significantly better accuracy for the algorithms. The other comparisons scored a draw in accuracy [...]
If they are weighted optimally to predict the training set, they're called proper linear models, and otherwise they're called improper linear models. Kahneman says about Dawes' The Robust Beauty of Improper Linear Models in Decision Making that
A formula that combines these predictors with equal weights is likely to be just as accurate in predicting new cases as the multiple-regression formula that was ptimal in the original sample. More recent research went further: formulas that assign equal weights to all the predictors are often superior, because they are not affected by accidents of sampling.
That is to say: to evaluate something, you can get very far just by coming up with a set of criteria that positively correlate with the overall result and with each other and then literally just adding them together.
Elizabeth @ 2020-12-08T18:09 (+6)
Winner
Last week we announced a prize for the best example of an evaluation. The winner of the evaluations prize is David Manheim, for his detailed suggestions on quantitative measures in psychology. I selected this answer because, although IAT was already on my list, David provided novel information about multiple tests that saved me a lot of work in evaluating them. David has had involvement with QURI (which funded this work) in the past and may again in the future, so this feels a little awkward, but ultimately it was the best suggestion so it didn’t feel right to take the prize away from him.
Honorable mentions to Orborde on financial stress tests, which was a very relevant suggestion that I was unfortunately already familiar with, and alexrjl on rock climbing route grades, which I would never have thought of in a million years but has less transferability to the kinds of things we want to evaluate.
Post-Mortem
How useful was this prize? I think running the contest was more useful than $50 of my time, however it was not as useful as it could have been because the target moved after we announced the contest. I went from writing about evaluations as a whole to specifically evaluations that worked, and I’m sure if I’d asked for examples of that they would have been provided. So possibly I should have waited to refine my question before asking for examples. On the other hand, the project was refined in part by looking at a wide array of examples (generated here and elsewhere), and it might have taken longer to hone in on a specific facet without the contest.
Davidmanheim @ 2020-12-10T06:55 (+7)
Thanks - I'm happy to see that this was useful, and strongly encourage prize-based crowdsourcing like this in the future, as it seems to work well.
That said, given my association with QURI, I elected to have the prize money donated to Givewell.