Reasons for optimism about measuring malevolence to tackle x- and s-risks

By Jamie_Harris @ 2024-04-02T10:26 (+85)

Reducing the influence of malevolent actors seems useful for reducing existential risks (x-risks) and risks of astronomical suffering (s-risks). One promising strategy for doing this is to develop manipulation-proof measures of malevolence.

I think better measures would be useful because:

We could use them with various high-leverage groups, like politicians or AGI lab staff.
We could use them flexibly (for information-only purposes) or with hard cutoffs.
We could use them in initial selection stages, before promotions, or during reviews.
We could spread them more widely via HR companies or personal genomics companies.
We could use small improvements in measurements to secure early adopters.

I think we can make progress on developing and using them because:

It’s neglected, so there will be low-hanging fruit
There’s historical precedent for tests and screening
We can test on EA orgs
Progress might be profitable
The cause area has mainstream potential

So let’s get started on some concrete research!

Context

~4 years ago, David Althaus and Tobias Baumann posted about the impact potential of “Reducing long-term risks from malevolent actors”. They argued that:

Dictators who exhibited highly narcissistic, psychopathic, or sadistic traits were involved in some of the greatest catastrophes in human history. Malevolent individuals in positions of power could negatively affect humanity’s long-term trajectory by, for example, exacerbating international conflict or other broad risk factors. Malevolent humans with access to advanced technology—such as whole brain emulation or other forms of transformative AI—could cause serious existential risks and suffering risks… Further work on reducing malevolence would be valuable from many moral perspectives and constitutes a promising focus area for longtermist EAs.

I and many others were impressed with the post. It got lots of upvotes on the EA Forum and 80,000 Hours listed it as an area that they’d be “equally excited to see some of our readers… pursue” as their list of the most pressing world problems. But I haven’t seen much progress on the topic since.

One of the main categories of interventions that Althaus and Baumann proposed was “The development of manipulation-proof measures of malevolence… [which] could be used to screen for malevolent humans in high-impact settings, such as heads of government or CEOs.” Anecdotally, I’ve encountered scepticism that this would be either tractable or particularly useful, which surprised me. I seem to be more optimistic than anyone I’ve spoken to about it, so I’m writing up some thoughts explaining my intuitions.

My research has historically been of the form: “assuming we think X is good, how do we make X happen?” This post is in a similar vein, except it’s more ‘initial braindump’ than ‘research’. It’s more focused on steelmanning the case for than coming to a balanced, overall assessment.

I think better measures would be useful

We could use difficult-to-game measures of malevolence with various high-leverage groups:

Political candidates
Civil servants and others involved in the policy process
Staff at A(G)I labs
Staff at organisations inspired by effective altruism.

Some of these groups might be more tractable to focus on first, e.g. EA orgs. And we could test in less risky environments first, e.g. smaller AI companies before frontier labs, or bureaucratic policy positions before public-facing political roles.

The measures could be binding or used flexibly, for information-only purposes. For example, in a hiring process, there could either be some malevolence threshold above which a candidate is rejected without question, or test(s) for malevolent traits could just be used as one piece of information in a wider process, which the hiring manager could use as they pleased. Of course, flexible uses would be easier first stepping stones, even in cases where we’d hope for binding thresholds to be used eventually.

These measures could potentially be used at various stages of selection and review:

Initial selection, screening, or hiring of candidates
Before a promotion or additional responsibility
As part of reviews or performance evaluations.

For example, in the case of a politician, they could be:

Screened for malevolent traits by the party machine, when deciding whether to let them run for election in the first place
Screened before being offered a government ministerial position
Screened before re-election, or in response to some specific concerning behaviour.^[1]

I’m most optimistic about the earliest stages: it’s generally easier to just reject a candidate than to fire them, and there's more of an existing culture of rigorous checks as part of application processes than as part of review processes. But we tend to have higher expectations of individuals with more responsibility (e.g. we expect more from the Prime Minister than from a Member of Parliament, and far more than from a local councillor), which might make some types of screening before additional responsibilities more promising. And various random factors might make an organisation more amenable to using screening measures in evaluations than in initial hiring at a given time.^[2]

There also seem to be some opportunities for wider dissemination and use, via business-to-business HR companies that offer candidate screening services. (Or possibly via personal genomics companies like 23andMe, sperm banks, IVF clinics, etc, though the benefits seem lower and the risks higher in these genetic contexts.)

Here, the theory of change is more diffuse, but benefits of more widespread use could include:

Increased culture of screening against malevolence, which in turn influences the higher-leverage organisations we’re most interested in.
Increased awareness and interest in advancing the science of malevolence more generally, which helps us develop better measures and implement the safeguards where they’re most needed.
Better data about malevolence, making relevant scientific advancements more feasible.
Increased awareness of the issues and therefore political will to implement anti-malevolence measures.
(Reduced frequency of highly malevolent individuals through parents selecting against embryos with high risk of malevolence.)

Change doesn’t happen overnight. I expect positive feedback loops between improved science/tech and practical implementation. We can use better measures of malevolence to advocate that potential early adopters start introducing these measures. This initial interest might, in turn, spur further interest in developing the science, which we can then use to advocate for further implementation of screening, and so on.

I think we can make progress^[3]

1. It’s neglected, so there will be low-hanging fruit

Neglected problems (and intervention types) tend to have low-hanging fruit that can be plucked. We don’t need to ‘fix’ malevolence overnight, or suddenly develop perfect, manipulation-proof tests of malevolence, in order to start making progress on tackling the problem and reducing the likelihood of catastrophic outcomes. Noisy and imperfect screening against malevolence is probably better than no screening.^[4]

2. There’s historical precedent for tests and screening

The use of security clearances in government and state positions provides precedent for binding measures to be introduced. Of course, security clearances aren’t primarily^[5] testing a personality trait, so this isn’t a perfect analogy for tests of dark tetrad traits.

But there are a plethora of non-binding tests and screening measures that provide broader precedent. Some relevant analogies include:

The extensive use of polygraph tests (lie detectors) in US government and law enforcement, despite their unreliability (Nelson, 2015 suggests 77-95% accuracy) and several legal rulings against them.
Public pressure on politicians to reveal their tax returns (and other documents?).^[6]
Businesses and other institutions using cognitive ability tests or personality tests as part of their hiring processes, e.g. Google. (I’m not sure if these are ever binding with pre-set thresholds?)
Reference checks and other ‘due diligence’ on candidates both for hiring and promotion.

3. We can test on EA orgs

Organisations inspired by or associated with effective altruism, longtermism, or rationalism (hereafter just ‘EA orgs’) could be the guinea pigs for tests that are developed.

It may be especially useful for EA orgs — by their own goals — to screen against malevolence:

Talent search organisations often offer irrelevant incentives such as money, prestige, and credentials to attract promising applicants, but in doing so also attract ‘grifters’. Application processes designed to identify altruistic intent, scout mindset, and other desirable traits can sometimes be sidestepped and exploited by people who are willing to exaggerate or lie… which seems especially likely among malevolent individuals. Speaking as someone who runs such an organisation, I’d be pretty relieved to be able to test (reliably) for malevolence.
My guess is that organisations explicitly focused on having positive social impact are more vulnerable to bad PR arising from malevolent actions of (ex-)employees or participants. People hate hypocrisy.
It seems plausible (to me) that effective altruism and longtermism are disproportionately attractive to people high on at least some malevolent traits.^[7] If EA orgs share this intuition, they might be more concerned about the risks of malevolence than other orgs.

Even beyond these reasons, I expect EA orgs to be more willing to incur some costs to help advance a project that may help reduce s-risks and x-risks. Tractability aside, it also seems genuinely pretty useful to screen against malevolence among these organisations, given that they may have non-negligible influence over the trajectory of the future.

4. Progress might be profitable

There might be profitable products that could be developed that help to advance the battle against malevolence. Again, these could be sold by business-to-business HR companies that offer candidate screening services (or personal genomics companies, sperm banks, IVF clinics, etc).

Advances might come from advocacy to established companies, or from mission-driven entrepreneurship. This might open up substantial additional resources.

Of course, profitable opportunities rely on demand for screening services, plus sufficiently strong science and technology to offer tests. I’m not sure if the conditions are yet ripe for any profitable businesses in this space. But if not, then initial progress on the relevant science, tech, or advocacy might “unlock” new resources by creating new, profitable opportunities.

5. The cause area has mainstream potential

Okay, there are super controversial (and plausibly genuinely bad!) aspects to at least some intervention possibilities for reducing the influence of malevolent actors. But not all of them... and the general case for reducing the influence of malevolent actors seems very robust across worldviews and priorities. It seems good for tackling both x-risks and s-risks, as well as for generally increasing justice, integrity, and other things that most people agree are good.^[8]

As well as the detailed, rigorous, academic research that is needed, I can also imagine a very wide range of different contributions to advancing this cause, making it much more accessible than, say, research relevant to cooperation and conflict between AIs:

There are potential advocacy projects,^[9] which opens up a whole host of classic nonprofit and social movement roles. Think large volunteer bases, campaign and marketing roles, nonprofit management roles, etc.
As noted above, it might be profitable, opening up classic for-profit business roles.
Psychology seems like the most relevant academic background here, which isn’t the case for any of the current most popular longtermist cause areas. Psychology is pretty popular as a PhD choice.^[10] So developing this area might just open up opportunities for different sorts of people to contribute to reducing s-risks and x-risks.

Concrete research ideas

We can summarise existing science and collect relevant insights through (systematic) literature reviews. E.g. on^[11]:

The neurobiology of dark tetrad traits
Difficult-to-game measures in general (e.g. fMRI, EEG, startle reactivity, pupillometry, MEG, fNIRS, HRV, implicit bias tests, 360 degree feedback)
Existing research from difficult-to-game methods that already focuses on malevolent traits (here are two examples from startle reactivity and pupillometry)
Analogous tests already used for screening (e.g. polygraph tests, cognitive ability tests, personality tests)
The correlates and causes of dark tetrad traits.

I think literature reviews are usually pretty tractable even for (junior) researchers without relevant prior expertise or training. Cost-effectiveness analyses and feasibility assessments of manipulation-proof measures (at scale) could be useful. Alternatively, social psychologists could focus on developing better constructs and measures of malevolence, even if the tests are easily gameable.^[12]

There are also a number of research projects that aren’t focused on developing manipulation-proof measures themselves, but could help to (1) better understand the promise of the cause area, (2) learn practical lessons about pathways to implementation, (3) gain examples to point towards to help make advocacy more credible:

A broad investigation into the most relevant existing parallels for malevolence tests in various institutional contexts. This could be very ‘practical’, rather than ‘academic’.
More detailed case studies of the historical use of some of these technologies or tests.^[13]
More detailed research (or summaries of the relevant implications of existing research) into seemingly malevolent historical leaders:
- To what extent were they genuinely malevolent?
- What were the effects of that?
- How much influence did they really have on events?
- Which factors have helped or hindered them gaining power?
- Which actions might have reduced the harms they caused?
Lots of possible surveys about people’s current attitudes towards malevolence.^[14]
Lots of possible other research or interventions relating to malevolence reduction but not difficult-to-game measures specifically, e.g. workshops on how to spot malevolent traits, whistleblowing on malevolent behaviour, political interventions, relevant AI evals.

You might be able to get funding and support to do this research, e.g. from:

Thank you to Clare Diane Harris, David Althaus, Tobias Baumann, and Lucius Caviola for comments/suggestions on a draft of this post. All opinions and mistakes are mine.

^{^}
For a startup position, equivalents might be (1) hiring, (2) before being promoted to a C-Suite position, (3) during a quarterly or annual performance review.
^{^}
For example, they might be looking to make cuts for some reason anyway and interested in reasons to get rid of people; they might have already invested heavily into hiring processes but neglected staff review processes; they might have experienced some sort of scandal or external pressure suggesting cultural issues in the organisation that they want to crack down on.
^{^}
Stefan Torges has previously highlighted that malevolence reduction is more tractable than other s-risk work: it is a known problem, with known quantities, potential for buy-in from stakeholders, a decent window of opportunity, feedback signals, potential for talent absorption and lower infohazard risk. I think those considerations seem similarly or more important than the 5 I list here, although Stefan’s points aren’t focused on better measures of malevolence.
^{^}
Of course the status quo is not ‘no screening’, it’s actually ‘indirect and disorganised screening.’ The point probably holds though.
^{^}
Some security clearance processes do seem to assess personality traits. For example, the Australian Criminal Intelligence Commission says that they look for honesty, trustworthiness, being impartial, being respectful, and being ethical. (Thanks to Clare Diane Harris for this point.)
^{^}
Of course, the pressure is sometimes refused. There’s interesting discussion on testing for presidents here.
^{^}
Indirectly relevant: utilitarian endorsement correlates with some psychopathic traits; “Longtermists are perceived as power-seeking”; “EA is about maximization, and maximization is perilous”; Sam Bankman-Fried.
^{^}
I can imagine it having a very compelling emotional/narrative appeal to it, along the lines of ‘literally battling evil.’
^{^}
Encouraging companies like 23andMe, sperm banks etc to include relevant tests; more general awareness raising to increase demand for tests of Dark Tetrad traits; potential pressure campaigns to demand such tests be carried out and used (likely at a later date).
^{^}
It’s somewhat relevant to improving institutional decision-making, global priorities research, and longtermist talent search. Overall, this point seems notably weaker than the previous two.
^{^}
Note, I haven’t actually looked up if any of these exist already. But my suggestion here is partly about distilling insights specifically for reducing malevolence anyway.
^{^}
Some initial progress here seems like it could be pretty easy for someone with relevant training already, and doable for others too if they’re willing to take the time to learn on the job.
^{^}
I’m pretty tempted to read this out of curiosity.
^{^}
I liked this comment from Saulius Simcikas: “I imagine that most people would support a law that candidates must take the test before elections and that this information should be made public. We can figure out if that's true via a survey. And if it turned out that some candidate has those traits, I think that it would make people less likely to vote for that person. That can also be researched by doing surveys.”

William McAuliffe @ 2024-04-03T14:15 (+69)

Reducing the influence of impression management on the measurement of prosocial and antisocial traits was the topic of my doctoral research. When I started, I thought that better behavioral paradigms and greater use of open-ended text analysis could meaningfully move the needle. By the time I moved onto other things I was much more pessimistic that there is low-hanging fruit that can both (a) meaningfully move the needle (here's one example of a failed attempt of mine to improve the measurement of prosocial traits; McAuliffe et al., 2020), and (b) be implemented at scale in a practical context. The general issue is that harder-to-game measures are much noisier than easier-to-game measures (e.g., see Schimmack, 2021 on implicit measures), so the gameable measures tend to be more useful for making individual predictions in spite of their systematic biases. The level of invasiveness required to increase the signal on a non-gameable measure (e.g., scraping all of a person's online text without their permission) would probably be at odds with other goals of the movement. The same probably goes for measures that do not rely on actual evidence of concerning behavior (e.g., polygenic scores).

More fundamentally, I disagree that this is a neglected topic– measuring malevolence and reducing responses biases are both mainstream topics within personality psychology, personnel psychology, developmental psychology, behavioral genetics, etc. For example, considerable effort has gone into testing whether multidimensional forced-choice personality questionnaires do a good job reducing faking (e.g., Wetzel et al., 2020). An academic psychologist who is EA-sympathetic and getting funding from standard academic sources might have more impact from pursuing this topic rather than whatever else they would have studied instead, but I see limited value in people changing careers or funding grants that would have otherwise gone to other EA causes. I also do not see a strong case for carrying on the discussion outside of the normal academic outlets where there is a lot more measurement expertise.

Clare_Diane @ 2024-04-11T16:25 (+20)

Thank you for sharing these insights. I am also pessimistic about using self-report methods to measure malevolent traits in the context of screening, and it’s very helpful to hear your insights on that front. However, I think that the vast majority of the value that would come from working on this problem would come from other approaches to it.

Instead of trying to use gameable measures as screening methods, I think that:

(1) Manipulation-proof/non-gameable measures of specific malevolent traits are worth investigating further.^[1] There are reasons to investigate both the technical feasibility and the perceived acceptability and political feasibility of these measures.^[2]

(2) Gameable measures could be useful in low-stakes, anonymous research settings, despite being (worse than) useless as screening tools.

I explain those points later in this comment, under headings (1) and (2).

On the neglectedness point

I think the argument that research in this area is not neglected is very important to consider, but I think that it applies much more strongly in relation to the use of gameable self-report measures for screening than it does to other approaches such as the use of manipulation-proof measures. As I see it, when it comes to using non-gameable measures of malevolent traits, this topic is still neglected relative to its potential importance.

The specific ways in which different levels of malevolence potentially interact with x-risks and s-risks (and risk factors) also seem to be relatively neglected by mainstream academia.

Manipulation-proof measures of malevolence also appear to be neglected in practice:

Despite a number of high-stakes jobs requiring security clearances, ~none that I’m aware of use manipulation-proof measures of malevolence,^[3] with the possible exception of well-designed manipulation-proof behavioral tests^[4] and careful background checks.^[5]
There are also many jobs that (arguably) should, but currently do not, require at least the level of rigor and screening that goes into security clearances. For those roles, the overall goal of reducing the influence of malevolent actors seems to be relatively absent (or at least not actively pursued). Examples of such roles include leadership positions at key organizations working to reduce x-risks and s-risks, as well as leadership positions at organizations working to develop transformative artificial intelligence.
Many individuals and institutions don’t seem to have a good understanding of how to identify elevated malevolent traits in others and seem to fail to recognize the impacts of malevolent traits when they’re present.^[6] It’s plausible that there’s a bidirectional relationship whereby low levels of recognition of situations where malevolence is contributing to organizational problems would plausibly reduce the degree to which people recognize the value of measuring malevolence in the first place (particularly in manipulation-proof ways), and vice versa.^[7]

(1) Some reasons to investigate manipulation-proof/non-gameable measures of malevolent traits

I’ll assume anyone reading this is already familiar with the arguments made in this post. One approach to reducing risks from malevolent actors could be to develop objective, non-gameable/manipulation-proof measures of specific malevolent traits.

Below are some reasons why it might be valuable to investigate the possibility of developing such measures.

(A) Information value.

Unlike self-report measures, objective measures of malevolent traits are still relatively neglected (further info on this below). It seems valuable to at least invest in reducing the (currently high) levels of uncertainty here as to the probability that such measures would be (1) technically and (2) socially and politically feasible to use for the purposes discussed in the OP.

(B) Work in this area might actually contribute to the development of non-gameable measures of malevolent traits that could then be used for the purposes most relevant to this discussion.

I think it would be helpful to develop a set of several manipulation-proof measures to use in combination with each other. To increase the informativeness^[8] of the set of measures in combination, it would be helpful if we could find multiple measures whose errors were uncorrelated with each other (though I do not know whether this would be possible in practice, I think it’s worth aiming for). To give an example outside of measuring malevolence, the combination of electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) appears much more useful for diagnosing epilepsy than either modality alone.

In some non-research/real-life contexts, such as the police force, there are already indirect measures in place that are specifically designed to be manipulation-proof or non-gameable. These can include behavioral tests designed in such a way that those undertaking the test do not know (or at least cannot verify) whether they are undertaking a test and/or cannot easily deduce what the “correct” course of action is. They are designed to identify specific behaviors that are associated with underlying undesirable traits (such as behaviors that demonstrate a lack of integrity or that demonstrate dishonesty or selfishness). Conducting (covert) integrity testing on police officers is one example of this kind of test (see also: Klaas, 2021, ch. 12).

Behavioral tests such as these would likely require significant human resources and other costs in order to be truly unpredictable and manipulation-proof. Despite the costs, in the context of high-stakes decisions (e.g., regarding who should have or keep a key position in an organization or group that influences x-risks or s-risks), it seems worth considering the idea of using behavioral measures in combination with background checks (similar to those done as part of security clearances, as discussed earlier) plus a set of more objective measures (ideas for which are listed below).

Below is a tentative, non-exhaustive list of objective methods which seem worth investigating as future ways of measuring malevolent traits. Please note the following things about it, though:

Listing a study or meta-analysis below doesn’t imply that I think it’s well done; this table was put together quickly.
Many of the effect sizes in these studies are quite small, so using any one of these methods in isolation does not seem like it would be useful (based on the information I’ve seen so far).

Objective approaches to measuring malevolence that seem worth investigating
Approach	Brief comments	Examples of research applications of this approach
Electroencephalography (EEG) event-related potentials (ERPs)	This involves recording electrical activity (measured at the scalp) of the brain via an electrogram. Thanks to its high temporal resolution, EEG allows one to assess unconscious neural responses within milliseconds of a given stimulus being presented. Relatively portable and cheap.	Assessing deficits in neural responses to seeing fearful faces among people with high levels of the “meanness” factor of psychopathy (across multiple studies) (link to full text thesis version of the same study)
Functional magnetic resonance imaging (fMRI)	This involves inferring neural activity based on changes in blood flow to different parts of the brain (which is called blood-oxygen-level dependent [BOLD] imaging). In comparison to EEG, fMRI offers high spatial resolution but relatively poor temporal resolution. Not portable or convenient. (More expensive than all the other methods in this table; fMRI is at least 10 times the cost of fNIRS!) However, a lack of (current) scalability does not (in my opinion) completely rule out the use of such a test in high-stakes hiring decisions.^[9]	Assessing mirror neuron activity during emotional face processing across multiple fMRI studies
Functional near-infrared spectroscopy (fNIRS)	Measures changes in the concentrations of both oxygenated and deoxygenated hemoglobin near the surface of the cortex. Relatively portable, but not as cheap as some of the other methods.	Assessing medial prefrontal cortex oxygenated hemoglobin activity in response to positive vs negative emotional films in individuals with high versus low levels of callous-unemotional (CU) traits Measuring activity in the dorsolateral prefrontal cortex among subjects with different levels of impulsivity and forgiveness in an experimental game setting
Pupillometry, especially in the context of measuring pupil reactivity to positive, negative, and neutral images	This involves measuring variations in pupil size; such measurements are reflective of changes in sympathetic nervous system tone, but they have an added advantage of higher temporal resolution compared to other measures of sympathetic nervous system activation. Relatively portable and cheap.	Screening for children at risk of future psychopathology Assessing the reduced pupil dilation in response to negative images among 4-7-year-old children and using that to predict later conduct problems and reduced prosocial behavior Assessing pupil reactivity to emotional faces among male prisoners with histories of serious sexual or violent offenses
Startle reactivity, especially the assessment of aversive startle potentiation (ASP)	The startle response in humans can be measured by monitoring the movement of the muscle orbicularis oculi surrounding the eyes. Aversive startle potentiation (ASP) involves presenting a stimulus designed to exaggerate the startle response, such as negative images or negative film clips. Relatively portable and cheap.	Assessing startle reactivity in response to negative, positive, and neutral images among people with high levels of psychopathy and everyday sadism

Other approaches (not listed above)

There are also ~manipulation-proof measures that, despite their relatively low susceptibility to manipulation, seem to me to be non-starters due to being likely to have a low positive predictive value, including polygenic risk scores in adults and the use of structural neuroimaging in isolation (i.e., without any functional components), as I again predict that that would have a low positive predictive value. (That’s one of the reasons I did not list magnetoencephalograpy (MEG) above - relatively few functional [fMEG] studies appear to have been done on malevolent traits.)

Finally, there are multiple measures that I do *not* currently^[10] think are promising due to being too susceptible to direct or indirect manipulation, including self-report surveys, implicit association tests (unless they are used in combination with pupillometry, in which case it’s possible that they would become less gameable), eye tracking, assessments of interpersonal distance and body language analysis, behavior in experimental game situations or in other contexts where it’s clear what the socially desirable course of action would be, measures of sympathetic nervous system activation that lack temporal resolution,^[11] anonymous informant interviews (as mentioned earlier^[5]), reference reports, and 360 reviews. In the case of the latter two approaches, I predict that individuals with high enough levels of social skills, intelligence, and charisma would be able to garner support through directly or indirectly manipulating (and/or threatening) the people responsible for rating them.

Thinking about implementation in the real-world: tentative examples of the social and political feasibility of using objective screening measures in high-stakes hiring and promotion decisions

Conditioned on accurate objective measures of malevolent traits being available and cost-effective (which would be a big achievement in itself), would such screening methods ever actually be taken up in the real world, and if they were taken up, could this be done ethically?

It seems like surveys could address the question of perceived acceptability of such measures. But until such surveys are available, it seems reasonable to be pessimistic about this point.

Having said this, there are some real-world examples of contexts in which it is already accepted that people should be selected based on specific traits. In positions requiring security clearance or other positions requiring very high degrees of trust in someone prior to hiring them, it is common to select candidates on the basis of their assessed character (despite the fact that there are not yet any objective measures for the character traits they are testing for). For example:

The National Security Eligibility Determination considers the following “facts” about someone when vetting them: stability, trustworthiness, reliability, discretion, character & judgement, honesty, and “unquestionable loyalty to the U.S.”. Some of these (trustworthiness and honesty) would tend to be anticorrelated with malevolent traits.
In New Zealand, Protective Security Requirements stipulate that a candidate “must possess and demonstrate” integrity, which they define as a collection of three character traits: honesty, trustworthiness and loyalty.
The Australian Criminal Intelligence Commission (ACIC) says that they look for the following traits in employees: “honesty, trustworthiness, [being] impartial, [being] respectful, [and being] ethical.” All of these would tend to be anticorrelated with malevolent traits.

In addition, there are some professional contexts in which EEG is already being used in decisions relating to allowing people to start or keep a job (in much lower-stakes settings than the ones of interest to longtermits). In these contexts, it is to try to rule out the possibility of epilepsy, which people tend to think about in a different way to personality traits. However, the examples still seem worth noting.

Before becoming a pilot, one typically has to undergo EEG screening for epileptiform activity, regardless of one’s medical history - and this is reportedly despite there not even being sufficient evidence for the benefits of this.
Driving regulations for epilepsy in Europe often stipulate that someone with epilepsy should show no evidence of epileptiform activity on EEG (and often the EEG has to be done while sleep deprived as well).

Of course, taking on a leadership position in an organization that has the capacity to influence x-risks or s-risks would (arguably) be much a higher-stakes decision than taking on a role as a pilot or as a professional driver, for example. The fact that there is a precedent for using a form of neuroimaging as part of an assessment of whether someone should take on a professional role, even in the case of these much lower-stakes roles, is noteworthy. It suggests that it’s not unrealistic to expect people applying to much higher-stakes roles to undergo similarly manipulation-proof tests to assess their safety in that role.

(2) Why gameable/manipulable measures might be useful in research contexts (despite not being useful as screening methods)

Self-report surveys are easy to collect data on, and it seems that people are less likely to lie in low-stakes, anonymous settings.

Administering self-report surveys could improve our understanding of how various traits of interest correlate with other traits and with specific beliefs (e.g., with fanatical beliefs) and behaviors of interest (including beliefs and behaviors that would be concerning from an x-risk or s-risk perspective, which do not appear to have been studied much to date in the context of malevolence research).^[12]

Thank you to David Althaus for very helpful discussions and for quickly reading over this! Any mistakes or misunderstandings are mine.

^{^}
Just to be clear, I don’t think that these investigations need to be done by people within the effective altruism community. I also agree with William MacAuliffe’s comment above that there would be value in mainstream academics working on this topic, and I hope that more of them do so in the future. However, this topic seems neglected enough to me that there could be value in trying to accelerate progress in this area. And like the OP mentioned, there could be a role for EA orgs in testing the feasibility of using measures of malevolent traits, if or once they are developed.
^{^}
Of course, anyone pursuing this line of investigation would need to carefully consider the downsides and ethical implications at each stage. Hopefully this goes without saying.
^{^}
I’d be very happy to be proven wrong about this.
^{^}
You could say: of course manipulation-proof measures of malevolence are neglected in practice - they don’t exist yet! However, well-designed manipulation-proof behavioral tests do exist in some places, and they do tend to test things that are correlated with malevolence, such as dishonesty. So it seems at least possible to track malevolence in some manipulation-proof ways, but this appears to be done quite rarely.
^{^}
One could argue that security clearance processes also tend to include other, indirect measures of malevolence (or of traits that [anti]correlate with malevolent traits), and some of these indirect measures are less susceptible to manipulation than others. When it comes to background checks and similar checks as part of security clearances, these seem difficult to game, but not impossible (for example, if someone was a very “skilled” fraud, or if someone hypothetically stole the identity of someone who already had security clearance). Other things to note about background checks are that they can be costly and can still miss “red flags” (for example, this paper claims that this happened in the case of Edward J. Snowden’s security clearance). However, if done well, such investigations could identify instances where the person of interest has been dishonest, has displayed indirect evidence of past conflicts, or has a record of more overtly malevolent behaviors. In addition to background checks, interviews with people who know the person of interest could also be useful. However, I would argue that these are somewhat gameable, because sufficiently malevolent individuals might cause some interviewees/informants to be fearful about giving honest feedback (even if they were giving feedback anonymously, and even if they were actively reassured of their anonymity). Having said this, there could be value in combining information from anonymous informant reports with a set of truly manipulation-proof measures of malevolence (if or when some measures are found to be informative enough to be used in this context).
^{^}
It appears there is also a relative paucity of psychoeducation about malevolent traits. A lack of understanding of malevolent traits has arguably also affected the EA community. For example, many people didn’t recognise how concerning some of SBF’s traits are for someone in a position as powerful as he was in. This lack of awareness, and the resulting lack of tracking of the probability that someone has high levels of malevolence, could plausibly contribute to institutional failures (such as a failure to take whistleblowers seriously and a general failure to identify and respond appropriately when someone shows a pattern of conflict, manipulation, deception, or other behaviors associated with malevolence). In the case of SBF, Spencer Greenberg found that some people close to SBF had not considered his likely deficient affective experience (DAE) as an explanation for his behavior within the set of hypotheses they’d been considering (disclosure: I work for Spencer, but I’m writing this comment in my personal capacity). Spencer spoke with four people who knew SBF well and presented them with the possibility that SBF had DAE while also believing in EA principles. He said that, after being presented with this hypothesis, all of their reactions “fell somewhere on the spectrum from “that seems plausible” to “that seems likely,” though it was hard to tell exactly where in that range they each landed.” Surprisingly, however, before their conversations with Spencer, the four people “seemed not to have considered [that possibility] before.”
^{^}
In the absence of accurate methods of identifying people with high levels of malevolent traits, organizations may fail to identify situations where someone’s malevolent traits are contributing to poor outcomes. In turn, this lack of recognition of the presence and impact of malevolent traits would plausibly reduce the willingness of such organizations to develop or use measures of malevolence in the first place. On the other hand, if organizations either began to improve their ability to detect malevolent traits, or became more aware of the impact of malevolent traits on key outcomes, this could plausibly contribute to a positive feedback loop where improved ability to detect malevolence and an appreciation of the importance of detecting malevolence positively reinforce each other. Such improvements might be useful not only for hiring decisions but also for decisions as to whether to promote someone or keep them in a position of power. For example, it seems reasonable to expect that whistleblowers within organizations (or others who try to speak out against someone with high levels of malevolent traits) would be more likely to be listened to (and responded to appropriately) if there was greater awareness of the behavioral patterns of malevolent actors.
^{^}
In this context, I’m talking about the degree to which measures could more sensitively detect (but not overestimate) when an individual’s levels of specific malevolent traits (such as callousness or sadism) would be high enough that we’d expect them to substantially increase x-risks or s-risks if given a specific influential position (that they’ve applied to, for example). Due to the dimensional, continuous nature of malevolent traits, the use of labels such as “malevolent” or “not malevolent” (which could be used if we wanted to calculate the sensitivity and specificity of measures of interest) would be artificial and would need to be decided carefully. Deciding whether and to what extent someone’s levels of malevolent traits would increase x-risks or s-risks would be a probabilistic assessment, but I think it would be an important part of assessing the risks involved in making high-stakes decisions about who to involve (or keep involved) in the key roles of high-impact organizations, groups, or projects.
^{^}
The comment above (by William McAuliffe) mentioned that it would be difficult to implement non-gameable measures at scale in a practical context. I think this concern should definitely be investigated in more depth, but I also think that the measures would not necessarily need to be implementable on a large scale to provide value. The measures would, of course, need to be studied well enough to have a good understanding of their specificity and sensitivity with respect to identifying actors most likely to increase x-risks or s-risks if given specific positions of power, but the relatively low numbers of these positions may mean that it could be justifiable to implement non-gameable tests of malevolent traits even if they are expensive and difficult to scale. The higher-stakes the position of interest, the more it would seem to be justified to invest significant resources into preventing malevolent actors from taking up that position.
^{^}
Some of the currently-gameable measures might become less gameable in the future. For example, using statistical or machine learning methods to model and correct for impression management and socially desirable responding in self-report surveys would at least increase the utility of self-report methods (i.e., it seems possible that such developments could elevate them from being worse than useless to being one component in a set of tests as part of a screening method). However, in view of William McAuliffe’s well-evidenced pessimism on this topic, I’m also less optimistic about this than I would otherwise have been.
^{^}
There are several measures of sympathetic nervous system activation that I did not list above. This is mainly because sympathetic activation is arguably under some degree of voluntary control (e.g., one could hyperventilate to increase their level of sympathetic activation, and one can train oneself to vary one’s heart rate variability), and given that there are also multiple other factors (other than malevolent traits) that can contribute to variations in sympathetic activation, I did not list them as potential measures at this stage.

However, for completion, I’ll list a couple of specific examples in this category here. Heart rate (HR) orienting responses to images designed to induce threat and distress have been investigated among people with high versus low levels of callous-unemotional (CU) traits. Similarly, HR variability (HRV) appears to be altered among people with high levels of the boldness factor of psychopathy. In addition to heart rate variability, another indirect measure of sympathetic nervous system activation is skin conductance (SC) or electrodermal activity (EDA), which capitalizes on the fact that sweat changes the electrical conductance of the skin. Notwithstanding the noisiness and downsides of such measures, changes in electrodermal activity have been observed in psychopathy, antisocial personality disorder, and conduct disorder (across multiple studies). Both HRV and EDA measurements lack the temporal resolution of pupillometry, though, so if a measure of sympathetic activation was going to be investigated in the context of screening for malevolent traits, I predict that it would be more useful to use pupillometry in combination with specific emotional stimuli.
^{^}
This is extremely speculative and uncertain (sorry), but if it turns out that a better understanding of the correlations between different traits and outcomes of interest within humans could translate into a better ability to predict personality-like traits and behaviors of large language models (for example, if positive correlations between one trait and another among humans translated into positive correlations between those two traits in corpuses of training data and in LLM outputs), then research in this area could be relevant to evaluations of those models. However, this is just a very speculative additional benefit beyond the main sources of value discussed above.

taoburga @ 2024-04-11T21:04 (+4)

Thanks Clare! Your comment was super informative and thorough.

One thing that I would lightly dispute is that 360 feedback is easily gameable. I (anecdotally) feel like people with malevolent traits (“psychopaths” here) often have trouble remaining “undiscovered” and so have to constantly move or change social circles.

Of course, almost by definition I wouldn’t know any psychopaths that are still undiscovered. But 360 feedback could help discover the “discoverable” subgroup, since the test is not easily gameable by them.
Any thoughts?

Clare_Diane @ 2024-04-12T16:03 (+8)

Thank you, Tao - I’m glad you found it informative!

And thank you for that - I think that’s a great point. I was probably a bit too harsh in dismissing 360 degree reviews: at least in some circumstances, I agree with you - it seems like they’d be hard to game.

Having said that, I think it would mostly depend on the level of power held by the person being subjected to a 360 review. The more power the person had, the more I’d be concerned about the process failing to detect malevolent traits. If respondents thought the person of interest was capable of (1) inferring who negative feedback came from and (2) exacting retribution (for example), then I imagine that this perception could have a chilling effect on the completeness and frankness of feedback.

For people who aren’t already in a position of power, I agree that 360 degree reviews would probably be less gameable. But in those cases, I’d still be somewhat concerned if they had high levels of narcissistic charm (since I’d expect those people to have especially positive feedback from their “fervent followers,” such that - even in the presence of negative feedback from some people - their high levels of malevolent traits may be more likely to be missed, especially if the people reviewing the feedback were not educated about the potential polarizing effects of people with high levels of narcissism).

If 360 reviews were done in ways that guaranteed (to the fullest extent possible) that the person of interest could not pinpoint who negative feedback came from, and if the results were evaluated by people who had been educated about the different ways in which malevolent traits can present, I would be more optimistic about their utility. And I could imagine that information from such carefully conducted and interpreted reviews could be usefully combined with other (ideally more objective) sources of information. In hindsight, my comment didn’t really address these ways in which 360 reviews might be useful in conjunction with other assessments, so thank you so much for catching this oversight!

I’d always be interested in discussing any of these points further.

Thank you again for your feedback and thoughts!

Jamie_Harris @ 2024-04-03T19:59 (+7)

Thanks a lot for this! I may reply in more detail later but I wanted to send a quick interim note; this is exactly the sort of useful feedback and info I was hoping to elicit with this post!

Clare_Diane @ 2024-04-03T13:00 (+1)

Thank you so much for writing this post!

SummaryBot @ 2024-04-02T17:09 (+1)

Executive summary: Developing manipulation-proof measures of malevolence could help reduce existential and suffering risks by screening out malevolent actors from high-impact positions, and this cause area seems neglected yet tractable and important.

Key points:

Better measures of malevolence could be used to screen politicians, AI researchers, and other high-leverage individuals at various career stages.
Wider dissemination of anti-malevolence screening, e.g. via HR companies, could help build awareness and political will.
The neglectedness of this cause area suggests low-hanging fruit opportunities for research and implementation.
Precedents like polygraph tests and historical case studies of malevolent leaders provide foundations to build on.
EA and longtermist organizations could serve as early adopters to test and advance the science of measuring malevolence.
Concrete research ideas include literature reviews, case studies, surveys on attitudes, and non-measure interventions like spotting malevolent traits.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.