TIO: A mental health chatbot

By Sanjay @ 2020-10-12T20:52 (+25)

This post describes a mental health chatbot, which we call the Talk It Over Chatbot (sometimes TIO for short). The first work on the bot started in 2018, and we have been working more earnestly since 2019.

My main aim is to explore whether I should seek funding for it. (It is currently volunteer-led)

Here’s a summary of my findings:

This post invites:

Our motivations when we started this project

The bot was created by a team of volunteers, the initial three of which were current/former volunteers with Samaritans, a well-known UK charity which provides emotional support to the distressed and despairing, including those who are suicidal.

A big part of our motivation was the fact that we knew there was a big demand for the Samaritans service, but nowhere near enough supply. This meant that our initial prototype need not outperform a human, it need only outperform (e.g.) calling the Samaritans and getting the engaged tone.

We also suspected that some users might prefer speaking to software, and this suspicion has been confirmed.

Another motivation was the fact that lots of important decisions we made were under-evidenced or unevidenced. Specifically, when deciding what we, as individual therapeutic listeners, should say to a service user, we were often operating in an evidence vacuum. Our conversations with professionals suggest that this problem is common to other parts of the mental health space, so creating a new source of data and evidence could be highly valuable.

Overview of the cost-effectiveness model

I referred earlier to “direct-results thinking” (as opposed to hits-based thinking). This refers to direct interventions having relatively immediate effects, which can be best assessed with a cost-effectiveness model.

The model that I have built incorporates the expected costs of running the bot and the expected improvements in the wellbeing of the users.

The model tries to envisage a medium-term future state for the bot (rather than, e.g., what the project will look like immediately after funding it).

The decision criterion for the modelled considerations is that the bot’s cost-effectiveness should be competitive with StrongMinds. (Note that the unmodelled considerations may also be material)

Modelled Results of the cost-effectiveness model

The model also explores alternative scenarios:

My interpretation of these findings is that the model’s estimates of the bot’s effectiveness are in the same ballpark as StrongMinds, assuming we accept the model’s assumptions.

The model can be found here.

Key assumptions underlying the model

Interested readers are invited to review the appendices which set out the assumptions in some detail.

A key feature is that the model sets out a medium term future state for the bot, after some improvements have been made. These assumptions introduce risks, however for a donor with appetite for risk and a general willingness to support promising early-stage/startup projects, this need not be an issue.

If I were to pick one assumption which I think is mostly likely to materially change the conclusions, it would be the assumption about the counterfactual. This is discussed in its own appendix. This assumption is likely to be favourable to the bot’s cost-effectiveness, and it’s unclear how much. If you were to fund this project, you would have to do so on one of the following bases:

Unmodelled arguments for and against funding the bot

This section refers to factors which aren’t reflected in the model.

For

Against

How would extra funds be used?

At first the most pressing needs are around:

  1. A specialist with NLP (Natural Language Processing) skills
  2. More frontend skills, especially design and UX
  3. A strong frontend developer, preferably with full stack skills
  4. A rigorous evaluation of the bot
  5. More google ads

The purpose of this post is to gather feedback on whether the project has sufficient potential to warrant funding it (which is a high-level question), so for brevity this section does not set out a detailed budget.

Volunteering opportunities

We currently would benefit from the following types of volunteer:

Our existing team and/or the people we are talking to do already cover all of these areas to a certain degree, however further help is needed on all of these fronts.

Appendices

The appendices are in the following sections:

  1. More info about how the bot works
  2. How this bot might revolutionise the evidence base for some therapies
  3. History of the bot
  4. More details of the cost-effectiveness modelling
  5. Overview of other mental health apps

Appendix 1a: How the bot works

The bot sources its users via google ads from people who are searching for things like “I’m feeling depressed”.

The philosophy behind how the bot chooses its responses can be summarised as follows:

If you want to get a feel for what the bot is like, here are some recordings of the bot:

TIO example conversation using blue/grey frontend

TIO example conversation using bootstrap frontend

TIO example conversation using Guided Track frontend

You are also welcome to interact with the chatbot, however note that the bot’s responses are geared towards *actual* users; people who aren’t in emotional need at the time of using the bot are likely to have a more disappointing experience. Here’s a link. Note that this is a test version of the bot because I want to avoid contaminating our data with testing users. However if you are genuinely feeling low (i.e. you are a “genuine” user of the service) then you are welcome to use the bot here.

Appendix 1b: Does the bot use AI?

Depends on what you mean by AI.

The bot doesn’t use machine learning, and we have a certain nervousness about incorporating a “black box” with poor explainability.

We instead are currently using a form of pattern matching where the bot looks for certain strings of text in the user’s message (this description somewhat simplifies the reality).

At time of writing we are investigating “Natural Language Processing” (NLP) libraries, such as NLTK and Spacy. We are currently expecting to ramp up the amount of AI we use.

Under some definitions of AI, even relatively simple chatbots count as AI, so this chatbot would certainly count as AI under those definitions.

Appendix 1c: Summary of impact scores

The impact scores shown here are calculated as:

How the user was feeling on a scale from 1 to 10 at the end of the conversation

Minus

How the user was feeling on a scale from 1 to 10 at the start of the conversation


Total number of users is >10,000, but not all users state how they feel at the end, so the number of datapoints (n) feeding into the above table is only c3300.

The story apparently being told in the above table is that each of the above changes improved the bot, but the last change (improving the design) was a step backwards. However we performed some statistical analysis, and we can’t be confident that this story is correct.

The main thing holding us back is money: getting larger sample sizes involves paying for more google ads; if we could do this we would be able to draw inferences more effectively.

Appendix 1d: How good are these impact scores?

The changes in happiness scores in the above table are mostly in the range 0.3 out of 10 (for the earliest prototype) to 0.99 out of 10.

Is this big or small?

If we look at this EA Forum post by Michael Plant, it seems, at first glance, that the bot’s impacts are moderately big. The table suggests that achieving changes as large as 1 point or more on a scale from 1 to 10 is difficult. The table shows that a physical illness has an impact of 0.22 out of 10, and having depression has an impact of 0.72 out of 10. However note that the question wording was different, so it doesn’t necessarily mean that those are on a like-for-like basis.

(Thank you to Michael for directing me to this in the call I had with him on the topic)

Using a more statistical approach to answering the question finds that the Cohen’s d typically falls in the range 0.2 to 0.5. Cohen’s d is a measure which compares the outcome with the standard deviation. A Cohen’s d of 0.2 typically indicates a modest effect size; a Cohen’s d of 0.5 indicates a medium effect size, and Cohen’s d of 0.8 indicates a large effect size.

Appendix 1e: Distribution of responses to the bot

Users’ responses to the bot experience are varied.

The below is an example histogram showing the users’ impact scores (i.e. final score minus initial score) for one variant of the bot.

If we ignore the big dollop of users whose impact score is zero out of 10, the shape seems to indicate a roughly symmetrical distribution centred roughly around 1 out of 10.

Note that the big dollop of zero scores may be linked to the fact that the question wording “leads” or “anchors” the user towards a zero score (see separate appendix about the impact measures for more on this).

There are also a few extreme values, with more positive extreme values than negatives.

I have read all of the conversations which have happened between users and the bot (that the users have been willing to share). I synthesise here my impressions based on those conversations and the feedback provided.

Extremely positive reactions:

Neutral to mildly positive reactions:

Negative reactions:

Appendix 1f: Impact measures: the “happiness” question

To assess our impact, we ask the user how they feel at the start and at the end of the conversation

At the start of the conversation, the bot asks the user:

“Please rate how you feel on a scale from 1 to 10, where 1 is terrible and 10 is great”

When the user initiates the final feedback question, they see this text:

“Thank you for using this bot. Please rate how you feel on a scale from 1 to 10, where 1 is terrible and 10 is great. As a reminder, the score you gave at the start was <INITIAL_SCORE>”

Considerations:

Appendix 2: How this bot might revolutionise the evidence base

The source for much of this section is conversations with existing professional psychiatrists/psychologists.

Currently some psychological interventions are substantially better evidenced than others.

Most psychologists would not argue that we should therefore dismiss these poorly evidenced approaches.

Instead the shortage of evidence reflects the fact that this sort of therapy typically doesn’t have a clearly defined “playbook” of what the therapist says.

Part of the aim of this project is to address this in two ways:

  1. Providing a uniform intervention that can be assessed at scale
    1. Example: imagine a totally uniform intervention (such as giving patients a prozac pill, which is totally consistent from one pill to the next)
    2. A totally uniform intervention (like Prozac) is easier to assess
    3. Some therapeutic approaches (like CBT) are closer to being uniform (although, depending on how you implement them, sometimes CBT can be more or less uniform)
    4. Others, like Rogerian or existential therapies, are highly non-uniform -- they don’t have a clear “playbook”
    5. This means that we don’t have good evidence around the effectiveness of these sorts of therapy
    6. This chatbot project aims to implement a form of Rogerian-esque therapy that *is* uniform, allowing assessment at scale
  2. Allowing an experimental/scientific approach which could provide an evidence base for therapists
    1. At the moment, there is a shortage of evidence about *specifically* to say in a given context
    2. To take an example which would be controversial within Samaritans:
    3. Imagine that a service user is talking about some health issues. Consider the following possible responses that could be said by a therapist/listening volunteer
      1. “You must recognise that these health issues might be serious. Please make sure you go to see a doctor. Will you do that for me please?”
      2. “How would you feel about seeing a doctor?”
      3. “How are you feeling about the health issues you’ve described?”
    4. In a Samaritans context, the first of these would most likely be considered overly directive, the third would likely be considered “safe” (at least with regard to the risk of being overly directive), and the middle option would be controversial.
    5. Currently, these debates operate in an evidence vacuum
    6. Adding evidence to the debate need not necessarily fully resolve the debate, but would almost certainly take the debate forward.

I believe that this benefit on its own may be sufficient to justify funding the project (i.e. even if there were no short term benefit). This reflects the fact that other, better-evidenced interventions don’t work for everyone (e.g. CBT might be great for person A, but ineffective for person B), which means that CBT can’t solve the mental health problem on its own, so we need more evidence on more types of therapy.

Crucially, TIO is fundamentally different from other mental health apps -- it has a free-form conversational interface, similar to an actual conversation (unlike other apps which either don’t have any conversational interface at all, or have a fairly restricted/”guided” conversational capability). This means that TIO is uniquely well-positioned to achieve this goal.

Note that there is still plenty of work needed before the bot can make this sort of contribution to the evidence base. Specifically, a number of other mental health apps have been subjected to rigorous evaluations published in journals, whereas TIO -- still an early-stage project -- has not.

Appendix 3: History of the bot

2018: I created the first prototype, which was called the Feelings Feline. The bot took on the persona of a cat. This was partly to explain to the user why the bot’s responses were simplistic, and also because of anecdotal evidence of the therapeutic value of conversations with pets. User feedback suggested that with the incredibly basic design used, this didn’t work. I suspect that with more care, this could be a success, however it would need better design expertise than I had at the time (and I had almost zero design expertise at the time).

Feedback was slow at this stage. Because of a lack of technical expertise, there was no mechanism for gathering feedback (although this is straightforward to develop for any full stack web developer). However another glitch in the process did (accidentally) lead to feedback. The glitch was because of an unforeseen feature of wix websites. Wix is an online service that allows people to create websites. I had created a wix website to explain the concept of the bot. It then pointed to another website which I had programmed in javascript. However wix automatically includes a chat interface in its websites (which doesn’t show up in the preview, so I didn’t know it existed when I launched the site) . This led to confusion -- many users started talking to a chat interface which I hadn’t programmed and didn’t even know existed! This led to a number of messages coming through to my inbox. I responded to them -- I felt I had to because many of them expressed great distress. In the course of the conversations, I explained about the chatbot (again, I had to, to clear up the confusion). Although this was not the intention, it meant that some users gave me their thoughts about the website, and some of those did so in a fairly high-quality “qualitative research” way. Once I worked out what was going on, I cleared up the website confusion, and the high quality feedback stopped.

2019: Another EA (Peter Brietbart) was working on another mental health app; he introduced me to Guided Track. Huge thanks to Peter Brietbart for doing this; it was incredibly useful. I can’t understate how helpful Peter was. Guided Track provided a solution to the problems around gathering feedback, all through a clean, professional-looking front-end, together with an incredibly easy to program back-end.

By this stage, the team consisted of me and two other people that I had recruited through my network of Samaritans volunteers.

2020: The team has now grown further, including some non-Samaritan volunteers. My attempts to recruit volunteers who are expert on UX (user experience) and design have not been successful. We also have a stronger network of people with a psychology background and have surpassed 10,000 users.

Appendix 4a: Detailed description of assumptions in cost effectiveness model

This section runs through each of the assumptions in the cost effectiveness model.

(Note that the comparison here is against the standard StrongMinds intervention, not the StrongMinds chatbot.)

Optimistic and pessimistic assumptions:

I did not try to carefully calibrate the probabilities around the optimistic and pessimistic scenarios, however they are probably something like a 75%-90% confidence interval for each assumption.

Note further most of the assumptions are essentially independent of each other, meaning that a scenario as extreme as the optimistic or pessimistic scenario is much more severe than a 90% confidence interval if we are looking at the overall scenario.

Having said that, the main aim of including those scenarios was to highlight the fact that the model is based on a number of uncertain assumptions about the future, and specifically that the uncertainty is sufficient to make the difference between being close to the benchmark and being really quite far from the benchmark (in either direction)

Appendix 4b: PHQ-9 to 1-10 conversion

This bot measures its effectiveness based on a scale from 1 to 10. The comparator (StrongMinds) uses a well-established standard metric, which is PHQ-9. PHQ-9 is described in more detail in a separate appendix.

The calculation used in the base scenario of the model is to take the 13-point movement in the 27-point PHQ-9 scale and simply rescale it.

This is overly simplistic. As can be seen from the separate appendix about PHQ-9, PHQ-9 assesses several very different things, and is constructed in a different way, so a straight rescaling isn’t likely to constitute an effective translation between the two measures.

To take another data point, this post by Michael Plant of the Happier Lives Institute suggests that the effect of curing someone’s depression appears to be 0.72 on a 1-10 scale, using data that come from Clark et al. The chatbot cost-effectiveness model includes an alternative scenario which uses this conversion between different impact scales, and it suggests that the chatbot (TIO) outperforms the benchmark by quite a substantial benchmark under this assumption.

To a certain extent this may seem worrying -- the output figures move quite considerably in response to an uncertain assumption. However I believe I have used a fairly conservative assumption in the base scenario, which may give us some comfort.

Appendix 4c: Time conversion

The cost-effectiveness model compares interventions with different durations.

I raised a question about this on the Effective Altruism, Mental Health, and Happiness facebook group. (I suspect you may need to be a member of the group to read it.) It raised some discussion, but no conclusive answers.

In the cost-effectiveness model I’ve used what I call the time-smoothing assumption.

It’s possible that the time-smoothing assumption is too harsh on the bot (or equivalently too generous to the longer-term intervention). After all, if someone's depression is improved after the StrongMinds intervention, it seems unrealistic to believe that the patient’s wellbeing will be consistently, uninterruptedly improved for a year. That said, it also seems unrealistic to believe that the chatbot will *always* cause the user to feel *consistently* better for a full day (or whichever assumption we are using), however in a short time period there is less scope for variance.

The cost-effectiveness model includes an alternative scenario which explores this by applying a somewhat arbitrarily chosen factor of 2 to reflect the possibility that the time-smoothing assumption used in the base case is too harsh on TIO.

However my overall conclusion on this section is that the topic is difficult, and remains, to my mind, an area of uncertainty.

Appendix 4d: “Support from google” scenario

The cost-effectiveness model includes a “Support from Google” scenario

At the last general meeting of the chatbot team, it was suggested that Google may be willing to support a project such as this, and specifically that this sort of project would appeal to Google sufficiently that they might be willing to support the project above and beyond their existing google for nonprofits programme. I have not done any work to verify whether this is realistic.

There is a scenario in the cost-effectiveness model for the project receiving this support from Google, which involves setting the google ads costs to zero.

I would only consider this legitimate if there were sufficiently favourable counterfactuals.

Appendix 4e: Counterfactuals

For impact scores we have gathered thus far, we have assumed that the counterfactual is zero impact.

This section explores this assumption, and concludes that with regard to the current mechanism for reaching users (via Google ads) the assumption might be generous. However, the bot’s ultimate impact is not limited to the current mechanism of reaching users, and given that mental health is, in general, undersupplied, it’s reasonable to believe that other contexts with better counterfactuals may arise.

Here’s how it works at the moment:

To explore this, I have tried googling “I’m feeling depressed”, and the search results include advice from the NHS, websites providing advice on depression, and similar. (note that one complication is that search results are not totally uniform from one user to the next)

The content of those websites seems fine, however I’ve heard comments from people saying things like

Which suggests that those websites aren’t succeeding in actually making people happier. Hence the zero impact assumption.

However, these impressions are *not* based on careful studies. Specifically, I’m gathering these impressions from people who have ended up talking to me in some role in which I’m providing volunteer support to people who are feeling low. In such a context, there’s a risk of selection effects: maybe the people who found such websites useful get help and therefore don’t end up needing my support?

Some important observations about counterfactuals in the specific context of sourcing users from Google ads:

While the counterfactuals appear to raise some material doubts within the Google ads context, the ultimate impact of the project need not be tied solely to that context

Appendix 4f: About PHQ-9

This is what a standard PHQ-9 questionnaire looks like:

There are 9 questions, and a score out of 27 is generated by assigning a score of 0 for each tick in the first column, a score of 1 for each tick in the second column, a score of 2 for each tick in the third column and a score of 3 for each tick in the fourth column, and then adding the scores up.

It is a standard tool for monitoring the severity of depression and response to treatment.

Appendix 5: Other mental health apps

We are very much not the first people to think of providing mental health help via an app.

A list of about 15 or so other apps can be found here.

Types of therapeutic approaches used a lot:

Approaches which don’t seem to come up often:

We appear to be taking an approach which is different from the existing mental health apps.

And yet if we look at attempts to solve the Turing test, an early attempt was a chatbot called Eliza, which was inspired by the Rogerian approach to therapy (which is also the approach which is closest to the Samaritans listening approach)

So it seemed surprising that people trying to solve the Turing test had tried to employ a Rogerian approach, but people trying to tackle mental health had not.

To our knowledge, we are the first project taking a Rogerian-inspired conversational app and applying it for mental health purposes.

On a related note, this project seems unusual in including a relatively free flowing conversational interface. While several other apps have a conversational or chatbot-like interface, these bots are normally constructed in a very structured way, meaning that most of the conversation has been predetermined. I.e. the bot sets the agenda, not the user. And in several apps, there is actually no free text fields at all.

We speculate that the reason for this is the more open conversational paradigm was too intimidating for other app developers, who perhaps felt that solving mental health *and* the Turing Test at the same time was too ambitious. Our approach is distinctive perhaps because of the fact that we were inspired by the Samaritans approach, which is relatively close to an MVP of a therapeutic approach.

The fact that our interface is so free flowing is important. It means that our bot’s approach is closest to actual real life therapy.


MichaelPlant @ 2020-10-14T19:18 (+9)

Hello Sanjay, thanks both for writing this up and actually having a go at building something! We did discuss this a few months ago but I can't remember all the details of what we discussed.

First, is there a link to the bot so people can see it or use it? I can't see one.

Second, my main question for you -sorry if I asked this before - is: what is the retention for the app? When people ask me about mental health tech, my main worry is not whether it might work if people used it, but whether people do want to use it, given the general rule that people try apps once or twice and then give up on them. If you build something people want to keep using and can provide that service cheaply, this would very likely be highly cost-effective.

I'm not sure it's that useful to create a cost-effectiveness model based on the hypothetical scenario where people use the chatbot: the real challenge is to get people to use it. It's a bit like me pitching a business to venture capitalists saying "if this works, it'll be the next facebook", to which they would say "sure, now tell us why you think it will be the next facebook".

Third, I notice your worst-cast scenario is the effect lasts 0.5 years, but I'd expect using a chatbot to only make me feel better for a few minutes or hours, so unless people are using it many times, I'd expect the impact to be slight. Quick maths: a 1 point increase on a 0-10 happiness scale for 1 day is 0.003 happiness life-years.

Sanjay @ 2020-10-14T20:32 (+2)

Thank you very much for taking the time to have a look at this.

(1) For links to the bot, I recommend having a look at the end of Appendix 1a, where I provide links to the bot, but also explain that people who aren't feeling low tend not to behave like real users, so it might be easier to look at one of the videos/recordings that we've made, which show some fictional conversations which are more realistic.

(2) Re retention, we have deliberately avoided measuring this, because we haven't thought through whether that would count as being creepy with users' data. We've also inherited some caution from my Samaritans experience, where we worry about "dependency" (i.e. people reusing the service so often that it almost becomes an addiction). So we have deliberately not tried to encourage reuse, nor measured how often it happens. We do however know that at least some users mention that they will bookmark the site and come back and reuse it. Given the lack of data, the model is pretty cautious in its assumptions -- only 1.5% of users are assumed to reuse the site; everyone else is assumed to use it only once. Also, those users are not assumed to have a better experience, which is also conservative.

I believe your comments about hypotheticals and "this will be the next facebook" are based on a misunderstanding. This model is not based on the "hypothetical" scenario of people using the bot, it's based on the scenario of people using the bot *in the same way the previous 10,000+ users have used the bot*. Thus far we have sourced users through a combination of free and paid-for Google ads, and, as described in Appendix 4a, the assumptions in the model are based on this past experience, adjusted for our expectations of how this will change in the future. The model gives no credit to the other ways that we might source users in the future (e.g. maybe we will aim for better retention, maybe we will source users from other referrals) -- those would be hypothetical scenarios, and since I had no data to base those off, I didn't model them.

(3) I see that there is some confusion about the model, so I've added some links in the model to appendix 4a, so that it's easier for people viewing the model to know where to look to find the explanations.

To respond to the specific points, the worst case scenario does *not* assume that the effect lasts 0.5 years. The worst case scenario assumes that the effect lasts a fraction of day (i.e. a matter of hours) for exactly 99.9% of users. For the remaining 0.1% of users, they are assumed to like it enough to reuse it for about a couple of weeks and then lose interest.

I very much appreciate you taking the time to have a look and provide comments. So sorry for the misunderstandings, let's hope I've now made the model clear enough that future readers are able to follow it better.

KrisMartens @ 2020-10-15T15:04 (+1)

Interesting idea, great to see such initiatives! My main attempt to contribute something is that I think I disagree about the way you seem to assume that this potentially would 'revolutionise the psychology evidence base'.

Questionable evidence base for underlying therapeutic approach
This bot has departed from many other mental health apps by not using CBT (CBT is commonly used in the mental health app space). Instead it’s based on the approach used by Samaritans. While Samaritans is well-established, the evidence base for the Samaritans approach is not strong, and substantially less strong than CBT. Part of my motivation was to improve the evidence base, and having seen the results thus far, I have more faith in the bot’s approach, although more work to strengthen the evidence base would be valuable

I'm not sure if it's helpful to think in terms of evidence base of an entire approach, instead of thinking diagnosis- or process-based. I mean, we do now a bit about what works for whom, and what doesn't. One potential risk is assuming that an approach can never be harmful, which it can.

The bot aims to achieve change in the user’s emotional state by letting the user express what’s on their mind

This is such a potential mechanism, it might be harmful for processes such as worrying or ruminating. If I understand the app correctly, I don't think I would advise it for my patients with generalized anxiety disorder, or with dependent personality traits.

Some therapeutic approaches (like CBT) are closer to being uniform (although, depending on how you implement them, sometimes CBT can be more or less uniform)
Others, like Rogerian or existential therapies, are highly non-uniform -- they don’t have a clear “playbook”

But a lot of Rogerian therapies would exclude quite some cases? Or there is at least a selection bias?

Sanjay @ 2020-10-15T21:33 (+2)

Thank you for your comment Kris.

I'm unclear why you are hesitant about the claim of the potential to revolutionise the psychology evidence base. I wonder if you perhaps inadvertently used a strawman of my argument by only reading the section which you quoted? This was not intended to support the claim about the bot's potential to revolutionise the psychology evidence base.

Instead, it might be more helpful to refer to Appendix 2; I include a heavily abbreviated version here:

The source for much of this section is conversations with existing professional psychiatrists/psychologists.
Currently some psychological interventions are substantially better evidenced than others.
<SNIP>
Part of the aim of this project is to address this in two ways:
(1) Providing a uniform intervention that can be assessed at scale
<SNIP>
(2) Allowing an experimental/scientific approach which could provide an evidence base for therapists
<SNIP>
Crucially, TIO is fundamentally different from other mental health apps -- it has a free-form conversational interface, similar to an actual conversation (unlike other apps which either don’t have any conversational interface at all, or have a fairly restricted/”guided” conversational capability). This means that TIO is uniquely well-positioned to achieve this goal.

To expand on item (2), the idea is that when I, as someone who speaks to people in a therapeutic capacity, choose to say one thing (as opposed to another thing) there is no granular evidence about that specific thing I said. This feels all the more salient when being trained or training others, and dissecting the specific things said in a training role play. These discussions largely operate in an evidence vacuum.

The professionals that I've spoken to thus far have not yet been able to point me to evidence as granular as this.

If you know of any such evidence, please do let me know -- it might help me to spend less time on this project, and I would also find that evidence very useful.

KrisMartens @ 2020-10-16T12:10 (+1)

Thanks for your reply, I hope I'm not wasting your time.

But appendix 2 also seems to imply that the evidence base for CBT is for it as an approach in its entirety. What we think that works in a CBT protocol for depression is different than what we think that works in a CBT protocol for panic disorder (or OCD, or ...). And there is data for which groups none of those protocols work.

In CBT that is mainly based on a functional analysis (or assumed processes), and that functional analysis would create the context in which specific things one would or wouldn't say. This also provides context to how you would define 'empathetic responses'.

(There is a paper from 1966 claiming that Rogers probably also used implicit functional analyses to 'decide' to what extent he would or wouldn't reinforce certain (mal)adaptive behaviors, just to show how old this discussion is. The bot might generate very interesting results to contribute to that discussion!)

Would you consider evidence that a specific diagnosis-aimed CBT protocol works better than a general CBT protocol for a specific group as relevant to the claim that there is evidence about which reactions (sentences) would or wouldn't work (for whom)?

So I just can't imagine revolutionizing the evidence base for psychological treatments using a 'uniform' approach (and thus without taking characteristics of the person into account), but maybe I don't get how diverse this bot is. I just interacted a bit with the test version, and it supported my hypothesis about it potentially being (a bit) harmful to certain groups of people. (*edit* you seem to anticipate on this but not encouraging re-use). But still great for most people!

Sanjay @ 2020-10-17T16:33 (+3)

Thanks very much Kris, I'm very pleased that you're interested in this enough to write these comments.

And as you're pointing out, I didn't respond to your earlier point about talking about the evidence base for an entire approach, as opposed to (e.g.) an approach applied to a specific diagnosis.

The claim that the "evidence base for CBT" is stronger than the "evidence base for Rogerian therapy" came from psychologists/psychiatrists who were using a bit of a shorthand -- i.e. I think they really mean something like "if we look at the evidence base for CBT as applied to X for lots of values of X, compared to the evidence base for Rogerian therapy as applied to X for lots of values of X, the evidence base for the latter is more likely to have gaps for lots of values of X, and more likely to have poorer quality evidence if it's not totally missing".

It's worth noting that while the current assessment mechanism is the question described in Appendix 1f, this is, as alluded to, not the only question that could be asked, and it's also possible for the bot to incorporate other standard assessment approaches (PHQ9, GAD7, or whatever) and adapt accordingly.

Having said that, I'd say that this on its own doesn't feel revolutionary to me. What really does seem revolutionary is that, with the right scale, I might be able to say: This client said XYZ to me, if I had responded with ABC or DEF, which of those would have given me a better response, and be able to test something as granular as that and get a non-tiny sample size.