Academics at Imperial College London’s Self-Care Academic Research Unit (SCARU), working in collaboration with the Royal College of GPs (RCGP), have discovered it’s almost impossible to benchmark online symptom checkers because doctors themselves can’t agree on what the right triage and diagnosis is for test cases. As a result of these findings, the researchers and team at Healthily suggest that a new industry standard for measuring the accuracy of symptom checkers should be explored.
The research – Self-care, see a GP or attend A&E?
The team worked to create more than 130 short stories (or “vignettes”) covering 18 medical areas.
Vignettes are medical stories outlining a patient’s symptoms and other relevant information and are traditionally used to test medical students. They have also been the principle way AI symptom checkers have tested their accuracy and assured their safety since their creation 7 years ago.
In the study, the research team initially gave the RCGP vignettes to experienced Imperial College clinicians who read them and gave their opinion on the appropriate “triage” (prioritisation) of the patient.. The options were (1) the patient should care for themselves, (2) see a GP or (3) go to A&E. The clinicians also had to decide what the 3 most likely conditions for the patient would be given their story.
The uncertainty of medicine
The study found that while the Imperial doctors agreed most of the time on the self-care conditions, there was only “fair” or “moderate” agreement on whether the person in the other stories should go and see a GP or go to hospital. Overall, they agreed with the RCGP clinicians for more than three-quarters of cases and only disagreed in 1 in 4 cases (26%).
When it came to naming the most probable condition, Imperial’s “independent” panel of doctors agreed with the RCGP’s doctors 72% of the time.
Study lead
Dr Austen El-Osta, who is the Director of the Self-Care Academic Research Unit (SCARU) at Imperial College London says : “This obviously speaks to the elephant in the room. There is no certainty in medicine until you have tested. Both doctors and AI are basing their advice on probability and risk, not diagnostic testing. This contrasts with the real world where diagnostic testing is often needed to drive evidence-based decision making such as by using medical imaging or blood tests for example.”
AI checker safe in more than 9 out of 10 cases
When the medical stories were fed into Healthily, the world’s leading AI symptom checker, it got the correct “triage” 62% of the time and the correct condition 61% of the time.
Imperial SCARU used a combination of medical academics and laypeople with no experience to pretend they were the patients with the symptoms described in the medical vignette.
This is the traditional way AI symptom checkers are evaluated but the researchers found this led to “significant variability”.
Dr El-Osta explained: “Artificial intelligence can ask far more questions than the vignette can anticipate. This means that inputters often put in responses that legitimately change the range of possible triage and condition options. So the AI may not come out with an answer that matches the vignette but it may be appropriate for what was inputted.”
Overall, the research team at Imperial concluded that Healthily was “generally working at a safe level of probable risk” – for 96.3% of the time. The study showed that Healthily only gave “very unsafe” triage recommendation 3.7% of the time (e.g. told someone they were able to self-care when they should go to hospital immediately).
A study with real-life patients is needed for more accuracy
Imperial SCARU concluded that online symptom checkers could only be truly verified if their performance was cross-checked against scenarios using real patients and interactions with GPs “as opposed to using artificial vignettes.”
Professor Maureen Baker CBE, the Chief Medical Officer for Healthily, said: “We welcome this study.
“By the nature of academic publishing this study has been a long time in gestation. The first findings were shown to Healthily in the summer of 2020 and we have already used the research to focus our quality and improvement efforts.
“We agree with the findings of the report that vignette testing has too many biases to form the basis of an accurate assessment of the appropriateness of the recommendation delivered by symptom checkers.
“The testing could be improved by using real people in real-life situations, for example, by asking patients to use a symptom checker before going to see a doctor (for example, in the waiting room), then comparing the top three suggestions from both the checker and the doctor.”
Dr Austen El-Osta, said: “This piece of research started as an accuracy report and became something more far-reaching. We need to rethink the standard of testing for AI symptom checkers in light of this study. Research in this space is really important because the routine use of safe and accurate online symptom checkers has the potential to ‘democratize self-care’ for all, and empower individuals to seek the right level of support when needed”
Raising the standard
“The current use of vignettes isn’t serving the industry or the consumer. We are keen to continue this work to find an appropriate gold standard of testing that can take account of all the variabilities we uncovered in this study.”
Matteo Berlucchi, Chief Executive of Healthily said: “This is an important discussion because the future of digital healthcare needs to be based on trust. The public must understand how to use the information they are being provided and trust that it is appropriate.
“This report shows the industry and Healthily how we need to think harder and improve testing.
“We are working with Imperial SCARU and the World Health Organisation (WHO) to tackle this issue and we hope to continue our collaboration with Imperial to bring forward new approaches.”