Canadian researchers put Bing Chat up against rheumatologists and made patients the judge.
Canadian research suggests we’re still a way off having AI conversational agents – or chatbots – capable of dealing accurately with questions from rheumatology patients.
The problem is, patients in the study couldn’t tell the difference between AI- and physician-generated responses.
The team from the University of Alberta had patients and rheumatologists evaluate responses to rheumatology-related questions, without being told that some responses were AI-generated and some were from real physicians.
Patients rated the responses similarly, and when told afterwards that some were AI-generated, failed to guess which was which at a rate better than chance.
Rheumatologists, however, rated the AI responses as significantly poorer than the physician responses for comprehensiveness, readability and – importantly – accuracy.
“Previous studies have found that ChatGPT-4 responses to patient questions to be of generally good quality, as determined by physician evaluators, although most did not compare these responses to a control set,” wrote the authors, led by rheumatologist Dr Carrie Ye, in Arthritis & Rheumatology.
“Our findings suggest that rheumatology patients may trust LLM-chatbot responses (rated highly for comprehensiveness), despite rheumatologists rating them lower than physician-responses for comprehensiveness and accuracy.
“Most alarmingly, responses that rheumatologists rated low for accuracy were rated high for comprehensiveness by patients.”
The patient-generated questions and physician-generated answers used in the study came from the “Ask The Rheumatologist” page of the Alberta Rheumatology website.
The study team chose 30 of the most recent questions and entered them into Microsoft Bing Chat, which uses the latest generation of ChatGPT-4 and includes a feature to make the responses “more precise”, cutting down on the overly lengthy output typical of ChatGPT-4.
Patients and rheumatologists in the study were blinded to the nature of the trial and weren’t told about the use of AI.
They asked 17 rheumatology patients to rate 10 pairs of answers – physician-generated vs AI-generated – on comprehensiveness and readability, and to pick their preferred answer of each pair.
The average rating for comprehensiveness was 7.12 for AI responses and 7.52 for physician responses (on a 1-10 scale, where 10 is excellent), a non-significant difference. The average readability scores were 7.90 for AI and 7.80 for physician responses, also a non-significant difference. Preference was roughly equal between the AI-generated (45%) and physician-generated responses (55%).
They also had four rheumatologists rate all 30 pairs on readability, comprehensiveness and accuracy, and the picture was quite different.
AI-generated responses rated significantly lower than physician responses for comprehensiveness (5.52 vs 8.76 respectively), readability (AI 7.85 vs physician 8.75) and accuracy (AI 6.48 vs physician 9.08). Only 15% of AI-generated responses were preferred to physician-generated responses.
There was no relationship between accuracy, as judged by physicians, and patients’ preferences, ratings for comprehensiveness or ratings for readability.
The “more precise” Bing chat feature utilised in this study did its job – the average word length for AI responses was 69 words compared with 99 words for physician-generated responses. The authors suggested this may have contributed to rheumatologists rating comprehensiveness a lot lower for the AI responses than physician responses.
Having rated the responses thus, the patients and rheumatologists were then told one of each pair was AI-generated and the other physician-generated, and asked to guess which was which.
Patients guessed correctly 49% of the time, compared with 97% for physicians.
While acknowledging the need to reduce the burden of services such as “Ask The Rheumatologist” upon clinicians, it’s uncertain whether large language model (LLM) chatbots will be able to reliably answer patient questions independently, the authors said. However, they suggested that an LLM specifically trained on rheumatology questions may perform better than the off-the-shelf model they tested.
“Until LLM-chatbots can perform on par with the current gold standard, rheumatology patients should exercise caution in using these AI tools for addressing their medical questions,” they concluded.