ChatGPT Outperforms Trainee Doctors in Assessing Pediatric Respiratory Illness

September 9, 2024

Hayden E. Klein

News

Article

Conference|ERS: European Respiratory Society

New findings highlight the potential role of artificial intelligence in supporting health care professionals, but thorough testing is needed before its integration into everyday clinical practice.

New research presented at the European Respiratory Society (ERS) Congress in Vienna, Austria, reveals that ChatGPT could assess complex cases of respiratory disease in children better than trainee doctors.¹

To come to this finding, 6 experts in pediatric respiratory medicine provided 6 clinical scenarios of cases such as cystic fibrosis, asthma, sleep-disordered breathing, breathlessness, and chest infections, that frequently occur in children. These scenarios were posed to 10 trainee doctors with less than 4 months of pediatric clinical experience, and they were given 1 hour to solve each case using internet resources, but not chatbots. These cases did not have an immediately clear diagnosis, and existing guidelines or evidence did not provide a definitive answer.

ChatGPT | Image credit: Timon – stock.adobe.com

ChatGPT scored the highest overall, with 7 of 9 points, performing better than trainee doctors | Image credit: Timon – stock.adobe.com

The 6 scenarios were also presented to 3 large language models (LLMs): ChatGPT version 3.5, Google’s Bard, and Microsoft Bing’s chatbot. The 6 experts then gave all responses a score out of 9 based on their correctness, comprehensiveness, usefulness, plausibility, and coherence, and answered whether they thought each response was generated by a human or chatbot.

Manjith Narayanan, MD, PhD, a consultant in pediatric pulmonology at the Royal Hospital for Children and Young People, presented the study’s findings at the ERS Congress 2024. He noted his motivation for the study was to determine how well LLMs can help clinicians in the real world.²

The results were intriguing. Trainee doctors scored a median (IQR) of 4 (3-6) points, the same as Bing (3-5), while Bard scored higher at 6 (5-7) and scored better than trainee doctors in coherence specifically (P < .05).

ChatGPT scored the highest overall with 7 of 9 points (6-8.25) and outperformed trainee doctors in all criteria (P < .001). Experts also believed ChatGPT had more human-like responses than responses from the other chatbots, as they correctly identified Bard and Bing responses as being nonhuman.

Notably, none of the chatbots showed signs of hallucination, a phenomenon where LLMs generate seemingly accurate but false information. However, there were occasional irrelevant responses from the chatbots and the trainee doctors, and experts should be aware of the potential of hallucinations.

According to Narayanan, this is the first study to test LLMs against trainee doctors in scenarios reflecting real-life clinical practice, and these results imply artificial intelligence (AI) could play a crucial role in alleviating pressure put on health care systems, although more research is needed.

“We have not directly tested how LLMs would work in patient facing roles,” Narayanan noted. “However, it could be used by triage nurses, trainee doctors, and primary care physicians, who are often the first to review a patient.”

Future studies will focus on comparing chatbot performance with that of more experienced doctors and exploring the capabilities of newer LLMs. The research team is also considering investigating how chatbots can assist with more complex cases and further testing for accuracy and safety in real-world clinical environments.

Hilary Pinnock, MD, chair of the ERS Education Council and professor of primary care respiratory medicine at The University of Edinburgh, called the study “fascinating” while also expressing caution.

“It is encouraging, but maybe also a bit scary, to see how a widely available AI tool like ChatGPT can provide solutions to complex cases of respiratory illness in children,” she said. “It certainly points the way to a brave new world of AI-supported care.”

However, as the researchers highlighted, it is crucial to ensure these chatbots and other generative AI tools do not cause errors before they can be implemented in everyday clinical practice. These mistakes can include fabricated or hallucinated information, and can be due to the AI being trained on data that inadequately represent the diverse populations it is meant to serve.

“As the researchers have demonstrated, AI holds out the promise of a new way of working, but we need extensive testing of clinical accuracy and safety, pragmatic assessment of organizational efficiency, and exploration of the societal implications before we embed this technology in routine care,” she added.

As AI continues to advance, this study signals a potential shift in the future of health care, where LLMs could become integrated into the clinical workflow, aiding professionals in delivering faster and more accurate diagnoses. However, the journey toward full adoption will require careful evaluation of clinical accuracy, organizational efficiency, and ethical considerations.

References

1. Juan J, Duverger K, Armstrong D, et al. Clinical scenarios in paediatric pulmonology: can large language models fare better than trainee doctors? Presented at: ERS Congress; September 7-11, 2024; Vienna, Austria. https://k4.ersnet.org/prod/v2/Front/Program/Session?e=549&session=17916

2. ChatGPT outperformed trainee doctors in assessing complex respiratory illness in children. News release. ERS. September 9, 2024. Accessed September 9, 2024. https://www.ersnet.org/news-releases/chatgpt-outperformed-trainee-doctors-in-assessing-complex-respiratory-illness-in-children/