New findings highlight the potential role of artificial intelligence in supporting health care professionals, but thorough testing is needed before its integration into everyday clinical practice.
New research presented at the European Respiratory Society (ERS) Congress in Vienna, Austria, reveals that ChatGPT could assess complex cases of respiratory disease in children better than trainee doctors.1
To come to this finding, 6 experts in pediatric respiratory medicine provided 6 clinical scenarios of cases such as cystic fibrosis, asthma, sleep-disordered breathing, breathlessness, and chest infections, that frequently occur in children. These scenarios were posed to 10 trainee doctors with less than 4 months of pediatric clinical experience, and they were given 1 hour to solve each case using internet resources, but not chatbots. These cases did not have an immediately clear diagnosis, and existing guidelines or evidence did not provide a definitive answer.
The 6 scenarios were also presented to 3 large language models (LLMs): ChatGPT version 3.5, Google’s Bard, and Microsoft Bing’s chatbot. The 6 experts then gave all responses a score out of 9 based on their correctness, comprehensiveness, usefulness, plausibility, and coherence, and answered whether they thought each response was generated by a human or chatbot.
Manjith Narayanan, MD, PhD, a consultant in pediatric pulmonology at the Royal Hospital for Children and Young People, presented the study’s findings at the ERS Congress 2024. He noted his motivation for the study was to determine how well LLMs can help clinicians in the real world.2
The results were intriguing. Trainee doctors scored a median (IQR) of 4 (3-6) points, the same as Bing (3-5), while Bard scored higher at 6 (5-7) and scored better than trainee doctors in coherence specifically (P < .05).
ChatGPT scored the highest overall with 7 of 9 points (6-8.25) and outperformed trainee doctors in all criteria (P < .001). Experts also believed ChatGPT had more human-like responses than responses from the other chatbots, as they correctly identified Bard and Bing responses as being nonhuman.
Notably, none of the chatbots showed signs of hallucination, a phenomenon where LLMs generate seemingly accurate but false information. However, there were occasional irrelevant responses from the chatbots and the trainee doctors, and experts should be aware of the potential of hallucinations.
According to Narayanan, this is the first study to test LLMs against trainee doctors in scenarios reflecting real-life clinical practice, and these results imply artificial intelligence (AI) could play a crucial role in alleviating pressure put on health care systems, although more research is needed.
“We have not directly tested how LLMs would work in patient facing roles,” Narayanan noted. “However, it could be used by triage nurses, trainee doctors, and primary care physicians, who are often the first to review a patient.”
Future studies will focus on comparing chatbot performance with that of more experienced doctors and exploring the capabilities of newer LLMs. The research team is also considering investigating how chatbots can assist with more complex cases and further testing for accuracy and safety in real-world clinical environments.
Hilary Pinnock, MD, chair of the ERS Education Council and professor of primary care respiratory medicine at The University of Edinburgh, called the study “fascinating” while also expressing caution.
“It is encouraging, but maybe also a bit scary, to see how a widely available AI tool like ChatGPT can provide solutions to complex cases of respiratory illness in children,” she said. “It certainly points the way to a brave new world of AI-supported care.”
However, as the researchers highlighted, it is crucial to ensure these chatbots and other generative AI tools do not cause errors before they can be implemented in everyday clinical practice. These mistakes can include fabricated or hallucinated information, and can be due to the AI being trained on data that inadequately represent the diverse populations it is meant to serve.
“As the researchers have demonstrated, AI holds out the promise of a new way of working, but we need extensive testing of clinical accuracy and safety, pragmatic assessment of organizational efficiency, and exploration of the societal implications before we embed this technology in routine care,” she added.
As AI continues to advance, this study signals a potential shift in the future of health care, where LLMs could become integrated into the clinical workflow, aiding professionals in delivering faster and more accurate diagnoses. However, the journey toward full adoption will require careful evaluation of clinical accuracy, organizational efficiency, and ethical considerations.
References
1. Juan J, Duverger K, Armstrong D, et al. Clinical scenarios in paediatric pulmonology: can large language models fare better than trainee doctors? Presented at: ERS Congress; September 7-11, 2024; Vienna, Austria. https://k4.ersnet.org/prod/v2/Front/Program/Session?e=549&session=17916
2. ChatGPT outperformed trainee doctors in assessing complex respiratory illness in children. News release. ERS. September 9, 2024. Accessed September 9, 2024. https://www.ersnet.org/news-releases/chatgpt-outperformed-trainee-doctors-in-assessing-complex-respiratory-illness-in-children/
Targeted Treatment May Improve Outcomes in IDH1-Mutated MDS
January 13th 2025A pair of abstracts presented at the 2024 American Society of Hematology (ASH) Annual Meeting & Exhibition suggest that IDH1-targeted treatment may improve survival among patients with myelodysplastic syndromes (MDS) with the mutation.
Read More
The Importance of Examining and Preventing Atrial Fibrillation
August 29th 2023At this year’s American Society for Preventive Cardiology Congress on CVD Prevention, Emelia J. Benjamin, MD, ScM, delivered the Honorary Fellow Award Lecture, “The Imperative to Focus on the Prevention of Atrial Fibrillation,” as the recipient of this year’s Honorary Fellow of the American Society for Preventive Cardiology award.
Listen
Patient-Reported QOL Outcomes of Initial CLL Treatments: ASH 2024
January 8th 2025Quality-of-life (QOL) outcomes vary widely for patients receiving first-line treatment for chronic lymphocytic leukemia (CLL), according to research presented at the 2024 American Society of Hematology (ASH) meeting.
Read More
Promoting Equity in Public Health: Policy, Investment, and Community Engagement Solutions
June 28th 2022On this episode of Managed Care Cast, we speak with Georges C. Benjamin, MD, executive director of the American Public Health Association, on the core takeaways of his keynote session at AHIP 2022 on public health policy and other solutions to promote equitable health and well-being.
Listen
ICYMI: Highlights From SPD 2024
December 30th 2024The Society for Pediatric Dermatology (SPD) Annual Meeting took place in Toronto, Canada, July 11-15, with our top coverage including the hot topics of combating misinformation and improving care for children with dermatologic conditions.
Read More