Popular artificial intelligence (AI) model GPT-4 answered 88% of questions correctly on a standardized patient safety exam.
A new study from Boston University highlighted the potential of generative artificial intelligence (AI) to improve patient safety in health care.
Published in The Joint Commission Journal on Quality and Patient Safety, the study tested the widely used AI model GPT-4 on the Certified Professional in Patient Safety (CPPS) exam, where it answered 88% of questions correctly.1 Researchers believe AI could help reduce medical errors, estimated to cause 400,000 deaths annually, by assisting clinicians in identifying and addressing safety risks in hospitals and clinics.
The study marks the first in-depth test of GPT-4’s capabilities in patient safety, focusing on its performance in key areas, including risk solutions, measuring performance, and systems thinking. GPT-4 excelled in areas such as patient safety and solutions, but showed weaknesses in culture and leadership domains, especially when multiple correct answers were possible.
The study authors suggested that AI has promise in helping doctors better recognize, address, and prevent mistakes or accidental harm in hospitals and clinics.
GPT-4 answered 88% of questions correctly on the CPPS exam | Image credit: LALAKA – stock.adobe.com
“While more research is needed to fully understand what current AI can do in patient safety, this study shows that AI has some potential to improve healthcare by assisting clinicians in addressing preventable harms,” said Nicholas Cordella, MD, MSc, assistant professor of medicine at Boston University Chobanian & Avedisian School of Medicine, medical director for quality and patient safety at Boston Medical Center.2
However, the study also highlighted critical limitations of current AI technologies, including the risk of bias, fabricated data, and false confidence in responses.1 Additionally, the exact passing score for the CPPS exam is not disclosed, but the researchers believe GPT-4's score aligned with the performance of skilled human patient safety practitioners. Notably, GPT-4 displayed high confidence across all questions, even when it provided incorrect answers, showing "high" certainty for 5 of the 6 questions it answered incorrectly.
"Our findings suggest that AI has the potential to significantly enhance patient safety, marking an enabling step towards leveraging this technology to reduce preventable harms and achieve better healthcare outcomes,” said Cordella.2 “However, it's important to recognize this as an initial step, and we must rigorously test and refine AI applications to truly benefit patient care.”
Integrating AI into patient care is a growing topic of discussion. In a separate study presented at the European Respiratory Society Congress, researchers found that ChatGPT outperformed trainee doctors in assessing pediatric respiratory diseases, such as cystic fibrosis and asthma.3 Trainee doctors and 3 large language models—ChatGPT version 3.5, Google’s Bard, and Microsoft Bing’s chatbot—gave responses to scenarios that were scored out of 9 based on their correctness, comprehensiveness, usefulness, plausibility, and coherence. Trainee doctors scored the same as Bing at 4, Bard scored higher at 6, and ChatGPT scored the best with 7 points.
Notably, judges believed ChatGPT gave more human-like responses than those of the other chatbots, but none of them showed signs of hallucination. Both studies suggest that while AI can greatly assist clinicians, extensive testing and safeguards are needed to ensure the technology's reliability in preventing harm and optimizing care delivery.
References
At ASCO, Testing and AI Rival Drugs for Attention, but Reimbursement Remains a Barrier
July 5th 2025Precision oncology has entered a new phase, as data sets mature and a new wave of tools emerges to help clinicians manage cancer care over time. A feature from our upcoming special issue of Evidence-Based Oncology: our annual ASCO recap.
Read More
Laundromats as a New Frontier in Community Health, Medicaid Outreach
May 29th 2025Lindsey Leininger, PhD, and Allister Chang, MPA, highlight the potential of laundromats as accessible, community-based settings to support Medicaid outreach, foster trust, and connect families with essential health and social services.
Listen
Inside the Center's MDD Value Model and Its Use of Dynamic Pricing
May 13th 2025Larragem Raines, MS, of the Center for Innovation & Value Research, discusses the organization's major depressive disorder (MDD) open-source value model, dynamic pricing, and the future role of artificial intelligence in care.
Listen
Addressing Cancer Care Challenges for Patients, and Ensuring Equity: Coral Omene, MD, PhD
June 22nd 2025Delivering value-based cancer care requires overcoming hurdles to access care and tailoring care that prioritizes the quality-of-life metrics the patient values, explained Coral Omene, MD, PhD.
Read More