LLMs provide valuable answers to common Mohs surgery questions, supporting postoperative care and patient education without replacing physician guidance.
This article was originally published on Dermatology Times®.
Large language models (LLMs) can provide generally accurate and patient-appropriate information for common postoperative Mohs surgery questions, particularly when guidance is straightforward and standardized, according to a study published in the International Journal of Dermatology.1
LLMs provide valuable answers to common Mohs surgery questions, supporting postoperative care and patient education without replacing physician guidance. | Image Credit: kras99 - stock.adobe.com

Artificial intelligence–based tools, including LLMs, are increasingly used by patients to seek medical information outside the clinical setting. Patients undergoing Mohs micrographic surgery often look for additional guidance online after their procedure, particularly regarding wound care, pain control, bleeding, and how to recognize complications.2 This occurs despite the routine provision of written and verbal postoperative instructions.
While prior work has examined the use of AI in dermatology education, little is known about how well these tools answer real-world postoperative questions asked by Mohs surgery patients. To address this knowledge gap, a recent study evaluated how accurately, appropriately, and completely several commonly used LLMs respond to typical postoperative concerns.1
Three board-certified Mohs surgeons developed 12 postoperative questions that reflect issues frequently raised by patients following surgery. These questions addressed pain management, wound care, bleeding, activity restrictions, signs of infection, scarring, and postoperative expectations. Using fresh sessions to avoid carryover effects, each question was submitted separately to 3 widely available LLMs: ChatGPT-4o, Gemini 2.0 Flash, and LLaMA 4.
The responses were compiled into a blinded and randomized survey. Eight board-certified Mohs surgeons, including the original 3 authors, reviewed each response. Reviewers rated how well each answer addressed the patient’s concern (sufficiency), whether the information was medically correct (accuracy), and whether it was suitable for a patient-facing setting (appropriateness). Ratings were provided using a 5-point scale, with higher scores indicating better performance. Reviewers could also note specific problems, such as missing information, unclear advice, or incorrect statements.
Across all questions, the LLMs generally produced responses that were medically reasonable and appropriate for patients. However, differences were seen in how complete the answers were. Gemini 2.0 received the highest overall scores for addressing patient concerns, followed by ChatGPT-4o, with LLaMA 4 scoring lowest; these differences were statistically significant.
Researchers found all 3 models performed best on questions with well-established and consistent guidance. Questions about signs of infection, when to call the doctor, and what to do for bleeding received high scores across all models. These topics are typically covered by standard postoperative instructions and rely on widely accepted clinical principles, which likely explains the strong performance.
Accuracy scores were reported to be high overall. Gemini 2.0 and ChatGPT-4o performed similarly and scored higher than LLaMA 4. All models were most accurate when answering infection-related questions and questions about expected postoperative sensations, such as numbness. Accuracy tended to be lower for questions involving scarring and stitch management, areas where recommendations can vary between surgeons.
Appropriateness scores were consistently strong, indicating that most responses were written in a way that would be understandable and acceptable for patients. Gemini 2.0 again scored highest overall. Lower appropriateness ratings occurred when responses were either too vague to be helpful or included unnecessary technical detail.
When reviewers flagged problems in the responses, missing information was the most common issue. Nearly half of all identified deficiencies involved omissions, such as failing to mention when a patient should seek medical attention or not acknowledging that recommendations may vary depending on the type of repair. Ambiguous guidance and factual inaccuracies were less common but still present. Unsafe advice was rare, and readability problems were uncommon.
LLaMA 4 accounted for the largest share of missing or unclear information, while inaccuracies were distributed more evenly across all 3 models.
This study highlights that LLMs can provide generally accurate and patient-appropriate information for common postoperative Mohs surgery questions, particularly when guidance is straightforward and standardized. However, responses were often less complete for topics that require individualized recommendations, such as wound care routines, scar management, and activity restrictions. These areas depend on factors like repair type, anatomic location, and surgeon preference, which LLMs cannot reliably account for.
The reviewers showed variability in how they rated responses, reflecting differences in how Mohs surgeons counsel patients rather than a flaw in the study design. This variability reinforces the importance of individualized postoperative communication.
Importantly, researchers noted that the goal of this work is not to endorse LLMs as replacements for surgeon guidance. Many patients already consult AI tools after surgery, particularly those who live far from their treating physician or seek reassurance outside clinic hours. Understanding the strengths and limitations of this information helps clinicians anticipate patient questions and identify areas where standard instructions may benefit from additional clarity.
At present, LLMs may serve only as supplemental educational tools and should not replace direct communication or personalized postoperative care.
References
Ambient AI Tool Adoption in US Hospitals and Associated Factors
January 27th 2026Nearly two-thirds of hospitals using Epic have adopted ambient artificial intelligence (AI), with higher uptake among larger, not-for-profit hospitals and those with higher workload and stronger financial performance.
Read More