Expert Warns of AI Pitfalls in Dermatology

HUNTINGTON BEACH, Calif. — When Roxana Daneshjou, MD, PhD, began reviewing responses to an exploratory survey she and her colleagues created on dermatologists’ use of large language models (LLMs) such as ChatGPT in clinical practice, she was both surprised and alarmed.
Of the 134 respondents who completed the survey, 87 (65%) reported using LLMs in a clinical setting. Of those 87 respondents, 17 (20%) used LMMs daily, 28 (32%) weekly, 5 (6%) monthly, and 37 (43%) rarely. That represents “pretty significant usage,” Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University, Palo Alto, California, said at the annual meeting of the Pacific Dermatologic Association.
Most of the respondents reported using LLMs for patient care (79%), followed by administrative tasks (74%), medical records (43%), and education (18%), “which can be problematic,” she said. “These models are not appropriate to use for patient care.”
When asked about their thoughts on the accuracy of LLMs, 58% of respondents deemed them to be “somewhat accurate” and 7% viewed them as “extremely accurate.”
The overall survey responses raise concern because LLMs “are not trained for accuracy; they are trained initially as a next-word predictor on large bodies of tech data,” Daneshjou said. “LLMs are already being implemented but have the potential to cause harm and bias, and I believe they will if we implement them the way things are rolling out right now. I don’t understand why we’re implementing something without any clinical trial or showing that it improves care before we throw untested technology into our healthcare system.”
Meanwhile, Epic and Microsoft are collaborating to bring AI technology to electronic health records, she said, and Epic is building more than 100 new AI features for physicians and patients. “I think it’s important for every physician and trainee to understand what is going on in the realm of AI,” said Daneshjou, who is an associate editor for the monthly journal NEJM AI. “Be involved in the conversation because we are the clinical experts, and a lot of people making decisions and building tools do not have the clinical expertise.”
To further illustrate her concerns, Daneshjou referenced a red teaming event she and her colleagues held with computer scientists, biomedical data scientists, engineers, and physicians across multiple specialties to identify issues related to safety, bias, factual errors, and/or security issues in GPT 3.5, GPT-4, and GPT-4 with internet. The goal was to mimic clinical health scenarios, ask the LLM to respond, and have the team members review the accuracy of LLM responses.
The participants found that nearly 20% of LLM responses were inappropriate. For example, in one task, an LLM was asked to calculate a RegiSCAR score for Drug Reaction With Eosinophilia and Systemic Symptoms for a patient, but the response included an incorrect score for eosinophilia. “That’s why these tools can be so dangerous because you’re reading along and everything seems right, but there might be something so minor that can impact patient care and you might miss it,” Daneshjou said.
On a related note, she advised against dermatologists uploading images into GPT-4 Vision, an LLM that can analyze images and provide textual responses to questions about them, and she recommends not using GPT-4 Vision for any diagnostic support. At this time, “GPT-4 Vision overcalls malignancies, and the specificity and sensitivity are not very good,” she explained.
Daneshjou disclosed that she has served as an advisor to MDalgorithms and Revea and has received consulting fees from Pfizer, L’Oréal, Frazier Healthcare Partners, and DWA and research funding from UCB.

Send comments and news tips to [email protected].

Trending now

No results