Salmi L, Lewis DM, Clarke JL, Zhiyong Dong, Fischmann R, McIntosh EI, Sarabu CR, DesRoches CM
doi: https://doi.org/10.1093/jamiaopen/ooaf021
Objectives
The use of large language models (LLMs) is growing for both clinicians and patients. While researchers and clinicians have explored LLMs to manage patient portal messages and reduce burnout, there is less documentation about how patients use these tools to understand clinical notes and inform decision-making. This proof-of-concept study examined the reliability and accuracy of LLMs in responding to patient queries based on an open visit note.
Materials and Methods
In a cross-sectional proof-of-concept study, 3 commercially available LLMs (ChatGPT 4o, Claude 3 Opus, Gemini 1.5) were evaluated using 4 distinct prompt series—Standard, Randomized, Persona, and Randomized Persona—with multiple questions, designed by patients, in response to a single neuro-oncology progress note. LLM responses were scored by the note author (neuro-oncologist) and a patient who receives care from the note author, using an 8-criterion rubric that assessed Accuracy, Relevance, Clarity, Actionability, Empathy/Tone, Completeness, Evidence, and Consistency. Descriptive statistics were used to summarize the performance of each LLM across all prompts.
Results
Overall, the Standard and Persona-based prompt series yielded the best results across all criterion regardless of LLM. Chat-GPT 4o using Persona-based prompts scored highest in all categories. All LLMs scored low in the use of Evidence.
Discussion
This proof-of-concept study highlighted the potential for LLMs to assist patients in interpreting open notes. The most effective LLM responses were achieved by applying Persona-style prompts to a patient’s question.
Conclusion
Optimizing LLMs for patient-driven queries, and patient education and counseling around the use of LLMs, have potential to enhance patient use and understanding of their health information.