Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

View ORCID ProfileShreya Johri, Jaehwan Jeong, Benjamin A. Tran, Daniel I. Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, View ORCID ProfilePranav Rajpurkar
doi: https://doi.org/10.1101/2023.09.12.23295399
Shreya Johri
1Department of Biomedical Informatics, Harvard Medical School, Boston, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shreya Johri
Jaehwan Jeong
1Department of Biomedical Informatics, Harvard Medical School, Boston, United States
4Department of Computer Science, Stanford University, Stanford, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benjamin A. Tran
5Medstar Georgetown University Hospital/Washington Hospital Center, Department of Dermatology, Washington, DC, United States
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel I. Schlessinger
6Department of Dermatology, Northwestern University, Chicago, IL, United States
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shannon Wongvibulsin
7Division of Dermatology, David Geffen School of Medicine at the University of California, Los Angeles, California, United States
MD PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zhuo Ran Cai
3Department of Dermatology, Stanford University, Stanford, United States
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Roxana Daneshjou
2Department of Biomedical Data Science, Stanford University, Stanford, United States
3Department of Dermatology, Stanford University, Stanford, United States
MD PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Pranav Rajpurkar
1Department of Biomedical Informatics, Harvard Medical School, Boston, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pranav Rajpurkar
  • For correspondence: pranav_rajpurkar{at}hms.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

The integration of Large Language Models (LLMs) like GPT-4 and GPT-3.5 into clinical diagnostics has the potential to transform patient-doctor interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD), a novel approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical exams, CRAFT-MD focuses on natural dialogues, using simulated AI agents to interact with LLMs in a controlled, ethical environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4 and GPT-3.5 in the context of skin diseases. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy. Based on these findings, we propose a comprehensive set of guidelines for future evaluations of clinical LLMs. These guidelines emphasize realistic doctor-patient conversations, comprehensive history taking, open-ended questioning, and a combination of automated and expert evaluations. The introduction of CRAFT-MD marks a significant advancement in LLM testing, aiming to ensure that these models augment medical practice effectively and ethically.

Competing Interest Statement

D.I.S. is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc., and a consultant with LuminDx. R.D. reported receiving personal fees from DWA, personal fees from Pfizer, personal fees from L'Oreal, personal fees from VisualDx, stock options from MDAlgorithms and Revea outside the submitted work, and a patent for TrueImage pending.

Funding Statement

S.J. is supported by the 2023 Quad Fellowship. We acknowledge support in the form of computational credits from the Microsoft Accelerating Foundation Models Research (AFMR) grant.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • ↵† These authors share senior authorship: Roxana Daneshjou, Pranav Rajpurkar

  • The manuscript has been revised to include proposed guidelines for comprehensive evaluation of clinical LLMs based on empirical evidence presented in the study.

Data Availability

Data used in the study is available on the following repository: https://github.com/rajpurkarlab/craft-md

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted January 23, 2024.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning
Shreya Johri, Jaehwan Jeong, Benjamin A. Tran, Daniel I. Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, Pranav Rajpurkar
medRxiv 2023.09.12.23295399; doi: https://doi.org/10.1101/2023.09.12.23295399
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning
Shreya Johri, Jaehwan Jeong, Benjamin A. Tran, Daniel I. Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, Pranav Rajpurkar
medRxiv 2023.09.12.23295399; doi: https://doi.org/10.1101/2023.09.12.23295399

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Dermatology
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)