Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance

Jamil S. Samaan, View ORCID ProfileSamuel Margolis, Nitin Srinivasan, Apoorva Srinivasan, Yee Hui Yeo, Rajsavi Anand, Fadi S. Samaan, James Mirocha, Seyed Amir Ahmad Safavi-Naini, Bara El Kurdi, Ali Soroush, Rabindra Watson, Srinivas Gaddam, Joann G. Elmore, Brennan M.R. Spiegel, Nicholas P. Tatonetti
doi: https://doi.org/10.1101/2024.07.27.24310809
Jamil S. Samaan
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Samuel Margolis
2David Geffen School Medicine University of California, Los Angeles, 10833 Le Conte Ave, Los Angeles, CA 90095
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Samuel Margolis
Nitin Srinivasan
3Keck School of Medicine of USC, Health Care Consultation Center, 1510 San Pablo St #514, Los Angeles, CA 90033
BA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Apoorva Srinivasan
4Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, California, USA
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yee Hui Yeo
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MD, MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rajsavi Anand
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Fadi S. Samaan
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
James Mirocha
5Cedars-Sinai Medical Center, Biostatistics and Bioinformatics Research Center, Los Angeles, CA, USA
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Seyed Amir Ahmad Safavi-Naini
6Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY USA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bara El Kurdi
7Division of Gastroenterology and Hepatology, Department of Medicine, Virginia Tech Carilion School of Medicine, Roanoke, VA, USA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ali Soroush
6Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY USA
8Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York, NY USA
MD, MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rabindra Watson
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Srinivas Gaddam
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joann G. Elmore
2David Geffen School Medicine University of California, Los Angeles, 10833 Le Conte Ave, Los Angeles, CA 90095
MD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Brennan M.R. Spiegel
1Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048
MD, MSHS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nicholas P. Tatonetti
4Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, California, USA
9Cedars-Sinai Cancer, Cedars-Sinai Medical Center, 8700 Beverly Blvd. Los Angeles, CA, USA
10Department of Biomedical Informatics, Columbia University, New York, New York, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: nicholas.tatonetti{at}cshs.org
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Background Large language models (LLMs) have shown promise in answering medical licensing examination-style questions. However, there is limited research on the performance of multimodal LLMs on subspecialty medical examinations. Our study benchmarks the performance of multimodal LLM’s enhanced by model prompting strategies on gastroenterology subspeciality examination-style questions and examines how these prompting strategies incrementally improve overall performance.

Methods We used the 2022 American College of Gastroenterology (ACG) self-assessment examination (N=300). This test is typically completed by gastroenterology fellows and established gastroenterologists preparing for the gastroenterology subspeciality board examination. We employed a sequential implementation of model prompting strategies: prompt engineering, retrieval augmented generation (RAG), five-shot learning, and an LLM-powered answer validation revision model (AVRM). GPT-4 and Gemini Pro were tested.

Results Implementing all prompting strategies improved the overall score of GPT-4 from 60.3% to 80.7% and Gemini Pro’s from 48.0% to 54.3%. GPT-4’s score surpassed the 70% passing threshold and 75% average human test-taker scores unlike Gemini Pro. Stratification of questions by difficulty showed the accuracy of both LLMs mirrored that of human examinees, demonstrating higher accuracy as human test-taker accuracy increased. The addition of the AVRM to prompt, RAG and 5-shot increased GPT-4’s accuracy by 4.4%. The incremental addition of model prompting strategies improved accuracy for both non-image (57.2% to 80.4%) and image-based (63.0% to 80.9%) questions for GPT-4, but not Gemini Pro.

Conclusions Our results underscore the value of model prompting strategies in improving LLM performance on subspecialty-level licensing exam questions. We also present a novel implementation of an LLM-powered reviewer model in the context of subspecialty medicine which further improved model performance when combined with other prompting strategies. Our findings highlight the potential future role of multimodal LLMs, particularly with the implementation of multiple model prompting strategies, as clinical decision support systems in subspecialty care for healthcare providers.

Competing Interest Statement

Conflict of Interest: Jamil S. Samaan declares that they have no conflict of interest. Samuel Margolis declares that they have no conflict of interest. Nitin Srinivasan declares that they have no conflict of interest. Yee Hui Yeo declares that they have no conflict of interest. Rajsavi Anand declares that they have no conflict of interest. Fadi S. Samaan declares that they have no conflict of interest. James Mirocha declares that they have no conflict of interest. Seyed Amir Ahmad Safavi-Naini received non-significant financial compensation as an R&D associate from AryaspCo. Bara El Kurdi declares that they have no conflict of interest. Ali Soroush declares that they have no conflict of interest. Rabindra Watson declares that they have no conflict of interest. Srinivas Gaddam declares that they have no conflict of interest. Joann G. Elmore declares that they have no conflict of interest. Brennan M.R. Spiegel declares that they have no conflict of interest. Nicholas P. Tatonetti declares that they have no conflict of interest.

Funding Statement

None

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

2022 American College of Gastroenterology (ACG) self-assessment examination. Available at https://education.gi.org/satest/satest_18

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

2022 American College of Gastroenterology (ACG) self-assessment examination. Available at https://education.gi.org/satest/satest_18

https://education.gi.org/satest/satest_18

  • Abbreviations

    ChatGPT
    Chat Generative Pre-trained Transformer
    LLM
    Large language model
    AI
    Artificial Intelligence
    USMLE
    United States Medical Licensing Examination
    RAG
    Retrieval Augmented Generation
    AGA
    American Gastroenterological Association
    ASGE
    American Society for Gastrointestinal Endoscopy
    AASLD
    American Association for the Study of Liver Diseases
    AVRM
    Answer Validation Revision Model
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
    Back to top
    PreviousNext
    Posted July 29, 2024.
    Download PDF

    Supplementary Material

    Data/Code
    Email

    Thank you for your interest in spreading the word about medRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance
    (Your Name) has forwarded a page to you from medRxiv
    (Your Name) thought you would like to see this page from the medRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance
    Jamil S. Samaan, Samuel Margolis, Nitin Srinivasan, Apoorva Srinivasan, Yee Hui Yeo, Rajsavi Anand, Fadi S. Samaan, James Mirocha, Seyed Amir Ahmad Safavi-Naini, Bara El Kurdi, Ali Soroush, Rabindra Watson, Srinivas Gaddam, Joann G. Elmore, Brennan M.R. Spiegel, Nicholas P. Tatonetti
    medRxiv 2024.07.27.24310809; doi: https://doi.org/10.1101/2024.07.27.24310809
    Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance
    Jamil S. Samaan, Samuel Margolis, Nitin Srinivasan, Apoorva Srinivasan, Yee Hui Yeo, Rajsavi Anand, Fadi S. Samaan, James Mirocha, Seyed Amir Ahmad Safavi-Naini, Bara El Kurdi, Ali Soroush, Rabindra Watson, Srinivas Gaddam, Joann G. Elmore, Brennan M.R. Spiegel, Nicholas P. Tatonetti
    medRxiv 2024.07.27.24310809; doi: https://doi.org/10.1101/2024.07.27.24310809

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Health Informatics
    Subject Areas
    All Articles
    • Addiction Medicine (349)
    • Allergy and Immunology (668)
    • Allergy and Immunology (668)
    • Anesthesia (181)
    • Cardiovascular Medicine (2648)
    • Dentistry and Oral Medicine (316)
    • Dermatology (223)
    • Emergency Medicine (399)
    • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
    • Epidemiology (12228)
    • Forensic Medicine (10)
    • Gastroenterology (759)
    • Genetic and Genomic Medicine (4103)
    • Geriatric Medicine (387)
    • Health Economics (680)
    • Health Informatics (2657)
    • Health Policy (1005)
    • Health Systems and Quality Improvement (985)
    • Hematology (363)
    • HIV/AIDS (851)
    • Infectious Diseases (except HIV/AIDS) (13695)
    • Intensive Care and Critical Care Medicine (797)
    • Medical Education (399)
    • Medical Ethics (109)
    • Nephrology (436)
    • Neurology (3882)
    • Nursing (209)
    • Nutrition (577)
    • Obstetrics and Gynecology (739)
    • Occupational and Environmental Health (695)
    • Oncology (2030)
    • Ophthalmology (585)
    • Orthopedics (240)
    • Otolaryngology (306)
    • Pain Medicine (250)
    • Palliative Medicine (75)
    • Pathology (473)
    • Pediatrics (1115)
    • Pharmacology and Therapeutics (466)
    • Primary Care Research (452)
    • Psychiatry and Clinical Psychology (3432)
    • Public and Global Health (6527)
    • Radiology and Imaging (1403)
    • Rehabilitation Medicine and Physical Therapy (814)
    • Respiratory Medicine (871)
    • Rheumatology (409)
    • Sexual and Reproductive Health (410)
    • Sports Medicine (342)
    • Surgery (448)
    • Toxicology (53)
    • Transplantation (185)
    • Urology (165)