Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

View ORCID ProfileBrendin R Beaulieu-Jones, Sahaj Shah, View ORCID ProfileMargaret T Berrigan, View ORCID ProfileJayson S Marwaha, Shuo-Lun Lai, View ORCID ProfileGabriel A Brat
doi: https://doi.org/10.1101/2023.07.16.23292743
Brendin R Beaulieu-Jones
1Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
2Department of Biomedical Informatics, Harvard Medical School, Boston, MA
MD MBA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Brendin R Beaulieu-Jones
Sahaj Shah
3Geisinger Commonwealth School of Medicine, Scranton, PA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Margaret T Berrigan
1Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Margaret T Berrigan
Jayson S Marwaha
4Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
MD MBI
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jayson S Marwaha
Shuo-Lun Lai
4Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gabriel A Brat
1Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
2Department of Biomedical Informatics, Harvard Medical School, Boston, MA
MD, FACS, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gabriel A Brat
  • For correspondence: gbrat{at}bidmc.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions.

Methods We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters.

Results A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions.

Conclusion Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

Dr. Beaulieu-Jones is supported by National Library of Medicine/NIH grant [T15LM007092]

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • Dr. Beaulieu-Jones is supported by National Library of Medicine/NIH grant [T15LM007092]

  • Disclosure Information: Nothing to disclose

  • Manuscript title

Data Availability

All input to the ChatGPT interface and associated output were recorded. Due to copyright laws, this data is not presented in the current manuscript. However, pending requisite approval from the respective organizations, this data may be shared upon reasonable request.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted July 24, 2023.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
Brendin R Beaulieu-Jones, Sahaj Shah, Margaret T Berrigan, Jayson S Marwaha, Shuo-Lun Lai, Gabriel A Brat
medRxiv 2023.07.16.23292743; doi: https://doi.org/10.1101/2023.07.16.23292743
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
Brendin R Beaulieu-Jones, Sahaj Shah, Margaret T Berrigan, Jayson S Marwaha, Shuo-Lun Lai, Gabriel A Brat
medRxiv 2023.07.16.23292743; doi: https://doi.org/10.1101/2023.07.16.23292743

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Surgery
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)