Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

EndoGPT: A Proof-of-concept Large Language Model Based Assistant for the Management of Thyroid Nodules

View ORCID ProfileMeghal Shah, View ORCID ProfileEric J. Kuo, View ORCID ProfileJennifer H. Kuo, View ORCID ProfileShawn Hsu, View ORCID ProfileCatherine McManus, View ORCID ProfileRachel Liou, View ORCID ProfileJames A. Lee, View ORCID ProfileTejas S. Sathe
doi: https://doi.org/10.1101/2024.05.29.24308002
Meghal Shah
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Meghal Shah
  • For correspondence: ms5835{at}cumc.columbia.edu
Eric J. Kuo
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Eric J. Kuo
Jennifer H. Kuo
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jennifer H. Kuo
Shawn Hsu
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shawn Hsu
Catherine McManus
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Catherine McManus
Rachel Liou
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Rachel Liou
James A. Lee
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for James A. Lee
Tejas S. Sathe
1Columbia University Irving Medical Center
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tejas S. Sathe
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Large language models (LLMs) are increasingly being explored for their potential to simulate clinical reasoning. Here, we demonstrate our initial experience using the GPT-4o LLM along with prompt engineering and knowledge retrieval to develop EndoGPT, a clinical decision support tool for the management of thyroid nodules. In a pilot study of 50 cases, EndoGPT demonstrated an 83% concordance rate with expert surgeons’ assessments and plans. The highest concordance was in diagnosis (93%), followed by the need for an operation (82%) and type of operation (69%). This work suggests that LLM-based assistants may play a useful role in assisting clinicians in the future.

Introduction

Though large-language models (LLM) demonstrate the ability to answer medical questions, their ability to simulate clinical reasoning is a topic of current exploration. Recent technical advances allow LLMs to be optimized using prompt engineering and knowledge retrieval from data sources, even without specific fine-tuning.1,2 Here, we describe our implementation of these techniques to prototype an LLM-based clinical decision support tool for the management of thyroid nodules.

Methods

We abstracted deidentified data from clinic notes of patients referred for evaluation of thyroid nodules or thyroid cancer. We built an assistant (EndoGPT) based on the GPT-4o LLM that could ingest this data and output a predicted assessment and plan (A&P). To provide EndoGPT with additional context, we uploaded the 2015 American Thyroid Association Management Guidelines for Thyroid Nodules and Differentiated Thyroid Cancer as a reference.3 EndoGPT could then utilize relevant components of the guidelines using vector embeddings and similarity search techniques. 4 For each patient scenario, we generated five predicted A&Ps and ensembled them into a compound A&P using a second assistant. After pre-testing EndoGPT on 25 patient scenarios, we analyzed errors, wrote instructions to avoid them, and added this data to EndoGPT’s prompt for additional context before testing it on new scenarios (Figure 1).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

We built an LLM-based assistant called EndoGPT. The input to EndoGPT is a deidentified clinic note excluding the expert surgeon’s assessment and plan. EndoGPT was built using the GPT-4o LLM. We generated vector embeddings from the 2015 American Thyroid Association Management Guidelines for Thyroid Nodules and Differentiated Thyroid Cancer and used vector similarity to determine which components of the guidelines would generate the most useful context for the introductory prompt based on the patient scenario. We also provided feedback generated from a pretest of 25 cases. After running the first assistant five times, we provided all five responses to a compounding assistant which took the most commonly appearing components of each and composited them together. We then evaluated the similarity between the expert A&P and the predicted A&P across the domains of (1) diagnosis, (2) the need for an operation, and (3) type of operation.

To evaluate EndoGPT, we measured concordance between the expert-generated and the predicted A&Ps across three domains: (1) diagnosis, (2) need for an operation, and (3) type of operation (Figure 1). This study was deemed exempt by the Columbia University Institutional Review Board (Protocol AAAV1151). Our code is available on GitHub.

Results

We tested EndoGPT on 50 patient scenarios and achieved an overall concordance of 83%. EndoGPT agreed with the expert’s diagnosis completely in 44/50 cases and partially in 5/50 cases (93% concordant). Moreover, the assistant agreed with the expert’s need for an operation in 41/50 cases (82% concordant). When the expert recommended surgery (n=36 cases), the assistant agreed with the expert’s choice of operation completely in 24 cases and partially in two cases (69% concordant) (Figure 2). Details on the differences in A&Ps are described in Table S1.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

EndoGPT concordance scores in the domains of diagnosis, need for an operation, type of operation, and overall. When assessing concordance in diagnosis and operation type, we allowed partial credit for partially concordant responses.

Discussion

Our early experience with EndoGPT suggests that surgeons who may not have the technical resources to build their own LLMs can still use general-purpose models like GPT-4o to develop clinical decision support tools. We achieved an 83% concordance with expert A&Ps using knowledge-retrieval and prompt engineering.

Our model was most concordant when predicting a diagnosis and least concordant when suggesting a specific operation. Specific areas of recurring discordance were in the type of lymph node dissection (LND) recommended (e.g. EndoGPT did not assign a laterality to central LND) or the recommendation of surgery for benign nodules causing compressive symptoms (rather than performing fine needle aspiration). The latter may have occurred because we gave EndoGPT specific feedback during pretesting to consider surgery for benign, compressive nodules, highlighting the risk of over-prompting the model. In some cases, because we tested concordance with a singular A&P, it is possible that EndoGPT suggested a safe alternative approach. Thus, we may be underestimating EndoGPT’s overall accuracy. In future experiments, a panel of experts can assess EndoGPT’s responses for accuracy.

Though not intended to replace physician evaluation, tools like EndoGPT may help train 4 surgical residents, assist non-specialist providers with initial workup and management, or make technical documents such as guidelines more accessible to patients. Utility will likely be greatest in areas of medicine where clear guidelines already exist. Further studies will be needed to fully optimize this system for patient care.

Data Availability

Our data and code are available on GitHub.

https://github.com/tsathe/endogpt

Supplementary Tables

View this table:
  • View inline
  • View popup
Table S1:

EndoGPT concordance scores in the domains of diagnosis (Dx), need for an operation (Op?), and type of operation (Type). When EndoGPT achieved a less than perfect score, we explain the areas of discordance. FNA = fine needle aspiration; PTC = papillary thyroid carcinoma; LND = lymph node dissection.

References

  1. [1].↵
    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete Special-Purpose tuning? case study in medicine. November 2023. URL http://arxiv.org/abs/2311.16452.
  2. [2].↵
    Tejas S Sathe, Joshua Roshal, Ariana Naaseh, Joseph C L’Huillier, Sergio M Navarro, and Caitlin Silvestri. How I GPT it: Development of custom artificial intelligence (AI) chatbots for surgical education. J. Surg. Educ., 81(6):772–775, June 2024. ISSN 1931-7204, 1878-7452. doi: 10.1016/j.jsurg.2024.03.004. URL http://dx.doi.org/10.1016/j.jsurg.2024.03.004.
    OpenUrlCrossRef
  3. [3].↵
    Bryan R Haugen, Erik K Alexander, Keith C Bible, Gerard M Doherty, Susan J Mandel, Yuri E Nikiforov, Furio Pacini, Gregory W Randolph, Anna M Sawka, Martin Schlumberger, Kathryn G Schuff, Steven I Sherman, Julie Ann Sosa, David L Steward, R Michael Tuttle, and Leonard Wartofsky. 2015 american thyroid association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: The american thyroid association guidelines task force on thyroid nodules and differentiated thyroid cancer. Thyroid, 26(1):1–133, January 2016. ISSN 1050-7256, 1557-9077. doi: 10.1089/thy.2015.0020. URL http://dx.doi.org/10.1089/thy.2015.0020.
    OpenUrlCrossRefPubMed
  4. [4].↵
    Underfitted. Building a RAG application from scratch using python, LangChain, and the OpenAI API, March 2024. URL https://www.youtube.com/watch?v=BrsocJb-fAo.
Back to top
PreviousNext
Posted May 31, 2024.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
EndoGPT: A Proof-of-concept Large Language Model Based Assistant for the Management of Thyroid Nodules
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
EndoGPT: A Proof-of-concept Large Language Model Based Assistant for the Management of Thyroid Nodules
Meghal Shah, Eric J. Kuo, Jennifer H. Kuo, Shawn Hsu, Catherine McManus, Rachel Liou, James A. Lee, Tejas S. Sathe
medRxiv 2024.05.29.24308002; doi: https://doi.org/10.1101/2024.05.29.24308002
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
EndoGPT: A Proof-of-concept Large Language Model Based Assistant for the Management of Thyroid Nodules
Meghal Shah, Eric J. Kuo, Jennifer H. Kuo, Shawn Hsu, Catherine McManus, Rachel Liou, James A. Lee, Tejas S. Sathe
medRxiv 2024.05.29.24308002; doi: https://doi.org/10.1101/2024.05.29.24308002

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Surgery
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)