GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment ========================================================================================== * Senthujan Senkaiahliyan M. * Augustin Toma * Jun Ma * An-Wen Chan * Andrew Ha * Kevin R. An * Hrishikesh Suresh * Barry Rubin * Bo Wang ## Abstract OpenAI’s large multimodal model, GPT-4V(ision), was recently developed for general image interpretation. However, less is known about its capabilities with medical image interpretation and diagnosis. Board-certified physicians and senior residents assessed GPT-4V’s proficiency across a range of medical conditions using imaging modalities such as CT scans, MRIs, ECGs, and clinical photographs. Although GPT-4V is able to identify and explain medical images, its diagnostic accuracy and clinical decision-making abilities are poor, posing risks to patient safety. Despite the potential that large language models may have in enhancing medical education and delivery, the current limitations of GPT-4V in interpreting medical images reinforces the importance of appropriate caution when using it for clinical decision-making. ## 1. Introducing GPT-4V(ision) This past year, large language models (LLMs) demonstrated impressive capabilities to perform numerous language-based tasks. They have shown capability in analyzing text, discerning patterns, and establishing connections between words [1]. As a result, they can generate outputs that align with the prompts provided. While LLMs have expressed strong performance in expert-level medical question answering, they are still unable to outperform their clinician counterparts especially in scenarios that require reasoning capabilities [2]. Generative Pre-Trained Transformer Vision (GPT-4V) is OpenAI’s first large multimodal model with the ability to accept image input alongside text. [3] Multimodal learning is the ability for machine learning models to be trained on and input multiple forms of input data. They have the potential to enhance the breadth and depth of tasks that LLMs can perform across various medical disciplines. [4] To evaluate GPT-4V’s proficiency in analyzing medical images, we conducted an evaluation involving senior residents and board-certified physicians to assess its capability to accurately interpret various medical conditions and provide accurate and useful information regarding the diagnosis and management of these conditions. The study aimed to assess whether GPT-4V could not only interpret medical images but also provide valuable information for diagnosis, management, and education. Finally, we aimed to evaluate if the resulting outputs align with the safety standards for patient care. ## 2. Data Collection ### 2.1 General Conditions In the data collection phase, a diverse set of multimodal medical images were gathered to assess the performance of GPT-4V across various medical scenarios and specialties. The breakdown of multimodal images is presented in Table 1, showcasing different modalities and their respective counts. These images were sourced from open-source libraries and repositories found on the internet. View this table: [TABLE 1](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/T1) TABLE 1 Breakdown of Multimodal Images. ### 2.2 Cardiology The dataset used was a set of ECG waveforms sourced from the ECG Wave-Maven: A Self-Assessment Program for Students and Clinicians1. These ECG images cover various cardiac conditions and serve as a representative dataset for evaluating GPT-4V’s interpretation of ECG’s. ### 2.3 Dermatology In dermatology, clinical photos were collected from the Hellenic Dermatological Atlas2, to curate a comprehensive set of dermatological conditions for assessing GPT-4V’s performance in interpretation. ## 3. Experimental Setup The methodology employed for this comprehensive evaluation followed a structured four-phase approach. ### 3.1 Dataset Curation A diverse range of medical images and corresponding labels were selected from public datasets, encompassing various diagnostic modalities such as patient clinical photos, radiological images, ECG traces, EEG, fundoscopy, endoscopy, and colonoscopy. GPT-4V analyzed these images based on the prompts. The combined prompts, images, and the model’s output were captured as a screenshot to be placed on the evaluation platform for assessment. ### 3.2 Evaluation Criteria A dual approach was adopted to assess the accuracy and reliability of GPT-4V’s interpretations. All images were evaluated by two senior surgical residents (K.R.A, H.S.) and a board-certified internal medicine physician (A.T.). ECGs and clinical photos of dermatologic conditions were additionally evaluated by a board-certified cardiac electrophysiologist (A.H.) and dermatologist (A.C.), respectively. The following below are the questionnaires used for the evaluation #### General Conditions (Diverse Modalities) * 1) Rate the answer from 1-5. * 2) Rate from 1-5 how comfortable you would be letting a medical student rely on this content to help learning. * 3) Was the image interpreted correctly? (Yes/No) * 4) Was the advice correct? (Yes/No) * 5) Was the advice given dangerous? (Yes/No) #### Cardiology (ECGs) * 1) Rate the overall interpretation of the ECG (1-5). * 2) Compared to a standard automated read of an ECG, would you consider this interpretation more competent? (Yes/No) * 3) Rate from 1-5 how comfortable you would be letting a medical student rely on this content to help learning. * 4) Would this interpretation be helpful in a medical student’s learning? (Yes/No) * 5) General Comments: #### Dermatology (Clinical Photos) * 1) Rate the quality of the layman’s description of the rash (1-5) * 2) Rate the quality of the medical description of the rash (1-5) * 3) Rate the quality of the differential diagnosis (1-5) * 4) General Comments ### 3.3 Evaluation Platform Figure 1 provides a visual representation of the evaluation platform, designed as a web-based form. * **Left Side:** Displayed the image and GPT-4V’s Prompts/Outputs as screenshots as well as the correct label for the image. * **Right Side:** Comprised a section for evaluator feedback, including questions from the evaluation criteria. ![Fig. 1.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/11/16/2023.11.15.23298575/F1.medium.gif) [Fig. 1.](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/F1) Fig. 1. Evaluation platform to collect clinician feedback on GPT4V’s output. ![Fig. 2.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/11/16/2023.11.15.23298575/F2.medium.gif) [Fig. 2.](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/F2) Fig. 2. Evaluation of GPT-4V’s Interpretations on Medical Images with Expert Feedback ![Fig. 3.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/11/16/2023.11.15.23298575/F3.medium.gif) [Fig. 3.](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/F3) Fig. 3. Case Study 1-MRI ![Fig. 4.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/11/16/2023.11.15.23298575/F4.medium.gif) [Fig. 4.](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/F4) Fig. 4. Case Study 2-CT ### 3.4 Prompts The prompts that were used by GPT-4V to generate outputs are listed below for each category. #### General Conditions (Diverse Modalities) * “You are an expert [insert specialty i.e. radiology, cardiology] tutor. Explain to a medical student what this patient photo indicates.” * “If indicators like arrows, asterisks, or circles are present in the image, underline and expand on their significance.” #### Cardiology (ECGs) * “You are an expert cardiology tutor assisting a medical student. Provide a detailed medical interpretation of the ECG, covering rhythm, rate, axis, evidence of ischemia, hypertrophy, or other clinically significant findings. Finally, list a differential diagnosis based on the ECG findings.” #### Dermatology (Clinical Photos) * “You are an expert dermatology tutor helping a medical student. Describe the rash seen in the photo in layman’s terms. Next, describe it using medical terminology. Finally, list a differential diagnosis for the given image.” ## 4 Results ### 4.1 Performance on Multimodal Images For multimodal images (Table 2), a total of 69 images were assessed. Several images were accompanied by multiple prompts, with each undergoing a separate assessment. The correct diagnostic label for all these images were provided to the clinician evaluator to ensure accuracy in assessment. Clinician evaluators were asked to identify if GPT-4V correctly interpreted the images and whether they felt that the interpretation given was correct and safe for patient care. The average comfort level the clinicians felt about letting medical students learn from these images was 1.8 ± 1.4 on a scale of 1-5. Out of the 69 images, only 15 were correctly interpreted with the correct advice. However, there were a concerning number of instances (30 out of 69) where dangerous advice was provided. The images spanned various modalities (Table 1), including CT scans of various body parts, ECG, MRI, CXR, and others. View this table: [TABLE 2](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/T2) TABLE 2 Multimodal Images Summary of Results. ### 4.2. Performance on Electrocardiograms (Cardiology) For ECG images (Table 3), 24 images were examined. The overall interpretation of these images had an average rating of 2.25 ± 1.07 out of 5. Notably, none of these interpretations matched the competence of standard automated ECG reads as determined by the cardiac electrophysiologist. Out of the 24, only 3 responses were considered helpful for medical student learning, and in 9 cases, dangerous advice for patient care was given. View this table: [TABLE 3](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/T3) TABLE 3 ECG Summary of Results ### 4.3 Performance on Clinical Photos (Dermatology) For dermatology images (Table 4), out of the 49 images, the average quality of layman’s description of the rash was 3 ± 1.55 out of 5. The medical descriptions and differential diagnoses of the rash averaged at 2.5 ± 1.49 and 2 ± 1.46 out of 5, respectively. The comfort level of using GPT-4V as an education tool for medical students averaged at 2 ± 1.4 out of 5. In addition, the differential diagnosis was described by the dermatologist as lacking depth and containing inaccuracies or irrelevant conditions. View this table: [TABLE 4](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/T4) TABLE 4 Clinical Photos Summary of Results Figure 5 highlights direct examples of GPT-4V responses to images used in the evaluation along with clinician comments. For both cases highlighted, clinician comments indicate that GPT-4V has provided inaccurate advice that can impact patient care. ![Fig. 5.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/11/16/2023.11.15.23298575/F5.medium.gif) [Fig. 5.](http://medrxiv.org/content/early/2023/11/16/2023.11.15.23298575/F5) Fig. 5. Case Study 3-ECG ## 5. Discussion and Limitations While GPT-4V demonstrates moderate proficiency in processing diverse medical imaging modalities and identifying specific features, it is important to note that the model occasionally falls short in recognizing overt findings. In addition, it’s important to consider that the public-facing version of GPT-4V, as part of alignment efforts to not explicitly provide directives, may have impacted its performance on certain medical tasks. Nevertheless, this evaluation of GPT-4V is not without its limitations. Firstly, our utilization of public-facing images, which might have potentially been part of the model’s training datasets, should, in theory, have augmented its performance. Yet, GPT-4V’s performance, especially with these images, were poor. This raises concerns about the depth and diversity of its training dataset. Secondly, as we provided GPT-4V with standalone images devoid of a broader clinical context, we expected clinicians to consider this aspect when evaluating the model’s efficacy. It should be emphasized that diagnoses are not formed solely on a single picture and, in the absence of patient history, GPT-4V’s output should be evaluated with this consideration in mind. The most glaring concern lies in the model’s accuracy, particularly with ECG interpretations. Instances where GPT-4V misinterprets severe conditions as benign poses significant risk for patient care. Without insight on the training datasets, a comprehensive evaluation will need to be conducted to uncover any harms in misrepresentation or potential bias. From our evaluation of GPT-4V’s performance, it’s evident that proprietary LLMs should strongly consider aligning with open-source principles. This is particularly crucial as many healthcare institutions are exploring collaborations with them for deployment in clinical and operational environments [5]. The Department of Health and Human Services within the United States is spearheading initiatives in this area, emphasizing the necessity for diverse and representative training data to ensure the ethical application of AI [6]. While LLMs have showcased the capability to tailor their responses based on user input and changing contexts, it’s noteworthy that our assessment was conducted during GPT-4V’s initial selective release. Since then, it appears that guardrails have been implemented to ensure that responses related to medical images remain generalized and descriptive rather than prescriptive. Newer LLMs are being designed to address specific challenges within the medical field. An exemplar of this is Clinical Camel, a model that has been fine-tuned with medical datasets to enhance its performance significantly when addressing clinical inquiries, surpassing the capabilities of its pre-trained model [7]. With these developments, there’s an untapped potential for these models to become multimodal, offering a chance to develop comprehensive tools that support healthcare professionals provided they undergo thorough evaluation and validation in real-world clinical settings. Considering the enthusiasm around Large Language Models (LLMs) and the suggestion that they will revolutionize the medical sphere, in our view GPT-4V’s current performance fails to offer merit to those claims. Our human evaluation substantiates healthcare regulatory bodies and OpenAI’s own advice on not using it as a substitute for clinician-based decision making [3]. While GPT-4V’s functionality as a multimodal foundation model—capable of processing both text and image inputs—is noteworthy, in its current form, significant concerns remain regarding its diagnostic accuracy and ability to interpret various medical image modalities. ## Data Availability All data produced in the present work are contained in the manuscript ## Supplementary Notes Below are additional case studies from the evaluation highlighting examples of GPT-4V’s output and comments from the evaluators. ## Footnotes * 1 [https://ecg.bidmc.harvard.edu/maven/mavenmain.asp](https://ecg.bidmc.harvard.edu/maven/mavenmain.asp) * 2 [http://www.hellenicdermatlas.com/en/](http://www.hellenicdermatlas.com/en/) * Received November 15, 2023. * Revision received November 15, 2023. * Accepted November 16, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at [http://creativecommons.org/licenses/by-nd/4.0/](http://creativecommons.org/licenses/by-nd/4.0/) ## References 1. [1]. A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023. 2. [2]. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023. 3. [3].OpenAI, “Gpt-4v(ision) system card,” 2023. 4. [4]. J. N. Acosta, G. J. Falcone, P. Rajpurkar, and E. J. Topol, “Multimodal biomedical ai,” Nature Medicine, vol. 28, no. 9, pp. 1773–1784, 2022. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-022-01981-2&link_type=DOI) 5. [5]. A. J. Nashwan, A. A. AbuJaber, and A. AbuJaber, “Harnessing the power of large language models (llms) for electronic health records (ehrs) optimization,” Cureus, vol. 15, no. 7, 2023. 6. [6]. B. Meskó and E. J. Topol, “The imperative for regulatory oversight of large language models (or generative ai) in healthcare,” npj Digital Medicine, vol. 6, no. 1, p. 120, 2023. 7. [7]. A. Toma, P. R. Lawler, J. Ba, R. G. Krishnan, B. B. Rubin, and B. Wang, “Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding,” arXiv preprint arXiv:2305.12031, 2023.