Evaluating the Performance of ChatGPT-4o Vision Capabilities on Image-Based USMLE Step 1, Step 2, and Step 3 Examination Questions ================================================================================================================================== * Avi A. Gajjar * Harshitha Valluri * Tarun Prabhala * Amanda Custozzo * Alan S Boulos * John C. Dalfino * Nicholas C. Field * Alexandra R. Paul ## ABSTRACT **Introduction** Artificial intelligence (AI) has significant potential in medicine, especially in diagnostics and education. ChatGPT has achieved levels comparable to medical students on text-based USMLE questions, yet there’s a gap in its evaluation on image-based questions. **Methods** This study evaluated ChatGPT-4’s performance on image-based questions from USMLE Step 1, Step 2, and Step 3. A total of 376 questions, including 54 image-based, were tested using an image-captioning system to generate descriptions for the images. **Results** The overall performance of ChatGPT-4 on USMLE Steps 1, 2, and 3 was evaluated using 376 questions, including 54 with images. The accuracy was 85.7% for Step 1, 92.5% for Step 2, and 86.9% for Step 3. For image-based questions, the accuracy was 70.8% for Step 1, 92.9% for Step 2, and 62.5% for Step 3. In contrast, text-based questions showed higher accuracy: 89.5% for Step 1, 92.5% for Step 2, and 90.1% for Step 3. Performance dropped significantly for difficult image-based questions in Steps 1 and 3 (p=0.0196 and p=0.0020 respectively), but not in Step 2 (p=0.9574). Despite these challenges, the AI’s accuracy on image-based questions exceeded the passing rate for all three exams. **Conclusions** ChatGPT-4 can handle image-based USMLE questions above the passing rate, showing promise for its use in medical education and diagnostics. Further development is needed to improve its direct image processing capabilities and overall performance. Keywords * Artificial Intelligence * ChatGPT * USMLE * Medical Education * Image-Based Questions * Natural Language Processing * Diagnostics ## INTRODUCTION Artificial intelligence (AI) has emerged as a transformative technology in medicine, offering significant potential to enhance clinical practice, diagnostics, and medical education.1,2 The deployment of AI models in various domains of healthcare has demonstrated promising results, particularly in the realm of natural language processing (NLP) for medical question answering. Among these, large language models such as Chat Generative Pre-trained Transformer (ChatGPT) have shown considerable success by achieving performance levels comparable to medical students on standardized exams like the United States Medical Licensing Examination (USMLE).3–5 Numerous studies have explored the capabilities of AI in handling medical board examinations. These investigations have primarily focused on text-based questions, consistently demonstrating that AI can provide accurate and contextually relevant answers.6–8 For example, recent research indicates that ChatGPT surpasses previous models in logical justification and coherence when responding to these exams’ questions.1,9 However, a critical limitation of these studies is the exclusion of image-based questions, which form an integral part of medical education and clinical assessment. Medical licensing exams frequently include visual data such as radiographs, histopathology slides, and clinical photographs to test competency in interpreting diagnostic images.6 To date, no studies have comprehensively evaluated AI models’ performance on these image-based questions, leaving a significant gap in understanding the full capabilities and limitations of AI in the medical field. This study aims to address this gap by investigating whether AI can handle images in addition to text, thereby providing a more comprehensive evaluation of its potential in medical applications. Specifically, we will assess the performance of AI models on image-based questions from a standardized medical question bank. ## METHODS ### Study Design This study evaluates the performance of ChatGPT-4, specifically its ChatGPT-4o version, on USMLE image-based questions. The objective is to determine whether the AI can effectively interpret and respond to image-embedded queries, thereby exploring the potential for multimodal AI applications in medical education. We utilized a comprehensive set of test questions designed to evaluate AI performance on USMLE Step 1 and Step 2 exams. These questions were sourced from a joint program of the Federation of State Medical Boards of the United States, Inc. (FSMB), and the National Board of Medical Examiners® (NBME®).10–12 ### Data Preparation To prepare each image-based question for input into ChatGPT, we employed an image-captioning system due to ChatGPT’s current limitations in directly processing visual data. This involved generating captions for the images using convolutional neural networks (CNNs) trained to produce medical image descriptions. These captions were then integrated into the question prompts, accompanied by the original question text and multiple-choice options. ### Prompt Engineering Prompt engineering was standardized to ensure consistency in AI responses. Each question was formatted to include an image description, followed by the original question text and multiple-choice options. For example, the prompt might read: “The following image shows a chest radiograph with bilateral patchy infiltrates. Based on the information provided in the image, answer the following question: What is the most likely diagnosis? A) Tuberculosis B) Pneumonia C) Lung cancer D) Pulmonary embolism E) Sarcoidosis.” The testing of the ChatGPT-4o version involved manually entering each question into the ChatGPT interface as a new prompt, ensuring no memory carryover between successive questions. The model’s answer to each question was documented and analyzed. ### Statistical Analysis All statistical analyses were conducted using Stata software (version 18.0 SE; StataCorp LLC, College Station, TX, USA). Independent t-tests were applied to determine the significance of performance differences across question difficulty levels and between correct and incorrect responses. A threshold of P<.05 was set for determining statistical significance. The study did not involve human subjects or patient data and therefore did not require ethical approval. All test questions were sourced from publicly available or licensed question banks. ## RESULTS ### Overall Performance The overall performance of ChatGPT-4 on USMLE Step 1, Step 2, and Step 3 was evaluated. The data set included 376 questions in total, comprising 54 questions with images. Specifically, 11.6% of USMLE Step 2 questions featured images, and up to 20.2% of USMLE Step 1 questions included visual data. The accuracy for Step 1 was 85.71%, Step 2 was 92.50%, and Step 3 was 86.86%. When comparing text-based and image-based questions, significant differences were observed. For Step 1, the accuracy for image-based questions was 70.83%, while text-based questions achieved 89.47%. In Step 2, the accuracy was consistent for both types of questions, with 92.86% for image-based and 92.45% for text-based questions. However, in Step 3, image-based questions had an accuracy of 62.50%, compared to 90.08% for text-based questions. ### Performance by Question Type For text-based questions, ChatGPT-4 showed high accuracy across all steps: 89.47% for Step 1, 92.45% for Step 2, and 90.08% for Step 3. Conversely, image-based questions presented a greater challenge. Step 1 achieved an accuracy of 70.83%, Step 2 reached 92.86%, and Step 3 managed 62.50%. These results highlight a noticeable performance gap in Steps 1 and 3 compared to text-based questions. The analysis of performance by difficulty level for image-based questions revealed significant drops in accuracy for more difficult questions. T-tests indicated notable differences in performance between easy and difficult image-based questions in Steps 1 and 3. For Step 1, there was a significant performance disparity (p=0.0196), and Step 3 also exhibited a significant difference (p=0.0020). In contrast, Step 2 maintained high accuracy across all difficulty levels, showing no significant difference (p=0.9574). ## DISCUSSION This study provides important insights into the performance of AI, particularly ChatGPT-4, on the USMLE image-based questions. Previous literature has extensively evaluated AI’s capabilities in answering text-based medical questions, demonstrating proficiency levels comparable to human medical students.3,13 However, there has been a considerable gap in assessing AI’s competencies on image-based questions, which are critical for diagnosing and understanding medical conditions. By addressing this gap, our study offers a comprehensive evaluation and highlights the potential utility and limitations of AI in medical education and diagnostic applications. ### Performance Disparity between Text-Based and Image-Based Questions Our results revealed a notable performance disparity between text-based and image-based questions on the USMLE. ChatGPT-4 demonstrated high accuracy for text-based questions, attaining 89.47% for Step 1, 92.45% for Step 2, and 90.08% for Step 3. These findings are consistent with prior studies where AI models have shown significant capability in handling text-based inquiries, often surpassing human benchmarks.3,13 However, the accuracy for image-based questions was considerably lower, with 70.83% for Step 1, 92.86% for Step 2, and 62.50% for Step 3. The largest discrepancies were observed in Steps 1 and 3, underscoring the difficulties AI faces in interpreting complex visual data. These results underscore a critical need to enhance AI’s image-processing systems to achieve more balanced performance across different question types. ### Challenges in Interpreting Visual Data Our analysis highlighted specific challenges AI faces in interpreting visual data. Despite not categorizing errors into types, common issues included AI’s difficulty in recognizing fine details and accurately interpreting complex medical images. These difficulties often led to incorrect or incomplete diagnoses, reflecting a significant gap in AI capabilities compared to human experts. The problems were most pronounced in Step 3 questions, where the accuracy dropped to 62.50%. Advanced image-processing algorithms may be required to improve AI’s proficiency. Prior studies suggest that incorporating more sophisticated machine learning techniques, like convolutional neural networks (CNNs), might enhance AI’s ability to analyze images more accurately.14,15 Moving forward, it will be crucial to develop AI systems that can maintain high accuracy in both text and image-based questions to provide reliable support in medical diagnostics and education. ### Implications for Medical Education and Diagnostics Enhancing AI capabilities to interpret medical images could revolutionize medical training by providing more accurate feedback on diagnostic questions that involve visual data.15 This could improve the learning experience for medical students and ensure they are better prepared for real-world clinical scenarios. Additionally, improved AI performance in diagnostic imaging may augment clinical practice by offering a reliable second opinion, potentially reducing diagnostic errors and improving patient outcomes. The integration of AI into medical education tools has the potential to create a more interactive and engaging learning environment, which could lead to better retention and understanding of complex medical information.3,16 ### Future Directions To address the current limitations and enhance AI performance, several recommendations for future research are proposed. There is a need to develop sophisticated algorithms specifically designed for interpreting medical images. Techniques such as convolutional neural networks (CNNs) and generative adversarial networks (GANs) could be leveraged to improve the accuracy and interpretability of AI models.15 AI models should be trained on extensive and diverse datasets that include a broad range of medical images and scenarios. This would increase the robustness and generalizability of AI systems across different types of medical imaging modalities and clinical contexts. Integrating text and image data in AI models could provide a more comprehensive approach to medical diagnostics and education. Multimodal AI would leverage the strengths of both data types, resulting in more accurate and contextually relevant outputs, thus enhancing clinical decision-making and learning outcomes.17 Collaboration between AI developers and medical educators is essential to ensure the developed AI systems are aligned with practical needs. Such partnerships can refine AI models to make them more applicable and beneficial in real-world medical settings.16 By following these recommendations, we can significantly improve the capabilities of AI in medical applications, ensuring it meets the rigorous demands of both educational and clinical environments. Future research should aim to continuously refine and test these AI models to maximize their potential and utility in the medical field. ## CONCLUSIONS This study demonstrates that ChatGPT-4, particularly in its ChatGPT-4o version, can handle image-based questions at an accuracy that exceeds the passing rate, thereby offering a comprehensive example of its ability to succeed in the USMLE exams. The AI’s performance on both text-based and image-based questions highlights its potential as a tool to assist medical students and professionals. Despite the current limitations in directly processing visual data, ChatGPT-4’s ability to interpret image captions and respond accurately underscores its promise in enhancing medical education and diagnostic processes. These findings suggest that with further development, AI could play a significant role in training and assessment within the medical field, providing robust support for both learners and practitioners. ## Data Availability All data produced in the present study are available upon reasonable request to the authors ## FIGURE LEGENDS ![Figure 1.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/06/19/2024.06.18.24309092/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2024/06/19/2024.06.18.24309092/F1) Figure 1. Example of prompted USMLE Step 1 question to ChatGPT ![Figure 2.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/06/19/2024.06.18.24309092/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2024/06/19/2024.06.18.24309092/F2) Figure 2. Accuracy of image, non-image, and overall questions for all USMLE Exams. View this table: [Table 1.](http://medrxiv.org/content/early/2024/06/19/2024.06.18.24309092/T1) Table 1. Accuracy Metrics and Statistical Significance for USMLE Steps ## Footnotes * The authors have no conflicts of interest to disclose. ## Abbreviations AI : Artificial Intelligence ChatGPT : Chat Generative Pre-trained Transformer USMLE : United States Medical Licensing Examination FSMB : Federation of State Medical Boards NBME : National Board of Medical Examiners CNN : Convolutional Neural Networks GAN : Generative Adversarial Networks * Received June 18, 2024. * Revision received June 18, 2024. * Accepted June 19, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## REFERENCES 1. 1.Garin D. Unleashing the potential of AI: a deeper dive into GPT prompts for medical research. BMJ Health Care Inform. 2023;30(1):e100857. doi:10.1136/bmjhci-2023-100857 [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiYm1qaGNpIjtzOjU6InJlc2lkIjtzOjEyOiIzMC8xL2UxMDA4NTciO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8wNi8xOS8yMDI0LjA2LjE4LjI0MzA5MDkyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 2. 2.Gajjar AA, Kumar RP, Paliwoda ED, et al. Usefulness and Accuracy of Artificial Intelligence Chatbot Responses to Patient Questions for Neurosurgical Procedures. Neurosurgery. Published online February 14, 2024. doi:10.1227/neu.0000000000002856 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1227/neu.0000000000002856&link_type=DOI) 3. 3.Haddad F, Saade JS. Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study. JMIR Med Educ. 2024;10:e50842. doi:10.2196/50842 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/50842&link_type=DOI) 4. 4.Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ. 2023;9:e48002. doi:10.2196/48002 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/48002&link_type=DOI) 5. 5.Roos J, Kasapovic A, Jansen T, Kaczmarczyk R. Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany. JMIR Med Educ. 2023;9:e46482. doi:10.2196/46482 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/46482&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37665620&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F06%2F19%2F2024.06.18.24309092.atom) 6. 6.Knoedler L, Alfertshofer M, Knoedler S, et al. Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis. JMIR Med Educ. 2024;10:e51148. doi:10.2196/51148 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/51148&link_type=DOI) 7. 7.Abdullahi T, Singh R, Eickhoff C. Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models. JMIR Med Educ. 2024;10:e51391. doi:10.2196/51391 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/51391&link_type=DOI) 8. 8.Sevgi M, Antaki F, Keane PA. Medical education with large language models in ophthalmology: custom instructions and enhanced retrieval capabilities. Br J Ophthalmol. Published online May 7, 2024:bjo-2023-325046. doi:10.1136/bjo-2023-325046 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTI6ImJqb3BodGhhbG1vbCI7czo1OiJyZXNpZCI7czoxNzoiYmpvLTIwMjMtMzI1MDQ2djEiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8wNi8xOS8yMDI0LjA2LjE4LjI0MzA5MDkyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 9. 9.Armitage R. Performance of GPT-4 in Membership of the Royal College of Paediatrics and Child Health-style examination questions. BMJ Paediatr Open. 2024;8(1):e002575. doi:10.1136/bmjpo-2024-002575 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NToiYm1qcG8iO3M6NToicmVzaWQiO3M6MTE6IjgvMS9lMDAyNTc1IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMDYvMTkvMjAyNC4wNi4xOC4yNDMwOTA5Mi5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 10. 10.Federation of State Medical Boards of the United States, Inc. (FSMB), and National Board of the United States, National Board of Medical Examiners (NBME). Sample Test Questions Step 1.; 2024. [https://www.usmle.org/sites/default/files/2021-10/Step\_1\_Sample\_Items.pdf](https://www.usmle.org/sites/default/files/2021-10/Step_1_Sample_Items.pdf) 11. 11.Federation of State Medical Boards of the United States, Inc. (FSMB), and National Board of the United States, National Board of Medical Examiners (NBME). Sample Test Questions Step 2.; 2024. [https://www.usmle.org/sites/default/files/2021-10/Step2\_CK\_Sample\_Questions.pdf](https://www.usmle.org/sites/default/files/2021-10/Step2_CK_Sample_Questions.pdf) 12. 12.Federation of State Medical Boards of the United States, Inc. (FSMB), and National Board of the United States, National Board of Medical Examiners (NBME). Sample Test Questions Step 3.; 2024. [https://www.usmle.org/sites/default/files/2021-10/Step3\_Sample_Items.pdf](https://www.usmle.org/sites/default/files/2021-10/Step3_Sample_Items.pdf) 13. 13.Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. doi:10.2196/45312 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/45312&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36753318&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F06%2F19%2F2024.06.18.24309092.atom) 14. 14.Safrai M, Azaria A. Does small talk with a medical provider affect ChatGPT’s medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS One. 2024;19(4):e0302217. doi:10.1371/journal.pone.0302217 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0302217&link_type=DOI) 15. 15.Chaudhari GR, Liu T, Chen TL, et al. Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology Reports. Radiol Artif Intell. 2022;4(4):e210185. doi:10.1148/ryai.210185 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1148/ryai.210185&link_type=DOI) 16. 16.Salt J, Harik P, Barone MA. Leveraging Natural Language Processing: Toward Computer-Assisted Scoring of Patient Notes in the USMLE Step 2 Clinical Skills Exam. Acad Med. 2019;94(3):314–316. doi:10.1097/ACM.0000000000002558 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/ACM.0000000000002558&link_type=DOI) 17. 17.Abbas A, Rehman MS, Rehman SS. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus. 2024;16(3):e55991. doi:10.7759/cureus.55991 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7759/cureus.55991&link_type=DOI)