Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam
================================================================================================================================================

* Stefan Morreel
* Veronique Verhoeven
* Danny Mathysen

## Abstract

Recently developed chatbots based on large language models (further called bots) have promising features which could facilitate medical education. Several bots are freely available, but their proficiency has been insufficiently evaluated. In this study the authors have tested the current performance on the multiple-choice medical licensing exam of University of Antwerp (Belgium) of six widely used bots: ChatGPT (OpenAI), Bard (Google), New Bing (Microsoft), Claude instant (Anthropic), Claude+ (Anthropic) and GPT-4 (OpenAI). The primary outcome was the performance on the exam expressed as a proportion of correct answers. Secondary analyses were done for a variety of features in the exam questions: easy versus difficult questions, grammatically positive versus negative questions, and clinical vignettes versus theoretical questions. Reasoning errors and untruthful statements (hallucinations) in the bots’ answers were examined. All bots passed the exam; Bing and GPT-4 (both 76% correct answers) outperformed the other bots (62-67%, p= 0.03) and students (61%). Bots performed worse on difficult questions (62%, p= 0.06), but outperformed students (32%) on those questions even more (p<0.01). Hallucinations were found in 7% of Bing’s and GPT4’s answers, significantly lower than Bard (22%, p<0.01) and Claude Instant (19%, p=0.02). Although the creators of all bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions.

Bing was able to detect weak or ambiguous exam questions. Bots could be used as a time efficient tool to improve the quality of a multiple-choice exam.

**Author Summary** Artificial chatbots such as ChatGPT have recently gained a lot of attention. They can pass exams for medical doctors, sometimes they even perform better than regular students. In this study, we have tested ChatGPT and five other (newer) chatbots in the multiple-choice exam that students in Antwerp (Belgium) must pass to obtain the degree of medical doctor. All bots passed the exam with results similar or better than the students. Microsoft Bing scored the best of all tested bots but still produces hallucinations (untruthful statements or reasoning errors) in seven percent of the answers. Bots performed worse on difficult questions but they outperformed students on those questions even more. Maybe they are most useful when humans don’t know the answer themselves? The creators of the bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions. Microsoft Bing also turns out to be useful to find weak questions and as such improve the exam.

## Introduction

The development of AI applications announces a new era in many fields of society including medicine and medical education. Especially artificial intelligence (AI) chatbots based on large language models (further called bots) have promising features which could facilitate education by offering simulation training, by personalizing learning experiences with individualised feedback, or by acting as a decision support in clinical training situations. However, before adopting this technology in the medical curriculum, its capabilities have yet to be thoroughly tested.[1, 2]

Soon after the first bots became publicly available, higher medical education institutes started to report on their performance in medical exam simulations.[3]

Whereas bots seem to be informative and logical in many of their responses, in others they answer with obvious, sometimes dangerous, hallucinations (confident responses which however contain reasoning errors or are unjustified by the current state of the art).[4] They will reproduce flaws in the datasets they are trained by; they may reflect or even amplify societal inequality or biases or generate inaccurate or fake information.[5]

Mostly, bots perform near the passing mark,[5-8] although they outperform students in some reports.[9, 10] Performance is in general better on more easy questions and when the exam is written in English.[11, 12] Notably their score is generally worse as exams at more advanced stages in the medical curriculum are offered. However, bots seem to learn rapidly, and new versions do considerably better than their prototypes [13-15]. As bots evolve, their proficiency needs continuous monitoring and updating.

Whereas media articles state that higher education institutes already anticipate the dangers of bots in terms of possible exam fraud, they also offer opportunities to assist in developing exams, for example by identifying ambiguous or badly formulated exam questions.

Very few comparisons between different bots have been made, and those that do exist only compare two or three bots and do not report hallucination rates.[16, 17]

In this study, we use the final theory exam that all medical students need to pass to obtain the degree of Medical Doctor. It is followed by an oral exam which is not part of this study. The current exam was used in 2021 at the University of Antwerp, Belgium. It is similar to countrywide exams used in other countries, such as the United States Medical Licensing exam step 1 and step 2CK.[18]

In this study we have tested the current performance of six publicly available bots on the University of Antwerp medical licensing exam. The primary outcomes concern the performance of each bot on the exam. Secondary outcomes include performance on subsets of questions, interrater variability, proportion of hallucinations and the detection of possible weak exam questions.

## Material and Methods

### Ethics

This experiment has been approved by the Ethics Committee of the University of Antwerp and the Antwerp University Hospital (reference number MDF 21/03/037, amendment number 5462).

### Materials

At the end of the undergraduate medical training at the University of Antwerp, medical students must pass a general medical knowledge examination before being licensed as medical doctor. Besides an oral viva examination, this general medical knowledge examination contains 102 multiple choice questions covering the entire range of curricular courses. In this study, the exam as it was presented to the students in their second master year (before their final year of clinical training) was used. The scoring system was adapted afterwards, so the student’s scores in this paper do not reflect the actual grades given to the students. The questions were not available online, so they were not used for the training of the studied bots.

### Bot selection

Six bots that are publicly available and can currently be used by teachers and students were tested. The most widely used free bots were selected: ChatGPT (OpenAI), Bard (Google), and New Bing (Microsoft). Claude instant (Anthropic), Claude+(Anthropic) and GPT-4(OpenAI) were added to the list because they allow for an evaluation of the difference between a free and a paying version. Even though Bing is based on the GPT-4 large language model, it also uses other sources such as Bing Search so it is a customized version of the pure GPT-4 bot.[19]

### Data extraction

The exam was translated using Deepl (DeepL SE), a neural machine translation service. Clear translation errors were corrected by author SM, but the writing style and grammar were not improved in order to mimic an everyday testing situation. Questions containing images/tables (N=2) and local questions were excluded (N=5). Local questions were excluded because they concern theories, frameworks or models that have only been described in Dutch and are only applicable to Belgium and the Netherlands. Literal translation of these questions leads to nonsense questions in English.

Details on how and when the bots were used can be found in table 1. By coincidence, the authors found out that when Bard refuses to answer a medical question, prompting it with “please regenerate draft” may force it to answer the question anyhow. This was not the case for the other bots. In all cases where Bard refused to answer, this additional prompt was used.

View this table:
[Table 1:](http://medrxiv.org/content/early/2023/08/21/2023.08.18.23294263/T1)

Table 1: overview of the tested generative chat bots.

### Outcomes

The primary outcome was the performance on the exam expressed as a proportion of correct answers (score). This outcome was also measured in the same way as the students were rated on this exam (adapted score): eleven questions contained a second best answer (an acceptable alternative to the best answer), a score of 0.33 was awarded when this option was chosen; twenty questions contained a fatal answer (this option is dangerous for the patient) leading to a score of -1. For calculation of the student’s scores, the image, table, and local questions were excluded as well.

The primary outcomes were assessed in four subsets of answers. Firstly, the difficulty of the questions: thirteen questions were difficult (recorded P-value in question bank below 0.30 meaning that less than 30% of the students answered the question correct[20]), 36 easy (recorded P-value in question bank above 0.80) and 46 moderate (recorded P-value in question bank between 0.30 and 0.80). Secondly, the grammar of the questions: negative formulated questions (e.g., “which statement is not correct?”) vs positive statements. Five questions were negatively formulated. Thirdly, the type of question: theory (50 questions) or describing a patient (clinical vignette, 45 questions). Finally, questions with vs without fatal answers.

In those cases where a bot answered a question incorrectly with a fatal answer, the proportion of selected fatal answers among all wrong answers was calculated.

The primary outcome was also assessed for a virtual bot (called Ensemble Bot), the answer of this bot was the mode (most common value) of the answers of all six bots.[21]

Three additional outcomes were assessed. Firstly, the proportion of hallucinations as rated by the authors among the incorrect answers of the best scoring bot. Authors VV and DM read all incorrect answers and judged them as containing a hallucination or not. In case of discordance, author SM made a final decision. A hallucination was previously defined as content that is nonsensical or untruthful in relation to certain sources.[22] This definition is not usable for the current research so the authors defined a hallucination as content that either contains clear reasoning or is untruthful in relation to current evidence based medical literature. To detect reasoning errors, no medical knowledge is required. For example: “the risk is about 1 in 100 (3%)”. To detect untruthful answers, the authors had to use their own background knowledge combined with common online resources to verify the AI answers. One clear example of an untruthful answer given by several bots: “This is a commonly used mnemonic to remember the order: “NAVEL” - Nerve, Artery, Vein, Empty space (from medial to lateral).” This mnemonic does exist, but it should be used from lateral to medial. Because a multiple-choice exam was studied, the hallucination could not be found in the answer itself but in the arguments supporting the selected answer. Bots never answer with a simple letter, they all produce written out answer of varying length. The authors wanted to report reasoning errors and untruthful answers separately but found out that often, these two were both present in a bot’s answer so this outcome was suspended.

Secondly, the proportion of possible weak questions among the incorrect answers of the best scoring bot. For this outcome, all authors discussed all incorrect answers of the best scoring bot and reached unanimous consensus.

Thirdly, the interrater variability was examined. Originally, the authors planned to test whether user interpretation of the answers would be different from strict interpretation of the bot’s answer as this difference was significant in a previous study.[8] This outcome was suspended because such cases occurred only in ChatGPT and Bard.

### Analysis

The differences in performance among the bots/students, differences in performance among categories of questions, and differences in the proportion of hallucinations were tested with a one-way ANOVA test and pairwise unpaired two-sample T-tests. P-values were 2-tailed where applicable, and a p-value of less than 0.05 was considered statistically significant. A p-value between 0.05 and 0.10 was considered a trend. For the wrong answers on questions with a fatal answer, a chi2 test was used to assess the difference between the bot’s proportion of fatal answers and the random proportion of fatal answers (which equals 0.33). Fleiss’ Kappa was used to assess the overall agreement among the bots. Cohen’s kappa was used to assess pairwise interrater agreement between the different bots. Raw data was collected using Excel 2023 (Microsoft). JMP Pro version 17 (JMP Statistical Discovery LLC) was used for all analyses except Fleiss’ kappa which was calculated in R version 4.31 (DescTools package).

## Results

### Exam performance

See table 2 for an overview of the scores of the tested bots. Bing and GPT-4 scored the best with 76% correct answers and an adapted score (the way students were rated) of 76% as well. The mean score of all bots was 68%, the scores of the individual bots were not significantly different from this mean (p = 0.12). However, Bing and GPT-4 scored significantly better than Bard (p=0.03) and Claude Instant (P=0.03). GPT-4 had the same score as Bing but had more wrong answers (25 versus 13). Claude+ did not significantly score better than Claude Instant. All Bots gave one fatal answer (on different questions) except Bard which did not give any fatal answers. Bing gave four second best answers, ChatGPT/Bard/GPT three, Claud two and Claud Instant only one. For thirteen questions, Bard refused to answer. After prompting Bard up to five times with “regenerate draft”, it still refused to answer four questions, seven were answered correctly and two were wrongly. The performance of the bots using the adapted score was very similar because the added points of second-best answers were smoothed out by the lost points due to fatal answers. The mean score of the 95 students was 61% (standard deviation 9), the mean adapted score for students was 60% (standard deviation 21). The Ensemble Bot (answers with the most common answer among the six bots) scored the same as Bing (72 correct answers, 76%).

View this table:
[Table 2.](http://medrxiv.org/content/early/2023/08/21/2023.08.18.23294263/T2)

Table 2. Performance of generative chat bots on the University of Antwerp Medical License Exam (95 questions)

To illustrate this performance S1 Table contains a question and the responses from all selected bots.

### Performance for subsets of questions

The bots scored on average 73% for easy questions and 62% for difficult questions (P=0.06%). The students scored on average 75% for easy questions and 32% for difficult questions (p<0.01). Assessing difficult questions only, ChatGPT performed best with a score of 77%, Bing/GPT4 scored 69%. The students scored 32% on difficult questions which is significantly lower as compared to ChatGPT, Bing, and GPT-4 (p<0.01). A similar but smaller effect was found for moderate questions (Bing versus students, 72% versus 59%, p = 0.07) but not for easy questions (69 vs 74%, p=0.30)

No significant difference in performance on negative versus positive questions (p=0.16) and on clinical vignettes versus theory questions (p=0.16) was found. Such a difference was not found for the students either (p = 0.54 and 0.38 respectively). When examining individual questions, errors on clinical vignette questions were often caused because Bing missed an important clue in the context or the history of the patient. For example, in a question concerning the timing of a flu vaccine for a pregnant patient consulting in august, Bing answers that the flu vaccine was necessary now. Bing missed the clue about august: flu vaccines should be given later and are generally not available yet in august.[23]

The bots scored on average 72% on questions with a fatal answer which is not significantly different from questions without a fatal answer (68%, p=0.39). Among the 34 wrong answers, the fatal answer was chosen five times (15%) which is lower than can be expected by chance only (11 wrong answers or 33%, p=0.09). The students did perform worse on these questions (mean 64% versus 52%, p=0.03). Among the 843 wrong student answers, the fatal option was chosen in 111 answers (13%).

### Hallucinations

Hallucinations were found in 7% of Bing’s and GPT4’s answers. This was significantly lower than Bard (22%, p <0.01) and Claude Instant (19%, p=0.02). ChatGPT had 15% hallucinations and Claude+ 12%, this was not significantly different from Bing and GPT4 (P>0.10 for all these comparisons).

See Table 3 for a question on which five bots hallucinated (reasoning errors).

View this table:
[Table 3.](http://medrxiv.org/content/early/2023/08/21/2023.08.18.23294263/T3)

Table 3. Example of all generative chatbot hallucinations on one question. Reasoning errors are indicated in bold.

### Detection of weak questions

Among the 23 incorrect answers of Bing, three questions were unclearly written and two were not in line with current literature. An example of a detected weak questions is one concerning renal replacement therapy: “*Complete. Renal function replacement therapy is indicated* … *a) in any symptomatic patient with an eGFR <15 ml/min/1*.*73m**2*. *b) only in patients under 65 years of age. c) in anyone with an eGFR < 6 l/min/1*.*73m**2* *d) only when urea is elevated*”. Bing answered “a)”. After review of current literature, the authors judge that an eGFR below 15 is indeed a commonly used cut of value for starting renal replacement therapy but it is not the only reason so start dialysis. Because statement a contains “any”, Bing’s answer is wrong, but the authors do understand why Bing answered this question and why a student might give this answer as well. The same argument applies to answer c which is supposed to be the correct answer. Even more, the eGFR cut-off of six is odd. This question needs improvement.

### Interrater variability

For 34 questions (36%), all bots agreed. Fleiss’ Kappa for all raters was 0.54 (moderate agreement). The agreement between ChatGPT and GPT-4 was the highest (Cohen’s Kappa=0.66, substantial agreement). The agreement between Bing and Bard was the lowest (Cohen’s Kappa= 0.48, moderate agreement).

## Discussion

In this study, significant differences in the performance of publicly available AI chatbots on the Antwerp Medical License Exam were found. Both GPT-4 and Bing scored the best, but Bing turns out more reliable as it produces fewer wrong answers. This performance is in line with previous research.[13-15] An ensemble bot which combines all tested bots scored equally. The proportion of hallucinations was much lower for Bing than for Bard and Claude+/Claude Instant.

The improvement of these new bots both in scores as in proportion of hallucinations sounds impressing, it might however increase the risk as users will have more confidence in wrong or even dangerous answers as the bots (in general) answer more correctly. The risk of replicating biases in the data on which these models are trained remains. Other authors already pointed out the meaning of these results: bots can pass exams, but this does not make them medical doctors as this requires far more capacities than reproduction of knowledge alone. The current study raises the questions whether a multiple choice exam is a useful way to assess the competencies modern doctors need (mostly concerning human interactions).[24] Bing performed equally as GPT-4 but with less wrong answers, so currently it is not worth paying for a bot in order to test a medical exam, neither is it useful to create an ensemble bot based on the mode of all bot’s answers. Ensemble bots based on more complex rules than just the mode of all answers should be studied further.

We can recommend the use of Bing to detect weak questions among the wrong answers. This is a time-efficient way to improve the quality of a multiple-choice exam.

The trend we found towards better bot performance on easy questions is in line with previous research.[11] However, the difference in performance between students and bots was large for difficult questions and absent for easy questions. This compelling new finding demands further research. Maybe bots are most useful in those situations that are difficult for humans?

The lack of a significant difference in performance between positive and negative questions, and between clinical vignettes and theory questions needs confirmation on larger datasets and on other exams.

Although the creators of all bots try, to a certain extent, to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions. Only Claude+ and Claude instant refused (at times) to answer the question and closed the conversation. For all other bots users can try to pursue them to answer the question anyhow. This finding was most compelling for Bard where after entering the same questions repeatedly, Bard did answer it in nine out of thirteen cases.

The rise of generative AI also raises many ethical and legal issues: their enormous energy consumption, use of data sources without permission, use of sources protected by copyright, lack of reporting guidelines and many more. Before widely implementing AI in medical exams, more legislation and knowledge is necessary on these topics.[25, 26]

The strengths of this study mainly concern its novelty: the comparison of six different bots had not been published yet. The bots tested are available to the public so our methodology can easily be re-used. This study, however, has got several limitations as well. It only concerned one exam with a moderate size set of questions. There was no usable definition of hallucinations, neither a validated approach to detect them available at the time of writing. The definition we have used (chatbot generated content that either contains clear reasoning or is untruthful in relation to current evidence based medical literature) might inspire other authors although we found out that a distinction between reasoning errors and untruthful statements was not feasible. The exclusion of tables, local questions and images reduces the use of the comparison to real students. Future bots will most likely be able to process such questions as well. Finally, the exam was translated in English to make the current paper understandable for a broad audience. Further research on other languages is necessary.

## Conclusion

Six generative AI chatbots passed the Antwerp multiple choice exam necessary for obtaining a license as an MD. Bing (and to a lesser extent GPT-4) outperformed all other bots and students. Bots performed worse on difficult questions but outperformed students on those questions even more. Bing can be used to detect weak multiple-choice questions. Bots should improve their algorithm if they do not want to be used as a medical.

## Data Availability

The questions of this exam cannot be made publicly because they will be used again in future exams. Consequently, the authors cannot share all the AI responses. Upon request we can provide all raw data, the questions and the responses as long as the requestor can guarantee that they will not be made publicly and no students will have access to them. As supplementary material, we do provide a datasheet with our raw data excluding the answers and the questions (S2 Selected Study Data and S3 Study Data Variables Overview). Individual student results, even anonymised will never be shared as it is impossible to ask permission to all students.

## Supplementary material captions

S1 Table. Responses from all selected bots on an example question

S2 Selected Study Data. Study data excluding selected columns. See Data Availability Statement for more information.

S3 Study Data Variables Overview. An overview of the properties of all variables used in file S2 Selected Study Data.

## Acknowledgements

The authors would like to thank Professor David Martens for proofreading this manuscript.

*   Received August 18, 2023.
*   Revision received August 18, 2023.
*   Accepted August 21, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/)

## References

1.  1.Rudolph J, Tan S, Tan S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching. 2023;6(1).
    
    
2.  2.Chatterjee J, Dethlefs N. This new conversational AI model can be your friend, philosopher, and guide… and even your worst enemy. Patterns. 2023;4(1).
    
    
3.  3.Kung TH, Cheatham M, Medinilla A, Chat GPT, Sillos C, De Leon L, et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. medRxiv. 2022:2022.12. 19.22283643.
    
    
4.  4.Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1–38.
    
    
5.  5.Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clinical Orthopaedics and Related Research®. 2022:10.1097.
    
    
6.  6.Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof. 2023;20(1).
    
    
7.  7.Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology. 2023:230582.
    
    
8.  8.Morreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach. 2023;45(6):665-6. Epub 20230311. doi: 10.1080/0142159x.2023.2187684. PubMed PMID: 36905610.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/0142159x.2023.2187684&link_type=DOI) 

9.  9.Li SW, Kemp MW, Logan SJ, Dimri PS, Singh N, Mattar CN, et al. ChatGPT outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology. American Journal of Obstetrics and Gynecology. 2023.
    
    
10. 10.Subramani M, Jaleel I,  Krishna Mohan S. Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Advances in Physiology Education. 2023;47(2):270–1.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1152/advan.00036.2023&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36971685&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

11. 11.Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Assoc. 2023;86(7):653-8. Epub 20230705. doi: 10.1097/jcma.0000000000000942. PubMed PMID: 37227901.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/jcma.0000000000000942&link_type=DOI) 

12. 12.Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. 2023;307(5):e230582. doi: 10.1148/radiol.230582.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1148/radiol.230582&link_type=DOI) 

13. 13.Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus. 2023;15(6):e40822. Epub 20230622. doi: 10.7759/cureus.40822. PubMed PMID: 37485215; PubMed Central PMCID: PMCPMC10362981.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7759/cureus.40822&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37485215&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

14. 14.Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Sullivan PLZ, et al. Performance of ChatGPT, GPT-4, and Google bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2022:10.1227.
    
    
15. 15.Oh N, Choi G-S, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Annals of Surgical Treatment and Research. 2023;104(5):269.
    
    
16. 16.Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023;104(5):269-73. Epub 20230428. doi: 10.4174/astr.2023.104.5.269. PubMed PMID: 37179699; PubMed Central PMCID: PMCPMC10172028.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.4174/astr.2023.104.5.269&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37179699&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

17. 17.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. doi: 10.2196/45312.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/45312&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36753318&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

18. 18.Rashid H, Coppola KM, Lebeau R. Three Decades Later: A Scoping Review of the Literature Related to the United States Medical Licensing Examination. Acad Med. 2020;95(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 59th Annual Research in Medical Education Presentations):S114-s21. doi: 10.1097/acm.0000000000003639. PubMed PMID: 33105189.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/acm.0000000000003639&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33105189&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

19. 19.Mehdi Y. Confirmed: the new Bing runs on OpenAI’s GPT-4 2023 [09/08/2023]. Available from: https://blogs.bing.com/search/march_2023/Confirmed-the-new-Bing-runs-on-OpenAI%E%80%99s-GPT-4.
    
    
20. 20.Miller MD, Linn RL. Measurement and assessment in teaching. 11th ed. Boston: Pearson; 2013. xviii, 538 p. p.
    
    
21. 21.Dietterich TG, editor Ensemble Methods in Machine Learning2000; Berlin, Heidelberg: Springer Berlin Heidelberg.
    
    
22. 22.OpenAI R. GPT-4 technical report. arXiv. 2023:2303.08774.
    
    
23. 23. Prevention CfDCa. Key Facts About Seasonal Flu Vaccine 2022 [11/08/2023]. Available from: [https://www.cdc.gov/flu/prevent/keyfacts.htm](https://www.cdc.gov/flu/prevent/keyfacts.htm).
    
    
24. 24.Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digit Health. 2023;2(2):e0000205. Epub 20230209. doi: 10.1371/journal.pdig.0000205. PubMed PMID: 36812618; PubMed Central PMCID: PMCPMC9931307.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pdig.0000205&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36812618&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

25. 25.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224–6. doi: 10.1038/d41586-023-00288-7. PubMed PMID: 36737653.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/d41586-023-00288-7&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36737653&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom) 

26. 26.Cacciamani GE, Collins GS, Gill IS. ChatGPT: standard reporting guidelines for responsible use. Nature. 2023;618(7964):238. doi: 10.1038/d41586-023-01853-w. PubMed PMID: 37280286.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/d41586-023-01853-w&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37280286&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F08%2F21%2F2023.08.18.23294263.atom)