Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam

View ORCID ProfileStefan Morreel, View ORCID ProfileVeronique Verhoeven, Danny Mathysen
doi: https://doi.org/10.1101/2023.08.18.23294263
Stefan Morreel
1Department of Family Medicine and Population Health, University of Antwerp, Antwerp, Belgium
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stefan Morreel
  • For correspondence: stefan.morreel{at}uantwerpen.be
Veronique Verhoeven
1Department of Family Medicine and Population Health, University of Antwerp, Antwerp, Belgium
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Veronique Verhoeven
Danny Mathysen
1Department of Family Medicine and Population Health, University of Antwerp, Antwerp, Belgium
2Dean’s Department, University of Antwerp, Antwerp, Belgium
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Recently developed chatbots based on large language models (further called bots) have promising features which could facilitate medical education. Several bots are freely available, but their proficiency has been insufficiently evaluated. In this study the authors have tested the current performance on the multiple-choice medical licensing exam of University of Antwerp (Belgium) of six widely used bots: ChatGPT (OpenAI), Bard (Google), New Bing (Microsoft), Claude instant (Anthropic), Claude+ (Anthropic) and GPT-4 (OpenAI). The primary outcome was the performance on the exam expressed as a proportion of correct answers. Secondary analyses were done for a variety of features in the exam questions: easy versus difficult questions, grammatically positive versus negative questions, and clinical vignettes versus theoretical questions. Reasoning errors and untruthful statements (hallucinations) in the bots’ answers were examined. All bots passed the exam; Bing and GPT-4 (both 76% correct answers) outperformed the other bots (62-67%, p= 0.03) and students (61%). Bots performed worse on difficult questions (62%, p= 0.06), but outperformed students (32%) on those questions even more (p<0.01). Hallucinations were found in 7% of Bing’s and GPT4’s answers, significantly lower than Bard (22%, p<0.01) and Claude Instant (19%, p=0.02). Although the creators of all bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions.

Bing was able to detect weak or ambiguous exam questions. Bots could be used as a time efficient tool to improve the quality of a multiple-choice exam.

Author Summary Artificial chatbots such as ChatGPT have recently gained a lot of attention. They can pass exams for medical doctors, sometimes they even perform better than regular students. In this study, we have tested ChatGPT and five other (newer) chatbots in the multiple-choice exam that students in Antwerp (Belgium) must pass to obtain the degree of medical doctor. All bots passed the exam with results similar or better than the students. Microsoft Bing scored the best of all tested bots but still produces hallucinations (untruthful statements or reasoning errors) in seven percent of the answers. Bots performed worse on difficult questions but they outperformed students on those questions even more. Maybe they are most useful when humans don’t know the answer themselves? The creators of the bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions. Microsoft Bing also turns out to be useful to find weak questions and as such improve the exam.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

The authors did not receive any funding for this study.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This experiment has been approved by the Ethics Committee of the University of Antwerp and the Antwerp University Hospital (reference number MDF 21/03/037, amendment number 5462).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The questions of this exam cannot be made publicly because they will be used again in future exams. Consequently, the authors cannot share all the AI responses. Upon request we can provide all raw data, the questions and the responses as long as the requestor can guarantee that they will not be made publicly and no students will have access to them. As supplementary material, we do provide a datasheet with our raw data excluding the answers and the questions (S2 Selected Study Data and S3 Study Data Variables Overview). Individual student results, even anonymised will never be shared as it is impossible to ask permission to all students.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted August 21, 2023.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam
Stefan Morreel, Veronique Verhoeven, Danny Mathysen
medRxiv 2023.08.18.23294263; doi: https://doi.org/10.1101/2023.08.18.23294263
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam
Stefan Morreel, Veronique Verhoeven, Danny Mathysen
medRxiv 2023.08.18.23294263; doi: https://doi.org/10.1101/2023.08.18.23294263

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)