Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions

View ORCID ProfileAnne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, View ORCID ProfileMoritz Mahling
doi: https://doi.org/10.1101/2023.08.18.23294159
Anne Herrmann-Werner
1Tübingen Institute for Medical Education (TIME), Faculty of Medicine, University of Tübingen, Germany
2Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Anne Herrmann-Werner
  • For correspondence: anne.herrmann-werner{at}med.uni-tuebingen.de
Teresa Festl-Wietek
1Tübingen Institute for Medical Education (TIME), Faculty of Medicine, University of Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Friederike Holderried
1Tübingen Institute for Medical Education (TIME), Faculty of Medicine, University of Tübingen, Germany
3University Department of Anesthesiology and Intensive Care Medicine, University Hospital Tübingen, Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lea Herschbach
1Tübingen Institute for Medical Education (TIME), Faculty of Medicine, University of Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jan Griewatz
1Tübingen Institute for Medical Education (TIME), Faculty of Medicine, University of Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ken Masters
4Medical Education and Informatics Department, College of Medicine and Health Sciences, Sultan Qaboos University, Muscat, Sultanate of Oman
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stephan Zipfel
2Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Moritz Mahling
1Tübingen Institute for Medical Education (TIME), Faculty of Medicine, University of Tübingen, Germany
5Department of Diabetology, Endocrinology, Nephrology, Section of Nephrology and Hypertension, University Hospital Tübingen, Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Moritz Mahling
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Introduction Large language models (LLMs) such as GPT-4 are increasingly used in medicine and medical education. However, these models are prone to “hallucinations” – outputs that sound convincing while being factually incorrect. It is currently unknown how these errors by LLMs relate to the different cognitive levels defined in Bloom’s Taxonomy.

Methods We used a large dataset of psychosomatic medicine multiple-choice questions (MCQ) (N = 307) with real-world results derived from medical school exams. GPT-4 answered the MCQs using two distinct prompt versions – detailed and short. The answers were analysed using a quantitative and qualitative approach. We focussed on incorrectly answered questions, categorizing reasoning errors according to Bloom’s Taxonomy.

Results GPT-4’s performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty compared to questions that GPT-4 answered incorrectly (p=0.002 for the detailed prompt and p<0.001 for the short prompt). Independent of the prompt, GPT-4’s lowest exam performance was 78.9%, always surpassing the pass threshold. Our qualitative analysis of incorrect answers, based on Bloom’s Taxonomy, showed errors mainly in the “remember” (29/68) and “understand” (23/68) cognitive levels. Specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.

Discussion GPT-4 displayed a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated against Bloom’s hierarchical framework, our data revealed that GPT-4 occasionally ignored specific facts (“remember”), provided illogical reasoning (“understand”), or failed to apply concepts to a new situation (“apply”). These errors, though confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

Conclusion While GPT-4 mostly excels at medical exam questions, discerning its occasional cognitive errors is crucial.

1 Introduction

The recent developments in artificial intelligence (AI) hold transformative potential for various fields, including medicine1 and medical education.2 In November 2022, OpenAI launched GPT-3, a large language model (LLM).3 Its high-quality performance surprised even experts and led to huge public interest. This is particularly true for school and higher education settings, where GPT-3 prompted manifold discussions on potential benefits and harms.4,5 In medical education, LLMs have the potential to revolutionize the current teaching approaches and thus ultimately improve physician performance and healthcare outcomes. However, before LLMs are deeply integrated into medical education, their performance must be comprehensively assessed in the context of medical education. It is especially critical to evaluate the capabilities of AI and LLMs within the educational theoretical frameworks.

One of the most-used frameworks in medical education is Benjamin Bloom’s taxonomy6,7 of learning outcomes. Briefly, Bloom and subsequent colleagues developed a hierarchical classification of cognitive processes, ordered from lower-order cognitive skills, such as knowledge recall (“remember”) and comprehension (“understand”), to higher-order thinking such as application (“apply”), analysis (“analyse”), evaluation (“evaluate”) and creation (“create”).8 From its first publication in 1956 until now, this taxonomy has been used as a common language for educational instructors and still influences the field of medical education.8

Although derived from human learning processes, the taxonomy developed by Bloom provides an ideal framework to describe cognitive processes that underlay success and failure. Recently, LLMs have been assessed for their (mostly surprisingly good) performance in various fields of medicine, ranging from specific subjects to board exams.9–11 However, the errors that LLMs make have not been evaluated in detail. For example, while LLMs might successfully recall facts (“remember”), they might struggle to apply those facts to a different context (or vice versa). At this stage, the authors acknowledge that applying terms like “remember” and “struggle” are anthropomorphisms for ease of reading, as an LLM currently does neither and merely generates responses based on language usage statistical probabilities, using a “next-word prediction paradigm”.12

Our intention was thus to use Bloom’s Taxonomy in order to gain a better understanding of the failures of LLMs. For studies of human medicine and the aforementioned use cases for LLMs, multiple-choice questions (MCQ) remain a primary written exam form and are used for summative and formative assessments.13 In Bloom’s Taxonomy, MCQs are often used to assess lower-order cognitive skills, such as knowledge recall (“remember”) and comprehension (“understand”), but may also probe higher-order thinking such as application (“apply”), analysis (“analyse”) and evaluation (“evaluate”).14 The form of MCQs thus offers a good window for evaluating different cognitive processes.

To elucidate cognitive processes and correct/incorrect reasoning, a medical field that heavily relies on language and factual understanding is instrumental. Given the interplay of psychological, social, and biological factors, psychosomatic medicine presents such a challenge. The field’s heavy reliance on verbal and written communication for diagnosis and treatment makes it particularly challenging. Additionally, the combination of strict diagnostic criteria with a nuanced understanding of the patient’s language positions it as an ideal testing ground for the capabilities of language models.

This paper presents a mixed-methods study designed to explore how GPT-4 performs (and fails) with regard to Bloom’s Taxonomy. First, we assessed the performance of GPT-4 in a large set of psychosomatic medicine exam questions and compared the results to responses from our cohort of medical students, providing a human comparison and quality indicators. For a deeper understanding of the results, we employed qualitative methods to understand the model’s performance and to assess the strengths and weaknesses of LLMs in relation to Bloom’s Taxonomy. The findings of this study will provide critical insights into the practical applications and limitations of LLMs, such as GPT-4, in medical education.

2 Methods

2.1 Exams

A total of 16 examinations from winter term 2014/2015 to summer term 2022 were retrieved from the Department of Psychosomatic Medicine and Psychotherapy faculty’s online exam programme integrated management system (IMS) (https://www.ucan-assess.org/ims/). In addition to question stems, answers, and distractors, the system also offers quality criteria for each individual question (discriminatory power and level of difficulty).

Each examination consists of 20 MCQs, each with one answer and four distractors. Diagnostic and therapeutic questions cover topics concerning anxiety disorders, depression, eating, and somatoform and trauma disorders. Questions concerning motivational interviewing techniques are also part of the examination.

Undergraduate medical students take this examination at the end of their third year after having attended 7.5 hours of lectures and 18 hours of practical classes on psychosomatic medicine and psychotherapy.

To pass, a student is usually required to correctly answer 12 out of 20 questions; an adjustment of this passing score is possible if, for example, a question has a too-poor performance.

We assessed the questions for their suitability to be analysed by GPT-4. From a total of 320 questions, 13 (4%) were excluded. Excluded were questions that were not single choice (n = 6), ambiguous questions (n = 3), questions featuring a graphic that had to be analysed (n = 2), and questions that covered a case represented in multiple single questions (n = 2).

We used GPT-4 to answer every single question (model “gpt-4”, OpenAI LP, San Francisco, California, USA). For each question, we generated a detailed and a short prompt version. The prompts were created by the authors in an iterative process using the web-interface ChatGPT Plus to achieve ideal performance. The most relevant difference between versions was that the detailed version included the command to critically reflect upon the answers and justify the choice made. For a detailed example, please see Table 1. We used the application programming interface (API) provided by OpenAI to post the question to GPT-4 and retrieve the answer. While all data are presented in English, the data acquisition was performed in German.

View this table:
  • View inline
  • View popup
Table 1: Examples of detailed and short prompt versions.

2.2 Data analysis

The responses given by GPT-4 were compared to the answer indicated by the answer index (e.g., “A” or “C”) and stored in Microsoft Excel (Version 16.0.10394.20022).

2.2.1 Quantitative data analysis

Quantitative analyses and figure generation were performed using R Statistical software (R version 4.3.1).15 Briefly, we combined all tables with relevant data, i.e., from GPT-4 and the aggregated results of the students’ exams (such as item difficulty). For each prompt version, we analysed the ratio of correctly and incorrectly answered questions. We further compared the question difficulty (from the aggregated student data) between questions answered correctly and incorrectly by GPT-4. The difficulty of a question can be understood as the fraction of students answering a question correctly, with 0 representing a very difficult question and 1 a very easy question.16 To test for statistical significance, a Wilcoxon rank sum test was used. A p-level < 0.05 was considered statistically significant. If not stated otherwise, medians and 25%/75% quartiles are given (Q25–Q75) in the results.

2.2.2 Qualitative data analysis

Two authors (TFW, FH) separately coded each text response. The answers from GPT-4 were analysed inductively and iteratively according to Mayring’s qualitative content analysis17 and as described previously by our group.18 The goal of the analysis were defined according to the answers to the examination questions. As the main category, we used the correct or incorrect answering of the question. We further focussed mainly on incorrect answers.

In the answer text, individual reasoning was categorized according to Bloom’s Taxonomy as revised by Krathwohl.8 In a second step, each of the raters coded the answers using MAXQDA (VERBI software, version 12.3.2). In order to obtain the same level of abstraction in building the categories, the raters revised them together and agreed on the final categories by paraphrasing representative examples and building a hierarchy of categories. Following this, both raters independently worked through the material again. When GPT-4’s responses were wrong, the explanation was analysed using Bloom’ s Taxonomy levels, including remember, understand, apply, analyse, evaluate, and create.8

2.3 Ethics

The Ethics Committee of the Faculty of Medicine at University Hospital Tübingen approved the study (number 076/2023A). All data were kept anonymous and were not associated with students.

3 Results

3.1 Quantitative Results

3.1.1 Distribution of correctly and incorrectly answered questions

For the detailed prompt, the success rate was 93% (284 of 307 questions). For the short prompt, GPT-4 answered 91% of the questions correctly (278 of 307 questions). The distribution is shown in Figure 1.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Distribution of correctly and incorrectly answered questions by prompt version.

3.1.2 Question difficulty

Among all questions, the median difficulty was 0.892 (Q25-Q75: 0.705–0.949). The distribution of the question difficulty for correctly and incorrectly answered questions is displayed in Figure 2.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

Question difficulty by prompt version and answer correctness.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

Performance of GPT-4 for all relevant terms for the detailed prompt (red) and short prompt (blue).

For the detailed prompt, questions answered correctly had a higher difficulty (0.900, Q25-Q75: 0.737–0.952) when compared to questions answered incorrectly (0.705, Q25-Q75: 0.380–0.885). This difference was statistically significant (p = 0.002).

When analysing the short prompt version, we also found a lower difficulty among incorrectly answered questions (0.708, Q25-Q75: 0.500–0.864) when compared to correctly answered questions (0.904, Q25-Q75: 0.741–0.953). A significant difference was detected between the correctly and incorrectly answered questions (p < 0.001).

3.1.3 Exam scores

We further analysed the performance of GPT-4 for all 16 individual terms/exams (Table 1). Regardless of the prompt version, GPT-4 never performed below 78.9% and thus always passed the exams. A total of three exams (one exam for both prompt versions and two exams in the detailed prompt version only) were passed with a score of 100%. The exam rates are displayed in Table 2.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2: Results of GPT-4 for all relevant terms (WS: winter term, SS: summer term).

3.2 Qualitative results

After the quantitative results were obtained, we analysed the answers given by GPT-4 in detail by applying content analysis.

3.2.1 Correctly answered questions

When GPT-4 answered a question correctly, it provided either a detailed or short explanation (Table 3). In the detailed version, GPT-4 also commented on the distractors. For correctly answered questions, the content analysis revealed logical reasoning and correct utilization of facts. No answers were found to be either implausible or incorrect regarding the facts cited.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3: (Correct) Answers by GPT-4 for the question described inTable 1 for the detailed and short prompt version.

3.2.2 Incorrectly answered questions

When a question was answered incorrectly by GPT-4, the explanation’s reasoning was analysed based on Bloom’s Taxonomy, including remember, understand, apply, analyse, evaluate, and create (Table 4). The most mistakes were found for “remember” (n = 29), followed by “understand” (n = 23) and “apply” (n = 15). Only one erroneous reasoning example was found for “evaluate”. In the following sections, examples for each category are given in detail.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 4: Count of incorrect reasoning by GPT-4 for each category of Bloom’s Taxonomy. For this analysis, all answers (detailed and short prompt) were analysed together.
3.2.2.1 Bloom’s Taxonomy: Remember

In this section, we provide an example of a question for which GPT-4 failed to remember facts. The questions presented the case of a man with recurrent physical complaints without organic cause. The complaints persisted for ten months, making it appropriate for an “undifferentiated somatoform disorder” (F45.1 according to the International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10)). A “somatization disorder” (F45.0) cannot be chosen, as it requires “symptoms of at least two years duration”.19

However, GPT-4 incorrectly chose “somatization disorder”. In its reasoning, GPT-4 stated:

“… I would choose answer C: Somatization disorder (F45.0) as the most appropriate diagnosis for this patient. Somatization disorder (F45.0) is characterized by persistent, multiple, and recurrent physical complaints for which no sufficient organic cause can be found. The symptoms can occur in different organ or functional systems and are usually severe enough to interfere with the patient’s daily life.”

In its recall, GPT-4 did not mention the required duration of the symptoms, as stated in the ICD-10. From what can be observed in the answer, the concept used by GPT-4 thus missed at least one important fact, and the failure could thus be attributed to GPT-4’s not “remembering”.

3.2.2.2 Bloom’s Taxonomy: Understand

The second level of Bloom’s Taxonomy covers the understanding of facts, defined as “determining the meaning of instructional messages, including oral, written, and graphic communication”8. An example of GPT-4’s not understanding some basic concepts can be found in a question concerning anorexia nervosa. The answer covers the fact that anorexia nervosa can be accompanied by transient hypercholesterolemia that does not necessitate treatment.20,21 However, GPT-4 picked one of the distractors. The distractor chosen by GPT-4 states that the basic metabolic rate is increased during phases of low body weight and decreases after refeeding22 – but the opposite is true. In its explanation, GPT-4 thought about these facts (and remembered them correctly in sentence 3!) but did not seem to understand how they relate and interact:

“In patients with anorexia nervosa, the basal metabolic rate is initially elevated at the onset of weight gain and decreases dramatically during refeeding. Food intake should be adjusted accordingly. Basal metabolic rate here refers to the amount of energy the body needs at rest to maintain its basic functions. In people with anorexia nervosa, the body is initially very efficient at obtaining energy due to persistent nutrient deficiency. However, as refeeding progresses, the basal metabolic rate drops dramatically as the body re-learns to store energy.”

3.2.2.3 Bloom’s Taxonomy: Apply

The third level of Bloom’s Taxonomy refers to the application of previously learned content to other areas. An error in this domain became evident when we asked GPT-4 a question about the diagnosis and treatment of a depressive episode. The answer was that symptoms/criteria must be present for at least 14 days, which is clearly required by ICD-10.19 However, GPT-4 rates this answer as incorrect by implying that this time frame might be more of a rough guideline, not a strict minimum requirement.

“In the diagnosis of depressive episodes, the time criterion is an important factor. However, many experts believe that a period of at least two weeks is rather a rough guideline. Some patients can also go through a depressive episode in a shorter time or fulfil the criteria for a longer time.”

This reveals an issue at the “application” level of Bloom’s Taxonomy. GPT-4 seemed to remember the fact about the temporal criterion (“remember”) and also understand its concept (“understand”). However, it struggled with the application of this rule, suggesting flexibility where there is conventionally rigidity. This can be interpreted as a misapplication of the learned rule and thus a mistake made at the “application” level of Bloom’s learning hierarchy.

3.2.2.4 Bloom’s Taxonomy: Evaluate

The process of evaluating is found at the fifth level of Bloom’s Taxonomy. It is defined as “making judgments based on criteria and standards”.8 GPT-4 made a mistake at Bloom’s “evaluate” level in a more complex question testing the judgement regarding the ideal therapy for a patient. Due to the complexity, the question is presented in full length:

A 57-year-old ready-to-wear clothing maker is referred to the psychosomatic outpatient clinic by her family doctor because of persistent back pain. She reports “terrible” pain in her lower back. She has been unable to work for seven months because of this and can hardly sleep. No one wanted to help her; the family doctor only sent her from doctor to doctor. (The findings from orthopaedics, neurology, neurosurgery, and radiology that she brought with her show minor disc changes on several levels, but no neurological deficits). She does not want to undergo surgery because “they probably could not earn enough money from her as a patient with public insurance”. Now the family doctor has prescribed her an antidepressant, which she will not take. She still has hope of improvement, but if it continues like this, she does not know whether she will put an end to her life. What is the first course of action indicated?

A: Immediate admission to psychiatric ward, against the patient’s will if necessary.

B: Initiate inpatient psychosomatic treatment.

C: Observation of the symptoms under medication with NSAIDs.

D: Obtain a neurosurgical second opinion after current imaging.

E: Clarification of secondary disease gain and immediate termination of incapacity for work.

When we used the short prompt version, GPT-4 chose the answer: Plan a psychosomatic inpatient treatment. In its brief explanation, GPT-4 correctly judged why this was the most likely answer.

However, when we used the detailed prompt, GPT-4 chose the distractor E and sent the patient immediately back to work. GPT-4 argues:

After careful consideration, answer E is most appropriate, but with modification. A detailed explanation of psychosomatic correlations and the possible role of secondary illness gain might help the patient to better understand her situation and is thus indicated. However, immediate cessation of incapacity should not be forced, as this could potentially exacerbate the patient’s pain and frustration. It would make sense to first seek outpatient psychosomatic treatment to help the patient cope with her pain and focus on long-term recovery.

In this example, it is evident that GPT-4 knows the facts (“remember”), understands them (“understand”), applies them to this situation (“apply”), and correctly analysis the situation (“analyse”). GPT-4 can even be observed to evaluate but fails at evaluating the consequences of its decision. While GPT-4 seems to be aware of this and thus wants to modify the answer, it incorrectly evaluates the answer options and does not recommend inpatient psychosomatic treatment.

4 Discussion

Tools based on LLMs are rapidly evolving and influencing medicine1 and medical education.2 In this article, we show that GPT-4 yields a high accuracy in answering psychosomatic medicine questions with success rates well above 90%. This confirms results from other groups, which also show high exam scores, e.g., in USMLE10 and other board exams.23 Although we detected high correct rates for our detailed and short prompt versions, GPT-4 answered more questions correctly when we used the detailed prompt (93% vs. 91%). While our study was not designed to assess reasons for the difference, there are two likely explanations for this difference that deserve attention. First, the difference rates for the two prompt versions could represent a true difference in the LLM’s performance with respect to the prompt version.24 Second, the difference could be a random variation that is known to occur even when the same prompt is used more than once.25,26

We were further interested in GPT-4’s performance compared to that of medical students. Here, our analysis revealed that questions answered correctly by GPT-4 were significantly easier than questions that were incorrectly answered. This difference could be observed for the detailed and short prompt versions. For further comparison, it has to be noted that the question difficulty is not a fixed/static variable but is dependent and calculated on the basis of the responses from human students.16

However, in order to understand why GPT-4 fails at some questions, we further analysed those using a qualitative approach. It is well known that incorrect/inaccurate information is an important issue with LLMs.27 To describe the cognitive process underlying learning, Bloom’s Taxonomy has emerged as a frequently used standard.8 To the best of our knowledge, the levels at which cognitive errors occur for GPT-4 have not yet been elucidated. We thus performed a detailed assessment of the answers and reasoning GPT-4 provided.

In our analysis, we found that most errors were made at the lowest level of Bloom’s Taxonomy, “remember”. In these answers, GPT-4 failed at naming or using a specific fact, as evident from the text response. In the example presented above, GPT-4 named most of the diagnostic criteria for a somatization disorder but did not mention the time domain. In this context, it is important to note that GPT-4 has been trained with publicly available and licensed data (not specified in detail by OpenAI).28 The information required in this example is publicly available in the ICD-1019 and thus expected to be included in the GPT-4 training data.

In a recent study, Johnson and colleagues evaluated ChatGPT for its accuracy in medical information.29 Using a quantitative approach, they found that GPT-3.5 provided medical answers that were between mostly and almost correct. It is important to note that these results were generated using GPT-3.5, the older model, compared to the model we used, GPT-4 (which is claimed to be “40% more likely to produce factual responses than GPT-3.5”28 and shows a better performance in medical exams23). However, the results reported by Johnson et al. are in line with our findings: while GPT-4 uses most facts correctly and completely, it sometimes fails at a specific detail. In psychosomatic medicine, we observed this to be a diagnostic criterion. While this can be an issue, too, missing some specific facts in other areas can make all the difference for patient outcomes. Thus, it is important for those using GPT-4 in medicine to keep in mind that specific facts can be wrong and/or missing.

Some errors were found for the second level of Bloom’s Taxonomy, “understand”. While GPT-4 generally showed a good reasoning capability28 and errors were sparse, we were surprised that some answers yielded obvious logical flaws, as presented in the example above. In this response, GPT-4 confidently presents subsequent sentences that don’t correlate logically. Mechanistically, language models such as GPT utilize likelihood maximization, generating text based on what most likely follows.12,30 However, this approach can result in what is called hallucination: “content that is nonsensical or unfaithful to the provided source content”.30 The resulting medical information might sound very confident but be incorrect,31 thereby posing a significant threat for medical applications.5 This does raise ethical concerns on the use of AI systems for patient-related work, particularly as GPT-4’s algorithms and ethical models are unknown and variable.4 Because GPT-4 is not considered sentient, it neither knows nor cares about the accuracy of its responses.

We also detected some mistakes that represented Bloom’s level of “application”. In our representative example, GPT-4 was quite flexible in applying a very strict time criterion (see example above). This can be interpreted in the context of the process of training LLMs. Although little has been published about this process, classification systems probably represent only a small amount of the data available on a certain subject. It can be further assumed that information designed for the public might not be as specific as strict diagnostic criteria because it serves another audience. Following GPT-4’s likelihood maximization approach, LLMs might thus neglect the specific but probably underrepresented piece of information. Furthermore, GPT-4 has been observed to perform poorly at pure calculation tasks,31 probably also challenging strictly numerical criteria. This is not entirely surprising, as GPT-4 is a large language model, with an emphasis on language, and is not intended to be used as a calculator. We found only one mistake for Bloom’s “evaluate” level, in which GPT-4 incorrectly judged a medically complex situation.

Our study has some limitations that deserve discussion. First, we focussed on Bloom’s Taxonomy. While Bloom provided good operationalization for cognitive processes, the taxonomy represents a continuum where more than one level can be activated in a single question.8 However, we observed that most errors could be attributed to one cognitive level. Moreover, we chose questions from psychosomatic medicine because many aspects can be covered by written language and do not require images or many numbers. However, psychosomatic medicine is a specialty in which treatment can be individual and especially complex, making categorical judgement harder. Furthermore, our data was acquired once and at a specific date. As the performance of GPT-4 varies over time, this could reduce generalizability.32

In summary, we found that GPT-4 performs extremely well on psychosomatic medicine questions. Questions answered correctly by GPT-4 were also easier for human students compared to incorrectly answered questions. When analysing the mistakes GPT-4 made, we found that most errors corresponded to lower-order cognitive levels, such as “remember” and “understand”. While we still found some mistakes for “apply”, very few errors were found for “analyse” and “evaluate” (“create” was not assessed in our study). To the best of our knowledge, we are the first to describe at which cognitive levels GPT-4 makes mistakes in the context of (psychosomatic) medicine.

Our study has important implications. First, GPT-4 is already capable of answering many questions in (psychosomatic) medicine and could thus (given the technology is made available) reduce the effectiveness of summative assessment. Second, GPT-4 sometimes fails with exact facts, correct understanding, and application of knowledge. However, without exact knowledge, these failures are hard to recognize. Thus, output generated by GPT-4 has to be checked for correctness, especially in those domains. Our research can also help in model training, and further studies can use our results to correlate model training and LLM outcomes.

Data Availability

All data produced in the present study are available upon reasonable request to the authors

References

  1. 1.↵
    Reddy S. Explainability and artificial intelligence in medicine. Lancet Digit Health. 2022 Apr 1;4(4):e214–5.
    OpenUrl
  2. 2.↵
    Eysenbach G. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ. 2023 Mar 6;9(1):e46885.
    OpenUrl
  3. 3.↵
    ChatGPT: Optimizing Language Models for Dialogue [Internet]. 2022 [cited 2023 Aug 7]. Available from: https://web.archive.org/web/20221130180912/ https://openai.com/blog/chatgpt/
  4. 4.↵
    Masters K. Ethical use of Artificial Intelligence in Health Professions Education: AMEE Guide No. 158. Med Teach. 2023 Jun 3;45(6):574–84.
    OpenUrl
  5. 5.↵
    Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023 Mar 30;388(13):1233–9.
    OpenUrlCrossRefPubMed
  6. 6.↵
    Bloom B, Furst E, Engelhart M, Hill W, Krathwohl D. Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. London: Longman Green & Co.; 1956.
  7. 7.↵
    Anderson LW, Krathwohl DR. A taxonomy for learning, teaching, and assessing: a revision of Bloom’s taxonomy of educational objectives [Internet]. Complete ed. New York: Longman; 2001 [cited 2023 Aug 15]. 352 p. Available from: http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=009385460&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA
  8. 8.↵
    Krathwohl DR. A Revision of Bloom’s Taxonomy: An Overview. Theory Pract. 2002 Nov 1;41(4):212–8.
    OpenUrlCrossRefWeb of Science
  9. 9.↵
    Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023 Feb 8;9(1):e45312.
    OpenUrl
  10. 10.↵
    Kung TH, Cheatham M, Medenilla A, Sillos C, Leon LD, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb 9;2(2):e0000198.
    OpenUrlCrossRef
  11. 11.↵
    Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, et al. ChatGPT Performs on the Chinese National Medical Licensing Examination [Internet]. In Review; 2023 Feb [cited 2023 Aug 5]. Available from: https://www.researchsquare.com/article/rs-2584079/v1
  12. 12.↵
    Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 [Internet]. arXiv; 2023 [cited 2023 Aug 14]. Available from: http://arxiv.org/abs/2303.12712
  13. 13.↵
    Epstein RM. Assessment in Medical Education. N Engl J Med. 2007 Jan 25;356(4):387–96.
    OpenUrlCrossRefPubMedWeb of Science
  14. 14.↵
    Case S, Swanson D. Constructing Written Test Questions For the Basic and Clinical Sciences. Natl Board Exam. 2002 Jan 1;
  15. 15.↵
    R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2023. Available from: https://www.R-project.org/
  16. 16.↵
    Möltner A, Schellberg D, Jünger J. Grundlegende quantitative Analysen medizinischer Prüfungen. GMS Z Med Ausbild. 2006;23(3):2006–23.
    OpenUrl
  17. 17.↵
    Mayring P. Qualitative inhaltsanalyse. Handbuch qualitative Forschung in der Psychologie. Handb Qual Forsch Psychol Mey G Mruck K Eds. 2010;601–13.
  18. 18.↵
    Moldt JA, Festl-Wietek T, Madany Mamlouk A, Nieselt K, Fuhl W, Herrmann-Werner A. Chatbots for future docs: exploring medical students’ attitudes and knowledge towards artificial intelligence and medical chatbots. Med Educ Online. 2023 Dec 31;28(1):2182659.
    OpenUrl
  19. 19.↵
    World Health Organization. ICD-10: international statistical classification of diseases and related health problems: tenth revision [Internet]. World Health Organization; 2004 [cited 2023 Aug 5]. Available from: https://apps.who.int/iris/handle/10665/42980
  20. 20.↵
    Ohwada R, Hotta M, Oikawa S, Takano K. Etiology of hypercholesterolemia in patients with anorexia nervosa. Int J Eat Disord. 2006;39(7):598–601.
    OpenUrlCrossRefPubMedWeb of Science
  21. 21.↵
    Weinbrenner T, Züger M, Jacoby GE, Herpertz S, Liedtke R, Sudhop T, et al. Lipoprotein metabolism in patients with anorexia nervosa: a case–control study investigating the mechanisms leading to hypercholesterolaemia. Br J Nutr. 2004 Jun;91(6):959–69.
    OpenUrlCrossRefPubMed
  22. 22.↵
    Kosmiski L, Schmiege SJ, Mascolo M, Gaudiani J, Mehler PS. Chronic Starvation Secondary to Anorexia Nervosa Is Associated With an Adaptive Suppression of Resting Energy Expenditure. J Clin Endocrinol Metab. 2014 Mar;99(3):908–14.
    OpenUrlPubMed
  23. 23.↵
    Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ. 2023 Jun 29;9(1):e48002.
    OpenUrlCrossRef
  24. 24.↵
    Lester B, Al-Rfou R, Constant N. The Power of Scale for Parameter-Efficient Prompt Tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing [Internet]. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021 [cited 2023 Aug 17]. p. 3045–59. Available from: https://aclanthology.org/2021.emnlp-main.243
  25. 25.↵
    Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, et al. Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential [Internet]. arXiv; 2023 [cited 2023 Aug 17]. Available from: http://arxiv.org/abs/2303.09038
  26. 26.↵
    Dale R. GPT-3: What’s it good for? Nat Lang Eng. 2021 Jan;27(1):113–8.
    OpenUrl
  27. 27.↵
    Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023 Jan;11(6):887.
    OpenUrl
  28. 28.↵
    GPT-4 [Internet]. [cited 2023 Aug 9]. Available from: https://openai.com/gpt-4
  29. 29.↵
    Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Res Sq. 2023 Feb 28;rs.3.rs-2566942.
  30. 30.↵
    Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of Hallucination in Natural Language Generation. ACM Comput Surv. 2023 Mar 3;55(12):248:1-248:38.
    OpenUrl
  31. 31.↵
    Bhayana R, Bleakney RR, Krishna S. GPT-4 in Radiology: Improvements in Advanced Reasoning. Radiology [Internet]. 2023 May 16 [cited 2023 Aug 9]; Available from: https://pubs.rsna.org/doi/10.1148/radiol.230987
  32. 32.↵
    Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? [Internet]. arXiv; 2023 [cited 2023 Aug 9]. Available from: http://arxiv.org/abs/2307.09009
Back to top
PreviousNext
Posted August 21, 2023.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions
Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling
medRxiv 2023.08.18.23294159; doi: https://doi.org/10.1101/2023.08.18.23294159
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions
Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling
medRxiv 2023.08.18.23294159; doi: https://doi.org/10.1101/2023.08.18.23294159

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Medical Education
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)