Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study

View ORCID ProfileHui Feng, View ORCID ProfileFrancesco Ronzano, Jude LaFleur, Matthew Garber, Rodrigo de Oliveira, View ORCID ProfileKathryn Rough, Katharine Roth, Jay Nanavati, Khaldoun Zine El Abidine, View ORCID ProfileChristina Mack
doi: https://doi.org/10.1101/2024.05.17.24307411
Hui Feng
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hui Feng
  • For correspondence: hui.feng{at}iqvia.com
Francesco Ronzano
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Francesco Ronzano
Jude LaFleur
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matthew Garber
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rodrigo de Oliveira
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kathryn Rough
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kathryn Rough
Katharine Roth
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jay Nanavati
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Khaldoun Zine El Abidine
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christina Mack
1Real world solutions, IQVIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christina Mack
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background The availability of increasingly powerful large language models (LLMs) has attracted substantial interest in their potential for interpreting and generating human-like text for biomedical and clinical applications. However, there are often demands for high accuracy, concerns about balancing generalizability and domain-specificity, and questions about prompting robustness when considering the adoption of LLMs for specific use cases. There also is a lack of a framework or method to help choose which LLMs (or prompting strategies) should be adopted for specific biomedical or clinical tasks.

Objective To address the speculations on applying LLMs for biomedical applications, this study aims to 1) propose a framework to comprehensively evaluate and compare the performance of a range of LLMs and prompting techniques on a suite of biomedical natural language processing (NLP) tasks; 2) use the framework to benchmark several general-purpose LLMs and biomedical domain-specific LLMs.

Methods We evaluated and compared six general-purpose LLMs (GPT-4, GPT-3.5-Turbo, Flan-T5-XXL, Llama-3-8B-Instruct, Yi-1.5-34B-Chat, and Zephyr-7B-Beta) and three healthcare-specific LLMs (Medicine-Llama3-8B, Meditron-7B, and MedLLaMA-13B) on a set of 13 datasets – referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) – covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction (RE); population, interventions, comparators, and outcomes (PICO); sentence similarity (SS); document classification (Class.); and question-answering (QA). All models were evaluated without further training or fine-tuning. Model performance was assessed according to a range of prompting strategies (formalized as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB.

Results Across all tasks, GPT-4 outperformed other LLMs, achieving a score of 64.6 on the benchmark, though other models, such as Flan-T5-XXL and Llama-3-8B-Instruct, demonstrated competitive performance on multiple tasks. We found that general-purpose models achieved better overall scores than domain-specific models, sometimes by significant margins. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text. Additionally, the most performant prompts for nearly half the models outperformed the previously reported best results for the PubMedQA dataset from the BLURB leaderboard.

Conclusions These results provide evidence of the potential LLMs may have for biomedical applications and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Notably, performant open-source LLMs such as Llama-3-8B-Instruct and Flan-T5-XXL show promise for use cases where trustworthiness and data confidentiality are concerns, as these models can be hosted locally, offering better security, transparency, and explainability. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the biomedical area.

Competing Interest Statement

All authors are employees of IQVIA. This study is funded by IQVIA. FR had received research fundings from Torres Quevedo R&D Contractor, Spanish Ministry of Science, Innovation and Universities (up to 11/2021). HF, KRough, JN, CM, KZ have stock in IQVIA. RO has stock in Arria NLG. KRough has stock in Google. JN has stock in Microsoft, AZ, Nvidia, Meta. CM has stock in AZ, J&J, and MindMed. FR was previously employed by Medbioinformatics Solutions SL. RO was previously employed by Arria NLG. JN was previously employed by AZ. KRough was previously employed by Google.

Funding Statement

This study was funded by IQVIA

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • Disclosure: No AI-assisted technologies were used to assist with the writing of the submitted work.

  • We have added evaluation results on additional models, and have updated our analysis and discussions based on the additional observations.

Data Availability

All underlying data used in this study are available online at https://microsoft.github.io/BLURB/tasks.html

  • Abbreviations

    BERT
    Bidirectional Encoder Representations from Transformers
    BLURB
    Biomedical Language Understanding and Reasoning Benchmark
    Class.
    document classification
    LLM
    large language model
    NER
    named entity recognition
    NLP
    natural language processing
    PICO
    population, interventions, comparators, and outcomes
    QA
    question answering
    RE
    relation extraction
    SS
    sentence similarity
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
    Back to top
    PreviousNext
    Posted August 20, 2024.
    Download PDF

    Supplementary Material

    Data/Code
    Email

    Thank you for your interest in spreading the word about medRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study
    (Your Name) has forwarded a page to you from medRxiv
    (Your Name) thought you would like to see this page from the medRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study
    Hui Feng, Francesco Ronzano, Jude LaFleur, Matthew Garber, Rodrigo de Oliveira, Kathryn Rough, Katharine Roth, Jay Nanavati, Khaldoun Zine El Abidine, Christina Mack
    medRxiv 2024.05.17.24307411; doi: https://doi.org/10.1101/2024.05.17.24307411
    Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study
    Hui Feng, Francesco Ronzano, Jude LaFleur, Matthew Garber, Rodrigo de Oliveira, Kathryn Rough, Katharine Roth, Jay Nanavati, Khaldoun Zine El Abidine, Christina Mack
    medRxiv 2024.05.17.24307411; doi: https://doi.org/10.1101/2024.05.17.24307411

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Health Informatics
    Subject Areas
    All Articles
    • Addiction Medicine (349)
    • Allergy and Immunology (668)
    • Allergy and Immunology (668)
    • Anesthesia (181)
    • Cardiovascular Medicine (2648)
    • Dentistry and Oral Medicine (316)
    • Dermatology (223)
    • Emergency Medicine (399)
    • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
    • Epidemiology (12228)
    • Forensic Medicine (10)
    • Gastroenterology (759)
    • Genetic and Genomic Medicine (4103)
    • Geriatric Medicine (387)
    • Health Economics (680)
    • Health Informatics (2657)
    • Health Policy (1005)
    • Health Systems and Quality Improvement (985)
    • Hematology (363)
    • HIV/AIDS (851)
    • Infectious Diseases (except HIV/AIDS) (13695)
    • Intensive Care and Critical Care Medicine (797)
    • Medical Education (399)
    • Medical Ethics (109)
    • Nephrology (436)
    • Neurology (3882)
    • Nursing (209)
    • Nutrition (577)
    • Obstetrics and Gynecology (739)
    • Occupational and Environmental Health (695)
    • Oncology (2030)
    • Ophthalmology (585)
    • Orthopedics (240)
    • Otolaryngology (306)
    • Pain Medicine (250)
    • Palliative Medicine (75)
    • Pathology (473)
    • Pediatrics (1115)
    • Pharmacology and Therapeutics (466)
    • Primary Care Research (452)
    • Psychiatry and Clinical Psychology (3432)
    • Public and Global Health (6527)
    • Radiology and Imaging (1403)
    • Rehabilitation Medicine and Physical Therapy (814)
    • Respiratory Medicine (871)
    • Rheumatology (409)
    • Sexual and Reproductive Health (410)
    • Sports Medicine (342)
    • Surgery (448)
    • Toxicology (53)
    • Transplantation (185)
    • Urology (165)