Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Overcoming Underrepresentation in Clinical Datasets for Accurate Subpopulation-specific Prognosis

View ORCID ProfileSharmin Afrose, Wenjia Song, View ORCID ProfileCharles B. Nemeroff, View ORCID ProfileChang Lu, View ORCID ProfileDanfeng (Daphne) Yao
doi: https://doi.org/10.1101/2021.03.26.21254401
Sharmin Afrose
1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sharmin Afrose
Wenjia Song
1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Charles B. Nemeroff
2Department of Psychiatry and Behavioral Sciences, the University of Texas at Austin Dell Medical School, Austin, TX, USA
MD PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Charles B. Nemeroff
Chang Lu
3Department of Chemical Engineering, Virginia Tech, Blacksburg, VA, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chang Lu
Danfeng (Daphne) Yao
1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Danfeng (Daphne) Yao
  • For correspondence: danfeng{at}vt.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Clinical datasets are intrinsically imbalanced, dominated by overwhelming majority groups. Off-the-shelf machine learning models optimize the prognosis of majority patient types (e.g., healthy class), causing substantial errors on the minority prediction class (e.g., disease class) and minority subpopulations (e.g., Black or young patients). For example, missed death prediction is 36.6 times higher than non-death cases in a mortality benchmark. Racial and age disparities also exist. Conventional metrics such as AUC-ROC do not reflect these deficiencies. We design a double prioritized (DP) sampling technique to improve the accuracy for underrepresented subpopulations. We report our findings on four prediction tasks over two clinical datasets, and comparisons with eight existing sampling solutions. With DP, the recall of minority classes shows 35.4–130.4% improvement. Compared to the state-of-the-arts, DP sampling gives 1.2–58.8 times more balanced recalls and precisions. Our method trains customized models for specific race or age groups, a departure from the one-model-fits-all-demographics paradigm. As underrepresented groups in clinical medicine are a daily occurrence, our contributions likely have broad implications.

Competing Interest Statement

Charles B. Nemeroff (CBN) declares consulting for the following companies in the last 12 months: ANeuroTech (division of Anima BV), Taisho Pharmaceutical, Inc., Takeda, Signant Health, Sunovion Pharmaceuticals, Inc., Janssen Research & Development LLC, Magstim, Inc., Navitor Pharmaceuticals, Inc., Intra-Cellular Therapies, Inc., EMA Wellness, Acadia Pharmaceuticals, Axsome, Sage, BioXcel Therapeutics, Silo Pharma, XW Pharma, Neuritek, Engrail Therapeutics, Corcept Therapeutics Pharmaceuticals Company. CBN owns stock in Xhale, Seattle Genetics, Antares, BI Gen Holdings, Inc., Corcept Therapeutics Pharmaceuticals Company, EMA Wellness. CBN serves on the scientific advisory boards of ANeuroTech (division of Anima BV), Brain and Behavior Research Foundation (BBRF), Anxiety and Depression Association of America (ADAA), Skyland Trail, Signant Health, Laureate Institute for Brain Research (LIBR), Inc., Magnolia CNS. CBN is the board of directors of Gratitude America, ADAA, Xhale Smart, Inc. CBN has patents in antipsychotic drug delivery. The other authors have no competing interests.

Funding Statement

No external funding was received.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

We submitted applications/forms to access the MIMIC III dataset from PhysioNet Team in MIT Laboratory for Computational Physiology and the SEER dataset from National Cancer Institute. We were granted to use the MIMIC III and SEER datasets after going through the registration procedures.

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

The MIMIC III and SEER data used in this study are not publicly downloadable but can be requested at their original sites. Parties interested in data access should visit the MIMIC III website (https://mimic.physionet.org/gettingstarted/access/) and the SEER website (https://seer.cancer.gov/data/access.html) to submit access requests.

https://mimic.physionet.org/gettingstarted/access/

https://seer.cancer.gov/data/access.html

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted April 04, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Overcoming Underrepresentation in Clinical Datasets for Accurate Subpopulation-specific Prognosis
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Overcoming Underrepresentation in Clinical Datasets for Accurate Subpopulation-specific Prognosis
Sharmin Afrose, Wenjia Song, Charles B. Nemeroff, Chang Lu, Danfeng (Daphne) Yao
medRxiv 2021.03.26.21254401; doi: https://doi.org/10.1101/2021.03.26.21254401
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Overcoming Underrepresentation in Clinical Datasets for Accurate Subpopulation-specific Prognosis
Sharmin Afrose, Wenjia Song, Charles B. Nemeroff, Chang Lu, Danfeng (Daphne) Yao
medRxiv 2021.03.26.21254401; doi: https://doi.org/10.1101/2021.03.26.21254401

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)