Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality

View ORCID ProfileAlexander D. VanHelene, Ishaani Khatri, View ORCID ProfileC. Beau Hilton, View ORCID ProfileSanjay Mishra, Ece D. Gamsiz Uzun, View ORCID ProfileJeremy L. Warner
doi: https://doi.org/10.1101/2024.01.30.24302027
Alexander D. VanHelene
1Lifespan Cancer Institute, Rhode Island Hospital, Providence, Rhode Island
2Center for Clinical Cancer Informatics and Data Science, Legorreta Cancer Center, Brown University, Providence, Rhode Island
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alexander D. VanHelene
Ishaani Khatri
3Warren Alpert Medical School, Brown University, Providence, Rhode Island
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
C. Beau Hilton
4Department of Internal Medicine, Vanderbilt University, Nashville, Tennessee
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for C. Beau Hilton
Sanjay Mishra
1Lifespan Cancer Institute, Rhode Island Hospital, Providence, Rhode Island
2Center for Clinical Cancer Informatics and Data Science, Legorreta Cancer Center, Brown University, Providence, Rhode Island
3Warren Alpert Medical School, Brown University, Providence, Rhode Island
MS, PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sanjay Mishra
Ece D. Gamsiz Uzun
2Center for Clinical Cancer Informatics and Data Science, Legorreta Cancer Center, Brown University, Providence, Rhode Island
3Warren Alpert Medical School, Brown University, Providence, Rhode Island
5Center for Computational Molecular Biology, Brown University, Providence, Rhode Island
6Department of Pathology and Laboratory Medicine, Brown University, Providence, Rhode Island
MS, PhD, FAMIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jeremy L. Warner
1Lifespan Cancer Institute, Rhode Island Hospital, Providence, Rhode Island
2Center for Clinical Cancer Informatics and Data Science, Legorreta Cancer Center, Brown University, Providence, Rhode Island
3Warren Alpert Medical School, Brown University, Providence, Rhode Island
MD, MS, FAMIA, FASCO
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jeremy L. Warner
  • For correspondence: jeremy_warner{at}brown.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Meta-researchers commonly leverage tools that infer gender from first names, especially when studying gender disparities. However, tools vary in their accuracy, ease of use, and cost. The objective of this study was to compare the accuracy and cost of the commercial software Genderize and Gender API, and the open-source gender R package. Differences in binary gender prediction accuracy between the three services were evaluated. Gender prediction accuracy was tested on a multi-national dataset of 32,968 gender-labeled clinical trial authors. Additionally, two datasets from previous studies with 5779 and 6131 names, respectively, were re-evaluated with modern implementations of Genderize and Gender API. The gender inference accuracy of Genderize and Gender API were compared, both with and without supplying trialists’ country of origin in the API call. The accuracy of the gender R package was only evaluated without supplying countries of origin since. The accuracy of Genderize, Gender API, and the gender R package were defined as the percentage of correct gender predictions. Accuracy differences between methods were evaluated using McNemar’s test. Genderize and Gender API demonstrated overall 96.6% and 96.1% accuracy, respectively, when countries of origin were not supplied in the API calls. Genderize and Gender API achieved the highest accuracy when predicting the gender of German authors with accuracies greater than 98%. Genderize and Gender API were least accurate with South Korean, Chinese, Singaporean, and Taiwanese authors, demonstrating below 82% accuracy. The gender R package achieved below 86% accuracy on the full dataset. In the replication studies, Genderize and gender API demonstrated better performance than in the original publications. Our results indicate that Genderize and Gender API are highly accurate, except when evaluating South Korean, Chinese, Singaporean, and Taiwanese names. We also demonstrated that Genderize can provide similar accuracy to Gender API while being 4.85x less expensive.

Author Summary Gender disparities in academia have prompted researchers to investigate gender gaps in professorship roles and publication authorship. Of particular concern are the gender gaps in cancer clinical trial authorship. Methodologies that evaluate gender disparities in academia often rely on tools that infer gender from first names. Tools that predict gender from first names are often used in methodologies that determine the gender ratios of academic departments or publishing authors in a discipline. However, researchers must choose between different gender predicting tools that vary in their accuracy, ease of use, and cost. We evaluated the binary gender prediction accuracy of Genderize, Gender API, and the gender R package on a gold-standard dataset of 32,968 clinical trialists from around the world. Genderize and Gender API cost money to use, while the gender R package is free and open source. We found that Genderize and Gender API were more accurate than the gender R package. In addition, Genderize is cheaper than Gender API, but is more sensitive to inconsistencies in name formatting and the presence of diacritical marks. Both Genderize and Gender API were most accurate with western names.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

Yes

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study did not require IRB approval because we did not deal with private health information.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The data that support the findings of this study are publicly available from the Harvard DataVerse: https://dataverse.harvard.edu/privateurl.xhtml?token=6d620f82-5ef2-4ea6-90a5-19fb8ca4fe80

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted January 31, 2024.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality
Alexander D. VanHelene, Ishaani Khatri, C. Beau Hilton, Sanjay Mishra, Ece D. Gamsiz Uzun, Jeremy L. Warner
medRxiv 2024.01.30.24302027; doi: https://doi.org/10.1101/2024.01.30.24302027
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality
Alexander D. VanHelene, Ishaani Khatri, C. Beau Hilton, Sanjay Mishra, Ece D. Gamsiz Uzun, Jeremy L. Warner
medRxiv 2024.01.30.24302027; doi: https://doi.org/10.1101/2024.01.30.24302027

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)