Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

CarD-T: Interpreting Carcinomic Lexicon via Transformers

Jamey O’Neill, Gudur Ashrith Reddy, Nermeeta Dhillon, Osika Tripathi, View ORCID ProfileLudmil Alexandrov, View ORCID ProfileParag Katira
doi: https://doi.org/10.1101/2024.08.13.24311948
Jamey O’Neill
1Mechanical Engineering Department, San Diego State University, San Diego, CA, USA
2Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: joneilliii{at}sdsu.edu pkatira{at}sdsu.edu
Gudur Ashrith Reddy
1Mechanical Engineering Department, San Diego State University, San Diego, CA, USA
2Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nermeeta Dhillon
1Mechanical Engineering Department, San Diego State University, San Diego, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Osika Tripathi
3Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ludmil Alexandrov
2Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
4Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
5Moores Cancer Center, University of California San Diego, La Jolla, CA, USA
6Sanford Stem Cell Institute, University of California San Diego, La Jolla, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ludmil Alexandrov
Parag Katira
1Mechanical Engineering Department, San Diego State University, San Diego, CA, USA
7Computational Science Research Center, San Diego State University, San Diego, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Parag Katira
  • For correspondence: joneilliii{at}sdsu.edu pkatira{at}sdsu.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

P.K. and J.O acknowledge funding support from the NIH: NIMHD U54MD012397 and NCI U54CA285117.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • Figure 3 in the manuscript has been revised to reflect the correct values

Data Availability

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

https://huggingface.co/jimnoneill/CarD-T

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted August 31, 2024.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
CarD-T: Interpreting Carcinomic Lexicon via Transformers
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
CarD-T: Interpreting Carcinomic Lexicon via Transformers
Jamey O’Neill, Gudur Ashrith Reddy, Nermeeta Dhillon, Osika Tripathi, Ludmil Alexandrov, Parag Katira
medRxiv 2024.08.13.24311948; doi: https://doi.org/10.1101/2024.08.13.24311948
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
CarD-T: Interpreting Carcinomic Lexicon via Transformers
Jamey O’Neill, Gudur Ashrith Reddy, Nermeeta Dhillon, Osika Tripathi, Ludmil Alexandrov, Parag Katira
medRxiv 2024.08.13.24311948; doi: https://doi.org/10.1101/2024.08.13.24311948

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)