Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Developing and validating a pancreatic cancer risk model for the general population using multi-institutional electronic health records from a federated network

Kai Jia, Steven Kundrot, Matvey Palchuk, Jeff Warnick, Kathryn Haapala, Irving Kaplan, Martin Rinard, Limor Appelbaum
doi: https://doi.org/10.1101/2023.02.05.23285192
Kai Jia
1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge MA 02139 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Steven Kundrot
2TriNetX, LLC, Cambridge MA 02140 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matvey Palchuk
2TriNetX, LLC, Cambridge MA 02140 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jeff Warnick
2TriNetX, LLC, Cambridge MA 02140 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kathryn Haapala
2TriNetX, LLC, Cambridge MA 02140 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Irving Kaplan
3Beth Israel Deaconess Medical Center, Boston MA 02215 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Martin Rinard
1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge MA 02139 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Limor Appelbaum
3Beth Israel Deaconess Medical Center, Boston MA 02215 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: lappelb1{at}bidmc.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Purpose Pancreatic Duct Adenocarcinoma (PDAC) screening can enable detection of early-stage disease and long-term survival. Current guidelines are based on inherited predisposition; only about 10% of PDAC cases meet screening eligibility criteria. Electronic Health Record (EHR) risk models for the general population hold out the promise of identifying a high-risk cohort to expand the currently screened population. Using EHR data from a multi-institutional federated network, we developed and validated a PDAC risk prediction model for the general US population.

Methods We developed Neural Network (NN) and Logistic Regression (LR) models on structured, routinely collected EHR data from 55 US Health Care Organizations (HCOs). Our models used sex, age, frequency of clinical encounters, diagnoses, lab tests, and medications, to predict PDAC risk 6-18 months before diagnosis. Model performance was assessed using Receiver Operating Characteristic (ROC) curves and calibration plots. Models were externally validated using location, race, and temporal validation, with performance assessed using Area Under the Curve (AUC). We further simulated model deployment, evaluating sensitivity, specificity, Positive Predictive Value (PPV) and Standardized Incidence Ratio (SIR). We calculated SIR based on the SEER data of the general population with matched demographics.

Results The final dataset included 63,884 PDAC cases and 3,604,863 controls between the ages 40 and 97.4 years. Our best performing NN model obtained an AUC of 0.829 (95% CI: 0.821 to 0.837) on the test set. Calibration plots showed good agreement between predicted and observed risks. Race-based external validation (trained on four races, tested on the fifth) AUCs of NN were 0.836 (95% CI: 0.797 to 0.874), 0.838 (95% CI: 0.821 to 0.855), 0.824 (95% CI: 0.819 to 0.830), 0.842 (95% CI: 0.750 to 0.934), and 0.774 (95% CI: 0.771 to 0.777) for AIAN, Asian, Black, NHPI, and White, respectively. Location-based external validation (trained on three locations, tested on the fourth) AUCs of NN were 0.751 (95% CI: 0.746 to 0.757), 0.749 (95% CI: 0.745 to 0.753), 0.752 (95% CI: 0.748 to 0.756), and 0.722 (95% CI: 0.713 to 0.732) for Midwest, Northeast, South, and West, respectively. Average temporal external validation (trained on data prior to certain dates, tested on data after a date) AUC of NN was 0.784 (95% CI: 0.763 to 0.805). Simulated deployment on the test set, with a mean follow up of 2.00 (SD 0.39) years, demonstrated an SIR range between 2.42-83.5 for NN, depending on the chosen risk threshold. At an SIR of 5.44, which exceeds the current threshold for inclusion into PDAC screening programs, NN sensitivity was 35.5% (specificity 95.6%), which is 3.5 times the sensitivity of those currently being screened with an inherited predisposition to PDAC. At a chosen high-risk threshold with a lower SIR, specificity was about 85%, and both models exhibited sensitivities above 50%.

Conclusions Our models demonstrate good accuracy and generalizability across populations from diverse geographic locations, races, and over time. At comparable risk levels these models can predict up to three times as many PDAC cases as current screening guidelines. These models can therefore be used to identify high-risk individuals, overlooked by current guidelines, who may benefit from PDAC screening or inclusion in an enriched group for further testing such as biomarker testing. Our integration with the federated network provided access to data from a large, geographically and racially diverse patient population as well as a pathway to future clinical deployment.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

LA acknowledges support from the Prevent Cancer Foundation for this work. MR, LA, KJ acknowledge the contribution of resources by TriNetX, including secured laptop computers, access to the TriNetX EHR database, and clinical, technical, legal, and administrative assistance from the TriNetX team of clinical informaticists, engineers, and technical staff. MR and KJ received funding from DARPA and Boeing. MR also received funding from the NSF, Aarno Labs, and Boeing. During the time the research was performed MR consulted for Comcast, Google, Motorola, and Qualcomm.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Any data displayed on the TriNetX federated network database platform in aggregate form or any patient level data provided in a data set generated by the federated network database platform only contains de-identified data as per the de-identification standard defined in the Health Insurance Portability and Accountability Act Privacy Rule. The process by which the data is de-identified is attested to through a formal determination by a qualified expert as defined in the Health Insurance Portability and Accountability Act Privacy Rule. This formal determination by a qualified expert supersedes the need for the previous waiver of TriNetX from the Western Institutional Review Board. Because this study used only de-identified patient records and did not involve the collection use or transmittal of individually identifiable data this study was exempted from Institutional Review Board approval.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • jiakai{at}mit.edu

  • rinard{at}csail.mit.edu

  • steve.kundrot{at}trinetx.com

  • matvey.palchuk{at}trinetx.com

  • jeff.warnick{at}trinetx.com

  • kathryn.haapala{at}trinetx.com

  • ikaplan{at}bidmc.harvard.edu

  • lappelb1{at}bidmc.harvard.edu

  • ↵⋆ Co-senior authors.

Data Availability

The de-identified data in TriNetX federated network database can only be accessed by researchers that are either part of the network or have a collaboration agreement with TriNetX. As stated in the manuscript, we accessed data as part of a no-cost collaboration agreement between BIDMC, MIT, and TriNetX.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted February 08, 2023.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Developing and validating a pancreatic cancer risk model for the general population using multi-institutional electronic health records from a federated network
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Developing and validating a pancreatic cancer risk model for the general population using multi-institutional electronic health records from a federated network
Kai Jia, Steven Kundrot, Matvey Palchuk, Jeff Warnick, Kathryn Haapala, Irving Kaplan, Martin Rinard, Limor Appelbaum
medRxiv 2023.02.05.23285192; doi: https://doi.org/10.1101/2023.02.05.23285192
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Developing and validating a pancreatic cancer risk model for the general population using multi-institutional electronic health records from a federated network
Kai Jia, Steven Kundrot, Matvey Palchuk, Jeff Warnick, Kathryn Haapala, Irving Kaplan, Martin Rinard, Limor Appelbaum
medRxiv 2023.02.05.23285192; doi: https://doi.org/10.1101/2023.02.05.23285192

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Oncology
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)