Abstract
Background and Aims Hepatocellular Carcinoma (HCC) is often diagnosed late, limiting curative treatment options. Conversely, early detection in cirrhotic patients through screening offers high cure rates but is underutilized and misses cases occurring in individuals without cirrhosis. We aimed to build, validate, and simulate the deployment of models for HCC risk stratification using routinely collected Electronic Health Record (EHR) data from a geographically and racially diverse U.S. population.
Methods We developed Logistic Regression (LiricLR) and Neural Network (LiricNN) models for the general (GP) and cirrhosis populations utilizing EHR data from 46,79 HCC cases and 1,128,202 controls aged 40-100 years. Data was sourced from 64 Health Care Organizations (HCOs) from a federated network, spanning academic medical centers, community hospitals, and outpatient clinics nationwide. We evaluated model performance using AUC, calibration plots, and Geometric Mean of Overestimation (GMOE), the geometric mean of ratios of predicted to actual risks. External validation involved HCO location, race, and temporal factors. Simulated deployment assessed sensitivity, specificity, Positive Predictive Value, Number Needed to Screen for each risk threshold.
Results LiricLR and LiricNN (GP) achieved test set AUCs of AUC=0.8968 (95% CI: 0.8925, 0.9010) and AUC=0.9254 (95% CI: 0.9218, 0.9289), respectively, leveraging 46 established (cirrhosis, hepatitis, diabetes) and novel (frequency of clinical encounters, platelet, albumin, aminotransferase values) features. Average external validation AUCs of LiricNN were 0.9274 (95% CI: 0.9239, 0.9308) for locations and 0.9284 (95% CI: 0.9247, 0.9320) for races. Average GMOEs were 0.887 (95% CI: 0.862-0.911). Simulated model deployment of LiricNN provides performance metrics across multiple risk thresholds.
Conclusions Liric models utilize routine EHR data to accurately predict risk of HCC development. Their scalability, generalizability, and interpretability set the stage for future clinical deployment and the design of more effective screening programs.
Lay Summary Hepatocellular Carcinoma (HCC), the most common liver cancer, is often diagnosed in late stages, limiting treatment options. Early detection through screening is essential for effective intervention and potential cure. However, current screening mostly targets patients with liver cirrhosis, many of whom do not get screened, while missing others who could develop HCC even without cirrhosis.
To improve screening, we created and tested Liric (LIver cancer RIsk Computation) models. These models use routine medical records from across the country to identify people at high risk of developing HCC.
Liric models have several benefits. Firstly, they can increase awareness among primary care physicians (PCPs) nationwide, improving the utilization of HCC screening. This is particularly crucial in areas with socio-demographic disparities, where access to specialist physicians may be limited. Additionally, Liric models can identify patients who would be missed by current screening guidelines, ensuring a more comprehensive approach to HCC detection.
Liric can be integrated into EHR systems to automatically generate a risk score from routinely collected patient data. This risk score can provide valuable information to physicians and caregivers, helping them make informed decisions about the need for HCC screening and can be used to develop cost-effective screening programs by identifying populations in which screening is effective.
Highlights
Screening detects HCC early but is underutilized and misses cases without cirrhosis
We developed, validated, and simulated deployment of Liric to identify individuals at high-risk for HCC
Liric uses routinely collected clinical and lab data from a diverse US population
Liric accurately predicts risk of HCC 6-36 months before it occurs
Liric can assist PCPs in identifying individuals most in need of screening
Impacts and implications Effective screening for hepatocellular carcinoma (HCC) is vital to achieve early detection and improved cure rates. However, the existing screening approach primarily targets patients with liver cirrhosis, and is both underutilized and fails to identify those without underlying cirrhosis.
Implementation of Liric models has the potential to enhance nationwide awareness among primary care physicians (PCPs), and improve screening utilization for hepatocellular carcinoma (HCC), particularly in regions characterized by socio-demographic disparities. Furthermore, these models can help identify patients who are currently overlooked by existing screening guidelines and aid in the development of new, more effective guidelines.
Integration of Liric models into EHR systems via a federated network would enable automatic generation of risk scores using unfiltered patient data. This approach could more accurately identify at-risk patients, providing valuable information to caregivers for HCC screening.
Competing Interest Statement
KJ and MR are not aware of any payments or services, paid to themselves or MIT, that could be perceived to influence the submitted work. IK and LA are not aware of any payments or services, paid to themselves or BIDMC, that could be perceived to influence the submitted work. During the time the research was performed MR received consulting fees and payment for expert testimony for Comcast, Google, Motorola, Qualcomm, and IBM, is a member of the scientific advisory board and owns stock at Vali Cyber, and acknowledges support from Boeing, DARPA, and the NSF for salary and research support including meeting attendance and travel. MR has the following patents: United States Patent 10,539,419. Method and apparatus for reducing sensor power dissipation. Phillip Stanley- Marbell, Martin Rinard. United States Patent 10,135,471. System, method, and apparatus for reducing power dissipation of sensor data on bit-serial communication interfaces. Phillip Stanley-Marbell, Martin Rinard. United States Patent 9,189,254. Translating text to, merging, and optimizing graphical user interface tasks. Nathaniel Kushman, Regina Barzilay, Satchuthananthavale Branavan, Dina Katabi, Martin Rinard. United States Patent 8,839,221. Automatic acquisition and installation of software upgrades for collections of virtual machines. Constantine Sapuntzakis, Martin Rinard, Gautam Kachroo. United States Patent 8,788,884. Automatic correction of program logic. Jeff Perkins, Stylianos Sidiroglou, Martin Rinard, Eric Lahtinen, Paolo Piselli, Basil Krikeles, Timothy Anderson, Greg Sullivan. United States Patent 7,260,746. Specification based detection and repair of errors in data structures. Brian Demsky, Martin Rinard. SK, MP, and JW are full-time employees of TriNetX, LLC. SK and JW own TriNetX stock. The remaining authors declare no competing interests.
Funding Statement
This work was supported by the Coverys Community Healthcare Foundation and TriNetX. TriNetX contributed resources including secured laptop computers, access to the TriNetX EHR database, and clinical, technical, legal, and administrative assistance from the TriNetX team of clinical informaticists, engineers, and technical staff. SK, MP, JW are employees of TriNetX and were involved in study design and data acquisition. The Coverys Community Healthcare Foundation was not involved in study design, collection, analysis, interpretation of data, writing of the report, or in the decision to submit the article for publication.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
All EHR data were obtained through the federated global health research network, TriNetX, and de-identified by TriNetX. We accessed the data under a no-cost collaboration agreement. Based on a determination by the Western IRB, studies using TriNetX data are not considered to be human subject research, and are therefore exempt from IRB review.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
↵* Co-senior authors
(jiakai{at}mit.edu)
(bogu{at}bwh.harvard.edu)
(zoomswk{at}mit.edu)
(steve.kundrot{at}trinetx.com)
(matvey.palchuk{at}trinetx.com)
(jeff.warnick{at}trinetx.com)
(ikaplan{at}bidmc.harvard.edu)
(rinard{at}csail.mit.edu)
Conflict of interest statement KJ and MR are not aware of any payments or services, paid to themselves or MIT, that could be perceived to influence the submitted work. IK and LA are not aware of any payments or services, paid to themselves or BIDMC, that could be perceived to influence the submitted work. During the time the research was performed MR received consulting fees and payment for expert testimony for Comcast, Google, Motorola, Qualcomm, and IBM, is a member of the scientific advisory board and owns stock at Vali Cyber, and acknowledges support from Boeing, DARPA, and the NSF for salary and research support including meeting attendance and travel. MR has the following patents: United States Patent 10,539,419. Method and apparatus for reducing sensor power dissipation. Phillip Stanley-Marbell, Martin Rinard. United States Patent 10,135,471. System, method, and apparatus for reducing power dissipation of sensor data on bit-serial communication interfaces. Phillip Stanley-Marbell, Martin Rinard. United States Patent 9,189,254. Translating text to, merging, and optimizing graphical user interface tasks. Nathaniel Kushman, Regina Barzilay, Satchuthananthavale Branavan, Dina Katabi, Martin Rinard. United States Patent 8,839,221. Automatic acquisition and installation of software upgrades for collections of virtual machines. Constantine Sapuntzakis, Martin Rinard, Gautam Kachroo. United States Patent 8,788,884. Automatic correction of program logic. Jeff Perkins, Stylianos Sidiroglou, Martin Rinard, Eric Lahtinen, Paolo Piselli, Basil Krikeles, Timothy Anderson, Greg Sullivan. United States Patent 7,260,746. Specification based detection and repair of errors in data structures. Brian Demsky, Martin Rinard. SK, MP, and JW are full-time employees of TriNetX, LLC. SK and JW own TriNetX stock. The remaining authors declare no competing interests.
Financial support statement This work was supported by the Coverys Community Healthcare Foundation and TriNetX. TriNetX contributed resources including secured laptop computers, access to the TriNetX EHR database, and clinical, technical, legal, and administrative assistance from the TriNetX team of clinical informaticists, engineers, and technical staff. SK, MP, JW are employees of TriNetX and were involved in study design and data acquisition. The Coverys Community Healthcare Foundation was not involved in study design, collection, analysis, interpretation of data, writing of the report, or in the decision to submit the article for publication.
Data Availability
The de-identified data in TriNetX federated network database can only be accessed by researchers that are either part of the network or have a collaboration agreement with TriNetX. As stated in the manuscript, we accessed data as part of a no-cost collaboration agreement between BIDMC, MIT, and TriNetX.
Abbreviations
- EHR
- electronic health record
- HCC
- Hepatocellular Carcinoma
- GP
- General Population
- NAFLD
- non-alcoholic fatty liver disease
- HCO
- health care organization
- LR
- logistic regression
- NN
- neural network
- AI
- Artificial Intelligence