A variant prioritization tool leveraging multiple instance learning for rare Mendelian disease genomic testing

Ho Heon Kim; Ju Yeop Baek; Heonjong Han; Won Chan Jeong; Dong-Wook Kim; Kisang Kwon; Yongjun Song; Hane Lee; Go Hun Seo; Jungsul Lee; Kyoungyeul Lee

doi:10.1101/2024.04.18.24305632

Abstract

Background Genomic testing such as exome sequencing and genome sequencing is being widely utilized for diagnosing rare Mendelian disorders. Because of a large number of variants identified by these tests, interpreting the final list of variants and identifying the disease-causing variant even after filtering out likely benign variants could be labor-intensive and time-consuming. It becomes even more burdensome when various variant types such as structural variants need to be considered simultaneously with small variants. One way to accelerate the interpretation process is to have all variants accurately prioritized so that the most likely diagnostic variant(s) are clearly distinguished from the rest.

Methods To comprehensively predict the genomic test results, we developed a deep learning based variant prioritization system that leverages multiple instance learning and feeds multiple variant types for variant prioritization. We additionally adopted learning to rank (LTR) for optimal prioritization. We retrospectively developed and validated the model with 5-fold cross-validation in 23,115 patients with suspected rare diseases who underwent whole exome sequencing. Furthermore, we conducted the ablation test to confirm the effectiveness of LTR and the importance of permutational features for model interpretation. We also compared the prioritization performance to publicly available variant prioritization tools.

Results The model showed an average AUROC of 0.92 for the genomic test results. Further, the model had a hit rate of 96.8% at 5 when prioritizing single nucleotide variants (SNVs)/small insertions and deletions (INDELs) and copy number variants (CNVs) together, and a hit rate of 95.0% at 5 when prioritizing CNVs alone. Our model outperformed publicly available variant prioritization tools for SNV/INDEL only. In addition, the ablation test showed that the model using LTR significantly outperformed the baseline model that does not use LTR in variant prioritization (p=0.007).

Conclusion A deep learning model leveraging multiple instance learning precisely predicted genetic testing conclusion while prioritizing multiple types of variants. This model is expected to accelerate the variant interpretation process in finding the disease-causing variants more quickly for rare genetic diseases.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2022-0-00333, Multi-faceted analysis of pediatric rare disease Al integrated SW solution development)

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study was approved by the institutional review board at Korea National Institute for Bioethics Policy (P01-202308-02-001). Informed consent was determined unnecessary with the study only involving anonymous and de-identified retrospective data.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The model and experimental data are available at https://github.com/4pygmalion/ASC3.

https://github.com/4pygmalion/ASC3

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.