Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases

Lungang Liang, View ORCID ProfileYulan Chen, Taifu Wang, Dan Jiang, Jishuo Jin, Yanmeng Pang, Qin Na, Qiang Liu, Xiaosen Jiang, Wentao Dai, Meifang Tang, Yutao Du, Dirong Peng, View ORCID ProfileXin Jin, Lijian Zhao
doi: https://doi.org/10.1101/2024.07.18.24310666
Lungang Liang
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yulan Chen
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yulan Chen
Taifu Wang
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Dan Jiang
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jishuo Jin
1BGI Genomics, Shenzhen 518083, China
3Clin Lab, BGI Genomics, Beijing 100000, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yanmeng Pang
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Qin Na
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Qiang Liu
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xiaosen Jiang
1BGI Genomics, Shenzhen 518083, China
4BGI Research, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wentao Dai
1BGI Genomics, Shenzhen 518083, China
3Clin Lab, BGI Genomics, Beijing 100000, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Meifang Tang
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yutao Du
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
7Medical Technology College, Hebei Medical University, Shijiazhuang 050017, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Dirong Peng
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xin Jin
4BGI Research, Shenzhen 518083, China
5School of Medicine, South China University of Technology, Guangzhou, China
6Shenzhen Key Laboratory of Transomics Biotechnologies, BGI Research, Shenzhen, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xin Jin
  • For correspondence: jinxin{at}genomics.cn zhaolijian{at}genomics.cn
Lijian Zhao
1BGI Genomics, Shenzhen 518083, China
2Clin Lab, BGI Genomics, Shenzhen 518083, China
7Medical Technology College, Hebei Medical University, Shijiazhuang 050017, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jinxin{at}genomics.cn zhaolijian{at}genomics.cn
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Identifying causative variants is crucial for the diagnosis of rare genetic diseases. Over the past two decades, the application of genome sequencing technologies in the field has significantly improved diagnostic outcomes. However, the complexity of data analysis and interpretation continues to limit the efficiency and accuracy of these applications. Various genotype and phenotype-driven filtering and prioritization strategies are used to generate a candidate list of variants for expert curation, with the final report variants determined through knowledge-intensive and labor-intensive expert review. Despite these efforts, the current methods fall short of meeting the growing demand for accurate and efficient diagnosis of rare disease. Recent developments in large language models (LLMs) suggest that LLMs possess the potential to augment or even supplant human labor in this context.

Methods In this study, we have developed Genetic Transformer (GeneT), an innovative large language model (LLM) driven approach to accelerate identification of candidate causative variants for rare genetic disease. A comprehensive evaluation was conducted between the fine-tuned large language models and four phenotype-driven methods, including Xrare, Exomiser, PhenIX and PHIVE, alongside six pre-trained LLMs (Qwen1.5-0.5B, Qwen1.5-1.8B, Qwen1.5-4B, Mistral-7B, Meta-Llama-3-8B, Meta-Llama-3-70B). This evaluation focused on performance and hallucinations.

Results Genetic Transformer (GeneT) as an innovative LLM-driven approach demonstrated outstanding performance on identification of candidate causative variants, identified the average number of candidate causative variants reduced from an average of 418 to 8, achieving recall rate of 99% in synthetic datasets. Application in real-world clinical setting demonstrated the potential for a 20-fold increase in processing speed, reducing the time required to analyze each sample from approximately 60 minutes to around 3 minutes. Concurrently, the recall rate has improved from 94.36% to 97.85%. An online analysis platform iGeneT was developed to integrate GeneT into the workflow of rare genetic disease analysis.

Conclusion Our study represents the inaugural application of fine-tuned LLMs for identifying candidate causative variants, introducing GeneT as an innovative LLM-driven approach, demonstrating its superiority in both simulated data and real-world clinical setting. The study is unique in that it represents a paradigm shift in addressing the complexity of variant filtering and prioritization of whole exome or genome sequencing data, effectively resolving the challenge akin to finding a needle in a haystack.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Ethics committee/IRB of BGI gave ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present study are available upon reasonable request to the authors

https://ftp.ncbi.nlm.nih.gov/pub/clinvar

https://omim.org

https://ftp.ncbi.nih.gov/repository/OMIM/ARCHIVE/

https://www.hgmd.cf.ac.uk/ac/index.php

https://hgdownload.soe.ucsc.edu/gbdb/hg19/1000Genomes/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted July 19, 2024.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases
Lungang Liang, Yulan Chen, Taifu Wang, Dan Jiang, Jishuo Jin, Yanmeng Pang, Qin Na, Qiang Liu, Xiaosen Jiang, Wentao Dai, Meifang Tang, Yutao Du, Dirong Peng, Xin Jin, Lijian Zhao
medRxiv 2024.07.18.24310666; doi: https://doi.org/10.1101/2024.07.18.24310666
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases
Lungang Liang, Yulan Chen, Taifu Wang, Dan Jiang, Jishuo Jin, Yanmeng Pang, Qin Na, Qiang Liu, Xiaosen Jiang, Wentao Dai, Meifang Tang, Yutao Du, Dirong Peng, Xin Jin, Lijian Zhao
medRxiv 2024.07.18.24310666; doi: https://doi.org/10.1101/2024.07.18.24310666

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetic and Genomic Medicine
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)