Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

AMELIE 3: Fully Automated Mendelian Patient Reanalysis at Under 1 Alert per Patient per Year

Johannes Birgmeier, Ethan Steinberg, Ethan E. Bodle, Cole A. Deisseroth, Karthik A. Jagadeesh, Jennefer N. Kohler, Devon Bonner, Shruti Marwaha, Julian A. Martinez-Agosto, Stan Nelson, Christina G. Palmer, Joy D. Cogan, Rizwan Hamid, Joan M. Stoler, Joel B. Krier, Jill A. Rosenfeld, Paolo Moretti, David R. Adams, Vandana Shashi, Elizabeth A. Worthey, Christine M. Eng, Euan A. Ashley, Matthew T. Wheeler, Undiagnosed Diseases Network, Peter D. Stenson, David N. Cooper, Jonathan A. Bernstein, Gill Bejerano
doi: https://doi.org/10.1101/2020.12.29.20248974
Johannes Birgmeier
1Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ethan Steinberg
1Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ethan E. Bodle
2Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Cole A. Deisseroth
1Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Karthik A. Jagadeesh
1Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jennefer N. Kohler
3Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Devon Bonner
3Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shruti Marwaha
3Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
4Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Julian A. Martinez-Agosto
5Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
6Department of Pediatrics, Division of Medical Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
8Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stan Nelson
5Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
7Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christina G. Palmer
5Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
8Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joy D. Cogan
9Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rizwan Hamid
9Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joan M. Stoler
10Division of Genetics and Genomics, Department of Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joel B. Krier
11Dept of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jill A. Rosenfeld
12Dept of Molecular and Human Genetics Baylor College of Medicine, Houston, TX, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paolo Moretti
13Department of Neurology University of Utah and George E. Wahlen VA Medical Center Salt Lake City, UT, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David R. Adams
14NIH Undiagnosed Diseases Program, Common Fund, Office of the Director, NIH, Bethesda, MD, USA
15Office of the Clinical Director, National Human Genome Research Institute, NIH, Bethesda, MD, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vandana Shashi
16Division of Medical Genetics, Department of Pediatrics, Duke Health, Durham, NC, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Elizabeth A. Worthey
17Software Development and Informatics, HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
18Departments of Pediatric and Pathology, University of Birmingham School of Medicine, Birmingham, AL, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christine M. Eng
12Dept of Molecular and Human Genetics Baylor College of Medicine, Houston, TX, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Euan A. Ashley
3Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
4Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA
19Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matthew T. Wheeler
3Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
4Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Peter D. Stenson
20Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David N. Cooper
20Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jonathan A. Bernstein
2Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gill Bejerano
1Department of Computer Science, Stanford University, Stanford, CA, USA
2Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
21Department of Developmental Biology, Stanford University, Stanford, CA, USA
22Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: bejerano{at}stanford.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Many thousands of patients with a suspected Mendelian disease have their exomes/genomes sequenced every year, but only about 30% receive a definitive diagnosis. Since a novel Mendelian gene-disease association is published on average every business day, thousands of undiagnosed patient cases could receive a diagnosis each year if their genomes were regularly compared to the latest literature. With millions of genomes expected to be sequenced for rare disease analysis by 2025, and considering the current publication rate of 1.1 million new articles per annum in PubMed, manually reanalyzing the growing cases of undiagnosed patients is not sustainable.

Methods We describe a fully automated reanalysis framework for patients with suspected, but undiagnosed, Mendelian disorders. The presented framework was tested by automatically parsing all ∼100,000 newly published peer reviewed papers every month and matching them on genotype and phenotype with all stored undiagnosed patients. If a new article contains a possible diagnosis for an undiagnosed patient, the system provides notification. We test the accuracy of the automatic reanalysis system on 110 patients, including 61 with available trio data.

Results Even when trained only on older data, our system identifies 80% of reanalysis diagnoses, while sending only 0.5-1 alerts per patient per year, a 100-1,000-fold efficiency gain over manual literature surveillance of equivalent yield.

Conclusion We show that automatic reanalysis of patients with suspected Mendelian disease is feasible and has the potential to greatly streamline diagnosis. Our system is not intended to replace clinical judgment. Rather, clinical diagnostic services could greatly benefit from a modest re-allocation of time from manual literature exploration to review of automated reanalysis alerts. Our system additionally supports a new paradigm for medical IT systems: proactive, continuously learning and consequently able to autonomously identify valuable insights as they emerge in digital health records. We have launched automated patient reanalysis, trained on the latest data, with user accounts and daily literature updates at https://AMELIE.stanford.edu.

Introduction

Severe genetic diseases affect tens of thousands of infants born every year worldwide. Many Mendelian conditions such as intellectual disability are diagnosed later in life for a total estimate of 0.5-1% of the 7.8 billion world population1,2. Millions of such patients are projected to be sequenced over the next few years3. Currently, for an estimated 30% of patients4 with a presumed Mendelian disease, a definitive diagnosis is arrived at immediately after exome sequencing5. Conversely, 70% of patients do not receive a diagnosis (for a variety of reasons6). However, approximately 250 novel gene-disease associations are identified every year6–8. Reanalysis of exomes of patients with previously undiagnosable genetic conditions results in a significant fraction (4%-30%) of these cases becoming diagnosable in a period of 1 to 5 years after the initial negative analysis6,9–17. PubMed grows by over 1 million publications each year. Thus, the lack of capacity18 to regularly reassess non-diagnostic clinical exome or genome sequencing in the light of newly published literature necessarily results in delayed diagnoses.

We have previously developed AMELIE19–21 (Automatic MEndelian LIterature Evaluation), a natural language processing and machine learning framework that automatically analyzes literature about Mendelian diseases and matches it to patients with undiagnosed Mendelian diseases to prioritize candidate causative genes in the patients’ genomes. Here, we adapted the use of AMELIE to perform continuous reanalysis of undiagnosed patients with suspected Mendelian disease. The AMELIE-based reanalysis framework automatically compares all new literature to all undiagnosed patients and notifies clinicians (or diagnosticians; we use these interchangeably here) about newly published, likely diagnostic articles. To estimate the diagnostic rate and clinician burden of the reanalysis system, we performed a “time machine” experiment: first, we trained the reanalysis system only on Mendelian disease data available until December 2011. Subsequently, we assembled a cohort of 110 Mendelian singleton patients, of which 61 also had trio sequencing data available, who gradually became diagnosable after January 2012. Using this system, we performed an automatic reanalysis experiment in monthly intervals from 2012 to 2018, demonstrating a high diagnostic yield at very low clinician burden.

Methods

AMELIE-based automatic reanalysis

The automatic reanalysis framework presented here takes as input exome or genome sequencing data and a (manually or automated ClinPhen22-created) list of phenotypic abnormalities per patient. User-parameterized filtering of exome or genome sequencing data reveals a list of patient variants that are rare (e.g., ≤0.5% minor allele frequency23) in the general population and hence potentially disease-causing. These are termed “candidate causative” variants. After sequencing, the patient’s candidate causative variants are analyzed for the presence of causative mutations using all knowledge available at the time. If the patient cannot be diagnosed shortly after sequencing, the patient’s relevant data (minimally consisting of a list of candidate causative variants and a list of phenotypic abnormalities observed in the patient) are added to a database of undiagnosed patients. Each patient is then reanalyzed automatically at monthly intervals until a diagnosis is successfully identified (Figure 1).

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. Automatic reanalysis of patients with undiagnosed Mendelian diseases.

After sequencing, clinicians examine the automated AMELIE analysis in search of a diagnosis. If a diagnosis is not available (currently in ∼70% of all cases), the patient’s information is entered into a reanalysis database. Every month, AMELIE matches all newly published literature against every patient candidate causative variant and phenotypes to seek new diagnoses. If a newly published article is flagged as being possibly diagnostic, it is reviewed by clinicians, resulting in either diagnosis or continuation of AMELIE-based automatic reanalysis. See example, reanalysis notifications in Table 3.

AMELIE

AMELIE21 performs two tasks: (1) automatically discovers and parses literature about Mendelian diseases to construct an “AMELIE knowledgebase”, and (2) estimates the likelihood that a given article contains a diagnosis for a patient through an “AMELIE classifier”. Here we build a computational framework around AMELIE that performs automatic reanalysis of undiagnosed patients with suspected Mendelian disease (Figure 1). For a detailed description of AMELIE, see Supplementary Methods and ref. 21.

AMELIE knowledgebase

The AMELIE knowledgebase is automatically constructed from articles about Mendelian diseases. Briefly, AMELIE knowledgebase construction is performed using a series of machine-learning classifiers21 operating on text data. First, all PubMed abstracts available (30+ million currently) are classified in terms of their likelihood to discuss monogenic diseases. The full-text articles of potentially relevant abstracts are retrieved directly from the publishers. From each article’s full text, disease-causing genes and resulting clinical phenotypes are extracted. Mentioned genetic variants are retrieved using AVADA24. In addition, a set of full-text classifiers assign scores to each article indicating whether it is most likely to be about a dominant or a recessive disease, and about protein-truncating (frameshift indel, stopgain, splicing) pathogenic variants or non-truncating (missense, nonframeshift indel) pathogenic variants. Information about mentioned phenotypic abnormalities, disease-causing genes, and disease inheritance modes, are extracted from these full text articles into the knowledgebase.

AMELIE classifier

The AMELIE classifier estimates the likelihood that a given article contains a diagnosis for a particular patient. Given an article A, a patient’s list of phenotypic abnormalities P, and a gene G containing candidate causative variants in the patient’s genome, the AMELIE classifier21 returns a diagnostic probability score between 0 and 100 (low to high) indicating how well the article A explains the patient’s phenotypes P in light of the patient-specific variants in gene G.

Automatic reanalysis using AMELIE

The automatic reanalysis framework takes a single parameter as input, termed “notification threshold”, a number (score) between 0 and 100. When a new article A about a disease-causing gene G is published and added to the AMELIE knowledgebase, the AMELIE classifier compares all known undiagnosed patients with a candidate causative variant in G to the article A and automatically sends a notification about the article if our “notification criterion” applies. We define the “notification criterion” as (1) article A’s diagnostic probability score is greater than or equal to the (global) notification threshold, and (2) article A’s diagnostic probability score is greater than or equal to the diagnostic probability score of previously published articles about the candidate gene G for the undiagnosed patient.

Patients who are successfully diagnosed after such notifications are removed from the database of undiagnosed patients. If a notification sent by the automatic reanalysis framework contains an article that, after clinician review, enables patient diagnosis, the notification is counted as “diagnostic”, or a “true positive”; if not, it is considered a “false positive” (Figure 1).

Patients

To retrospectively test AMELIE-based automatic reanalysis, we assembled a cohort of 110 diagnosed patients with diseases where the causative gene was first published between January 2012 and May 2018 (Table 1, Supplementary Table S1). Patient data was obtained from the Deciphering Developmental Disorders (DDD) project25, the clinical genetics service at Stanford Children’s Health (SCH), and the Undiagnosed Diseases Network (UDN)26. From these sources, we included all available patients with a single causative gene disease diagnosis for which the first supporting literature appeared after January 2012; had available exome or genome sequencing data containing the causative variant(s); and a list of clinician-noted or ClinPhen22-extracted phenotypes (Supplementary Methods). De-identified data from the DDD project were accessed via the European Genome-Phenome Archive27 (study EGAS00001000775). As applicable to the participating patients, the study protocol was reviewed and approved by the Stanford University Institutional Review Board (IRB) and the central IRB at the NIH National Human Genome Research Institute for the Undiagnosed Diseases Network. Written informed consent was obtained from all participants. For each of the 110 patients, a clinician reviewed the literature about the patient’s disease and manually identified a subset of articles, each with sufficient information to diagnose the case. The year and month in which the first article linking the patient’s disease to the patient’s causative gene was published were tagged as the patient’s earliest possible date of literature-based diagnosis.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1. Clinical characteristics of patient cohort

We defined candidate causative variants in singleton patient genomes as rare (≤0.5% minor allele frequency in a large healthy control cohort23), non-silent exonic or core splice-site variants in protein-coding genes. For 61 of the 110 test patients, exome or genome sequencing data of 2 of the patient’s unaffected relatives (usually parents) were available and the patient’s causative variants were not identically observed in an unaffected relative. For trio patients, candidate variants were further filtered by segregation with the disease in the family (Table 1, Supplementary Table S1).

Experimental design

For our time machine experiment, we built a version of the AMELIE knowledgebase and trained all machine learning components using only article data from 2011 or before. We then ran this AMELIE classifier, in monthly steps, on all PubMed data from January 2012 through May 2018, noting every notification generated at different notification thresholds (Figure 1).

Performance Measures

We define the number of diagnosed patients as the number of test cohort patients who received a diagnostic notification within the experiment timeframe. The wait time for diagnosis after publication of the first diagnostic article is the number of months between the publication of the first diagnostic article and the sending of a diagnostic notification by AMELIE.

In a typical undiagnosed patient set, only a small fraction of patients become diagnosable every year6,9–17. Since our test patient cohort consists only of patients who become diagnosable within the experiment timeframe, reporting the number of false positives per diagnostic notification purely from the test cohort data would underestimate the number of false positive notifications per diagnostic notification in a cohort including patients not diagnosable before May 2018. We conducted a meta-analysis of manual reanalysis studies of undiagnosed patients with suspected Mendelian disease6,9–15. For each study, we collected the total number of patients, the number of patients receiving a reanalysis diagnosis due to updated literature (rather than other factors like improved variant calling pipelines), and the reanalysis timeframe. Based on these data, we used a meta-analysis statistic implemented by the R function “metarate” to estimate the expected fraction of undiagnosed patients that become newly diagnosable per year through growth of knowledge about Mendelian diseases. This rate was estimated as 6.74% (Supplementary Methods and Supplementary Table S2).

To calculate the number of false positive notifications per diagnostic notification and total clinician burden, we assume the existence of a typical undiagnosed patients’ database containing n patients. We estimate the average number of false positive notifications per patient per month f as the number of false positive notifications (FPs) per patient per month during the reanalysis experiment, calculated as Embedded Image mean_FPs_per_month(patient). Further, we estimate the fraction p of diagnosable patients who receive a diagnostic notification by automatic reanalysis as the fraction of diagnosable test patients who receive a diagnostic notification in the reanalysis experiment timeframe. Based on these estimates, the expected annual number of diagnostic notifications equals 6.74% · n · p and the expected annual number of false positive notifications equals 12 · f · n. Thus, given a scenario in which 6.74% of patients in an undiagnosed patients database become diagnosable within a year, the expected number of false positive notifications per diagnostic notification equals Embedded Image and the total evaluation burden on clinicians, per patient per year, is 6.74% · p + 12 · f.

Comparison of AMELIE-based reanalysis to a simple abstract-based approach

To estimate the efficiency gain of AMELIE-based reanalysis over a manual abstract-based reanalysis approach, we defined the 20 most cited Mendelian disease journals as the most-cited journals in the Human Gene Mutation Database (HGMD), which aims to comprehensively curate Mendelian disease-causing mutations from the primary literature28 (Supplementary Table S3 and Supplementary Methods). For each patient, we assembled a surveillance list of all articles mentioning at least one patient candidate causative gene in the 20 most cited Mendelian disease journals that were published between the start of the reanalysis experiment and the publication of the first diagnostic article for the patient. The first diagnostic article was contained in this surveillance list for 82-83% of patients (91 of 110 of singleton patients and 51 of 61 of trio patients). Consequently, we estimated the efficiency gain of automatic reanalysis compared to tracking the 20 most cited Mendelian disease journals for a patient equals the number of articles about any of the patient’s candidate causative genes in the 20 most cited journals about Mendelian disease until publication of the first diagnostic article divided by the number of AMELIE-based automatic reanalysis notifications for the patient.

Notification threshold calibration

The automatic (global) reanalysis notification threshold can be adjusted to achieve high sensitivity (aiming for a large fraction of diagnosed patients), or high precision (aiming for a low number of false positives per diagnostic notification). We report the measures defined above for 3 differently calibrated notification thresholds: (a) a “high-sensitivity” notification threshold, in which the clinician receives diagnostic notifications for at least 80% of diagnosable patients, comparable in recall to tracking the top 20 journals above, at the lowest possible clinician burden, (b) a “high-precision” approach, in which at most 3 false positives per diagnostic notification are sent on average at the highest possible true positive rate, and (c) a “minimal interruptions” (even higher precision) approach, in which the majority of notifications sent are diagnostic, at the highest possible true positive rate.

Results

Table 2 summarizes the outcomes of the reanalysis experiment. The fraction of diagnosed patients and total number of notifications per patient per year is shown in Figure 2. The automatic reanalysis timeline of three examples of singleton patients is presented in Table 3. Automatic reanalysis on singleton data could be calibrated for high sensitivity or high precision; achieving high sensitivity and precision simultaneously was possible with trio data. Both modes of operation resulted in between 86 and 893 times fewer abstracts to consider compared to manual reanalysis by tracking abstracts in the 20 most cited Mendelian disease journals.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2. Reanalysis experiment outcomes
View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3. Automatic reanalysis notifications of three singleton patients, starting January 2012
Figure 2.
  • Download figure
  • Open in new tab
Figure 2. Fraction of diagnosed patients and average clinician burden per patient per year across notification thresholds.

Both panels have the same x-axis so that matching values can read simultaneously from both. (Upper panel) The fraction of diagnosable test cohort patients who received a diagnostic notification (i.e., true positive) during the 6.5 year reanalysis experiment timeframe across notification thresholds. (Lower panel) The expected total number of notifications (or clinician burden) per patient per year across notification thresholds, including both diagnostic notifications and false positive notifications. For example, the system detects 80% of diagnosable singletons (trios) at the low burden of 1 (0.5) notification per patient per year.

Singletons

We ran singleton analysis on all 110 patients. By manually tracking articles (only) about patient candidate causative genes in the 20 most cited Mendelian disease journals, clinicians would need to evaluate an average of 892 articles per diagnosable patient from the start of the reanalysis experiment until the publication of the first diagnostic article.

In contrast, our automatic reanalysis system is powerful enough to attain “high sensitivity”, where 80% of all diagnosable patients trigger a diagnostic notification, 58% of them immediately upon publication of the first diagnostic article, at an average of only 1.05 notification per patient per year (Figure 2 and Table 2). In “high precision” mode false positive notifications are reduced by 80%, while 44% of diagnosable singleton patients receive a diagnostic alert, at an average rate of only 0.17 notifications per patient per year. And in “minimal interruptions” mode, only 22% of diagnosable singleton patients receive a diagnostic notification, but the majority of notifications sent by the system are diagnostic, at a minimal 0.05 notifications per patient per year.

Thus, automatic reanalysis with the above notification thresholds for high sensitivity, high precision, or minimal interruptions, requires following up on 361-893 times fewer articles compared to manual reanalysis surveillance overall, amounting to only a couple of article alerts per patient.

Trios

In the case of manual reanalysis for our 61 trio patients, clinicians would examine an average of 131 articles about candidate causative genes per patient by tracking abstracts in the 20 most cited Mendelian disease journals from start of the reanalysis experiment to the publication of the first diagnostic article.

In contrast, automatic trio reanalysis in “high sensitivity” mode resulted in an 82% diagnosis rate, at 0.53 notifications per patient per year, or half the clinician burden of comparable singleton reanalysis. “High precision” mode was very similar, resulting in over 75% of diagnoses. And in “minimal interruptions” mode, the diagnosis rate was still 46% of diagnosable patients with the majority of notifications leading to diagnosis, at an impressive 0.12 notifications per patient per year.

Thus, automatic reanalysis as presented here requires following up on 86-145 times fewer articles per patient compared to manual reanalysis by tracking abstracts in the 20 most cited Mendelian disease journals.

Web portal

We have launched a web portal containing a working implementation of AMELIE analysis21 followed by automatic reanalysis at https://amelie.stanford.edu. The updated website is trained on current PubMed (as opposed to 2011 in above experiment), and it performs daily literature updates by automatically parsing and classifying newly indexed PubMed entries, downloading full text of relevant articles, and inserting extracted knowledge from full-text articles into the AMELIE knowledgebase. For demonstration purposes users can sign up for individual accounts and enable automatic reanalysis notifications (delivered by email) for selected patients at user-defined notification thresholds. Customizable singleton and trio variant filtering based on gnomAD variant frequency data29 is supported.

Discussion

We present here a retrospective analysis of an automatic reanalysis framework on both singleton patients and trios diagnosed with Mendelian disorders over the span of over six years. We showed that automatic reanalysis can already be used to reveal diagnoses for patients with suspected Mendelian disease who could not be previously diagnosed at a very acceptable notification burden, while requiring dramatically less work of clinicians as compared to manual reanalysis. By simply tracking abstracts pertaining to patient candidate causative genes in the 20 most cited Mendelian disease journals, clinicians have to review hundreds of articles per diagnosable patient from the start of our reanalysis experiment to diagnosis.

In 2016 we were among the first to publish on the value of reanalysis6. From 40 cases we were able to diagnose 4. This 10% yield (on cases accumulated over multiple years) has since held up for a great number of similar studies by other groups over their undiagnosed patients. Here our sample size is bigger, and we expect it to be similarly representative of continuous patient reanalysis at under 1 notification per patient per year. Moreover, AMELIE’s “time machine” performance here was obtained while training only on 2011 data, not long after next generation sequencing became available in the clinic. It should be seen as a lower bound on AMELIE’s actual performance, as the AMELIE web portal is trained on nearly a decade of additional years of accumulated knowledge. Performance would further improve should the conservative expected rate of reanalysis diagnoses per year we estimate at 6.7% be higher.

A mass of sequenced but undiagnosed patients is already accruing17. CLIA-certified exome data production now costs only a few hundred dollars. A wave of data – millions of sequenced patients3, and tens of thousands of articles on Mendelian disease genes21 – is coming the way of fewer than a thousand clinical laboratory geneticists in the U.S.30 and their peers worldwide. Germline exome and genome sequencing data, in contrast to results from many other diagnostic tests, do not expire. As our knowledge about disease-causing genetic variation constantly grows, manual reinterpretation of patient sequencing data can at best be done periodically. In Mendelian diagnosis alone, a substantial 70% of cases will not be diagnosed at initial analysis5, and yet, as estimated here, a meaningful ∼6.7% will become diagnosable with each subsequent year that passes on new knowledge alone. This accumulating load will greatly weigh on any interpretation service. Automation, as we show, can realize the promise of continuous reanalysis and timely diagnosis for all, and will be essential to handle the incoming flood of healthcare data and insights.

Our AMELIE-based reanalysis framework has limitations, catching only 80% of diagnosable cases even in high sensitivity mode. But what diagnoses it finds, it offers with an efficiency gain of ∼100-1000-fold over the – unsustainable – current standard of manual curation. Importantly, our system does not replace clinicians, but rather augments their capabilities. If a medical institute or lab devotes a certain number of work hours to re/analysis, a small fraction of this time should be devoted to resolving our system’s notifications. The remainder can certainly be spent on more open-ended explorations, and all lessons learned (both inside and outside the system) can be incorporated to make such resident clinical support systems better and better over time.

Traditionally, patient cases are most often reassessed at the time of a new clinical encounter. The rapid accumulation of medical knowledge pressures this paradigm as the significance of one’s health record can change dramatically between visits. On any given day, a patient may become diagnosable and a portion of such diagnoses are expected to be immediately actionable. At the same time, logistical and cost constraints currently prevent the regular reanalysis of many patient cases following non-diagnostic sequencing. Together with automated phenotype extraction tools from the electronic medical record, like ClinPhen22, AMELIE demonstrates the potential of a scalable means of regular reanalysis for undiagnosed patients, which can also encompass emerging incidentals. This has implications for the care of patients with undiagnosed genetic disease and more broadly. The promise of efficient, continuous, automated identification of latent, actionable diagnoses in patient data has the potential to significantly improve health outcomes across care settings.

Data Availability

A portion of the data we use is available from EGA. Another portion is of consented Stanford or UDN patients. Some of the latter can be shared while respecting consent conditions.

Funding

All computational work was funded only by a Bio-X SIGF fellowship (JB), the Stanford Department of Pediatrics (JAB, GB), a Packard Foundation Fellowship (GB), and a Microsoft Faculty Fellowship (GB). UDN curated data used in this manuscript was supported by the NIH Common Fund, through the Office of Strategic Coordination/Office of the NIH Director under Award Numbers U01HG007709, U01HG007672, U01HG007690, U01HG007708, U01HG007674, U01HG007942 and U01HG007943. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. A list of UDN collaborators is available in Supplementary Table S4.

Author contributions

JB and GB designed the study and analyzed the results. JB and ES implemented the text mining software, website and associated databases. EEB verified diagnostic articles for the purposes of the reanalysis experiment. CAD and KAJ processed patient data. JNK, DB, SM, JAMA, SN, CGP, JDC, RH, JMS, JBK, JAR, PM, DRA, VS, EAW, CME, EAA, MTW, and UDN provided curated patient data. PDS and DNC curated HGMD. JAB provided guidance on clinical aspects of study design, testing set construction and interpretation of results. JB, JAB, and GB wrote the manuscript. All authors commented on and approved the manuscript. GB guided the study.

Conflict of Interest

DNC and PDS acknowledge the receipt of financial support from Qiagen Inc through a License Agreement with Cardiff University. The Department of Molecular and Human Genetics at Baylor College of Medicine receives revenue from clinical genetic testing completed at Baylor Genetics. EAA is advisor to Apple, co-founder of Personalis Inc., and of DeepCell Inc. MTW is a stockholder of Personalis. The remaining authors declare no conflict of interest.

Acknowledgments

We would like to thank Erich Weiler for continuous support and guidance. We thank the members of the Bejerano lab for technical advice and helpful discussions. We thank Victoria Wang, Max Haeussler, Mark E. Diekhans, Natalie T. Deuitch, and Laura E. Hayward for helpful input. We thank Elijah Kravets, Julia Buckingham and Kirstie MacMillan for study coordination. We thank the European Genome-Phenome Archive27 (EGA) and the Deciphering Developmental Disorders (DDD) project25 for data sharing. The DDD study presents independent research commissioned by the Health Innovation Challenge Fund [grant HICF-1009-003], a parallel funding partnership between the Wellcome Trust and the Department of Health, and the Wellcome Trust Sanger Institute [grant WT098051]. The views expressed in this publication are those of the author(s) and not necessarily those of the Wellcome Trust or the Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South REC, and GEN/284/12 granted by the Republic of Ireland REC). Deidentified DDD data was obtained through EGA. The research team acknowledges the support of the National Institute for Health Research, through the Comprehensive Clinical Research Network. The authors would like to thank the Genome Aggregation Database (gnomAD) and the groups that provided exome and genome variant data to this resource. A full list of contributing groups can be found at http://gnomad.broadinstitute.org/about. UDN data were obtained directly from the UDN.

References

  1. 1.↵
    Church G. Compelling Reasons for Repairing Human Germlines. N Engl J Med 2017;377(20):1909–11.
    OpenUrlCrossRef
  2. 2.↵
    Blencowe H, Moorthie S, Petrou M, et al. Rare single gene disorders: estimating baseline prevalence and outcomes worldwide. J Community Genet 2018;9(4):397–406.
    OpenUrl
  3. 3.↵
    Birney E, Vamathevan J, Goodhand P. Genomics in healthcare: GA4GH looks to 2022. bioRxiv 2017;203554.
  4. 4.↵
    Dragojlovic N, Elliott AM, Adam S, et al. The cost and diagnostic yield of exome sequencing for children with suspected genetic disorders: a benchmarking study. Genet Med 2018;20(9):1013.
    OpenUrlCrossRef
  5. 5.↵
    Yang Y, Muzny DM, Reid JG, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med 2013;369(16):1502–11.
    OpenUrlCrossRefPubMedWeb of Science
  6. 6.↵
    Wenger AM, Guturu H, Bernstein JA, Bejerano G. Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet Med 2017;19(2):209-214 (ePub 2016).
    OpenUrl
  7. 7.
    Bamshad MJ, Nickerson DA, Chong JX. Mendelian Gene Discovery: Fast and Furious with No End in Sight. Am J Hum Genet 2019;105(3):448–55.
    OpenUrlCrossRefPubMed
  8. 8.↵
    Stenson PD, Mort M, Ball EV, et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum Genet 2020;
  9. 9.↵
    Nambot S, Thevenon J, Kuentz P, et al. Clinical whole-exome sequencing for the diagnosis of rare disorders with congenital anomalies and/or intellectual disability: substantial interest of prospective annual reanalysis. Genet Med 2018;20(6):645–54.
    OpenUrlCrossRefPubMed
  10. 10.
    Need AC, Shashi V, Schoch K, Petrovski S, Goldstein DB. The importance of dynamic reanalysis in diagnostic whole exome sequencing. J Med Genet 2017;54(3):155–6.
    OpenUrlFREE Full Text
  11. 11.
    Costain G, Jobling R, Walker S, et al. Periodic reanalysis of whole-genome sequencing data enhances the diagnostic advantage over standard clinical genetic testing. Eur J Hum Genet 2018;26(5):740–4.
    OpenUrlCrossRef
  12. 12.
    Xiao B, Qiu W, Ji X, et al. Marked yield of re-evaluating phenotype and exome/target sequencing data in 33 individuals with intellectual disabilities. Am J Med Genet A 2018;176(1):107–15.
    OpenUrlCrossRef
  13. 13.
    Ewans LJ, Schofield D, Shrestha R, et al. Whole-exome sequencing reanalysis at 12 months boosts diagnosis and is cost-effective when applied early in Mendelian disorders. Genet Med 2018;20(12):1564.
    OpenUrl
  14. 14.
    Eldomery MK, Coban-Akdemir Z, Harel T, et al. Lessons learned from additional research analyses of unsolved clinical exome cases. Genome Med 2017;9(1):26.
    OpenUrlCrossRef
  15. 15.↵
    Shashi V, Schoch K, Spillmann R, et al. A comprehensive iterative approach is highly effective in diagnosing individuals who are exome negative. Genet Med 2018;161–72.
  16. 16.
    Baker SW, Murrell JR, Nesbitt AI, et al. Automated Clinical Exome Reanalysis Reveals Novel Diagnoses. J Mol Diagn JMD 2019;21(1):38–48.
    OpenUrl
  17. 17.↵
    Liu P, Meng L, Normand EA, et al. Reanalysis of Clinical Exome Sequencing Data. N Engl J Med 2019;380(25):2478–80.
    OpenUrlCrossRefPubMed
  18. 18.↵
    Maiese DR, Keehn A, Lyon M, Flannery D, Watson M, Working Groups of the National Coordinating Center for Seven Regional Genetics Service Collaboratives. Current conditions in medical genetics practice. Genet Med Off J Am Coll Med Genet 2019;21(8):1874–7.
    OpenUrl
  19. 19.↵
    Birgmeier J, Haeussler M, Deisseroth CA, et al. AMELIE accelerates Mendelian patient diagnosis directly from the primary literature. bioRxiv 2017;171322.
  20. 20.
    Birgmeier J, Haeussler M, Deisseroth CA, et al. AMELIE 2 speeds up Mendelian diagnosis by matching patient phenotype & genotype to primary literature. bioRxiv 2019;839878.
  21. 21.↵
    Birgmeier J, Haeussler M, Deisseroth CA, et al. AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Sci Transl Med 2020;12(544).
  22. 22.↵
    Deisseroth CA, Birgmeier J, Bodle EE, et al. ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis. Genet Med 2018;1.
  23. 23.↵
    Karczewski KJ, Francioli LC, Tiao G, et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv 2019;531210.
  24. 24.↵
    Birgmeier J, Deisseroth CA, Hayward LE, et al. AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature. Genet Med 2019;1–9.
  25. 25.↵
    Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 2015;519(7542):223–8.
    OpenUrlCrossRefPubMed
  26. 26.↵
    Ramoni RB, Mulvihill JJ, Adams DR, et al. The Undiagnosed Diseases Network: Accelerating Discovery about Health and Disease. Am J Hum Genet 2017;100(2):185–92.
    OpenUrlCrossRefPubMed
  27. 27.↵
    Lappalainen I, Almeida-King J, Kumanduri V, et al. The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 2015;47(7):692–5.
    OpenUrlCrossRefPubMed
  28. 28.↵
    Stenson PD, Mort M, Ball EV, et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017;136(6):665–77.
    OpenUrlCrossRefPubMed
  29. 29.↵
    Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016;536(7616):285–91.
    OpenUrlCrossRefPubMedWeb of Science
  30. 30.↵
    Providers begin to use genomic testing in mapping patient care. Health Data Manag [Internet] 2018;Available from: https://www.healthdatamanagement.com/news/providers-begin-to-use-genomic-testing-in-mapping-patient-care
  31. 31.
    Gray KA, Yates B, Seal RL, Wright MW, Bruford EA. Genenames.org: the HGNC resources in 2015. Nucleic Acids Res 2015;43(Database issue):D1079-1085.
  32. 32.
    Bateman A, Martin MJ, O’Donovan C, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res 2017;45(D1):D158–69.
    OpenUrlCrossRefPubMed
  33. 33.
    Jurafsky D, Martin JH. Speech and Language Processing (2Nd Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc.; 2009.
  34. 34.
    Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 2015;43(Database issue):D789–798.
    OpenUrlCrossRefPubMed
  35. 35.
    Haeussler M. Download, convert and process the full text of scientific articles: maximilianh/pubMunch3 [Internet]. 2018. Available from: https://github.com/maximilianh/pubMunch3
  36. 36.
    Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825–2830.
    OpenUrlCrossRefPubMed
  37. 37.
    1000 Genomes Project Consortium, Auton A, Brooks LD, et al. A global reference for human genetic variation. Nature 2015;526(7571):68–74.
    OpenUrlCrossRefPubMed
  38. 38.
    Landrum MJ, Lee JM, Benson M, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 2018;46(D1):D1062–7.
    OpenUrlCrossRefPubMed
  39. 39.
    Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio [Internet] 2013;Available from: http://arxiv.org/abs/1303.3997
  40. 40.
    Broad Institute, Picard Tools. Picard Tools - By Broad Institute [Internet]. 2017;Available from: http://broadinstitute.github.io/picard/
  41. 41.
    DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43(5):491–8.
    OpenUrlCrossRefPubMedWeb of Science
  42. 42.
    Girdea M, Dumitriu S, Fiume M, et al. PhenoTips: patient phenotyping software for clinical and research use. Hum Mutat 2013;34(8):1057–65.
    OpenUrlCrossRefPubMed
  43. 43.
    Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics 2011;27(15):2156–8.
    OpenUrlCrossRefPubMedWeb of Science
  44. 44.
    Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38(16):e164.
    OpenUrlCrossRefPubMed
  45. 45.
    Schwarzer G. meta: An R package for meta-analysis. R News 2007;7(3):40–45.
    OpenUrl
  46. 46.
    Wei C-H, Kao H-Y, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res 2013;41(Web Server issue):W518–522.
    OpenUrlCrossRefPubMedWeb of Science
Back to top
PreviousNext
Posted January 04, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
AMELIE 3: Fully Automated Mendelian Patient Reanalysis at Under 1 Alert per Patient per Year
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
AMELIE 3: Fully Automated Mendelian Patient Reanalysis at Under 1 Alert per Patient per Year
Johannes Birgmeier, Ethan Steinberg, Ethan E. Bodle, Cole A. Deisseroth, Karthik A. Jagadeesh, Jennefer N. Kohler, Devon Bonner, Shruti Marwaha, Julian A. Martinez-Agosto, Stan Nelson, Christina G. Palmer, Joy D. Cogan, Rizwan Hamid, Joan M. Stoler, Joel B. Krier, Jill A. Rosenfeld, Paolo Moretti, David R. Adams, Vandana Shashi, Elizabeth A. Worthey, Christine M. Eng, Euan A. Ashley, Matthew T. Wheeler, Undiagnosed Diseases Network, Peter D. Stenson, David N. Cooper, Jonathan A. Bernstein, Gill Bejerano
medRxiv 2020.12.29.20248974; doi: https://doi.org/10.1101/2020.12.29.20248974
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
AMELIE 3: Fully Automated Mendelian Patient Reanalysis at Under 1 Alert per Patient per Year
Johannes Birgmeier, Ethan Steinberg, Ethan E. Bodle, Cole A. Deisseroth, Karthik A. Jagadeesh, Jennefer N. Kohler, Devon Bonner, Shruti Marwaha, Julian A. Martinez-Agosto, Stan Nelson, Christina G. Palmer, Joy D. Cogan, Rizwan Hamid, Joan M. Stoler, Joel B. Krier, Jill A. Rosenfeld, Paolo Moretti, David R. Adams, Vandana Shashi, Elizabeth A. Worthey, Christine M. Eng, Euan A. Ashley, Matthew T. Wheeler, Undiagnosed Diseases Network, Peter D. Stenson, David N. Cooper, Jonathan A. Bernstein, Gill Bejerano
medRxiv 2020.12.29.20248974; doi: https://doi.org/10.1101/2020.12.29.20248974

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetic and Genomic Medicine
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)