ABSTRACT
INTRODUCTION Given the established association between DNA methylation and the pathophysiology of dementia and its plausible role as a molecular mediator of lifestyle and environment, blood-derived DNA methylation data could enable early detection of dementia risk.
METHODS In conjunction with an extensive array of machine learning techniques, we employed whole blood genome-wide DNA methylation data as a surrogate for 14 modifiable and non-modifiable factors in the assessment of dementia risk in two independent cohorts of Alzheimer’s disease (AD) and Parkinson’s disease (PD).
RESULTS We established a multivariate methylation risk score (MMRS) to identify the status of mild cognitive impairment (MCI) cross-sectionally, independent of age and sex. We further demonstrated significant predictive capability of this score for the prospective onset of cognitive decline in AD and PD.
DISCUSSION Our work shows the potential of employing blood-derived DNA methylation data in the assessment of dementia risk.
1. BACKGROUND
As a result of population aging, the global number of dementia cases has drastically increased over the past decade and is estimated to keep increasing to more than 150 million patients worldwide by 2050 [1]. Despite the seemingly ever-increasing global burden, there is currently no availability of drugs or other treatment options that can halt or fully reverse the cognitive decline in dementia [2]. This emphasizes the importance of pre-clinical prevention strategies that require the identification of persons at risk of developing dementia before the onset of irreversible cognitive decline.
Over the past decades, the most precise models for predicting dementia and Alzheimer’s disease (AD) have relied on molecular information derived from cerebrospinal fluid (CSF) and neuroimaging modalities [3, 4]. In addition to that, several studies have identified genetic and environmental (lifestyle-related) factors that are associated with dementia risk, such as smoking [5], alcohol intake [6], plasma cholesterol levels [7], physical activity [8], education [9], and diet [10]. Accordingly, several of these unmodifiable and modifiable (lifestyle-related) dementia risk factors have been combined into a single score, such as the ‘Cardiovascular Risk Factors, Aging, and Dementia’ (CAIDE) [11] and ‘Lifestyle for Brain Health’ (LIBRA) [12] scores, respectively. However, in recent years, there has been a significant shift towards prioritizing blood-based biomarkers for AD [13]. This is due to several factors, including the invasive nature of CSF sampling, the lack of objective quantitative assessment of lifestyle-related factors, and the high cost and limited availability of specialized neuroimaging facilities.
It has been well-established that epigenetic mechanisms and in particular, DNA methylation, are involved in the molecular pathology of various neurodegenerative diseases, including dementia [14]. DNA methylation is a molecular mechanism that can mediate the impact of lifestyle and environmental factors on the genome, regulating the expression of the genes [15]. Notably, prior research has demonstrated associations between peripheral DNA methylation patterns and risk factors for dementia, such as smoking [16], obesity [17], and blood pressure [18]. Moreover, recent comprehensive DNA methylation Quantitative Trait Locus (mQTL) analyses have confirmed a substantial genetic influence on methylation patterns [19]. This collective evidence positions DNA methylation as an intriguing molecular biomarker with the potential to capture both genetic and environmental information at the individual level. However, previous endeavors to establish blood-derived DNA methylation-based predictions for AD have encountered challenges in external validation, likely attributed to the heterogeneity of the disease among different cohorts [20].
Therefore, in this study, rather than adopting a direct approach to predict the risk of dementia using blood DNA methylation, we sought to utilize the large-scale nature of a general population cohort to first develop objective and precise DNA methylation-based prediction models for the CAIDE and LIBRA scores and multiple modifiable and non-modifiable dementia risk factors. Next, along with the epigenetic score for CAIDE and LIBRA (i.e., epi-CAIDE and epi-LIBRA), we employed the methylation profile scores (MPSs) of the individual dementia risk factors to generate a multivariate methylation risk score (MMRS) for cognitive impairment and dementia.
2. METHODS
The applied methodology consists of four main steps: (1) the model generation by two distinct approaches using the DNA methylation data of the Exeter 10,000 project (EXTEND) [21] and the European Medical Information Framework for Alzheimer’s Disease (EMIF-AD) [22] cohorts (Fig. 1), (2) model validation in the independent test set of the EMIF-AD, Parkinson’s Progression Markers Initiative (PPMI) [23] and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [24] cohorts, (3) model interpretation in terms of variable importance, gene ontology (GO) overrepresentation analysis, and the influence of genetic variation, and (4) model extension by adding genetic and CSF biomarkers to the model.
Overview of the model generation. The model generation workflow consists of model training in the EXTEND and EMIF-AD cohorts using two different approaches. In the first approach, models for the prediction of CAIDE and LIBRA were trained in the EXTEND cohort (section 2.3.1.). Furthermore, in approach 2, the EXTEND cohort was used to predict 14 known dementia risk factors. The predicted risk scores by these 14 models (i.e., methylation profile scores; MPSs) were used as variables for the prediction of AD and MCI status in the training set of the EMIF-AD cohort (section 2.3.2.). Both approaches were evaluated in terms of AD and MCI classification performance in the independent test set in the EMIF-AD cohort. The model with the best performance was also validated in the PPMI and ADNI longitudinal cohorts (section 2.4.).
2.1. Study cohorts
2.1.1. The EXTEND cohort
The EXTEND study is a National Institute for Health and Care Research (NIHR) funded project aiming to collect blood samples along with extensive health information from people with and without health issues [21]. In the current study, a subset of the EXTEND cohort, consisting of individuals with available genotyping and blood-derived DNA methylation data (n = 1076) was used. The dataset exclusively comprised phenotype data from individuals aged 40 to 75, denoted as the midlife age group.
2.1.2. The EMIF-AD cohort
The EMIF-AD cohort is a research initiative focused on collecting and collating comprehensive medical and health-related information from individuals affected by AD and related cognitive disorders [22]. In this study, besides excluding the individuals outside the midlife age range (i.e., age < 40 or age > 75), cognitively healthy individuals who were recorded to have converted to MCI or AD during the follow-up period (mean follow-up time ± sd ≈ 2.3 ± 1.2 years) were also excluded. This process resulted in a final dataset of 110 individuals with AD, 293 individuals with MCI, and 220 healthy controls with available blood-derived DNA methylation data, genotyping data, and CSF protein markers.
2.1.3. The ADNI cohort
The ADNI cohort [24] contains genetic and blood-derived DNA methylation data of cognitively normal individuals as well as individuals diagnosed with MCI and dementia. In addition to the DNA methylation data, ADNI has also extensive individual-level information on various psychometric biomarkers measured at multiple time points. For the validation of the models, we used the baseline DNA methylation data of 223 midlife (40 ≤ age ≤ 75) individuals with a Mini Mental State Examination (MMSE) ≥ 26 (i.e., cognitively intact) at baseline.
2.1.4. The PPMI cohort
The Parkinson’s Progression Markers Initiative (PPMI) cohort [23] includes the blood-derived DNA methylation data of individuals recently diagnosed with Parkinson’s Disease (PD). Only samples from persons with 40 ≤ age ≤ 75 were included for the validation, resulting in 129 samples for which baseline DNA methylation data and the cognitive impairment outcome information were available. An overview (i.e., description, sample size, and sex and age distribution) of the four cohorts used in the current research is provided in Supplementary Table 1.
2.2 Data preprocessing
2.2.1. Clinical outcomes
The cognitive status of individuals in the EMIF-AD cohort was defined as described previously [22]. In summary, cognitively healthy individuals were defined by a normal neuropsychological assessment score. In nine of the EMIF-AD subcohorts, the MCI diagnosis was based on the criteria of Petersen [25], while for two subcohorts the Winblad et al. criteria [26] was used. Furthermore, AD diagnosis was defined based on the criteria of the National Institute of Neurological and Communicative Disorders and Stroke–Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) [27].
In the ADNI cohort, we only included the data of cognitively intact (i.e., MMSE ≥ 26) individuals at baseline to perform the survival analysis based on the longitudinal outcome of different cognitive domains (i.e., Alzheimer’s Disease Assessment Scale (ADAS), Rey’s Auditory Verbal Learning Test (RAVLT), Wechsler Logical Memory Delay (LDELTOTAL), Trail Making Test Part B Time (TRABSCOR), and MMSE scores) measured over 4 years. For the MMSE score we defined cognitive impairments by a MMSE < 24, while for the other cognitive outcomes, cognitive impairment status was defined by a score of either 2 SD below or above the mean of the control group, depending on the direction of the score (Supplementary Table 2).
Finally, in the PPMI cohort, dementia and MCI diagnosis was based on the Movement Disorders Society (MDS) recommended criteria [28, 29] as done previously [30]. MCI and dementia individuals who have been recorded to have reverted to a cognitively normal status (absence of MCI and dementia) were excluded from the analysis.
2.2.2. Dementia risk factors
In the current study, the dementia risk factors were defined in the EXTEND cohort according to Table 1. Particularly, 15 dementia risk factors were used for the calculation of the CAIDE and LIBRA scores based on the previously established risk factor weights [12, 31] (Supplementary Tables 3 and 4). Additionally, because of the small number of kidney disease cases, only 14 dementia risk factors were used as target variables for the construction and/or validation of the DNA methylation-based risk factor models (section 2.3.2.).
The risk factors in the EXTEND cohort were used for the calculation of the CAIDE and LIBRA scores and as target variables for the training or validation of the DNA methylation-based risk factor models.
2.2.3. Cerebral spinal fluid biomarkers
Three CSF biomarkers scores (amyloid-β (Aβ), phosphorylated tau (p-tau), and total tau (t-tau) z-scores) were defined in the EMIF-AD cohort as described previously [22]. In summary, the t-tau and p-tau z-scores were defined by their local p-tau levels as measured by center-specific ELISAs, standardized within assay according to mean and standard deviation of the healthy controls. Furthermore, the Aβ z-score were specified to be standardized scores for amyloid pathology. Specifically, this score was standardized within assay according to the mean and standard deviation of the control group and was based on CSF Aβ42/40 ratio, local CSF-Aβ42, and the standardized uptake value ratio (SUVR) on an amyloid PET scan.
2.2.4. Whole blood DNA Methylation profiling
Sample filtering was performed as a first step of the DNA methylation data pre-processing pipeline and includes the removal of samples with a median bisulfite conversion rate below 80 percent, incorrect sex labels, and a low median (un)methylated intensity according the minfi package’s guidelines (i.e., median log2 unmethylated intensity + median log2 methylated intensity ≤ 21) [32].
In the normalization procedure, the combination of Noob (minfi package (v1.46.0) [32]) and BMIQ (wateRmelon package (v2.6.0) [33]) normalization was applied. This pipeline has previously been shown to be a high-performing method for reducing type I/type II bias and enhancing reproducibility [34]. Furthermore, both Noob and BMIQ are within-sample normalization methods, which avoid information leakage from the training to the test set.
Before model training in the EXTEND cohort, previously reported cross-reactive probes [35, 36], probes with a detection P value > .01 in at least one sample, sex-chromosomal probes, and probes with single nucleotide polymorphisms (SNPs) at the single base extension and/or CpG interrogation site were removed. Next, density plots and principal component analysis (PCA) score plots were constructed to assess the quality of the pre-processing and to identify possible outliers. Lastly, β-values were converted to M-values to account for the inherently heteroscedastic nature of methylation data. The M-values were used for the subsequent feature selection and machine learning pipeline.
It should be noted that not all features that are incorporated in the generation of the risk factor models passed the described probe filtering steps in the other cohorts. Therefore, when applying the risk factor models in a different cohort, low-quality probes (i.e., detection P value > .1) were imputed using the imputePCA function from the missMDA package (v1.8) [37]. This function uses a regularized iterative PCA algorithm to impute missing values. Specifically, this algorithm first imputes all missing values with the feature’s mean after which it iteratively performs PCA and imputes each missing value using the low-rank representation until convergence is reached (i.e., a difference of less than 10-6 between iterations). The optimal number of principal components (PCs) for the low-rank representation was found by removing and imputing 100 random values and choosing the number of PCs that yields the lowest mean absolute error (MAE). This iterative PCA imputation method has previously been shown to be among the best-performing and computationally efficient algorithms for missing value imputation of DNA methylation data [38].
2.2.5. Polygenic score generation
The genomics data from the EMIF-AD cohort was pre-processed as described previously [39]. In short, this pre-processing pipeline includes filtering of strand-ambiguous SNPs, aligning alleles to the human genome assembly GRCh37/hg19, phasing, imputation based on the HRC reference panel, pre- and post-imputation quality control (QC), and filtering of SNPs with a minor allele frequency (MAF) < 0.01. The genomics data of the EXTEND cohort underwent a similar pre-processing pipeline including SNP filtering and alignment using the HRC/100G imputation preparation and checking pipeline (v4.2.7) [40], imputation using the Michigan Imputation Server with the 1000Genomes reference panel (phase 3 v5 hg19, Population: EUR, Phasing: Eagle, R-squared filter: 0.3) [41], and post-imputation SNP filtering (MAF < 0.01, Hardy-Weinberg equilibrium (HWE) P value < 10-4, missing call rated > 0.1).
Subsequently, the LDAK tool (v5.2) [42] was applied to calculate the polygenic (risk) scores (PGSs) for 11 dementia risk factors as well as AD status (including the APOE region) using the HapMap reference panel and the summary statistics of 12 genome-wide association studies (GWASs) [43–53] (Supplementary Table 5). In short, LDAK splits the summary statistics in pseudo training and test summary statistics and then uses a variational Bayes approach to estimate the regression coefficients of the SNPs [42]. For the hyperparameter optimization, multiple models are trained and evaluated on the test summary statistics for different combinations of prior distribution parameters. This methodology is available for six model types (i.e., bayesR, bayesR-shrink, lasso, lasso-sparse, ridge, and bolt regression models) that differ in the form of the prior distribution for the SNP effect sizes. As the bayesR is the recommended method for PGS generation by the developers of LDAK, all PGSs were generated using the bayesR approach.
2.3. Model generation
The model generation consists of two approaches; the prediction of the CAIDE and LIBRA scores in the EXTEND cohort as well as the prediction of MCI and AD status in the EMIF-AD cohort by 14 methylation profile scores of dementia risk factors (Fig. 1).
2.3.1. Approach 1: Generation of methylation risk scores for CAIDE and LIBRA
In the first approach, we aimed at constructing an epigenetic model for the prediction of the LIBRA score (i.e., epi-LIBRA model) and CAIDE score (i.e., epi-CAIDE model). However, before constructing these models, we evaluated multiple supervised and unsupervised feature selection methods to investigate which method is suitable for reducing the large dimensionality of the DNA methylation data and found superior performance of the “correlation-based feature selection method” (see Supplementary Text 1 for a more detailed description of the applied methodology and results). In short, in the correlation-based feature selection method, the 10,000 CpGs that have the highest absolute Spearman correlation coefficient with the target variable in the training set were selected for model training.
Hence, for the prediction of both the CAIDE and LIBRA scores, we trained an ElasticNet and Random Forest model on the 10,000 features selected by correlation-based feature selection as well as an ElasticNet model trained on all CpGs that passed QC (Table 2). Accordingly, 5-repeated 5-fold cross-validation was applied to find the optimal hyperparameter values that yield the minimal MAE (the searched hyperparameter space is shown in Supplementary Table 6).
For the prediction of nine dementia risk factors and the CAIDE and LIBRA scores in the EXTEND cohort, different combinations of feature selection and machine learning methods were used.
2.3.2. Approach 2: Generation of multivariate methylation risk scores
For the second approach, we aimed at predicting 14 known dementia risk factors that are included in the CAIDE and/or LIBRA scoring systems (Table 1). Specifically, for the prediction of smoking, alcohol consumption, HDL cholesterol, and BMI, the corresponding DNA methylation-based models from Hillary and Marioni (2020) [54] were used, while the epigenetic clock model from Zhang et al. (2019) [55] was used for the prediction of age. For each of the nine remaining risk factors (i.e., low education, physical inactivity, unhealthy diet, depression, type II diabetes, heart disease, sex, systolic blood pressure, and total cholesterol), five models were trained in the EXTEND cohort corresponding to different combinations of feature selection and machine learning algorithms (Table 2). These include an ElasticNet model without prior feature selection, an ElasticNet and Random Forest model with correlation-based feature selection, as well as an ElasticNet and Random Forest model trained on the CpGs that reached genome-wide significance in previously performed epigenome-wide association studies (Supplementary Table 7).
Accordingly, for each of these five models, 5-repeated 5-fold cross-validation was applied to find the optimal hyperparameter values that yield the maximal area under the receiver operating characteristic curve (AUROC) (for discrete risk factors) or minimal MAE (for continuous risk factors) (the searched hyperparameter space is shown in Supplementary Table 6). From the five models per risk factor, the model that achieved the highest AUROC or R2 over all folds was considered as the best-performing risk factor model.
Subsequently, the predicted risk scores of each risk factor model (for binary variables this is defined as log(1/1-p), where p is the estimated class probability), referred to as methylation profile scores (MPSs), were used as variables for the construction of a multivariate methylation risk score (MMRS) for the prediction of ‘MCI versus control’ (i.e., MMRS-MCI model) and ‘AD versus control’ (i.e., MMRS-AD model) in the training set of the EMIF-AD cohort. For this, the Kennard-Stone algorithm [56] was first applied to the MPSs to split the data into a training (n = 436) and an independent test set (n = 187) (Supplementary Table 8). Accordingly, an ElasticNet (EN), sparse partial least squares-discriminant analysis (sPLS-DA), and Random Forest model with recursive feature elimination (RF-RFE) models were trained by 5-repeated 5-fold cross-validation to find the optimal hyperparameter combination that yields the highest AUROC (the searched hyperparameter space is shown in Supplementary Table 9).
2.4. Model validation
In this step, besides the estimation of the model’s performance in the independent test set of the EMIF-AD cohort, the validation of our best-performing model includes the survival analysis of cognitive impairments in the PPMI and ADNI longitudinal cohorts.
Based on the risk scores calculated by our best-performing model among the epi-LIBRA, epi-CAIDE and MMRS models, the individuals of the ADNI and PPMI cohorts were categorized into three equally sized risk categories; low- (n = 74 for ADNI, n = 43 for PPMI), intermediate- (n = 75 for ADNI, n = 43 for PPMI), and high-risk (n = 74 for ADNI and n = 43 for PPMI). We accordingly assessed the statistical significance of the difference in time-dependent conversion to cognitive impairments (as defined in section 2.2.1.) by comparing the low- and high-risk groups using the log-rank test and a Cox regression model (survival package (v3.5.5) [57]). Furthermore, a Kaplan-Meier curve was constructed to visualize the probability of cognitive impairments over time for each of the three risk categories. Please note that in the ADNI cohort, not all cognitive measures were available for each individual and, hence, the number of samples per risk category was different per cognitive outcome (Supplementary Table 2).
2.5. Model interpretation
2.5.1. Variable importance
The importance of the best-performing model’s variables was evaluated using mean absolute SHapley Additive exPlanations (SHAP) values of the test set samples in the EMIF-AD cohort as calculated with the DALEX package (v2.4.3) [58]. Notably, to make the SHAP values comparable between models, the values were normalized such that the absolute sum equals one. The scaled mean absolute SHAP values can therefore be interpreted as the average proportional contribution to the predicted score.
2.5.2. GO overrepresentation analysis
The missMethyl package (v1.34.0) [59] was used to perform GO overrepresentation analysis on the union of the most important features of the risk factor models that are used for the prediction by the best-performing model. For the ElasticNet models, the most important features are the CpGs with a non-zero coefficient, while for the Random Forest models, the 1000 CpGs with the largest Gini index were considered as the most important variables.
2.5.3. Influence of genetic variation
To test the extent to which the variation in the methylation status of the model’s CpGs can be explained by genetic variation, the joint and individual variation explained (JIVE) method (r.jive package (v2.4) [60]) was performed on the risk factor model’s most important features and the corresponding methylation quantitative trait loci (mQTLs) (clumped cis- and trans-mQTLs, P value < 1e-5, GoDMC database [19]). The joint variation, as quantified by the JIVE method, of the risk factor model’s CpG β-values and their mQTLs was recorded.
In addition, colocalization analysis (coloc package (v5.2.2) [61]) was performed for each of the model’s most important features to assess whether its methylation status and Alzheimer’s disease status share a common genetic variant. For this, the GWAS summary statistics of Alzheimer’s disease from Marioni et al. (2018) [45] and the mQTL data (non-clumped cis-mQTLs, P value < 1e-5) from the GoDMC database [19] were used.
2.6. Model extension
Finally, we performed training in the EMIF-AD cohort using the 14 MPSs with 12 PGSs (Supplementary Table 5) and/or three CSF biomarkers (i.e., Aβ, t-tau, and p-tau z-scores) as additional variables. For this, the same machine learning strategy as described in section 2.3.2. was applied.
3. RESULTS
3.1. Generation and validation of the epi-CAIDE and epi-LIBRA scores (approach 1)
While the CAIDE score could be relatively well predicted by the Random Forest model with a cross-validation R2 ≈ 0.47, the LIBRA score was poorly predicted with a maximal cross-validation R2 ≈ 0.04 by the Random Forest model (Supplementary Table 10).
Interestingly, in the EMIF-AD cohort, the epi-CAIDE score was found to be highly correlated with chronological (R2 ≈ 0.45) and epigenetic (R2 ≈ 0.55) age, indicating that age is the main driver of the epi-CAIDE score. Nevertheless, with an AUROC ≤ 0.61, both the epi-CAIDE and epi-LIBRA scores as predicted by the best-performing Random Forest models were shown to be poor estimators of both MCI and AD status in the independent test set of the EMIF-AD cohort (Fig. 2 and Supplementary Table 11).
The MMRS models (red) are trained on the 14 MPSs for the prediction of AD (A) and MCI (B). The epi-LIBRA and epi-CAIDE scores (blue) are both predicted by the Random Forest model with correlation-based feature selection method (i.e., the best performing model) from the EXTEND data. The 95% confidence intervals of the AUROC values are indicated between brackets.
3.2. Generation and validation of MMRS models (approach 2)
Besides age (R2 ≈ 0.92) and sex (AUROC = 1), the best-predicted dementia risk factors by blood-derived DNA methylation data in the EXTEND cohorts include smoking (AUROC ≈ 0.91), type II diabetes (AUROC ≈ 0.89), and heart disease status (AUROC ≈ 0.80) (Supplementary Tables 12 and 13). The performance of 9 MPSs could also be validated in the EMIF-AD cohort, demonstrating a mostly lower but statistically significant predictive performance. Just as in the EXTEND cohort, age (R2 ≈ 0.87), sex (AUROC = 1), smoking (AUROC ≈ 0.80), and heart disease status (AUROC ≈ 0.67) were found to be the best predicted risk factors in the EMIF-AD cohort (Supplementary Table 12).
As shown in Fig. 2 and Supplementary Table 11, our MMRS models, which use the 14 MPSs as variables, could not be used to significantly predict AD status in the independent test set of the EMIF-AD cohort (AUROC ≤ 0.60). Nevertheless, our MMRS model (RF-RFE) for MCI was able to significantly predict MCI status with an AUROC of 0.68 (P value ≈ .002, adjusted for age and sex) and significantly improve upon the MCI prediction by epigenetic age (DeLong’s P value ≈ 8.0e-3).
3.3. Validation in longitudinal dementia cohorts
To validate our best-performing MMRS model, we performed survival analysis in the ADNI and PPMI longitudinal cohorts. Specifically, for each cohort, we divided the individuals into three equally sized risk categories based on the baseline risk score predicted by the MMRS-MCI (RF-RFE) model and recorded their conversion to cognitive impairments over time.
The Kaplan-Meier curve in Fig. 3A shows that, in the PPMI cohort, there is more conversion to cognitive impairments (i.e., MCI or dementia) within the high-risk group as compared to the low-risk category (log-rank test P value ≈ 6.0e-4, AUROC ≈ 0.67, hazard ratio (HR) [95% confidence interval (CI)] ≈ 3.72 [1.66-8.33]), suggesting that the baseline risk score predicted by the MMRS-MCI model in this Parkinson’s disease cohort is predictive of the onset of cognitive impairments. Furthermore, in the ADNI cohort, we also see more conversion to cognitive impairments as measured by several cognitive tests, reaching statistical significance for RAVLT - learning (log-rank test P value ≈ .02, AUROC ≈ 0.62, HR [95% CI] ≈ 2.11 [1.08-4.11]) and TRABSCOR (log-rank test P value ≈ .04, AUROC ≈ 0.60, HR [95% CI] ≈ 1.94 [1.01-3.73]) (Fig. 3B, Supplementary Table 14, and Supplementary Fig 1.).
The risk categories were defined based on the baseline score predicted by the MMRS-MCI (RF-RFE) model. The shaded area around the line indicates the 95% confidence interval.
3.4. Model interpretation
The best performing MMRS-MCI (RF-RFE) model uses only 10 of the 14 MPSs for its prediction. Furthermore, the mean absolute SHAP values indicate that the MMRS is mainly driven by the depression, HDL cholesterol, physical inactivity, and low education MPSs (Fig. 4). The distribution of the MPSs among the diagnostic groups is shown in Supplementary Fig 2.
The scaled mean absolute SHAP values indicate the variable importance of the 14 MPSs in the three MMRS-MCI models.
Interestingly, ‘AMPA glutamate receptor clustering (GO:0097113)’ is the most overrepresented GO term by the union of the CpGs used in these 10 risk factor models (7571 CpGs, unadjusted P value ≈ 4.2e-4) (Supplementary Table 15). Notably, the model’s CpGs that are associated with this GO term are located in the untranslated region (UTR) or gene body of the APOE, DLG, NLGN1, SHANK3, SHISA6, SHISA7, SLC7A11, and SSH1 genes (Supplementary Fig. 3). However, JIVE indicates that there is only minimal joint information captured by the model’s CpGs and their mQTLs. Particularly, for each risk factor model, less than three percent of the variance in DNA methylation data is explained by genetic variation (Supplementary Fig. 4). Finally, the colocalization analysis shows that only a few CpGs are likely to share a genetic variant with AD status, including cg19514613 in the 5’UTR of the APOE gene (posterior probability > 0.99) and cg08374890 located approximately 2 kb upstream of the BCKDK gene (posterior probability ≈ 0.94) (Supplementary Fig. 5). Together, these results indicate that our best performing model encompasses information unique to DNA methylation data with only a limited genetic influence.
3.5. Model extension
Although most PGSs are significantly associated with their corresponding risk factor in the EXTEND and EMIF-AD cohorts, their predictive power is relatively low with an R2 ≤ 0.1 and AUROC ≤ 0.67 (Supplementary Table 16). In addition, most of the PGSs and their corresponding MPSs have only low correlations (i.e., Pearson correlation coefficient < 0.1) (Supplementary Fig. 6).
Furthermore, with an AUROC of 0.69, the addition of 12 PGSs to the 14 MPSs as additional features did not yield a significant improvement in MCI predictive performance above the prediction by MPSs only (Fig. 5). However, the addition of CSF biomarkers to the 14 MPSs could increase the MCI predictive performance to an AUROC of 0.88.
MCI status was predicted in the independent test set of the EMIF-AD cohort using the MPSs, PGSs, and/or CSF biomarkers as variables.
4. DISCUSSION
In this study, we used whole-blood-derived DNA methylation data quantified in a general population of individuals at midlife stage (EXTEND) to create a molecular score as a proxy for CAIDE and LIBRA dementia risk scores (referred to as the epi-CAIDE and epi-LIBRA models, respectively). Alternatively, we generated individual MPSs for 14 modifiable and non-modifiable risk factors contributing to the formation of CAIDE and LIBRA scores. Next, along with epi-CAIDE and epi-LIBRA scores, we trained MMRS models using the 14 MPSs in the EMIF-AD training set to predict the AD and MCI status. Subsequently the performance of each model was evaluated in an independent test set. Although our epi-CAIDE and epi-LIBRA models were shown to perform poorly for both AD and MCI prediction in the EMIF-AD cohort, the MMRS model was demonstrated to significantly predict MCI status with an AUROC of 0.68. The potential of this model was further demonstrated by its usefulness for prospectively predicting the onset of cognitive impairments in two independent cohorts of AD (ADNI) and PD (PPMI) as well as improving upon its cross-sectional prediction in combination with established genetic and CSF biomarkers in the EMIF-AD cohort.
The poor performance of our epi-CAIDE and epi-LIBRA models might be explained by the fact that the risk factor’s weights in the CAIDE and LIBRA scores do not consider the extent to which the risk factors can be estimated by DNA methylation data, whereas an ideal epigenetic model would give more weight to the factors best predicted by the DNA methylation data. In addition, some people may have the same CAIDE or LIBRA score but different contributing risk factors. For example, one can have a high CAIDE score because of having a high BMI and low physical activity, while another person may have the same CAIDE score but because of having a high age. These difference in contribution to the total scores may not be effectively captured by our molecular-based epi-CAIDE and epi-LIBRA models, possibly explaining their poor performance.
Similarly, our MMRS model – a model generated from 14 dementia risk factor models – could neither be used to significantly predict AD status in the independent test set of the EMIF-AD cohort (Fig. 2). This poor performance might be explained by the fact that many of the risk factors included in our analysis are cardiovascular-related risk factors (e.g., BMI, systolic blood pressure, heart disease, type II diabetes, physical activity, diet, smoking, as well as HDL and total cholesterol) which might be more predictive of MCI and non-AD types of dementia, like vascular dementia, than AD.
Nevertheless, the MMRS model for MCI, was demonstrated to be able to significantly predict MCI status with an AUROC of up to 0.68 (Fig. 2) as well as to prospectively predict cognitive impairments in two external cohorts (Fig. 3). Noteworthily, despite not being trained in the PPMI cohort, the performance of our best performing MMRS model in prospectively predicting cognitive impairments in this cohort (AUROC ≈ 0.67) was even better than the previously reported prediction by genetic and epigenetic variables (AUROC ≈ 0.64) [30].
Furthermore, compared to using only CSF biomarkers as variables, the combination of the CSF biomarkers and the 14 MPSs resulted in a consistent improvement in the prediction of MCI status (Fig. 5). This combination outperforms previously reported MRI-based classification of MCI in late-life depression [62] and at midlife [63] which have been shown to yield an AUROC of 0.77 and 0.85, respectively. However, the addition of the PGSs as additional features to the 14 MPSs, did not lead to a significant improvement in predictive performance. This is, however, not unforeseen given the relatively poor performance of the PGSs in predicting their corresponding risk factors (Supplementary Table 16) and MCI status (Fig. 5).
Although the addition of PGSs yielded only limited improvement in MCI predictive performance, matching pairs of PGSs and MPSs were found to have only a relatively low correlation (Supplementary Fig. 6), meaning that the overlapping information between matching pairs of MPSs and PGSs is very limited. Furthermore, JIVE and the colocalization analysis also suggested relatively little influence of genetic variation on the model’s prediction (Supplementary Fig. 4-5). Finally, the fact that the addition of MPSs to the PGSs and/or CSF biomarkers consistently improves the model’s performance (Fig. 5), indicates that the MPSs provide unique information on top of the already established genetic and CSF biomarkers.
The ‘AMPA glutamate receptor clustering (GO:0097113)’ term was shown to be the most significantly overrepresented GO term by the union of the CpGs included in our best-performing MMRS model (Supplementary Table 15). Interestingly, AMPA glutamate receptors are known to play an important role in synaptic transmission, and Alzheimer’s disease pathology has previously been shown to be associated with an enhanced removal of these receptors from the post-synaptic membrane [64]. Although the methylation changes in these processes were measured in the blood, they may resemble the alterations that occur in an AD-affected brain. Specifically, the model’s CpGs associated with this GO term are located in known dementia-related genes, including APOE [65], DLG1 [66], NLGN1 [67], SHANK3 [68], SHISA6 [69], SHISA7 [70], and SSH1 [71]. Hence, this AMPA receptor-associated process might be one way by which the model is able to estimate a person’s dementia risk.
Besides the validation of our blood-derived DNA methylation-based model in multiple independent cohorts, a strength of the present study is the use of 14 MPSs as variables for dementia risk prediction. Particularly, our approach utilizes the large-scale nature of a general population cohort to reduce the dimensionality of the DNA methylation data into 14 interpretable latent features (i.e., the 14 MPSs), allowing for the construction of a robust and replicable model and overcoming the lack of replication of CpG-level models as described previously [20]. In addition, the MPSs have a clear biological meaning which is in contrast to the more difficult-to-interpret latent features generated by other commonly used dimensionality reduction methods such as (s)PLS, PCA, and autoencoders. This way, our model can provide direct information about which risk factors contribute to an (elevated) dementia risk. As the DNA methylation profile has previously been shown to be modifiable through lifestyle changes (e.g., smoking cessation [72] and weight restoration [73], alcohol withdrawal [74], and physical activity [8]), the information provided by our model possibly allows for targeted intervention strategies, aiming at maximally reducing the patient-specific risk scores. However, to which extent and how these MPSs can be best modified by, for example, lifestyle interventions could be investigated in future studies.
Furthermore, it should be noted that, although MCI is a well-established dementia risk factor, it is not a perfect predictor of the future onset of dementia. Specifically, a significant proportion of MCI individuals revert to a cognitively healthy status [75]. So, instead of using the cross-sectional MCI status as the dependent variable, modeling the prospective cognitive outcome (e.g., the trajectory of cognitive decline) may result in a better model for the prediction of the future onset of cognitive impairments. The lack of large-scale prospective data, however, makes this approach currently impossible, and it is up to future initiatives to collect such data for more sophisticated analyses.
Nevertheless, our established MMRS model may act as a starting point for future studies, aiming at further improving the model’s predictive performance by testing novel feature selection and machine learning methods, incorporating more omics layers, as well as performing model training on larger (prospective) datasets. In the end, this might bring us closer to the establishment of a reliable blood-based model for the early identification of people at risk of developing dementia, an essential requirement for successful intervention strategies.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
DECLARATION OF INTEREST
Declarations of interest: none.
CODE AVAILABILITY STATEMENT
All codes and an R Shiny application to run the models are available at https://github.com/jarnokoetsier/DementiaRiskPrediction
ACKNOWLEDGMENTS
This work was supported through a ZonMw Memorabel/Alzheimer Nederland Grant (733050516) to E.P. The current study was conducted as part of the EMIF-AD MBD project, which has received support from the Innovative Medicines Initiative Joint Undertaking under EMIF grant agreement number 115372, the resources of which are composed of financial contribution from the European Union’s Seventh Framework Program (FP7/2007–2013) and EFPIA companies’ in kind contribution. The analysis of the DNA methylation data was in part funded by a major project grant from the Alzheimer’s Society UK (AS-PG-14-038) to KL, a project grant from the Medical Research Council (MRC) (MR/N027973/1) to KL as part of the EPI-AD consortium through the Joint Programme—Neurodegenerative Disease Research (JPND) initiative. Part of this work was supported by the EU’s Horizon2020 program as part of the “Lifebrain” consortium project (to LB). RV acknowledges the support by the Stichting Alzheimer Onderzoek (#13007, #11020, #2017-032) and the Flemish Government (VIND IWT 135043). HZ is a Wallenberg Academy Fellow supported by grants from the Swedish Research Council (#2018-02532), the European Research Council (#681712) and Swedish State Support for Clinical Research (#ALFGBG-720931). SJBV received funding from the Innovative Medicines Initiative 2 Joint Undertaking under ROADMAP grant agreement No. 116020 and from ZonMw during the conduct of this study. No conflict of interest exists. Research at VIB-UAntwerp was in part supported by the University of Antwerp Research Fund. The research was supported by ALF clinical grants from Region Västra Götaland to AW and to PK. The authors acknowledge the assistance of Ellen De Roeck, Naomi De Roeck and Hanne Struyfs (UAntwerp) with data collection. The Lausanne study was funded by a grant from the Swiss National Research Foundation (SNF 320030_141179) to JP. We thank Mrs. Tanja Wesse and Mrs. Sanaz Sedghpour Sabet at the Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany for technical assistance with the EPIC methylation profiling. The LIGA team acknowledges computational support from the OMICS compute cluster at the University of Lübeck. We thank all participants and teams who contributed data to PPMI. PPMI, a public-private partnership, is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners including 4D Pharma, AbbVie, AcureX Therapeutics, Allergan, Amathus Therapeutics, Aligning Science Across Parkinson’s (ASAP), Avid Radiopharmaceuticals, Bial Biotech, Biogen, BioLegend, Bristol Myers Squibb, Calico Life Sciences LLC, Celgene Corporation, DaCapo Brainscience, Denali Therapeutics, The Edmond J. Safra Foundation, Eli Lilly and Company, GE Healthcare, GlaxoSmithKline, Golub Capital, Handl Therapeutics, Insitro, Janssen Pharmaceuticals, Lundbeck, Merck & Co., Meso Scale Diagnostics LLC, Neurocrine Biosciences, Pfizer, Piramal Imaging, Prevail Therapeutics, F. Hoffmann-La Roche and its affiliated company Genentech, Sanofi Genzyme, Servier, Takeda Pharmaceutical Company, Teva Neuroscience, UCB, Vanqua Bio, Verily Life Sciences, Voyager Therapeutics and Yumanity Therapeutics.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.;Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Abbreviations
- Aβ
- amyloid-β
- AD
- Alzheimer’s disease
- ADAS
- Alzheimer’s disease assessment scale
- AUROC
- area under the receiver operating curve
- CI
- confidence interval
- CSF
- cerebral spinal fluid
- EN
- ElasticNet
- GO
- gene ontology
- HR
- hazard ratio
- HWE
- Hardy-Weinberg equilibrium
- LDELTOTAL
- Wechsler logical memory delay
- MAE
- mean absolute error
- MAF
- minor allele frequency
- MCI
- mild cognitive impairments
- mQTL
- methylation quantitative trait loci
- MMRS
- multivariate methylation risk score
- MMSE
- Mini Mental State Examination
- MPS
- methylation profile score
- PCA
- principal component analysis
- PD
- Parkinson’s disease
- PGS
- polygenic (risk) score
- p-tau
- phosphorylated tau
- QC
- quality control
- RAVLT
- Rey’s auditory verbal learning test
- RF-RFE
- random forest with recursive feature elimination
- SHAP
- Shapley additive explanations
- SNP
- single nucleotide polymorphism
- sPLS-DA
- sparse partial least squares discriminant analysis
- TRABSCOR
- trail making test part B time
- t-tau
- total tau
- UTR
- untranslated region.
REFERENCES
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].
- [45].↵
- [46].
- [47].
- [48].
- [49].
- [50].
- [51].
- [52].
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵