Abstract
Background Breast cancer risk assessment typically relies on consideration of mammography, family history, reproductive history, in addition to assessment of major mutations. However, quantifying the impact of environmental factors, such as lifestyle, can be challenging. DNA methylation (DNAm) presents a promising opportunity for additional information, as it captures effects from both genetics and environment. Previous studies have identified associations and predicted the risk of breast cancer using DNAm in blood, however, these studies did not distinguish between genetic and environmental influences.
Results Pre-diagnosis blood samples were obtained from 32 monozygotic (MZ)and 76 dizygotic (DZ) female twin pairs discordant for breast cancer. DNAm was profiled using Illumina 450k and EPIC platform. Samples were collected at the mean age of 56 years (standard deviation (SD) =9.90), while the mean age at diagnosis was 66.76 years (SD=6.64). The mean age for censoring controls was 75.20 years (SD=9.45). To identify individual DNAm sites (Cytosine-phosphate-Guanine, CpG) and differentially methylated regions (DMRs) associated with breast cancer risk, survival analysis using paired Cox proportional hazard modeling was performed. Due to the paired modeling, shared genetic and environmental effects, age at entry, and age at sampling were controlled for.
We identified 212 CpGs (p<6.4*10^-8) and 15 DMRs associated with breast cancer risk across all pairs, with three DMRs overlapping with the individual CpGs. All but one of the 212 CpGs had lower DNAm in cases, suggesting a prevailing trend of hypomethylation of blood DNA prior to breast cancer diagnosis. Altogether 197/212 significant CpGs were also differentially methylated within the 32 MZ twin pairs discordant for breast cancer, suggesting that DNAm at these CpGs is likely independent of genetic effects. Prior research suggests that estrogen regulates at least five of the top CpGs identified. Hence, methylation at these sites may reflect individual differences in estrogen exposure.
Conclusion In conclusion, most of the identified CpGs associated with future breast cancer diagnosis appear to be independent of genetic effects. This suggests that DNAm could potentially serve as a valuable biomarker for environmental risk factors for breast cancer and may offer potential benefits as a complementary tool to current risk assessment procedures.
Background
Breast cancer (BC) risk assessment tools rely on physiological or genetic screening methods such as mammography, family history and information on reproductive history. In addition to assessment of major mutations, polygenic risk scores are increasingly used to assess overall genetic risk. However, accurately quantifying the impact of environmental factors, including health-related behaviors and occupational exposures, can be challenging, and these factors are often not included in the risk assessment models. In recent years, DNA methylation (DNAm) has emerged as a promising biomarker for BC risk assessment. Therefore, incorporating DNAm information into existing risk assessment models has the potential to improve their accuracy and effectiveness in identifying individuals at increased risk of developing BC.
BC has been linked to DNAm alterations in blood, as evidenced by specific DNAm sites (1–16) and overall average DNAm levels (17) associating with BC. In addition, the studies conducted by Kresovich et al. (2021,(4)), Xiong et al. (2022, (13)) and Chung et al. (2023, (18)) have shown that blood-derived DNAm can be used to predict an individual’s risk of developing BC. This presents an opportunity for blood-derived DNAm to serve as a complementary measure to current standard BC risk assessment tools (19,20). However, these predictors, as well as earlier work on DNAm and overall BC risk (2,4,6,11,13,14,16) have not differentiated between genetic and environmental risk factors for BC.
Health-related behaviors and environmental exposures are the main environmental factors that can increase the risk of BC (21), and many of these factors have been shown to affect DNAm as well; e.g. alcohol (22,23), obesity(24,25), physical inactivity(26), hormonal exposure (27), and reproduction-related factors (25,27–29). In addition, genetic variants, including those linked to BC risk, may affect DNAm.
Studies using discordant monozygotic (MZ) twins provide a valuable approach for investigating the impact of environmental factors, e.g., DNAm, on BC risk because the genetic components are fully controlled for in MZ twins, who are genetically identical at the germline. In contrast, genetic effects are partially controlled for in dizygotic (DZ) twin pairs, in which the twins share approximately 50% of their segregating genetic background. Moreover, the twin design effectively controls for all shared environmental influences between twin pairs, regardless of whether they are DZ or MZ twin pairs. Focusing on discordant twin pairs, in which one twin is diagnosed with BC while the other remains cancer-free, helps to minimize the influence of confounding factors. Therefore, the discordant twin design, functioning as a case-control design, dramatically boosts the statistical power and provide more accurate estimates to explore the association between BC risk and DNAm by controlling for shared environmental factors and genetic background. We can then attribute the observed associations to non-shared environmental factors, or in other words, within pair differences in exposure to known or unknown environmental risk factors for BC. In this study, our objective is to assess the potential impact of DNAm as a biomarker on environmental BC risk. To accomplish this, we leverage a BC discordant twin cohort, utilizing DNAm data collected prior to BC diagnosis in the Finnish Twin Cohort sample. Additionally, we aim to validate our findings by examining an independent BC discordant twin dataset from the Danish Twin Study.
Results
Discovery
In the discovery Model 1 (MZ and DZ pairs), 212 DNAm (Cytosine-phosphate-Guanine, CpG) sites were significantly associated (p < 6.4*10−8) with BC (Supplementary table 1, Figure 1A). Among these CpG sites, all except one (cg00550725, in gene CPNE3) showed negative association, as indicated by Hazard Ratio (HR) below one. The BC associated hypomethylated CpG sites had HRs ranging from 0.01 to 0.49, while the hypermethylated CpG site had an HR of 3.07. TDRD1 was the only gene with two significant BC associated CpG sites (cg14779973 and cg27547703).
Results on the survival modeling for individual CpG sites associated with breast cancer; A) Volcano plot of Model 1 (MZ+DZ) with CpG sites significantly associated with breast cancer marked in red; B) Comparison between Model 1 (MZ+DZ) and Model 2 (MZ) with significant CpG sites from Model 1 validated in Model 2 marked in red; C) Comparison between Model 2 (MZ) and Model 3 (DZ) with a regression line in blue (regression coefficient = 1.31, p=0.001); D) Comparison between Model 1 (MZ+DZ, Finnish data) and Model 2R (MZ+DZ, Danish data).
In addition to the 212 individual CpG sites, 15 DMRs significantly associated with BC in Model 1 (Supplementary table 2). Among these, three DMRs (in genes SCMH1, PXDNL and GNAS/RP1-309F20.3) contain CpG sites that are also significant as single hits in Model 1. Out of the 15 DMRs, 14 exhibit an average HR<1 and lower DNAm in the cases.
Validation in MZ data and comparison between the MZ and DZ model
To explore the significance of the findings in relation to environmental factors, we sought to validate the significant CpG sites from Model 1 using only the MZ twin pairs (Model 2). MZ pairs are fully matched for genetic background, but the sample size is smaller. Model 2 was fitted on the 212 significant CpG sites, hereby 197 CpG sites (93%) meet the Benjamini-Hochberg FDR for the association with BC (Supplementary table 1, Figure 1B). This indicates that the majority of the identified CpG sites are genuinely associated with environmental (non-genetic) risk factors for BC.
Model 3 (containing DZ twin pairs only) was also fitted for the 212 significant CpG sites identified, all passed the Benjamini-Hochberg FDR. Additionally it is to note that the 212 CpG sites had higher effect sizes in the Model 2 compared to Model 3 (regression coefficient 1.31, p = 0.001), suggesting that a higher level of genetic matching is associated with a stronger observed effect size (Supplementary table 1, Figure 1C).
Sensitivity analyses
Two sensitivity analyses were done (Supplementary Table 3). The 212 significant CpG sites have Harrell’s C indices ranging from 0.59 to 0.72 (mean= 0.65, standard deviation (SD) = 0.02), i.e. notably higher than the expected value of 0.5 for null effects. The E-values were high (E-value of 5.59 for cg00550725) or close to theoretical limits (for HRs <1), indicating that unknown or unmeasured covariates are unlikely to account for the association between the CpG sites and BC.
Comparison of the significant CpG sites with the Danish Twin study
The Finnish and Danish datasets shared 98 out of the 212 CpG sites identified in the Finnish analysis. The remaining sites were unavailable due to platform differences, with the Danish data solely based on the 450K platform. Out of these shared CpG sites 22 had the same effect direction in the Finnish data (Model 1 and Model 2) and the Danish data (Model 2R). However, the Danish data (Model 2R) did not yield any replication of Finnish data (Model 1 or Model 2) under Benjamini-Hochberg FDR (Supplementary table 4, Figure 1D).
Discussion
In this study 212 CpG sites and 15 DMRs associated with BC using a discordant twin design, matching for familial confounders. Among these DMRs are three that contain CpG sites with an individually significant p-value. These DMRs are in the genes SCMH1, PXDNL, and GNAS/RP1-309F20.3. Two individually significant CpG sites were found to be located in the gene TDRD1. The majority of the CpG sites (197 out of 212) were identified also in the model using MZ pairs only, which is fully matched for genetic confounding. This implies that these 197 CpG sites associate with BC independent of genetic factors and are likely attributed to environmental effects. In addition, significantly higher effect sizes were observed across the 212 CpG sites among the MZ pairs, compared to DZ pairs. No CpG sites was replicated in the independent dataset from the Danish twin study.
DNAm may change at specific CpG sites due to exposure to a BC risk factor. Several environmental and health related behavioral factors, such as alcohol consumption (22,23), obesity (24,25), physical inactivity (26), hormonal exposure (27), hormonal disruption (30), and multiple reproduction-related factors (25,27–29) have been found to associate with DNAm. It is important to note that these are also known risk factors for BC (21). Based on this, the association between DNAm and specifically environmental BC risk which was observed in this study could be hypothesized as a summary of exposure to these factors. Hereby, the twin with BC has potentially been exposed to a risk factor with greater extent compared to her co-twin who has not had BC, leading to within-pair difference in DNAm. However, determining the specific contribution of individual factor is not possible within the scope of this study. Interestingly, we identified several genes (TDRD1, SCMH1, PXDNL and GNAS), each containing multiple sites associating positively with BC, which have been shown to be regulated by estrogen (see Textbox 1). In addition, the only gene (CPNE3) that has a significantly hypermethylated CpG site in cases in our study, has been indicated to be under the regulation of estrogen in BC (Textbox 1). Identifying potentially estrogen regulated genes exhibiting differential methylation in relation to environmental factors supports our hypothesis that DNAm can summarize environmental exposure or disruption of hormones, which are known risk factors for BC (21).
Altogether 68 CpG sites identified in the current study Model 1 are located in the same genes that have been previously associated with BC through DNAm (2,4,6,11,13,14,16) (Supplementary table 5). In six of these seven studies, blood samples were collected before breast cancer diagnosis (average time to diagnosis reported only for five studies: 1.3 years (14), 3.8 years (11), 5.2 years (4), 5.6 years (6), and 7.2 years (2)), while one study obtained samples at the time of diagnosis, but prior to treatment (16). Two studies focused on case cohorts using time to diagnosis as a variable (4,11), while the other studies included both cases and controls (2,6,13,14,16). In each of the seven studies, the focus was on overall breast cancer risk, not distinguishing between genetic and environmental risk factors.
Among these 68 CpG derived from these seven studies, four CpG sites passed Benjamini-Hochberg FDR in the Finnish data of Model 1 (Supplementary table 6), with cg21769444 being one of the 212 discovery CpG sites. The effects sizes reported in the literature vary due to differences in analysis approaches, including fold change in DNAm between cases and controls, Hazard Ratio, and differences in DNAm between cases and controls. Notably, cg01259104/ANKLE2, cg04248461/DIP2C, and cg21769444/NUDT3 showed similar effect directions as in this study, while cg05375728/DAB1 exhibited an effect in the opposite direction.
Further, Chung et al. (2023, (18)) conducted a prospective EWAS on blood samples for breast cancer (BC), identifying 187 DMRs associated with future BC diagnosis. None of these DMRs replicated in our study, however, four individual CpG sites within five different DMRs replicated under Benjamini-Hochberg FDR in our Model 1 (Supplementary table 7).
Strengths and limitations
A notable strength of this study is that the DNAm data used for analysis was collected on average, almost 11 years prior to BC diagnosis, showing that DNAm associates with BC already well before the cancer is diagnosed. Hereby, DNAm has potential for identifying individuals at a high risk of developing BC at an early stage, which could be used to implement proactive interventions, such as more frequent screenings or targeted preventive measures. A further strength of our study lies in its comprehensive approach, accounting for both known and unknown factors that could confound DNAm analysis, including age, technical variation, and familial effects. Additionally, the sampling strategy implemented prior to diagnosis is a crucial advantage, as it effectively minimizes the potential influence of breast cancer as a disease and its treatment, thus significantly enhancing the reliability and validity of our findings.
Nevertheless, several potential limitations in this study should be noted. Firstly, the absence of data on BC subtypes could be crucial in refining the outcomes. Distinct BC subtypes require different treatments and have varying susceptibilities to environmental risk factors, mainly in the case of as hormone receptor-positive versus hormone receptor-negative BC (42). Secondly, the sample size of MZ twin pairs is small, which reduces statistical power. The inclusion of DZ twins can partly address this issue; however, it introduces bias based on genetic factors. Thirdly, replication in the Danish data may not be optimal due to differences in sample size and age at sampling. However, these limitations are beyond control as they depend on the data availability.
Conclusion
We demonstrate the presence of DNAm alterations in blood on average over 11 years before the actual BC diagnosis, likely independent of familial factors, like shared early life environment and germline genetics. Individual environmental exposures or de novo mutations are likely contributing factors that could explain this observation. The study identified associations between BC risk and DNAm of genes involved in BC biology, specifically of estrogen-related genes. Furthermore, our findings suggest that DNAm could be a promising addition to BC risk assessment toolset for identifying individuals who have a higher likelihood of developing BC due to environmental experiences and exposures. Our findings warrant future investigations in much larger prospective cohorts to clarify which environmental factors are most relevant, and associate with BC risk through DNAm.
Methods
The Finnish Twin Cohort
The Finnish Twin Cohort, consisting of individuals born before 1958, was established in 1975, recruitment was completed by May 1, 1976, and the followed-up period lasted until December 31, 2018. To obtain cancer diagnosis data during the study period, the Finnish Twin Cohort was linked to the Finnish Cancer registry. Information on death and emigration was obtained from the Finnish population registry. In the 1990s, blood samples were collected from a subset of individuals and DNA was extracted and stored in the Biobank of the Finnish Institute for Health and Welfare. DNAm data was subsequently generated from these samples.
A group of 108 pairs of female twins who showed discordance for BC at the end of the follow-up period were selected from among all pairs with DNA from the Finnish Twin Cohort 32 pairs were MZ, and 76 pairs were DZ (Table 1). Among the cases, BC was either their first or only cancer diagnosis, while the controls remained cancer-free during the follow-up. The follow-up period was considered to end for cases at the time of diagnosis and for controls either at death (n=18) or latest at the end of the study in 2018 (n=90).
The average age of study entry was 33.78 years (SD = 9.53 years) and the average age for blood sample collection was 56.00 years (SD = 9.90 years). For cases, the age at diagnosis was on average 66.76 years (SD = 6.64 years), and for controls, the end of follow-up was at an average age of 75.20 years (SD = 9.45 years). The mean time between blood sampling and diagnosis was 10.76 years (SD = 6.65).
Data on epidemiological risk factors for BC were obtained from a health-related questionnaire collected in 1975 (Table 2), while information on the number of children and age at first birth was obtained from the Finnish Population Register (Table 2) (46). The association between these variables and breast cancer discordance was examined using conditional logistic regression. None of these variables showed a significant association with BC discordance (Supplementary Table 8).
The Danish Twin Study
The Danish sample is selected from two Danish twin cohorts, including the Longitudinal Study of Aging in Danish Twins (LSADT) study and the Middle Age Danish Twins (MADT) study. Details about these twin cohorts are described previously in (47,48). We only included those twins that have both DNAm (1180 samples) and the information about BC diagnosis, which was retrieved through the link between the twin registry and the cancer registry in the NorTwinCan database (49). Eighty-six twins in LSADT were measured twice in 1997 and 2007, respectively, for their DNAm and we included only the early measurement in 1997 to increase the sample size for the analysis of pre-diagnosis DNAm. Among these twins, 11 twin pairs (eight MZ and three DZ pairs) that are discordant for BC and of whom the methylation was measured prior to the diagnosis were included in the analysis. The mean age at the diagnosis of these 11 twin pairs is 78.09 (SD=9.62 years) and the mean age at the DNAm profiling is 71.32 (SD=6.61 years). The end of follow-up was defined the same way as for the Finnish dataset.
DNA methylation data
Among the 108 pairs from the Finnish Twin Cohort, DNAm was measured for four pairs using the Illumina Infinium HumanMethylation450 (450k) platform and for 104 pairs the Illumina Infinium Methylation (EPIC) platform (Illumina, San Diego, CA, USA), (Table 1). Twins in a pair shared the same platform technology and were sampled and processed at the same time. Preprocessing of the DNAm data was done using the meffil R package (50). Sample quality was assessed based on the following criteria: (1) mean difference between X and Y (technical noise in female samples) chromosome signals was less than -2, (2) mean methylated signal did not deviate from the regression line (mean methylated signal linear regressed over mean unmethylated signal) by more than three standard deviations, (3) sample was not an outlier based on Illumina control probes, (4) percentage of probes with only background signal was less than 20, and (5) percentage of probes with less than three beads was less than 20. All samples that met all the above criteria passed the quality control. In addition, ambiguous mapping and poor-quality probes based on Zhou et al. (2017, (51)) and probes binding to sex chromosomes were removed to address ambiguity of the DNAm signal based on X-chromosome inactivation (52,53).
Following the quality control, the DNAm data underwent preprocessing in two separate batches due to use of two different types of microarray platforms, EPIC and 450K. The CpG sites shared across the platforms were preprocessed together, while the EPIC specific CpG sites were preprocessed separately using the same approach.
The preprocessing was performed by functional normalization using the first 15 principal components of the control probes, to eliminate unwanted technical variation using the meffil R package (50). Additionally, to reduce probe bias, beta-mixture quantile normalization (54) using the wateRmelon R package (55) was applied. Next, the CpG probes exclusively present on the EPIC platform were merged with the set of probes common between the EPIC and 450k platforms. Finally, the beta values were scaled based on the standard deviation of each CpG probe across all samples.
After performing quality control and preprocessing, a total of 52333 probes were removed due to their insufficient quality, and 9918 probes were removed due to their binding to sex chromosomes. A final set of 778861 probes were retained, consisting of 336849 probes (43%) shared by the 450k and EPIC data and 442012 probes exclusive to the EPIC platform.
In the Danish twin cohorts, DNAm was measured from the buffy coat samples stored at -80°C in 24 hours after the blood was collected using the HumanMethylation450 (450k) platform (Illumina, San Diego, CA, USA). Quality control for sample and probe exclusion were conducted with the MethylAid (56) and Minfi (57) R packages. More detailed steps of quality control and criteria for excluding samples and probes are described in the previous study (58). After further excluding any CpG sites that had a missing rate of >10% across the whole 1180 samples, 451,471 CpG sites remained in the survival analysis.
Survival modelling
To investigate the association between DNAm and BC risk, a survival analysis was conducted using Cox proportional hazard regression models in the R package survival (59). To ensure that all CpG sites meet the model assumptions, the proportional hazard assumption of time independence of the hazard for the within-pair difference in methylation beta value was tested using the cox.zph() function, and Schoenfeld residuals were examined for the significant CpG sites. For the analysis, four different models with the same survival term and covariates, but different test population were run (Equation 1).
Equation 1: Survival term (age at censoring | BC status) ∼ within-pair difference in methylation beta value + pairwise mean in methylation beta value + frailty term of pair identifier.
The survival term in this model refers to the survival outcome observed during the follow-up period. To adjust for the methylation levels specific to each twin pair, the mean beta values for each pair were included as a covariate. The pair identifier is considered as a random effect. Additionally, based on the study design the pairs were matched on potential confounding variables, such as chip platform, age at entry, age at sampling, sex, and early life environment.
The primary analysis involved 108 twin pairs that were sampled before BC diagnosis (Model 1). This was considered as the discovery model, with the significance threshold defined as Bonferroni p < 6.4*10-8, which corresponds to a p-value of 0.05 divided by the number of CpG sites (n = 778861). The statistically significant CpG sites were validated in zygosity-specific analysis of MZ pairs (n=32, Model 2) and DZ pairs (n=76, Model 3), to assess the role of genetic vs environmental effects among the significant methylation sites. The resulting p-values from the follow-up models (Model 2 and 3) were corrected by Benjamini-Hochberg procedure, and the significance was set to FDR < 0.05.
To replicate our significant findings in an independent twin sample, a comparable survival model was fitted using the Danish data of 11 twin pairs discordant for BC (Model 2R). The resulting p-values were corrected by Benjamini-Hochberg procedure, and the significance was set to FDR < 0.05.
Sensitivity analyses
To assess the robustness of the identified significant DNAm sites, a sensitivity analysis was conducted by performing a 10-fold cross-validated calculation of the Harrell’s Concordance index for within-pair difference in DNAm at the significant CpG sites. These results were compared with the concordance of the fitted covariates (60).
To evaluate the possible effect of unmeasured confounding variables on the association between DNAm and survival outcomes, E-values were computed for each significant CpG site using the Evalue R package (61,62). To simplify the E-values comparison of CpG sites with different effect (HR) directions, E-values are represented in a simplified manner. For CpG sites with HR<1, the E-values are provided as a fraction with the numerator set as 1.
DMR analysis
Differentially methylated regions (DMR) were identified using the ipDMR() function (63) of the ENmix R-package (64). Hereby neighboring CpG sites within a maximum distance of 500 base pairs were identified, and their combined p-values were calculated using the p-values derived from Model 1. Using a Benjamini-Hochberg FDR < 0.001, all significant pairs of CpG sites were selected. These significant CpG pairs were then merged into broader regions using an approach like single linkage clustering, where neighboring CpG sites within a maximum distance of 500 base pairs were linked together. The association between each broader region and the survival outcome was then calculated by summarizing the associations of all CpG sites present in such region, based on the individual CpG sites p-values derived from Model 1. A DMR was considered significant if it contained three or more CpG sites with unidirectional methylation association and had a Benjamini-Hochberg FDR < 0.001 combined across all CpG sites.
Data Availability
The Finnish Twin Cohort datasets utilized in the current study are stored in the Biobank of the Finnish Institute for Health and Welfare, Helsinki, Finland. The data is publicly available for use by qualified researchers through a standardized application procedure.
Declarations
Ethics approval and consent to participate
Informed consent was obtained before the beginning of the studies in 1975, and upon every contact with the study subjects. When clinical investigations were undertaken with sampling of biological material, written informed consent was obtained. Ethics approval of these procedures was provided in multiple studies, the last one on the transfer of biological samples to the THL Biobank by the Hospital District of Helsinki and Uusimaa ethics board in 2018.
The two Danish twin studies (LSADT and MADT) were approved by the Regional Committees on Health Research Ethics for Southern Denmark (S-VF-19980072), and written informed consents were obtained from all participants.
Availability of data and materials
The Finnish Twin Cohort datasets utilized in the current study are stored in the Biobank of the Finnish Institute for Health and Welfare, Helsinki, Finland. The data is publicly available for use by qualified researchers through a standardized application procedure (https://thl.fi/en/web/thl-biobank/forresearchers).
Competing interests
The authors declare that they have no competing interests.
Funding
This research was funded by the European Union’s Horizon 2020 Research and Innovation Programme, Marie Skłodowska-Curie (JK, grant number 859860). This project has received further funding from the Academy of Finland (MO, grant numbers 297908, 328685 & JK, grant 336823) and the Sigrid Juselius Foundation (MO and JK).
Authors’ contributions
All authors contributed to conceptualization and methodology. HB and LH contributed to the bioinformatics and statistical analyses. HB, MO and JK contributed to the interpretation. HB contributed to writing of the original draft; all authors contributed to reviewing and editing of the manuscript. JH, LH, JK and MO contributed to supervision of the project. All authors read and approved the final manuscript.
Acknowledgements
The authors thank the participants for their invaluable contribution to the study. The technical staff at the Finnish Twin Cohort stud and the Danish twin studies are acknowledged for their help in collecting the data. The authors also wish to thank Mia Urjansson and Teemu Palviainen, from the University of Helsinki, for their valuable help with the data collection.
List of abbreviations
- BC
- breast cancer
- CI
- confidence interval
- CpG
- Cytosine-phosphate-Guanine
- DNAm
- DNA methylation
- DZ
- dizygotic
- FDR
- false discovery rate
- HR
- Hazard Ratio
- LSADT
- Longitudinal Study of Aging in Danish Twins Study
- MADT
- Middle Age Danish Twins Study
- MZ
- monozygotic
- OR
- Odds Ratio
- SD
- standard deviation