ABSTRACT
In this study, we present the most extensive dataset of chromatin conformation data with matching gene expression and chromatin accessibility from primary T cells to date. We use this data to enhance our understanding of the different mechanisms by which GWAS variants impact gene regulation and revealing how natural genetic variation alter chromatin accessibility and structure in primary cells at an unprecedented scale. Capitalizing on this vast dataset, we refine the mapping of GWAS loci to implicated regulatory elements, such as CTCF binding sites and other enhancer elements, aiding gene assignment. Importantly, we uncover BCL2L11 as the probable causal gene within the RA locus rs13396472, despite the GWAS variants’ intronic positioning relative to ACOXL and we identify mechanisms involving SESN3 dysregulation in the RA locus rs4409785. Given these genes’ significant role in T cell development and maturation, our work is vital for deepening our comprehension of autoimmune disease pathogenesis and suggesting potential treatment targets.
INTRODUCTION
Genome wide association studies (GWAS) have uncovered a large proportion of the genetic basis underlying most common traits and diseases (Buniello et al. 2019), such as psoriatic arthritis (PsA) (Soomro et al. 2022). It is widely established that these genetic associations are implicated mainly in the regulation of genes rather than altering coding sequences directly (Farh et al. 2015; Finucane et al. 2015). Because regulatory elements perturbed by these variants can affect genes located far away from their genomic location with complex, cell-type specific, mechanisms (Nolis et al. 2009; Shlyueva et al. 2014; Kundaje et al. 2015; Simeonov et al. 2017; Alasoo et al. 2018), understanding the functional role of these variants involves significant effort and currently remains the main challenge in the field (Cano-Gamez and Trynka 2020; Orozco et al. 2022).
Many studies have used functional genomics techniques to study risk loci, both to identify the causal variants among the ones in strong linkage disequilibrium (LD), and to assign candidate target genes (Orozco 2022). These include using epigenomic markers to identify the variants that are likely to affect regulatory elements in disease relevant cell types (Kundaje et al. 2015; Robertson et al. 2021), eQTL methods (Schmiedel et al. 2018; Võsa et al. 2021), quantifying co-activation of enhancers and promoters (Delaneau et al. 2019; Boix et al. 2021) and using chromatin conformation capture methods to assign enhancers to genes (Martin et al. 2015; Mifsud et al. 2015; Cairns et al. 2016; McGovern et al. 2016; Mumbach et al. 2017; Choy et al. 2018; Montefiori et al. 2018; Greenwald et al. 2019; Miguel-Escalada et al. 2019; Ray-Jones et al. 2020; Shi et al. 2021; Ge et al. 2021). Most of these studies have, however, used cell lines or healthy lineages due to the difficulty of generating Hi-C libraries from primary cells and tissues, but there is extensive evidence that gene regulation and chromatin conformation is highly cell state specific (Schmitt et al. 2016; Yang et al. 2020; Iqbal et al. 2021). These studies also typically used small samples sizes due to the cost of these techniques, and the only report examining the effect of natural DNA variation on 3D chromatin structure used cell lines and a technique with low resolution (Gorkin et al. 2019). Finally, many of these studies used capture techniques to reduce cost, such as ChIA-PET (Fullwood et al. 2010), HiChIP (Mumbach et al. 2017) or capture Hi-C (Mifsud et al. 2015), which limit the view of the local chromatin structure.
Here we address these issues by generating the largest collection of high quality, high resolution, Hi-C maps in primary T cells using the commercial Arima Hi-C protocol, which allowed for higher quality and reproducibility whilst reducing the input material required, enabling the use of primary cells isolated from patients. We chose PsA as an exemplar disease as this autoimmune condition has a strong, complex genetic component and high heritability (O’Rielly and Rahman 2011; Veale and Fearon 2018). Additionally, the mechanisms underlying this disease are not very well understood and have not been the focus of previous functional genomics studies.
We interpret the results from our study using the latest understanding of how chromatin conformation is linked to gene regulation, validating recently published gene regulatory mechanisms (Rowley and Corces 2018; Hsieh et al. 2020; Krietenstein et al. 2020; Hua et al. 2021; Aljahani et al. 2022; Goel et al. 2023). We study how chromatin conformation varies across a population of individuals in different cell types and conditions and with genotype, at a significantly higher resolution to what has been previously achieved (Gorkin et al. 2019). Finally, we show how this rich dataset allows us to decipher the mechanisms by which disease associated variants increase the risk of developing disease by studying a selection of GWAS loci from a variety of immune conditions.
RESULTS
A compendium of functional genomics data in primary T cells
In this study, we present the largest dataset to date of matching chromatin conformation, chromatin accessibility and gene expression from freshly isolated primary CD4+ and CD8+ T cells from 55 PsA patients and 19 Healthy controls (Figure 1A). This consists of 108 Hi-C libraries (49 billion reads), 128 RNA-seq libraries and 126 ATAC-seq libraries. Uniform Manifold Approximation and Projection (UMAP) plots show that the samples separate by cell type, with the synovial fluid samples forming a separate cluster from the peripheral blood samples (Figure 1B-E). We do not detect a clear separation between samples derived from PsA patients and healthy controls (Figure 1B-E), but we see a slight separation by sex of the individuals (Supplementary Figure 1).
A) Summary of the study design. B) UMAP of RNA-seq data, counts from DESeq2 C) UMAP of the ATAC-seq data, counts from DiffBind. D) UMAP of the Hi-C loops data. E) Multidimensional Scaling plot (MDS) of HiCRep analysis of the Hi-C data. F) Pearson’s correlation between gene expression and insulation score of the bin overlapping promoter of the gene. Each dot is a gene. G) Pearson’s correlation between gene expression and loop strength. Left: comparison for loops that surround the gene vs loops that are nearby the gene. Right: comparison between loops that surround the gene with loops for which loop anchor overlaps the promoter of the gene. H) Pearson’s correlation between height of ATAC-seq peaks and loop strength. Left: Comparison between loops that surround the peak vs loops that are nearby the peak. Middle: Comparison between loops that surround the peak with loops for which loop anchor overlaps the peak. Right: Comparison between loops that are nearby the peak and loops that are more than 250kb away.
We identify chromatin loops using a recently developed loop caller specifically designed for high resolution contact maps, Mustache (Ardakany et al. 2020). We identify 69,303, 72,884, 30,415 and 12,000 loops from merged maps from CD4+, CD8+, CD8+ synovial fluid and CD4+ synovial fluid T cells respectively, which are then used for further analysis. As expected, 75% of these loops overlap a CTCF peak in at least one anchor, with stronger loops more likely to overlap a CTCF peak, although only 42% overlap a CTCF peak at both anchors (Supplementary Figure 2A-C). Additionally, we find that 89% of loops overlap a chromatin accessibility peak at one end, 75% of loops overlap one at both ends and 47% of loops overlap a transcription start site (TSS).
Although it is known that immortalized cell lines differ significantly from primary cells, it has never been shown how altered is the chromatin conformation. Here, we generated matching Hi-C libraries from two cell lines, Jurkat CD4+ T cells and MyLa CD8+ T cells to compare them to our primary T cells datasets. Previous studies have shown that topologically associating domains (TADs) are generally stable between different cell types and differentiation states (Dixon et al. 2015). In line with these findings, we find that around 95% of TAD boundaries from MyLa and Jurkat cells are present in their corresponding primary cells and 99% of TAD boundaries from primary cells are present in the cell lines. For reference, two primary T cell samples share around 97% of TAD boundaries. However, we can identify large differences when looking at the compactness of these TADs. We estimate, using an outlier detection technique (see methods), that 45% of the genome has significant differences in insulation score (Supplementary Figure 2D). We also see large differences when we analyse high resolution loops. We find that up to 20% of all loops are differentially interacting in cell lines compared to primary cells (Supplementary Figure 2E). These regions with altered chromatin conformation include genes that are important in T-cell function such as the ANKRD44, KLRF1, ITK and GZMB genes (Supplementary Figure 3-4) and regions that overlap disease associated GWAS loci such as the PsA loci rs11743851 (SLC22A5) or rs13203885 (FYN) (Supplementary Figure 5Error! Reference source not found.).
Chromatin conformation is highly associated with gene regulation
Next, we explored how chromatin conformation varies across the cell populations studied. Out of 105,956 loops, 29,454 were differentially interacting between CD8+ and CD4+ T cells (FDR<0.1), and out of 113,350 insulation score bins at 25kb, 41,614 bins have differential compactness. Out of the 86,720 combined loops, 15,597 were differentially interacting between CD8+ T cells isolated from blood and synovial fluid (FDR<0.1), together with 30,408 bins with differential compactness.
Although we could identify 166 (CD4+) and 38 (CD8+) genes differentially expressed between patients with active disease and remission and 110 (CD4+) and 437 (CD8+) genes between active disease and healthy (Table 1), we could not identify any significant differential looping or differential insulation between these subgroups. This is likely because the differences between these subgroups are significantly smaller compared to the differences between cell types (Table 1), and the techniques used did not allow for the identification of these subtle changes between heterogenous groups.
We next probed the interplay between gene expression and chromatin conformation. Our data shows a strong positive correlation between the insulation score of the chromatin domain and gene expression (Figure 1F). Expanding on this finding, we examined the correlation between gene expression and the strength of the chromatin loops. As expected, we observed a strong positive correlation when a loop overlapped the gene promoter on one of the anchors. Interestingly, loops that encompassed a gene promoter, but not at the loop anchors, also showed correlation, albeit to a lesser extent (Figure 1G). This indicates that indirect spatial relationships between promoters and loop anchors are associated with gene expression. Next, we included chromatin accessibility in the analysis. In this context, ATAC-seq peaks within a loop but that did not directly overlap the loop anchors exhibited a significant positive correlation with the loops, surprisingly even stronger than that observed for TSS (Figure 1H). This emphasizes the possible regulatory role of accessible chromatin regions, irrespective of their spatial relation to loop anchors, further extending our understanding of the chromatin conformation’s influence on gene expression. These findings collectively suggest a nuanced role for chromatin architecture in transcriptional regulation.
Delving deeper into specific regions, our data enables us to visualize intricate local shifts in chromatin conformation. For instance, within the IL7R gene region (Figure 2A), which is located in a large TAD which contains SPEF2 and CAPSL genes as well, we observed complex changes associated with increased interactions around the subTAD encompassing IL7R. This increase in compactness is associated with a decrease of interactions outside of this subTAD, but an increase of interactions upstream to an intron of SPEF2. We find that the chromatin accessibility peaks located in this region are correlated with the expression of IL7R, suggesting a complex orchestration of chromatin interactions which regulate gene expression. We find that these complex changes in local chromatin conformation are very common across many important genes and our data can help understand the mechanisms involved in the regulation of these genes.
A) Visualization of the region surrounding IL7R. Top: ATAC-seq peaks. Intensity of the colour indicates the average peak-height of the peak, whilst position across the Y axis indicates the pearson’s correlation with the expression of IL7R. Middle: Genes from Gencode v29. The red dots indicate the transcription start sites. Bottom left: Correlation between the Hi-C contacts and the expression of IL7R. Bottom right: Merged Hi-C map. B) Venn diagram showcasing the overlap between chromatin accessibility allelic imbalance results from CD4+ T cells, CD8+ T cells and merged dataset. C) Venn diagram showcasing the overlap between loop allelic imbalance results from CD4+ T cells and CD8+ T cells. D) Accessibility of CTCF site located at chr11:95578036 vs the genotype of rs4409785. E) Visualization of the region surrounding rs4409785. Top: ATAC-seq peaks. Intensity of the colour indicates the average peak-height of the peak, whilst position across the Y axis indicates the pearson’s correlation with the genotype of rs4409785. Middle: Genes from Gencode v29. The red dots indicate the transcription start sites. Bottom left: Correlation between the Hi-C contacts and the genotype of rs4409785. Bottom right: Merged Hi-C map.
Natural genetic variation is strongly associated with alterations in gene expression, chromatin accessibility and chromatin conformation
Building upon these findings, we proceeded to assess the impact of common genetic variants on gene expression, chromatin accessibility, and chromatin conformation. To this aim, we implemented insulation score QTL (insQTL) and loop strength QTL (loopQTL) in the context of chromatin conformation, to investigate the influence on the strength and insulation of topologically associating domains (TADs) and the strength of chromatin loops respectively. In conjunction, chromatin accessibility QTL (caQTL) was utilized to discern how genetic variation affects the functionality of enhancers and other regulatory elements and eQTL for gene expression. The number of significant loops, insulation score bins, chromatin accessibility peaks, and genes with a significant QTL are presented in table 2.
Next, we explored how these different QTL results intersect with one another. Initially, we compared the results from the same modality across the two cell types (CD8+ and CD4+ T cells). For eQTLs and caQTLs we found an overlap of 27.2% and 26.6% respectively, with 100% concordance in effect directionality for shared QTLs. In the case of insQTLs, a smaller portion of significant QTLs (∼19%) were shared, while for loopQTLs, the overlap further diminished to 6%, with a maintained concordance at 100%. This suggests that loopQTL discovery may be more challenging than eQTLs and caQTLs due to subtler changes in chromatin conformation. Additionally, these results indicate that, whilst our study is the largest of its kind for chromatin conformation and is among the largest for chromatin accessibility QTLs, the sensitivity remains significantly limited.
Continuing our investigation, we looked into how QTLs from different modalities overlapped each other. We could recover approximately 10% of eQTLs from the caQTLs that overlapped the promoters of the eQTL genes, with a concordance rate of about 91%. Similar levels of overlap and concordance were found between eQTLs and insQTLs and caQTLs and insQTLs. These findings are in line with the results of our previous section, corroborating the notion that stronger gene expression is associated with increased chromatin domain insulation.
When we considered the overlap between eQTLs and caQTLs with loopQTLs, the concordance dropped to 67% and 78% respectively, if we included all loops for which the promoter or the ATAC-peak was located between the two loop anchors. However, the concordance increased, reaching 85% and 90% respectively, when we included only the loops that overlapped the promoter or the ATAC-peak at loop anchors. This suggests that loops that directly overlap the promoter of the genes or the ATAC-peak are more likely to be positively correlated with the expression of the genes and the activity of the peak. Meanwhile, loops that don’t overlap the promoter or the ATAC-peak, whilst still significant QTLs, don’t necessarily have a positive correlation, indicating different mechanisms at play.
An alternative method to study the effect of a genetic variant with a molecular phenotype is allelic imbalance. This approach calculates the proportion of signal derived from each allele in heterozygous individuals and can also be helpful in pinpointing the causal variants within the ones in LD. Because a large proportion of GWAS SNPs are predicted to affect enhancers, we apply this technique to chromatin accessibility regions to identify enhancers altered by these variants. We identify 40,997 variants with allelic imbalance in chromatin accessibility (FDR<0.1) out of 243,664 variants overlapping open chromatin regions in the merged dataset (Figure 2B). We find that allelic imbalance produces slightly different results although we find that it is generally more sensitive than QTL analysis (see methods for more information). We could recover 38% of caQTL that directly overlap their feature, but caQTL could only recover 13% of significant allelic imbalance variants. For the variants that overlap between the two analyses, we find 98% concordance in their directionality. Additionally, we have applied this technique to our chromatin conformation data, identifying 2,764 loops with allelic imbalance across the two cell types (Figure 2C). In this case we find that only 5% of the loopQTLs have their associated variant within their loop anchors and the majority of the loops don’t have enough sequencing depth to call allelic imbalance significantly (albeit with 100% concordance in directionality for the significant ones). These results suggest that although the QTL and allelic imbalance strongly agree on the effect, they can be used to interrogate a different set of mechanisms and can be combined to increase our understanding of GWAS loci.
Unveiling the consequence of GWAS variants that affect the binding of CTCF at an unprecedent resolution
One of the most important proteins in the regulation of chromatin architecture is CTCF. For this reason, we searched for variants within CTCF binding sites in the allelic imbalance loops and loopQTLs. We find that 15% of loops with allelic imbalance have one variant located within a CTCF peak. We also find that 7% of loopQTLs and 10% of insQTL have at least one variant overlapping a CTCF peak. These results indicate that even though a majority of these structures are defined by CTCF related mechanisms, the way genetic variants affect these interactions is not primarily driven by alteration of the CTCF binding. Finally, we find that about 20% of GWAS loci have one variant overlapping a CTCF peak.
For example, in the RA locus tagged by rs4409785, we find that the lead SNP is a strong caQTL for the peak it maps to (FDR 5.52e-07) (Figure 2D), as well as being in allelic imbalance (FDR 1.01e-03). Previous studies have revealed that the risk allele (ALT) of this SNP creates a new CTCF binding site (Sadowski et al. 2019). Using our data we can visualize, for the first time, how the creation of this new CTCF binding site alters the local chromatin structure (Figure 2E). This SNP is located at the centre of a large chromatin domain, which contains the gene SESN3 and a large gene desert. The risk allele of rs4409785 is associated with the creation of two new loops, not visible in the merged high resolution Hi-C, which connect this element to the downstream boundary of the chromatin domain. This results in an increase of the contacts within this new subTAD, and a reduction of interactions between this region and the upstream region which includes the SESN3 gene. In this way, this reduces the ability of regulatory elements positioned downstream of rs4409785 to activate the gene SESN3. Public eQTL data shows that the risk allele of rs4409785 is associated with a reduction of expression of SESN3 (Võsa et al. 2021). This risk locus had been previously assigned to the CEP57 gene (Ishigaki et al. 2022). However, SESN3 could have an important role in arthritis. A recent study suggests that SESN3+ memory T cells, a subset not extensively studied before, may play a significant role in the development of arthritis or autoimmune diseases. It was found that these cells, which were identified in skin biopsies from patients with psoriasis, could differentiate into cytotoxic CD4+ T cells, potentially contributing to the pathogenesis of autoimmune diseases such as rheumatoid arthritis (Argyriou et al. 2022).
Another interesting example is the IKZF3/GSDMB/ORMDL3 GWAS locus, which is linked to many autoimmune diseases, including RA, Asthma, Systemic Sclerosis (SSc) and others (Ferreira et al. 2019; López-Isac et al. 2019; Ishigaki et al. 2022). Previous studies have localized the causal SNP as rs12936231 (Verlaan et al. 2009; Schmiedel et al. 2016; Sadowski et al. 2019), which creates a novel CTCF binding motif. We find in our dataset that this SNP is a strong caQTL for the ATAC-peak (p-value 5.02e-09 for CD4 and 2.38e-10 for CD8) (Figure 3A) and displays strong loop allelic imbalance (FDR 4.48e-25 for CD4 and 3.55e-33 for CD8) and multiple loopQTLs (one example plotted in Figure 3B). Although it is known that this novel CTCF binding site causes a change in chromatin looping, using our dataset, we can visualize the effect on this locus at an unprecedented detail. We find significant changes in the insulation of two contact domains (subTADs), associated with a reduction of interactions between them, whilst significantly increasing the interactions that go downstream of the new CTCF site (Figure 3C). Our allelic imbalance data also shows that other variants upstream of rs12936231, and in LD with it, have increased chromatin accessibility which indicates that this reduction of interactions results in an increase in activity of these enhancers. Together with eQTL data, which indicate upregulation of genes upstream of the CTCF motif (IKZF3) and downregulation of genes downstream (GSDMB, ORMDL3), we can conclude that this variant increases disease risk by disrupting the ability of upstream regulatory elements to contact these downstream genes.
A) Accessibility of CTCF site located at chr17:39872642 vs the genotype of rs12936231. B) Representative loopQTL with the genotype of rs12936231 in the region. Loop displayed is chr17:39872500-39977500. C) Visualization of the region surrounding rs12936231. Top: ATAC-seq peaks. Intensity of the colour indicates the average peak-height of the peak, whilst position across the Y axis indicates the pearson’s correlation with the genotype of rs12936231. Middle: Genes from Gencode v29. The red dots indicate the transcription start sites. Bottom left: Correlation between the Hi-C contacts and the genotype of rs12936231. Bottom right: Merged Hi-C map.
Another region with a similar mechanism is a locus associated with white blood cell count and neutrophil counts indexed by the SNP rs10424217. We find that the SNP in LD rs2353678, directly overlaps a predicted CTCF binding motif, with our data showing a strong increase in chromatin accessibility both through allelic imbalance (p-value 2.99e-16) and caQTL (FDR 2.37e-175) (Supplementary Figure 6A). Similarly, to the previous region, this new CTCF site increases the separation between two chromatin domains (Supplementary Figure 6B). In this case, this is associated with an increased expression of most genes in this region, including RPS16 (upstream), SUPT5H, EID2, EID2B, TIMM50 (downstream), but a decrease of expression of PLEKHG2 (upstream).
In summary, we show how some variants can affect the regulation of multiple genes, by affecting key regulatory elements that define chromatin architecture at an unprecedent detail and resolution. However, the majority of the GWAS loci do not act through this mechanism. We explore how our data can help us investigate GWAS loci with other mechanisms in the following section.
Uncovering the mechanisms by which GWAS loci affect gene regulation by altering enhancer function
Previous studies have indicated that the majority of GWAS loci affect enhancers elements (Farh et al. 2015). Motivated by these findings, we utilized our dataset to investigate the mechanisms involved behind this class of loci by first using chromatin accessibility allelic imbalance as a tool to functionally fine-map GWAS loci (supplementary tables). Our analysis led to the identification of one or more allelic imbalance SNPs in about a third of the loci. For instance, in the KLF13 psoriasis locus linked by rs28624578, out of the 4 SNPs in LD, 3 of them overlap chromatin accessibility regions. However, our data shows that only rs28510484 exhibits strong allelic imbalance (FDR 9.11E-36) with a 2.2 fold increase in chromatin accessibility. These SNPs are located in an intron of KLF13 and eQTLgen data (Võsa et al. 2021) show an increase in KLF13 expression associated with the risk (ALT) allele of rs28510484. Another example is the RAB2A-CHD7 SSc locus linked by rs685985, where the SNPs are located between RAB2A and CHD7. Here, although 5 of the 7 SNPs in LD overlap open chromatin regions, only two have significant allelic imbalance, with one of the two having significantly stronger allelic imbalance (rs612558 FDR 3.98E-40, rs686852 FDR 5.4E-3). Interestingly, although the risk (ALT) allele of rs612558 is associated with a 25% reduction in chromatin accessibility, eQTLgen data indicate increased expression of RAB2A.
Remarkably, certain regions also display alterations in chromatin conformation. The RA locus linked by SNP rs7943728 serves as an example, wherein the SNP in LD rs968567 is the only SNP that overlaps chromatin accessibility regions and showcases pronounced allelic imbalance accompanied by an 8.2-fold increase in chromatin accessibility (FDR 0) for the protective (ALT) allele. This SNP is a strong caQTL for 4 separate ATAC-peaks, indicating that the alternate form of rs968567 causes an increased activity of other regulatory elements in the locus as well (Figure 4A-D). This strong increase in the activity of these elements results in the increase in interactions in the local chromatin architecture, particularly those bringing the genes FADS2, FADS1, FADS3 and FTH1 in closer contact with the enhancer elements affected by rs968567 (Figure 4E). Moreover, this variant is correlated with an increase in the expression of FADS2, FADS1 (Figure 4F-G). These genes are crucial for the pathogenesis of RA as they play a key role in the biosynthesis of long-chain polyunsaturated fatty acids, like omega-3 and omega-6 (Hastings et al. 2001), known to influence inflammation and immune response mechanisms pivotal in the disease’s pathogenesis (Kostoglou-Athanassiou et al. 2020).
A) Chromatin accessibility of peak chr11:61816418 vs the genotype of rs968567. B) Chromatin accessibility of peak chr11:61827746 vs the genotype of rs968567. C) Chromatin accessibility of peak chr11:61834520 vs the genotype of rs968567.D) Chromatin accessibility of peak chr11:61871089 vs the genotype of rs968567. E) Visualization of the region surrounding rs968567. Top: ATAC-seq peaks. Intensity of the colour indicates the average peak-height of the peak, whilst position across the Y axis indicates the pearson’s correlation with the genotype of rs968567. Middle: Genes from Gencode v29. The red dots indicate the transcription start sites. Bottom left: Correlation between the Hi-C contacts and the genotype of rs968567. Bottom right: Merged Hi-C map. F) Expression of FADS1 with the genotype of rs968567. G) Expression of FADS2 with the genotype of rs968567.
Lastly, in the RA locus linked by rs13396472, only one SNP, rs13401811, shows strong allelic imbalance (FDR 1.86E-96), with a reduction of 40% in chromatin accessibility associated with the protective (ALT) allele. This SNP is also a strong caQTL for the overlapping ATAC-seq peak (p-value 1.43e-06, Figure 5A). This locus has been previous linked to ACOXL because the SNP in LD rs1554005, is a missense variant for ACOXL. Additionally, rs13401811 is located in an intron of ACOXL. However, the function of this gene is not especially intriguing for RA pathogenesis and is not expressed in T cells (less than 5 reads in our data) or other immune cells according to the DICE database (Schmiedel et al. 2018). Our chromatin conformation data reveals a loop connecting the enhancer affected by rs13401811 to the promoter of BCL2L11 a gene located more than 300kb downstream of the GWAS SNPs. In fact, the activity of this enhancer appears to be highly correlated with the strength of the loop (p-value 1.25E-15, R2 0.47, Figure 5B-C).
A) Chromatin accessibility of peak chr2:110858286 vs the genotype of rs13401811. B) Loop strength of the loop connecting the enhancer with the promoter of BCL2L11 vs the peak height of the enhancer. C) Visualization of the region surrounding rs13401811. Top: ATAC-seq peaks. Intensity of the colour indicates the average peak-height of the peak, whilst position across the Y axis indicates the pearson’s correlation with the peak-height of the enhancer chr2:110858286. Middle: Genes from Gencode v29. The red dots indicate the transcription start sites. Bottom left: Correlation between the Hi-C contacts and the peak height of the enhancer chr2:110858286. Bottom right: Merged Hi-C map. The green box indicates the loop that connects the enhancer with the promoter of BCL2L11. D) Proportion of double-positive T cells from mouse spleen with deleted enhancer region.
BCL2L11 has a critical role within the immune system, acting as a pro-apoptotic stimulator and modulating thymic negative selection. Knockout mice for BCL2L11 display progressive autoimmune disease (Bouillet et al. 1999). To further study the in-vivo effects of the altered enhancer activity we decided to create a deletion of this regulatory region (hg38: chr2:110843317-110860033) located in the ACOXL intron in mouse (mm39: chr2: 127755856-127771494). Region comparison using Ensembl illustrates that the whole region, containing the genes BUB1, ACOXL and BCL2L11, as well as the regulatory elements deleted, are highly conserved between humans and mouse (Supplementary Figure 7). Preliminary results show that in mice of approximately 7.5 months of age there is a marked increase in the presence of a population of CD4/CD8 double-positive splenic T cells in both heterozygous and knockout mice when compared with their wild-type equivalents (Figure 5D). This indicates a possible evasion of central tolerance mechanisms in T lymphocytes. A smaller region overlapping our targeted regulatory locus has been previously knocked out in mice (Hojo et al. 2019). showing that this enhancer specifically alters the expression of BCL2L11 in TCR-dependent activation of T cells, further strengthening our finding that rs13401811 increases the risk of developing RA by altering activation of BCL2L11.
DISCUSSION
In this study, we have produced the most extensive dataset to date combining chromatin conformation, chromatin accessibility, and gene expression data from primary T cells isolated from Psoriatic Arthritis (PsA) patients and healthy individuals. Our data allowed us to clearly distinguish T cell samples into the different populations. However, we found that chromatin conformation variations between patients and controls were subtle and hard to detect, in line with previous attempts in the field (González-Serna et al. 2023).
Our work corroborates the recent findings about gene regulatory mechanisms and the relationship between chromatin conformation and gene regulation (Rowley and Corces 2018; Hsieh et al. 2020; Krietenstein et al. 2020; Hua et al. 2021; Aljahani et al. 2022; Goel et al. 2023). In that regard, we observed a strong correlation between chromatin compactness, as measured by insulation score, and gene expression, alongside looping strength. Additionally, our dataset enabled us to discern subtle alterations in local chromatin conformation linked to genotype, gene expression, and chromatin accessibility peaks, thus revealing intricate dynamics that remain hidden in conventional merged data.
We also present the most extensive analysis to date on the impact of genetic variation on chromatin conformation at an unprecedent resolution. By using complementary results from chromatin accessibility and gene expression we illustrate how variants that increase gene expression and chromatin accessibility also increase chromatin compactness and have intricate relationships with chromatin loops. Our results, however, indicate that our sample size is not sufficiently large to detect all QTLs, highlighting the ongoing need for larger-scale studies in this domain.
Finally, we leveraged our data to functionally refine GWAS loci. The interpretation of risk variants from GWAS has often been a stumbling block in their application to drug development. Compared to other studies, ours provides the most comprehensive functional context around these variants and underlines the complex nature of the mechanisms by which GWAS loci can influence gene regulation. The majority of prior studies relied on simple chromatin loops or raw contact frequency to associate genes with enhancers (Martin et al. 2015; Mifsud et al. 2015; Javierre et al. 2016; McGovern et al. 2016; Mumbach et al. 2017; Choy et al. 2018; Montefiori et al. 2018; Greenwald et al. 2019; Miguel-Escalada et al. 2019; Ray-Jones et al. 2020; Ge et al. 2021; Shi et al. 2021). However, this approach poses several challenges. Firstly, the majority of chromatin loops observed in Hi-C maps are structural loops mediated by CTCF, which do not link promoters with enhancers. Moreover, the majority of chromatin conformation methods are bulk methods, which means that enhancer-promoter interactions that appear only in a subpopulation of the cells, or that are more transient (van Staalduinen et al. 2023), might be significantly harder to detect. In that regard, our correlation method allows for the identification of dynamic structures and changes in interaction frequency that are not visible in the merged Hi-C map, such as those depicted in the FADS1/2 locus. Other techniques, for example those based on Micro-C (Krietenstein et al. 2020; Aljahani et al. 2022; Goel et al. 2023) have lower background noise compared to Hi-C but are significantly more challenging technically, requiring large number of cells, and, to our knowledge, no study to date has been published that uses primary cells. Moreover, difficulties in the reproducibility of the libraries limit the kind of correlation and QTL analysis that have been presented here. Secondly, focusing on enhancer promoter interactions overlooks the exploration of other mechanisms by which genetic variants impact gene regulation and chromatin conformation, such as the CTCF binding altering variants presented in our study. These regions represent just the beginning of understanding the intricate mechanisms at play, such as the ability of specific variants to affect multiple genes at once and for regulatory elements to affect genes across TAD boundaries. This highlights the need to move beyond strict, static definitions of TADs, loops, and correlations in assigning genes to GWAS loci. Additionally, we identified notable differences between primary cells and cell lines, which suggest that caution needs to be taken when using cell lines to draw conclusions.
The mechanisms and genes we identified here through our dataset, such as BCL2L11 and SESN3, present promising targets for novel, more effective therapeutics, and aid in patient stratification. New treatments that have genetic support are twice as likely to be approved (King et al. 2019). BCL2L11, is a key player in the apoptosis pathway and fundamental for T cell maturation (Bouillet et al. 1999; Hojo et al. 2019). Methotrexate, a widely used drug in rheumatology, has an inhibitory effect on dihydrofolate reductase (DHFR) which can influence various pathways, including those involving apoptosis where BCL2L11 plays a significant role. In this context, our findings about BCL2L11 regulation or expression could provide novel avenues for future research, including the exploration of how variations in BCL2L11 regulation might affect therapeutic outcomes or the development of new treatments that target BCL2L11 directly. Another breakthrough from our study is the identification of the potential mechanism by which the RA locus rs4409785 affects the expression of SESN3, a gene that plays a pivotal role in a specific subset of T cells implicated in the pathogenesis of autoimmune diseases (Argyriou et al. 2022). This discovery suggests the potential for novel therapies that target these cells directly.
In conclusion, our research elucidates the complex dynamics of chromatin conformation and gene regulation, offering potential implications for disease understanding and therapeutic development. We trust that our dataset will be a valuable resource for researchers in this field. To facilitate further research, we have made available all pre-processed data, including precomputed chromatin conformation maps, correlations with gene expression, chromatin accessibility, genotype, QTL datasets and code to replicate our analysis and annotate further GWAS results at http://bartzabel.ls.manchester.ac.uk/orozcolab/SNP2Mechanism/
ONLINE METHODS
Sample collection and cell isolation
Between 40ml and 100ml of peripheral blood was collected following informed consent from 55 PsA patients and 19 healthy controls and processed on the same day. Patients and healthy controls were recruited under the National Repository study ethics approval number IRAS 32231, MREC 99/8/084. PBMCs were isolated using Ficoll density gradient centrifugation. EasySep Human CD4+ and CD8+ T cell negative selection kits (StemCell technologies #17952 and #17953) were used to isolate CD4+ and CD8+ T cells respectively. Cells were then aliquoted and prepared for long-term storage separately for each technique used.
Synovial fluid extracted from patients with PsA was first treated with hyaluronidase enzyme (Sigma Aldrich #H3884) and then processed in a similar way as the peripheral blood samples.
To confirm that there was no sample mislabelling we checked the concordance of the genotype with the bam file by using the QTLtools v1.3.1 mbv function (Fort et al. 2017). To identify mismatches between CD4+ T cells and CD8+ T cells we used the gene expression of the CD4 and CD8A genes for the RNA-seq data, the chromatin accessibility of the promoters of CD4 and CD8A for ATAC-seq, and the loop strength of the loops surrounding the CD4 gene for the Hi-C data. Sample mismatches were then corrected in the metadata files.
Hi-C library preparation and sequencing
Approximately 3-5 million cells were crosslinked in 2% formaldehyde for 10 minutes at room temperature and the reaction was then quenched using a solution glycine + BSA. Samples were then washed in PBS and snap frozen on dry ice and then transferred to −80°C for long-term storage.
2 million cells were then used for library preparation following the two-restriction enzyme Arima Hi-C kit (Arima Genomics) and the KAPA HyperPrep kit (Roche #KK8504) manufacturer’s protocol using Illumina TruSeq dual indexes.
Library size was checked by TapeStation 4200 (Agilent) and QC was done by quantstudio (Life technologies) using the NEBNext® Library Quant Kit for Illumina (E7630). Sequencing was performed on the NovaSeq6000 platform using NovaSeq S4 flow cells at a target depth of 350 million reads per library and generating 150bp paired-end reads.
RNA-seq library preparation and sequencing
Approximately 500 thousand to 1 million cells were lyzed and stored in 700uL Qiazol lysis reagent (QIAGEN, ref: 79306). To isolate RNA, 140 uL of chloroform was added. After centrifugation at 12000 x g for 15 min, approximately 350 uL of the upper layer containing the RNA was transferred and mixed with 350 uL of 70% ethanol. RNA isolation was continued from this point using the RNeasy MinElute kit (QIAGEN, ref: 74204) protocol. RNA libraries were then generated by NovoGene UK using a custom protocol based on cDNA synthesis using random hexamer primers and polyA mRNA enrichment. Sequencing was performed on the NovaSeq6000 platform using NovaSeq S4 flow cells at a target depth of 40 million reads per library and generating 150bp paired-end reads.
ATAC-seq library preparation and sequencing
Approximately 500 thousand cells were frozen in Recovery™ Cell Culture Freezing Medium (Thermofisher #12648010) and stored at −80°C. An aliquot of 50 thousand cells was then used for ATAC-seq library preparation following the Omni-ATAC protocol as released by the Kaersten lab (Corces et al. 2017; Amanda Ackermann 2019).
Library size was checked by TapeStation 4200 (Agilent) and QC was done by quantstudio (Life technologies) using the NEBNext® Library Quant Kit for Illumina (E7630). Sequencing was performed on the NextSeq2000 platform at a target depth of 50 million reads per library and generating 50bp paired-end reads.
Hi-C data processing, loop calling and clustering
Hi-C reads were quality filtered and trimmed and adapters were removed using fastp v0.20.1 (Chen et al. 2018). Reads were then processed and mapped to the GRCh38 genome using HiC-pro v3.0.0 (Servant et al. 2015), using default settings. Hic juicebox files for visualization and analysis were generated using the hicpro2juicebox.sh script using juicer tools v1.22.01. Cool files were generated from the juicebox files using hic2cool (https://github.com/4dn-dcic/hic2cool). TAD calling was done using onTAD (An et al. 2019) with default settings and ICE normalized 40kb resolution maps. Visualizations were made using a modified version of the python package coolbox (Xu et al. 2021) or through custom scripts available at (#####). The Hi-C processing pipeline is available on GitHub (https://github.com/ChenfuShi/hic_master_pipeline).
Loop calling was performed on merged libraries by cell population (CD4+, CD8+ and CD8+ synovial fluid) using Mustache (Ardakany et al. 2020) at a resolution of 2.5kb and SCALE normalization. This tool allows the detection of locally enriched loops (similar to HICCUPS (Rao et al. 2014)) rather than defining a model of contact decay based on genomic distance (such as FitHiC or CHiCAGO (Cairns et al. 2016; Bhattacharyya et al. 2019)) but it has shown to have improved sensitivity, although the number of loops identified strongly correlates with the depth of the contact maps. A common list of loops was obtained by merging the loops lists and removing duplicates. Counts for each loop from each sample were extracted at 5kb resolution and including the surrounding pixels, in such a way that counts were extracted from a 15kb-by-15kb region around the centre of the loop. These counts were extracted from SCALE normalized maps and library normalization between samples was computed by extracting the expected cis counts based on genomic distance decay and multiplying the counts by the ratio between the sample’s decay distribution with a reference sample. This distance decay function was calculated using cooltools v0.5.1 (Venev et al. 2022) api function “expected_cis” at a 10kb resolution and calculated for each chromosome. Except these normalization steps, no batch effect corrections were found to be necessary or influenced the results significantly.
Clustering of Hi-C libraries was performed using two separate methods. The first method is based on the the loops called. Normalized counts were simply treated as data matrix and standard scaled, after which a UMAP transformation (McInnes et al. 2018) was applied to the matrix to obtain the 2 component representation of the data. Alternatively, a PCA transformation could also be applied obtaining similar results. The second method was based on HiCrep (Yang et al. 2017). We used a fast python implementation (Lin et al. 2021) which allowed the computation of pairwise distances across all samples. Distances were calculated using a bin size of 50kb, a smoothing factor h of 5, and a maximum genomic distance of 5mb. We then applied MDS scaling on the pairwise distance matrix to generate the 2D plot in the results. Results are visualized for chromosome 1.
RNA-seq data processing and analysis
RNA-seq reads were quality filtered and trimmed and adapters were removed using fastp v0.20.1 (Chen et al. 2018). Counts per transcript were generated using salmon v1.6.0 (Patro et al. 2017), with a decoy aware transcriptome (Gencode hg38 v29) and the validate mappings option enabled. Counts were then aggregated per gene and differential analysis was performed using DESeq2 (Love et al. 2014). Genes with average counts lower than 5 were removed. In all cases, unless otherwise specified, sex was used as a covariate. For other analysis, normalized counts were extracted from DESeq2 and used. No batch effect correction was found to be necessary.
ATAC-seq data processing and analysis
ATAC-seq reads were quality filtered and trimmed and adapters were removed using fastp v0.20.1 (Chen et al. 2018). Reads were then mapped using bowtie2 v2.4.4 (Langmead and Salzberg 2012) using the very-sensitive pre-set. Duplicates were removed using the MarkDuplicates functionality of picard tools (Institute 2019) and mitochondrial reads were removed, as well as removing low quality alignments (MAPQ<30) and keeping only properly paired reads. Coverages were generated from cleaned alignments using the bamCoverage functionality from DeepTools2 (Ramírez et al. 2016). Peaks were called using macs2 v2.2.7.1 (Zhang et al. 2008). The ATAC-seq processing pipeline is available on GitHub (https://github.com/ChenfuShi/ATAC_ChIP_pipeline).
Data for all samples was then analysed in R using DiffBind (Stark and Brown 2011). A consensus peakset was generated using peaks that were present in at least 5 samples. Peak size was set to 500bp. Cross library normalization was performed using the DESeq2 normalization method (DBA_NORM_NATIVE) but using only reads in peaks (DBA_LIBSIZE_PEAKREADS), as different samples had significantly different proportion of reads in peaks, which would bias the default normalization method. Counts were extracted at this point for further analysis, except for differential binding analysis, which was performed within DiffBind, using the DESeq2 method. No batch effect correction was found to be necessary. We found that quantile normalization slightly improved results, such as separation of cell types on the PCA plot and correlation between expression and peak height of promoter peaks, and as such we used quantile normalized counts for all downstream analysis.
We have also used the ATAC-seq data to call genotypes for the samples for which we did not have Hi-C data. This was done using a similar method as the Hi-C data using GLIMPSE v1.1.1 (Rubinacci et al. 2021).
Hi-C loop differential analysis
We developed a novel method based on linear regression to perform differential analysis from a large dataset of Hi-C data. All previous tools failed to compute the large number of samples used in our experiment. We extracted and normalized counts as described in the previous sections using a distributed system across our computational cluster. For each loop we then applied an ordinary least square regression using the python package statsmodels (Seabold and Perktold 2010). In the comparisons presented here the gender of the individual was set as a co-variate. Adjusted p-values were calculated using FDR correction (Benjamin-Hochberg) as implemented in the statsmodels package. A permutation run of the data (shuffling the cell type labels) resulted in less than 3 significant differential loops.
Differential insulation score analysis
We calculated insulation score using the cooltools v0.5.1 (Venev et al. 2022) api function “insulation” using SCALE normalized matrixes at 25kb resolution and a window size of 100kb. Insulation scores were then corrected across libraries using quantile normalization. Regions with differential insulation score were then identified using a linear regression model using the gender of the individual as a co-variate using the python package statsmodels (Seabold and Perktold 2010). Adjusted p-values were calculated using FDR correction (Benjamin-Hochberg) as implemented in the statsmodels package.
Identification of loops and insulation bins altered in cell lines
To quantify how altered the chromatin conformation is in cell lines compared to primary cells we generated high quality Arima Hi-C libraries for two cell lines typically used as proxy for T cells. CD4+ T cells we used Jurkat E6-1 cells and for CD8+ T cells we used MyLa cells (Sigma-Aldrich, 95051033). Library generation was carried out using the same protocol as the primary cells. Sequencing was carried out to a final depth of 785 million reads for Jurkat cells and 385 million reads for MyLa cells and then processed using the same methodology as primary cells.
Loops and normalized counts were extracted as described in the previous sections. Because we had multiple replicates for the primary samples but only one replicate for the cell lines we had to test for differences in a different way. For each loop we calculated the mean and standard deviation of this loop from our dataset, and then tested if each sample was considered an outlier compared to the distribution of the counts for all samples. In practice this is calculated by calculating the z-score for each sample, then converting this to a p-value. The loop was considered an outlier if the unadjusted p-value was lower than 0.05.
Similarly, we carried out the same analysis for the insulation score. Insulation scores for all samples were pre-processed as described in the previous sections. Then for each genomic bin we calculated the mean and standard deviation of this bin in our dataset and then tested if each sample was considered an outlier in the same way as the loops.
Loops correlation with gene expression and chromatin accessibility
Normalized counts for each loop, gene and ATAC-seq peak was extracted as previously described. Pearson’s correlation coefficient was calculated using the pearsonr function from scipy (Virtanen et al. 2020) between peak height or gene expression and interaction strength for each pairwise comparison for each sample. Because only loops that vary between the samples can show correlation, we only show the results for loops that are differentially interacting between CD4+ and CD8+ T cells (FDR<0.01). To limit computation time, we also only show results for chromosome 1, but similar results can be shown for any other chromosome. Additionally, we only test genes that are expressed in our samples (mean count > 5), loops that have a mean interaction strength of at least 10, and peaks with at least mean accessibility of 10 reads. The background used for peaks and genes all elements within 4 mb from the ends of the loop tested.
Insulation score correlation with gene expression
Normalized insulation scores and counts for each gene were extracted as previously described. We then selected the genes that were differentially expressed (LFC>1, FDR<0.1) between CD4+ and CD8+ T cells and correlated their expression level with the insulation score of the bin which overlapped the promoter of the gene. Finally, the Pearson’s correlation coefficient was calculated using the pearsonr function from scipy (Virtanen et al. 2020).
CTCF tracks
Imputed CTCF tracks for primary CD8+ and CD4+ T cells were obtained from ENCODE (ENCFF148DDI and ENCFF811QTU). These have been imputed using a multi-scale deep tensor factorization method, Avocado (Schreiber et al. 2020a, 2020b). To obtain peaks, we applied a threshold to the signal of 5, corresponding to genome-wide significance of a p-value of 10−5. Overlaps with loops and SNPs were obtained using pybedtools (Dale et al. 2011) and bedtools v2.30.0 (Quinlan and Hall 2010).
Identification of loops with allelic imbalance
To identify loops displaying allelic imbalance we first had to generate phased genotypes for each individual. To obtain genotypes we first mapped all Hi-C reads for each individual to the GRCh38 genome using bwa-mem v0.7.17 using the settings -SP5M. Genotypes were then called using GLIMPSE v1.1.1 (Rubinacci et al. 2021) and the 1000 genomes phase 3 reference, making sure that all known restriction sites were excluded from the first step of the genotype calling. Genotype phasing was then carried out using the integrated phasing pipeline (Bansal 2019) which integrates population phasing using SHAPEIT2 (Delaneau et al. 2014) with reads phasing using the Hi-C using HAPCUT2 (Edge et al. 2017), resulting in around 2 million heterozygous SNPs phased per sample. We then reprocessed all Hi-C reads using Hi-C pro to a masked GRCh38 genome (all known SNPs masked with Ns) to reduce mapping bias. Aligned reads for each library were then split using SNPsplit v0.5.0 (Krueger and Andrews 2016) generating allele specific alignments. Counts for each allele for each significant loop (with a slop of 10kb) were then calculated using bedtools pair_to_pair (Quinlan and Hall 2010) and data was integrated in Python 3.9. Allelic imbalance of the reads for each loop that overlapped a SNP was tested using a one-sided binomial test (each SNP was tested twice for both directions separately) as implemented in scipy (Virtanen et al. 2020) for each sample which was heterozygous for that SNP. In addition, we filtered SNP/loop pairs so that at least 2 samples were heterozygous and phased for the SNP, at least 10 reads were present for each SNP and at least 1 read for each allele had to be present. All the resulting p-values were then merged using the Firsher’s method for meta-analysis as implemented in scipy (separately for each directionality). Resulting p-values were then corrected using Benjamini-Hochberg method as implemented in statsmodels (Seabold and Perktold 2010).
Identification of variants with allelic imbalance in ATAC-seq
To identify variants displaying allelic imbalance in chromatin accessibility we first reprocessed all samples by mapping them to a masked GRCh38 genome (all known SNPs masked with Ns) to reduce mapping bias. For each sample we then identified the sites which contain variants using the bcftools v1.15.1 commands mpileup and call. Specifically, we used the following settings: mpileup -q 10 -I -E -a ‘FORMAT/DP,FORMAT/AD’ --ignore-RG - d 10000, call -Aim -C alleles. Additionally, we supplied the GRCh38 fasta references as well as the 1000 genomes sites, to limit the calling to known variants. Next, we filtered variants that had a read depth of at least 10 and had at least 2 reads from each allele or 5% of the reads, whichever was higher (REF and ALT). Read depth for each allele was extracted from the INFO field which contains the number of high-quality reads for the REF and the ATL alleles. Allelic imbalance for each SNP was tested using a one-sided binomial test (each SNP was tested twice for both directions separately) as implemented in scipy (Virtanen et al. 2020) for each sample. All the resulting p-values were then merged using the Firsher’s method for meta-analysis as implemented in scipy (separately for each directionality). Resulting p-values were then corrected using Benjamini-Hochberg method as implemented in statsmodels (Seabold and Perktold 2010).
Overlap with caQTL was done by first selecting the caQTLs for which the variant is directly overlapping the target ATAC-peak. Then overlaps were carried out by matching the rsID of the variant. Finally, we compared the directionality of the allelic imbalance with the slope of the caQTL for the variants that were significant both for allelic imbalance and for caQTL. Of note, a partial explanation for the limited overlap between caQTLs and allelic imbalance SNPs is the fundamental difference in the processing of the data. In particular, we found the following limitations:
- To compare read counts for ATAC-peaks across samples, a consensus peak set needs to be generated first. This limits the extent of the size of a promoter or enhancer to 500bp. Moreover, multiple enhancers within a certain size will be merged into one enhancer. We find that in some cases, we can split larger enhancers into multiple smaller sections, and these would correlate with genotype, whilst the overall called peak would not.
- Allelic imbalance can only be calculated if there are enough reads directly overlapping a SNP to support a heterozygous call.
- However, allelic imbalance intrinsically corrects for batch effects and other biological variability, as the imbalance is called from within the same sample.
Genotyping array
Genotyping was carried out using the Illumina Infinium HumanCoreExome 24 BeadChip kit (Illumina, San Diego, California, USA). 250 ng of DNA was used, according to the manufacturer’s guidance. Genotype calling was carried out using GenomeStudio software (Illumina, San Diego, California, USA). Standard QC was conducted on each individual array using PLINK v1.9 (Purcell et al. 2007): SNPs and samples were excluded if there was >2% missing data, and SNPs with MAF□<□0.01 and Hardy Weinberg Equilibrium (HWE) p□<□1□×□10−4 were also excluded. Population stratification adjustment was done using HapMap 3 reference panel to determine genetic ancestry of each individual, followed by Principal Component Analysis (PCA) analysis. Full genotype imputation was carried out using HRC v1.1 reference panel on the Michigan Imputation Server (Das et al. 2016). Following imputation SNPs with imputation r2 score lower than 0.5 were removed.
QTL analysis
Quantitative Trait Locus (QTL) analysis was performed to investigate the genetic basis of molecular phenotypes. The cis QTL analysis was carried out using QTLtools v1.3.1 (Delaneau et al. 2017), a versatile software that allows the use of arbitrary molecular phenotypes. We utilized genotypes called from the Hi-C data and the ATAC-seq data due to their lower error rates and fewer missing samples compared to those imputed from the genotyping array, although results were similar to the ones done using the array genotypes. Variants were filtered based on a minor allele frequency (MAF) greater than 0.05 in our samples.
For each molecular phenotype, samples that were identified as outliers through principal component analysis (PCA) were eliminated for each technique and condition. Four types of molecular phenotypes were investigated: gene expression QTL, chromatin accessibility QTL, insulation score QTL, and loop QTL.
Gene expression QTL analysis was conducted using gene expression data normalized with DESeq2 (Love et al. 2014) and log2 transformed. Chromatin accessibility QTL analysis was performed using counts from DiffBind (Stark and Brown 2011), which were subsequently quantile normalized and log2 transformed. Insulation score QTL analysis was conducted using insulation scores calculated as described previously and quantile normalized. Finally, loop QTL analysis was performed using loop counts extracted as described previously, followed by log2 transformation.
All QTL analyses were executed using the --normal parameter and tested regions within 1mb of the variant. The number of PCA components to be used as covariates for genotype and phenotype for each modality was determined by conducting a permutation pass on chromosome 1. The combination that yielded the most significant QTLs was then selected for further analysis. To calculate corrected p-values, we performed a permutation pass with 1000 permutations. However, corrected p-values are only calculated on the top hit. We find that for all our modalities an equivalent uncorrected p-value to a corrected p-value of 0.1 is approximately 1*10-4. For the calculation of the overlaps between the modalities, we decided to use this latter p-value as the top hit would rarely overlap between the different modalities.
For the overlaps between the same modality, we merged based on selecting one condition (CD4 or CD8) as the reference, which is filtered by corrected p-value of 0.1. using this as the lead signal for that eQTL we then searched the other condition for the same variant-phenotype pair at an uncorrected p-value of 1*10-4. We find that the concordance of the directionality is maintained at an uncorrected p-value of 1*10-3. For overlaps between eQTLs and caQTLs, we matched the ATAC-seq peaks that overlapped the TSS sites for the genes in the eQTL dataset. For overlaps with insQTLs we matched the insulation score bin to the position of the TSS or the ATAC-peak. For overlaps with loopQTLs we matched the TSS and the ATAC-seq peaks that overlapped either within the loop anchors or within the actual loop anchors, based on the description in the main text. When reporting results, we ran both CD8+ and CD4+ separately, then reported the average result between the two.
Since our dataset is relatively small compared to other publicly available eQTL datasets, we have integrated the eQTLgen dataset (Võsa et al. 2021) when discussing our results, where stated in the text.
GWAS datasets and linkage disequilibrium SNPs identification
Leads SNPs for selected GWAS studies were downloaded as follows: PsA (Soomro et al. 2022), Ps (Tsoi et al. 2017), SSc (López-Isac et al. 2019), atopic dermatitis (Paternoster et al. 2015) and rheumatoid arthritis (Ishigaki et al. 2022), JIA (López-Isac et al. 2021).
SNPs in high linkage disequilibrium (R^2 > 0.8) with the lead SNPs were identified using plink v1.90b3.39 on the 1000 genomes data v3 with population set to EUR.
Generation of enhancer deletion mouse
Two sgRNA for the upstream cut site (g483 – ggttacagtccgttaccgcc, g484 – tacgttcttcctggcggtaa) and two sgRHA downstream (g324 – gtgtgacatgagttgtcact, g325 – catgagttgtcactgggatt) were used as guides for the CRISPR-Cas9 deletion system. sgRNA were designed with stringent criteria for off target predictions (guides with mismatch (MM) of 0, 1 or 2 for elsewhere in the genome were discounted, and MM3 were tolerated if predicted off targets were not exonic according to http://www.sanger.ac.uk/htgt/wge/). The sgRNA were purchased as full length Alt-R sgRNA (IDT; Coralville, USA) and resupsended in sterile, RNase free injection buffer (TrisHCl 1 mM, pH 7.5, EDTA 0.1 mM). 1 ug each sgRNA were combined with 1 ug EnGen Cas9 (New England Biolabs). A final embryo injection mix (concentrations; each sgRNA 20 ng/ul, Cas9 protein 80 ng/ul) was directly microinjected into C57BL6/J (Envigo) zygote pronuclei using standard protocols. Zygotes were cultured overnight and the resulting 2 cell embryos surgically implanted into the oviduct of day 0.5 post-coitum pseudopregnant mice. Potential founder mice were identified by extraction of genomic DNA from ear clips, followed by PCR using primers that, upon an NHEJ mediated deletion event, are brought into proximity and generate an amplicon (Geno F1 GCTCTGCCACACTTGCTTAG, Geno R2 AACTGCACACGTGAACATGT). Sanger sequencing of the amplicon and alignment to the mouse genome verified the enhancer deletion in each pup.
DATA AVAILABILITY STATEMENT
All processed data, differential gene expression and chromatin accessibility analysis, QTLs and allelic imbalance datasets are available online at:
http://bartzabel.ls.manchester.ac.uk/orozcolab/SNP2Mechanism/
All the code to reproduce the results presented here is available in the same webpage and on the GitHub repository:
https://github.com/ChenfuShi/PsA_cleaned_analysis
Precomputed Hi-C correlation maps with gene expression, chromatin accessibility and various variants are available at:
http://bartzabel.ls.manchester.ac.uk/orozcolab/SNP2Mechanism/PsA_output_hic_plots/main.html
Raw reads are available upon reasonable request.
CONFLICTS OF INTERESTS
The authors declare no conflict of interest.
CONTRIBUTIONS
Example: DZ, SR, AF, JD, CF, CW, RH, ER and MG collected and processed the samples. CS, DZ, SR, AF, JD, ER, MG and AA designed and performed the experiments. CS and CY analysed the data. DP, PM, SE, JB, AB, PH, MR and GO acquired the funding and oversaw the project. CS and GO conceived the work, generated the figures, and wrote the manuscript. All authors have read and approved the final manuscript.
Supplementary Figure 1. A) UMAP of RNA-seq data, counts from DESeq2, symbol shape is sex of individual B) UMAP of the ATAC-seq data, counts from DiffBind, symbol shape is sex of individual. C) UMAP of the Hi-C loops data, symbol shape is sex of individual. D) MDS of HiCRep analysis of the Hi-C data, symbol shape is sex of individual.
Supplementary Figure 2. A) APA plot of the loops that overlap CTCF at both ends. Numbers at bottom left corner and upper right corner indicate the ratio between the central pixel and the mean of the pixels at those corners. All loops and Hi-C map are from merged CD8+ T cells Hi-C. B) APA plot of the loops that overlap CTCF at least at one end. C) APA plot of the loops that do not overlap CTCF. D) Boxplot with number of insulation score bins that were considered outliers, per sample. MyLa and Jurkat cells are highlighted. E) boxplot with number of loops that were considered outliers, per sample. MyLa and Jurkat cells are highlighted.
Supplementary Figure 3. A) Hi-C maps from the ANKRD44 locus, showing differences in chromatin conformation between Jurkat CD4+ T cells and primary CD4+ T cells. B) Hi-C maps from the KLRF1 locus, showing differences in chromatin conformation between Jurkat CD4+ T cells and primary CD4+ T cells.
Supplementary Figure 4. A) Hi-C maps from the ITK locus, showing differences in chromatin conformation between Jurkat CD4+ T cells and primary CD4+ T cells. B) Hi-C maps from the GZMB locus, showing differences in chromatin conformation between Jurkat CD4+ T cells and primary CD4+ T cells.
Supplementary Figure 5. A) Hi-C maps from the SLC22A5 PsA GWAS locus, showing differences in chromatin conformation between MyLa CD8+ T cells and primary CD8+ T cells. B) Hi-C maps from the FYN PsA GWAS locus, showing differences in chromatin conformation between Jurkat CD4+ T cells and primary CD4+ T cells.
Supplementary Figure 6. A) Accessibility of CTCF site located at chr19:39440211 vs the genotype of rs2353678. B) Visualization of the region surrounding rs2353678. Top: ATAC-seq peaks. Intensity of the colour indicates the average peak-height of the peak, whilst position across the Y axis indicates the pearson’s correlation with the genotype of rs2353678. Middle: Genes from Gencode v29. The red dots indicate the transcription start sites. Bottom left: Correlation between the Hi-C contacts and the genotype of rs2353678. Bottom right: Merged Hi-C map.
Supplementary Figure 7. A) Region view from ensembl genome browser with homology between Homo Sapiens and Mus Musculus for the BUB1, ACOXL, BCL2L11 region. B) same as before but zoomed in view of the regulatory region, with the 3 chromatin accessibility peaks that were deleted in mouse.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the assistance given by IT Services and the use of the Computational Shared Facility at The University of Manchester, the study coordinators and the research nurses, the flow cytometry facility and the patients and healthy volunteers who contributed their time and donated the samples.
This work was funded by the Wellcome Trust (award references 207491/Z/17/Z and 215207/Z/19/Z), Versus Arthritis (award reference 21754), NIHR Manchester Biomedical Research Centre and the Medical Research Council (award reference MR/N00017X/1).
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵