Cross-Population Genetic Variation of Loci Identified by Genome-Wide Association Studies conducted in British participants of European-descent from the UK Biobank ================================================================================================================================================================== * Antonella De Lillo * Salvatore D’Antona * Maria Fuciarelli * Renato Polimanti ## Abstract To provide novel insight regarding the inter-population diversity of loci associated with complex traits, we integrated genome-wide data from UK Biobank (UKB) and 1,000 Genomes Project (1KG) data representative of the genetic diversity among worldwide populations. We investigated genome-wide data of 4,359 traits from 361,194 UKB participants of European descent. Using 1KG data, we explored the allele frequency differences and linkage disequilibrium (LD) structure of UKB genome-wide significant (GWS) loci across worldwide populations. Functional annotation data were used to identify regulatory elements and evaluate the tagging properties of GWS variants. No significant difference was observed in allele frequency between UKB and 1KG GBR (British in England and Scotland). Considering other population groups, we identified genome-wide significant alleles with frequencies different from what expected by chance: UKB vs. 1KG Europeans without GBR (rs74945666; allele=T [0.908 vs. 0.03], standing height pGWAS=1.48×10-17), UKB vs. 1KG African (rs556562; allele=A [0.942 vs. 0.083], platelet count pGWAS=4.84×10-15), UKB vs. 1KG Admixed Americans (rs1812378; allele=T [0.931 vs. 0.089], standing height pGWAS=4.23×10-12), UKB vs. 1KG East Asian (rs55881864; allele=T [0.911 vs. 0.001], monocyte count pGWAS=7.29×10-13), and UKB vs. South Asian (rs74945666; allele=T [0.908 vs. 0.061], standing height pGWAS=1.48×10-17). LD-structure analysis and computational prediction showed differences in how these alleles tag functional elements across human populations. In conclusion, the human diversity of certain GWS loci appear to be affected by local adaptation while in other cases the associations may be biased by residual population stratification. Keywords * Ancestry * complex traits * 1,000 Genomes Project * GWAS * phenome * UK Biobank ## Introduction Genome-wide association studies (GWAS) are a powerful tool to identify genetic variants associated with human traits and diseases (Visscher et al. 2017). Since the first GWAS conducted in 2005 (Klein et al. 2005), 4,671 GWAS reporting >19,813 associations have been listed in the GWAS Catalog (Buniello et al. 2019) as of August 13, 2020. This unprecedented amount of information has revolutionized our understanding of the predisposition to complex phenotypes, demonstrating that a large portion of the heritability of complex traits resides in common genetic variation (i.e., polymorphisms in the human genome that show a minor allele frequency (MAF) greater than 1%) (Visscher et al. 2017). In recent years, the investigations of massive cohorts from 100,000 to more than 1,000,000 participants were possible because of large collaborative projects combining numerous studies (Colodro-Conde et al. 2017; Kim et al. 2017; Sullivan et al. 2018; Thompson et al. 2014), the availability of biobanks enrolling an unprecedented number of participant (Fan et al. 2008; Kubo and Guest 2017; Sudlow et al. 2015), and collaboration with direct-to-consumer genetic testing companies (Check Hayden 2017). These large-scale GWAS identifying ever-greater numbers of risk loci with ever-smaller individual effects demonstrated that the genetic architecture of common diseases is highly polygenic and their heritability is likely due to the contribution of several thousand (or even more) risk loci across the human genome (Evangelou et al. 2018; Karlsson Linner et al. 2019; Lee et al. 2018; Timmers et al. 2019). One of the main GWAS promises is that the knowledge gained can be used to develop genetic instruments useful to predict disease risk, treatment response, and disease prognosis. Leveraging data generated by large-scale GWAS, a growing number of studies are developing approaches to test the utility of polygenic information with respect to the human phenotypic spectrum (Inouye et al. 2018; Khera et al. 2019; Sparano et al. 2019; Weigl et al. 2018). Although these successful experiments strongly support the movement towards the application of GWAS data to develop new strategies to prevent and treat human diseases, important challenges remain. Among them, one of the most pressing is related to the limited ancestry and ethnic diversity of large-scale GWAS that have created a large gap between the genetic data available for populations of European descent and non-European human groups (Sirugo et al. 2019). Applying GWAS data generated from European-ancestry cohorts to non-European individuals raise serious issues, including much lower predictive power than that observed in comparisons between like populations (Martin et al. 2019; Mostafavi et al. 2019) and possible biases (e.g., reflecting an accounted population stratification rather than the phenotype of interest) due to the genetic diversity among human populations (Duncan et al. 2019; Martin et al. 2017). The most reliable solution to this problem is to conduct large-scale GWAS in populations with non-European ancestry. Ongoing efforts such as the Million Veteran Program (Gaziano et al. 2016) and the AllofUS Research Program (Sankar and Parker 2017) are investigating multiple ancestry groups representative of the US population to reduce this gap. Although these kinds of projects are expected to eliminate the population disparities in human genetic research, this is likely to be a long-term outcome. To date, to contribute to a more comprehensive understanding of human genetic diversity, we can leverage the data available, combining large-scale genome-wide association datasets generated from cohorts including mainly participants of European descent with reference panels representative of the genetic diversity among worldwide populations (Daub et al. 2013; Hofer et al. 2009; Iorio et al. 2017; Polimanti et al. 2015). In the present study, we focused our attention on the UK Biobank (UKB). This large cohort including more than 500,000 participants with approximately 90% of them as British individuals of European descent (Bycroft et al. 2018). Based on UKB participants of European descent, GWAS have been conducted with respect to the human phenome spectrum, identifying a large number of risk loci surviving the genome-wide significance threshold (p<5×10-8). Using 1,000 Genomes Project (1KG) data, we explored the diversity of these loci, comparing allele frequency differences across worldwide populations. The results obtained showed that allele frequency differences in certain risk loci are significantly different from that expected from randomly selected variants with similar genomic characteristics (i.e., minor allele frequency (MAF), gene density, distance to nearest gene, and linkage disequilibrium (LD) proxies). In some cases, these population differences appear to be due to the evolutionary events related to local adaptation (i.e., adaptation in response to selective pressure related to the local environment), while other cases may be related to the residual effect of population stratification in UKB GWAS. ## Materials and Methods ### UK Biobank The present study was conducted leveraging UKB genome-wide association data. UKB is a large population-based prospective study to explore different life-threatening disorders using information about environment and genes in order to improve diagnosis and treatment (Sudlow et al. 2015). A wide variety of phenotypic information, including socio-demographic and lifestyle factors, electronic health records data, and physiological conditions have been collected for more than 500,000 UKB participants (Bycroft et al. 2018). The genotypes of the whole cohort were defined by applying a bespoke genome-wide DNA microarray that contains about 850,000 genetic variants (including rare, intermediate, and common variants) (Allen et al. 2014). Genetic data were then used to generate genome-wide association datasets that can be employed to explore the genetics of complex traits. The genome-wide datasets used in the present study were derived from the analysis of 361,194 unrelated British participants of European descent. Genome association analyses for over 4,000 phenotypes was conducted using appropriate regression models available in Hail (available at [https://github.com/hail-is/hail](https://github.com/hail-is/hail)) including the first 20 ancestry principal components, sex, age, age2, sex x age, and sex x age2 as covariates. The principal components included in the regression model were generated by the UKB investigators using fastPCA algorithm (Galinsky et al. 2016) and considering unrelated subjects and genetic markers pruned for linkage disequilibrium (Bycroft et al. 2018). Details regarding QC criteria, GWAS methods, and the original data are available at [https://github.com/Nealelab/UK\_Biobank_GWAS/tree/master/imputed-v2-gwas](https://github.com/Nealelab/UK_Biobank_GWAS/tree/master/imputed-v2-gwas). ### 1000 Genomes Project Phase3 To dissect the genetic differences of UKB participants with respect to other European samples and other worldwide populations, we used data derived from 1KG Phase3. The 1KG project aims to provide information about common and rare human genetic variation by applying whole-genome sequencing to a large cohort of individuals derived from different populations (Genomes Project et al. 2010; Genomes Project et al. 2012; Genomes Project et al. 2015). The 1KG Phase 3 of the project includes data about 2,504 individuals sampled from 26 populations representative of Africa (AFR), East Asia (EAS), Europe (EUR), South Asia (SAS), and the Americas (admixed; AMR) (Genomes Project et al. 2015). Details regarding alignment, mapping algorithm, SNP (single nucleotide polymorphism) calling, and the data of the project are available at [https://www.internationalgenome.org/analysis](https://www.internationalgenome.org/analysis). ### Variants filtering and clumping We considered genetic association results generated from 361,194 UKB participants of European descent tested with respect to 4,359 phenotypic outcomes including physiological, health, and lifestyle conditions (Supplementary File 1). We focused our attention on variants with a GWAS p-value significance threshold of P ≤ 5×10-8 and MAF ≥ 5 %. Furthermore, to control the potential inflation in the test statistics, as suggested by the investigators that generated the data (details at [http://www.nealelab.is/blog/2017/9/11/details-and-considerations-of-the-uk-biobank-gwas](http://www.nealelab.is/blog/2017/9/11/details-and-considerations-of-the-uk-biobank-gwas)), we selected high-confidence associations results generated from variants with at least 25 minor alleles in the smaller group between case or control. To find independent association signals among variants selected, we conducted a P-value-informed clumping with a LD cut-off of R2 = 0.1 within a 1000 kb window. ### Allele frequency differences among human populations We calculated the allele frequency of the index variants identified from the LD-clumping in AFR, EAS, EUR, SAS, and the AMR 1KG superpopulations. Specifically, we tested the following comparisons: i) UKB *vs*. 1KG GBR (British in England and Scotland) reference sample; ii) UKB vs. 1KG EUR reference panel (excluding GBR sample); iii) UKB vs. each of the non-European 1KG superpopulations (AFR, AMR, EAS, and SAS). For the subsequent analyses, we considered the loci showing allele frequency differences in the top 1% of all index variants investigated with respect to each comparison conducted. ### Comparisons with respect to randomly-selected variants matched by genomic characteristics To verify whether the allele frequency of each variant identified was different from what expected by chance, we generated a control set of matched variants using SNPsnap tool (Pers et al. 2015). This permitted us to identify sets of randomly selected variants SNPs matched to the index variants on the basis of four genomic characteristics: i) MAF, ii) LD proxies, iii) distance to nearest gene, and iv) gene density. Thus, variants identified in the first percentile were used as inputs considering the following parameters: 1KG EUR population (which is the closest reference panel among those available in SNPsnap); LD distance cut-off of R2=0.5; ±5% point deviation; ±50% of gene density relative deviation; ±50% of relative deviation of the distance to nearest gene; ±50% of relative deviation of LD proxies. For each index variant identified in the initial screening described in the section above, we extracted up to 10,000 matched SNP, excluding the HLA region due to its complex LD structure. Based on the corresponding randomly-selected genomically-matched sets, we calculated empirical p values for each index variant tested and considered type I error rate at 1% as the significance threshold. Finally, we checked whether the significative index variants showed allele frequency mismatches and mismapping using previously generated data available at [http://kunertgraf.com/data/biobank.html](http://kunertgraf.com/data/biobank.html) (Kunert-Graf et al. 2020). ### Cross-Ancestry LD comparison and Functional Annotation For the index variants with empirical p values surviving statistical significance, we conducted computational analyses to explore their functional consequences. Using LDlink (Machiela and Chanock 2015, 2018), we tested the effect of the LD structure variability across human populations on the ability of differentiated index variants to tag (measured as LD R2) functional variants in the surrounding regions (±500Kb). RegulomeDB (Boyle et al. 2012) was used to score the regulatory effect of the tagged variants on the basis of high-throughput, experimental data sets as well as computational predictions and manual annotations. LD R2>0.50 and RegulomeDB score = 1a-f (Supplementary File 2) were used as criteria to identify functional tag SNPs. ### Enrichment analysis for significant phenotypic traits To test whether traits related to differentiated loci were overrepresented with respect to certain phenotypic domains, we performed χ2 test comparing whether the proportions of the phenotypic distribution observed with respect to the identified loci are significantly different from the ones of the overall distribution observed across the 4,000+ UKB phenotypes analyzed. ### Pan-UK Biobank data To investigate the loci identified in non-European ancestral groups, we used the newly-released Pan-UKB genome-wide association statistics related to 7,221 phenotypes: 6,636 of AFR individuals; 980 AMR individuals; 8,876 individuals of Central/South Asian ancestry (CSA); 2,709 EAS individuals. A detailed description of the methods used to generate these data is available at [https://pan.ukbb.broadinstitute.org/](https://pan.ukbb.broadinstitute.org/). Using these data, we investigated whether the EUR associations of the index variants were also concordant in AFR, AMR, CSA, and EAS. Pan-UKB data are available at [https://pan.ukbb.broadinstitute.org/downloads](https://pan.ukbb.broadinstitute.org/downloads). ### Results Based on genome-wide significant associations (p< P ≤ 5×10-8) across the UKB phenotypic spectrum assessed (4,359 traits), we identified a total of 15,327 LD-independent risk alleles. Among these, we identified 154 index variants showing allelic frequency differences in the top 1% with respect to the three comparisons conducted: i) UKB *vs*. 1KG GBR; ii) UKB vs. 1KG EUR (excluding GBR sample); iii) UKB vs. each of the non-European 1KG superpopulations (AFR, AMR, EAS, and SAS) (Figure 1; Supplementary File 3). To test whether the allele frequency differences were significantly different from what expected by chance, we generated a control set of 10,000 variants matched by genomic characteristics (i.e., gene density, distance to the nearest gene, and the number of LD proxies) for each of the index variants (Supplementary File 4). For all significative index variants, we reported their phenotypic associations and those related to the variants in LD with them in Supplementary File 5. In line with the fact that both samples are representative of the genetic variability of British populations, no significant difference was observed in the allele frequency of index variants between the UKB cohort and 1KG GBR panel (Supplementary File 4). Conversely, when comparing UKB with other population groups, allele frequency differences were observed in loci associated with several traits. The differentiated loci appear to be associated mainly with observed that anthropometric traits and hematologic parameters. Across multiple populations comparisons, we observed that the phenotypic enrichments were significantly different from what expected by chance (5.39×10-7
0.5) with functional elements in both populations (Supplementary File 11; Supplementary File 12-FigureS12.4-5).
## UK Biobank British participants vs. 1KG South Asians
Similarly, to what observed in the other ancestry comparisons, allele frequency differences between UKB and SAS were observed in variants associated with anthropometric traits and hematologic parameters. These included *immature reticulocyte fraction* (rs34690548, allele=CA [0.904 vs. 0.025], *p*=2.85×10-320); *standing height* (rs74945666, allele=T [0.909 vs. 0.061], *p*=1.48×10-17); *eosinophil percentage* (rs200725444, allele=A [0.927 vs. 0.030], *p*=8.60×10-14); *Heel Broadband ultrasound attenuation* (rs200033476, allele=C [0.917 vs. 0.013], *p*=4.01×10-9); (Figure 2; Table 1). The UKB-SAS differentiated loci did not show evidence of regulatory function or tagging of regulatory SNPs in any of the two populations (Supplementary File 13).
### Cross-ancestry association analysis in non-European UK Biobank participants
Considering Pan-UK Biobank data related to non-European populations, we tested whether the differentiated variants and their functional tagged SNPs were associated with their related phenotypic traits in AFR, AMR, EAS and, CSA participants from UKB. Due to the dramatic difference in sample size between UKB participants of European descent (N=361,194) and UKB participants of non-European descent (980