Abstract
Large medical centers located in urban areas such as Los Angeles care for a diverse patient population and offer the potential to study the interplay between genomic ancestry and social determinants of health within a single medical system. Here, we introduce the UCLA ATLAS Community Health Initiative – a biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients. We leverage the unique genomic diversity of the patient population in ATLAS to explore the interplay between self-reported race/ethnicity and genetic ancestry within a disease context using phenotypes extracted from the EHR. First, we identify an extensive amount of continental and subcontinental genomic diversity within the ATLAS data that is consistent with the global diversity of Los Angeles; this includes clusters of ATLAS individuals corresponding to individuals with Korean, Japanese, Filipino, and Middle Eastern genomic ancestries. Most importantly, we find that common diseases and traits stratify across genomic ancestry clusters, thus suggesting their utility in understanding disease biology across diverse individuals. Next, we showcase the power of genetic data linked with EHR to perform ancestry-specific genome and phenome-wide scans to identify genetic factors for a variety of EHR-derived phenotypes (phecodes). For example, we find ancestry-specific associations for liver disease, and link the genetic variants with neurological and neoplastic phenotypes primarily within individuals of admixed ancestries. Overall, our results underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping efforts linked with EHR-based phenotyping.
Introduction
Linking electronic health records (EHRs) to patient genomic data within biobanks in a de-identified fashion has the potential to significantly advance genomic discoveries and precision medicine efforts (e.g., population screening, identifying drug targets)[1]–[4]. However, the underrepresentation of minoritized populations in biomedical research [5]–[11]raises concerns that advancements in precision medicine may widen disparities in access to high-quality health care [12]–[14]. For example, European-ancestry individuals constitute approximately 16% of the global population, yet account for almost 80% of all genome-wide association study (GWAS) participants [13]. As a direct result of this imbalance, existing methods to predict disease risk from genetics (e.g., polygenic risk scores) are vastly inaccurate in individuals of non-European ancestry [13], [15] thus forming a barrier for advancing genomic medicine to benefit patients of all ancestries.
The UCLA Health medical system is located in Los Angeles, one of the most ethnically diverse cities in the world. There is no ethnic majority: 48.5% of Los Angeles residents self-identify as Hispanic or Latino, 11.6% as Asian, and 8.9% as Black or African American; additionally, 37% of Los Angeles residents are neither U.S. nationals, nor U.S. citizens at birth [16]. Therefore, the UCLA Health patient population and the availability of digital health data captured in EHRs from a single medical system present a unique opportunity to increase the inclusion of underrepresented minorities in biomedical research. We introduce the UCLA ATLAS Community Health Initiative (or ATLAS for brevity), a biobank embedded within the UCLA Health medical system composed of de-identified, EHR-linked genomic data from a diverse patient population. The current initiative aims to collect data from over 150,000 individuals; currently this consists of 26,414 individuals genotyped at 673,148 variants genome-wide each using the Illumina global screening array (GSA) [17]. The EHR contains a de-identified extract of medical records (billing codes, laboratory values, etc.) as well as demographic information such as self-reported race and ethnicity information. It is important to note that self-reported race and ethnicity (SIRE) represent social constructs that capture shared values, cultural norms, and behaviors of subgroups [18] that are distinct concepts from genetic ancestry, which refers to the history of one’s genome with little to no relation to cultural aspects of identify. This difference is even more relevant for individuals self-describing as multi-racial (and/or admixed) where genetic ancestry bears little correlation to SIRE [19], [20]. Understanding the interplay of genetic factors (such as genetic ancestry) with social determinants of health (as inferred from self-reports) is still mired in the confounding overlaps between race, socioeconomic status, and disease, but serves as a critical step in mapping and predicting disease risk across individuals of all ancestries, thus enabling equitable genomic medicine to individuals of all ancestries.
In this work, we leverage the unique genomic diversity of the patient population in ATLAS to explore the interplay between self-reported race/ethnicity and genetic ancestry within a disease context using phenotypes extracted from the EHR within a single medical system. We cluster individuals by genetic ancestry within the EHR-linked biobank, systematically construct phenotypes from EHR, and compute disease associations using multi-ethnic pipelines for both genome-wide and phenome-wide association studies. We find that genetic ancestry and self-reported ancestry yield distinct subpopulations thus emphasizing the distinction between genetic ancestry and self-reported race and ethnicities. We leverage genetic and self-reported data to find extensive variation of sub-continental ancestry within ATLAS across European, Asian, and American ancestries. For example, we find clusters of individuals with recent ancestry from Filipino, Japanese, and Korean ancestries. Such sub-continental clusters also stratify individuals according to disease groups thus emphasizing their utility in biomedical research. We perform genome-wide and phenome-wide association studies to recapitulate known genomic risk regions; as an example, focusing on chronic nonalcoholic liver disease, we recapitulate the 22q13.31 locus and perform a phenome-wide association study across 1,330 EHR-derived phenotypes at the lead SNP, rs2294915, across multiple populations. We describe genetic associations for liver-related phenotypes in multiple ancestry groups as well as associations with neurological and neoplastic phenotypes that are associated exclusively in the Admixed American group. These results underscore how the utility of large-scale genetic analyses and deep phenotyping in diverse populations have substantial medical relevance for population health.
Results
ATLAS includes individuals of diverse continental ancestries
The UCLA Health patient population is diverse, with 63.3% self-reporting their race as White or Caucasian, 6.7% as Black or African American, 10.5% as Asian, 0.6% as American Indian or Alaska Native, 0.3% as Pacific Islander, and 18.6% identify as one of the additional races listed in detail in the Supplementary Materials (Figure 1A, Supplementary Table S1, S3). 15.8% of individuals self-report their ethnicity as Hispanic or Latino; the remaining individuals self-report as non-Hispanic/Latino (Figure 1A, Supplementary Table S2, S3). We investigated genetically inferred ancestry through principal component analysis (PCA)[21], [22], to identify population clusters according to the five continental “superpopulations” defined in the 1000 Genomes reference panel [23] (see Methods; Figure 1B, 2A; Supplementary Figure S1, S2; Supplementary Table S4). Although we broadly find that SIRE is concordant with the inferred continental genetic ancestry, we find marked differences between genetically defined ancestry groups and SIRE, further emphasizing that genetic ancestry is a distinct concept from self-reported race and ethnicity. For example, we find >10% of individuals within the European genetic ancestry group do not identify as Non-Hispanic/Latino – White/Caucasian (NH-WC) SIRE; 10% of individuals within the African genetic ancestry group do not self-report as Non-Hispanic/Latino – Black/African American (NH-AfAm), and >25% of the Admixed American genetic ancestry group do not identify as Hispanic/Latino – Other Race (HL-Other) or Hispanic/Latino – White/Caucasian (HL-WC) (Supplementary Table S5).
We show the percentage breakdown of (A) all SIREs and (B) continental genetic ancestry for all unrelated individuals in ATLAS (N= 25,842). We exclude individuals whose self-reported race and/or ethnicity are unknown.
Further making the distinction between genetic ancestry and SIRE, we reveal extensive genetic heterogeneity both between and within SIREs within orthogonal spectra from PCA (Figure 2A and 2B). For example, most individuals who self-report as NH-AfAm lie along a cline between the African and European genetic ancestry clusters. We also observe that the cluster of individuals with inferred African ancestry from PCA form a considerably smaller cluster than the group of individuals in the NH-AfAm SIRE in ATLAS. Within ATLAS, we find that 1,426 individuals self-identify as NH-AfAm, but only 1,233 individuals are grouped into the African genetic ancestry cluster. This difference is likely because many individuals in ATLAS identify as African American, which suggests genetic admixture between African and European ancestry in this group. Conversely, there are fewer individuals in the Non-Hispanic/Latino – Asian (NH-Asian) SIRE (N=2,469) than those grouped into the East Asian and South Asian ancestry clusters (N= 2,611). A similar trend follows for the NH-WC SIRE (N= 14,328) and the European ancestry cluster (N= 14,800). The majority of individuals who are included in the genetic ancestry clusters, but not the corresponding SIREs, had either unknown SIRE information or reported their race as ‘Other Race’, demonstrating how genetic ancestry inference can be advantageous when self-reported information is not known or individuals’ race/ethnicity are not represented in patient questionnaires. However, 14% of individuals still have unclassifiable genetic ancestry (Supplementary Table S4) either because they are clustered into multiple ancestry groups or no ancestry group as all. The latter could be due to extensive admixture in their genomes or the absence of relevant ancestral groups in the chosen reference panels.
(A) Genetic PCs 1 and 2 of individuals in ATLAS (N=25,842) shaded by inferred continental genetic ancestry (European, African, Admixed American, East Asian, and South Asian) as inferred from 1000 Genomes. (B) and (C) show the first two genetic PCs of the ATLAS participants shaded by SIRE and preferred language, respectively. To improve visualization in (C), only languages with >10 responses are assigned a color.
Labeling individuals by self-reported preferred language, we observe trends that are consistent with both SIRE and continental genetic ancestry (Figure 2C). For example, out of all individuals who report Spanish as their primary language, 96.1% of these individuals were estimated to have Admixed American genetic ancestry. Additionally, 98.5% of individuals who report Japanese, Korean, Tagalog, Vietnamese, Mandarin Chinese, and Cantonese as their primary languages were inferred to have East Asian genetic ancestry. We also see clusters of individuals who speak Armenian, Arabic, and Farsi/Persian; we find that 26.2% of the individuals that speak these languages could not be classified into one of the five continental ancestry groups within 1000 Genomes. This discrepancy is likely because the 1000 Genomes reference panel does not contain samples from regions where these languages are primarily spoken. These findings showcase the limitation of current reference panels of genetic diversity and demonstrate the value of characterizing individuals using both genetic ancestry and self-reported information.
Fine-scale subcontinental ancestry within ATLAS individuals
Next, we inferred the ancestry of individuals within the ATLAS East Asian ancestry group (EAS) and identified 5 subcontinental genetic ancestry groups (Figure 3A, Supplementary Figure S3) [23]. First, we clustered individuals in ATLAS according to 3 different subgroups of Chinese ancestry (Han Chinese, Southern Han Chinese, and Dai Chinese). Additionally, when projecting individuals’ preferred language onto the PCs, two distinct clusters are delineated according to Chinese Mandarin and Chinese Cantonese/Toishanese (Supplementary Figure 5B). The Southern Han Chinese cluster of individuals highly correlates with individuals speaking Chinese Cantonese/Toishanese, where 91.7% of individuals who speak Chinese Cantonese/Toishanese are within this cluster. Conversely, only 8.3% of individuals who speak Chinese Mandarin are within the Southern Han Chinese cluster. Furthermore, the Han Chinese cluster correlates with Chinese Mandarin, although to a lesser extent, where 66.4% of individuals who speak Chinese Mandarin fall within the Han Chinese ancestry cluster and 26.2% within the Southern Han Chinese cluster. Additionally, from Figure 3A, there are two notable clusters that do not match any of the East Asian subcontinental populations within 1000 Genomes. Projecting individuals’ self-identified race over the PCs shows that the majority of individuals in these two clusters identify as ‘Asian: Korean’ and ‘Asian: Filipino’ respectively (Supplementary Figure 5A). This pattern is similarly reflected by the self-identified preferred languages where many of these individuals speak Korean and Tagalog. This clustering not only characterizes the fine-scale genetic and ethnic diversity of ATLAS, but also emphasizes how the concepts of genetic ancestry and self-reported constructs, such as primary spoken language, can be combined to identify and label distinct genetic clusters that would not have been characterized based on a single criterion alone.
Principal component analysis performed separately on each continental ancestry group (East Asian, European, Admixed American) in ATLAS with corresponding subcontinental ancestry samples from 1000 Genomes. Cluster annotation labels were determined using a combination of known genetic ancestry samples from 1000 Genomes and self-reported race, ethnicity, and language information from the EHR.
Next, we identify clusters of individuals with subcontinental genetic ancestry of European descent, but due to limitations in reference panels, we were unable to describe the origins of the majority of the observed genetic variation within the ATLAS European continental ancestry cluster (Figure 3B). Comparing self-reported race and ethnicity information does not delineate any subgroups since most individuals are within the NH-WC SIRE (Supplementary Figure S6A). Instead, we overlay individuals’ self-reported preferred language over the projected PCs and observe clusters of individuals whose preferred languages are Arabic, Armenian, and Farsi/Persian; notably the primary populations that speak these languages are not present in the current 1000 Genomes reference panel (Supplementary Figure S6B). Although not definitive about ancestral origins, these results suggest that individuals in these clusters may have cultural ties and/or genetic origins relating to the Middle East. We also observe two distinct clusters of individuals who speak Farsi/Persian (labeled as ‘Farsi, Persian I’ and ‘Farsi, Persian II’), suggesting that although these groups may share cultural ties, the groups could have varying ancestral origins.
We perform a similar analysis for the Admixed American cluster of individuals. We are able to cluster individuals according to Mexican, Peruvian, Columbian, and Puerto Rican ancestry, where 66.4% of individuals are within the Mexican ancestry cluster (Figure 3C, Supplementary Figure S4). Comparing self-reported race/ethnicity and language did not reveal any additional subclusters for this population (Supplementary Figure S7A,B). However, shading the PCs by estimated genetic ancestry proportions (see Methods), we see a cline between European and Native American ancestries, demonstrating that although we cannot determine further clusters within our data, there is still substantial population substructure present (Supplementary Figure S7C,D). Corresponding analyses were also performed for the African ancestry group (Supplementary Figure S8), but clear subcontinental clusters could not be constructed. We omitted the subcontinental analysis for the South Asian ancestry group due to the small sample size.
IBD sharing reveals communities of recent shared ancestry within ATLAS
A complementary method to principal components for inferring fine-scale ancestry is identical-by-descent (IBD) analysis [24]–[26]. Using pair-wise IBD estimates for all individuals in ATLAS and reference population information from the 1000 Genomes Project [23], Simons Genome Diversity Project [27], and Human Genome Diversity Project [28], we describe fine-scale populations based on total pairwise IBD (Figure 4; see Methods). Each subgroup is annotated according to a combination of genetic ancestry from reference populations as well as self-reported race, ethnicity, and language information. Many subgroups have similar characteristics to those defined from PCA-based clustering, such as the Filipino and Dai Chinese clusters. We can also characterize subgroups not previously identified through the previous PCA analysis. For example, PCA-based clustering was only able to distinguish clusters at the level of continental African ancestry, whereas IBD clustering identified West African, East African, and Ethiopian subgroups. In contrast, Japanese and Korean individuals form a single subgroup when estimated by the IBD clustering approach, whereas PCA-based clustering delineated these individuals into two separate groups. Note that both IBD and PCA-based clustering granularity is dependent on the clustering algorithm used and here we report at only a single level of resolution. For further discussion of PCA and IBD for fine-scale population analyses, see Belbin et al 2021[29]. Our results show that fine-scale population identification is specific to each genetic ancestry inference method, as well as how the combination of multiple methods can maximize the number of identified subgroups.
InfoMap community membership is indicated by color for all communities with greater than 100 individuals (20 communities total) and individuals with a degree greater than 30. Community membership indicates elevated shared IBD within that community. Community identity is labelled adjacent to the network plot in the corresponding color.
Admixture describes genetic variation within self-reported race/ethnicity groups
Many individuals do not fall within a single genetic ancestry cluster, but instead lie on the spectrum between multiple ancestry groups. We can characterize this variation through genetic admixture, the exchange of genetic information across two or more populations [30]. We estimate genetic ancestry proportions using k=4, 5,or 6 ancestral populations and visualize groups of individuals by SIRE (see Methods; Supplementary Figure S9). For the following analyses, we use k=4 ancestral populations where the clusters correspond to European, African, East Asian, and Native American ancestry. Among individuals in the HL-Other SIRE, the estimated average proportion of European ancestry is 49%, 6% African ancestry, and 44% Native American ancestry (Supplementary Table S6). We also observe that the HL-Other and HL-WC (White or Caucasian) SIREs have approximately the same admixture profile, where the proportion of European ancestry is 49% and 58% respectively, 6% and 5% African ancestry, and 44% and 36% Native American ancestry. However, there is also a large amount of variation within SIREs, where for example, individuals who identify as Hispanic or Latino ethnicity are estimated to have European ancestry percentages ranging from nearly 0% to almost 100%.
Genetic ancestry groups correlate with disease prevalence within ATLAS
Understanding how disease prevalence varies across populations is integral to understanding how the interplay of genetic factors and social determinants of health contribute to disease risk. We investigate 1,330 EHR-derived phenotypes (phecodes) [31] spanning a wide range of disease categories (see Methods) and identify 1,401 total significant phecode-ancestry associations (p < 3.8e-5) across the 5 continental ancestry groups after adjusting for age and sex (Supplementary Table S7). Overall, there are 659 phenotypes that show cross-ancestry differences, where a phenotype is significantly associated with a particular ancestry group compared to the rest of the population. From this set, the highest number of phecodes are from the circulatory (N=84), endocrine/metabolic (N=74), and digestive (N=80) system-related groups. Specifically, we recapitulate many known associations such as liver and intrahepatic bile duct cancer (p=6.97e-35) within the East Asian ancestry population [32]–[34], skin cancer (p=2.02e-162) in the European ancestry population [35], [36], hereditary hemolytic anemia (p=2.4e-22) [37] and primary open-angle glaucoma (POAG) (p=5.33e-12) [38], [39] within the African ancestry population, as well as both alcoholic liver damage (p=2.0e-47) and cirrhosis of liver without mention of alcohol (p=4.84e-70) [40]–[43] in the Admixed American population (Figure 5).
We show the odds ratio computed from associating each phenotype with individuals’ continental genetic ancestry in ATLAS (N=25,872) under a logistic regression model. Error bars represent 95% confidence intervals.
Next, as an example, we analyze phecodes spanning different traits where we observe a significantly higher prevalence for at least one continental ancestry group per trait. For example, we observe that the prevalence of both schizophrenia (freq=0.02, SE=0.004) and sickle cell anemia (freq=0.03, SE=0.005) have the highest prevalence in the African ancestry group in ATLAS, which is consistent with previous findings [44], [45]. We also observe substantial disease risk heterogeneity across subgroups of the same continental ancestry. We compute the prevalence for the same set of diseases across subgroups within the East Asian ancestry group (Korean, Japanese, Filipino, Chinese, and Vietnamese) in ATLAS and compare this with the aggregated East Asian ancestry group. The estimated prevalence of type 2 diabetes from the East Asian ancestry group is 0.26 (SE=0.009). However, analysis of specific subgroups shows a significant increase in the prevalence of type 2 diabetes for individuals in the Japanese (freq=0.33, SE=0.03) and Filipino (freq=0.32, SE=0.02) subgroups compared to the Chinese subgroup (freq=0.21, SE=0.01). These results indicate that genetically grouping individuals across sub-continental ancestries yield meaningful interpretation of disease risk across groups of individuals.
We also investigated disease prevalence within admixed individuals, where variation in genetic ancestry across individuals in the population allows for the correlation of disease risk with the proportion of genetic ancestry from any given continental group. Within each SIRE group, we perform an association test between the proportions of inferred ancestry estimated from ADMIXTURE [46] and each phenotype (see Methods; Supplementary Table S8). After correcting for the number of tested phenotypes, we find numerous significant phenotype-ancestry associations: 113 associations within the HL-Other SIRE, 62 within the NH-WC SIRE, and 48 within the NH-Asian SIRE. However, we do not find any significant associations within the NH-AfAm SIRE, which could be due to the smaller sample size. Across SIREs, both the top associated phenotype categories as well as the direction of the associations greatly vary. Out of the top 3 phenotype categories with the most associations in each SIRE group, only the endocrine/metabolic category is shared across all 3 tested SIREs (HL-Other, NH-WC, NH-Asian). Even within this category, looking at the statistics quantifying the association of the proportion of European ancestry with endocrine/metabolic phenotypes, there are exclusively 5 negative associations within the NH-WC group, exclusively 16 negative associations within the HL-Other group, but 8 positive associations and no negative associations within the NH-Asian group. The other top phenotype categories for each SIRE are also unique, where the HL-Other SIRE’s top categories include digestive and respiratory phenotypes, the NH-WC SIRE’s top categories include neoplasms and dermatologic phenotypes, and the NH-Asian SIRE’s top categories includes mental disorders and infectious diseases. Specifically, we find that within the HL-Other population, the overall proportion of European ancestry is significantly negatively associated (p=9.2e-14) with type 2 diabetes and the proportion of Native American ancestry is significantly positively associated (p=1.6e-13) (Figure 6A), which is consistent with previous studies [47], [48]; we additionally find a similar trend for ‘other chronic nonalcoholic liver disease’ (Figure 6B). These results suggest that not only are some disease statuses associated with SIRE and continental genetic ancestry, but the specific ancestry proportions may also correlate with disease risk.
Individuals who self-report as “Hispanic/Latino – Other Race” (HL-Oth) (N=2,206) and have had a diagnosis of “Other chronic nonalcoholic liver disease” or (B) type 2 diabetes are binned by their proportions of European, African, and Native American ancestry estimated using ADMIXTURE. Bins are defined by the proportion of each ancestral population in increments of 0.20. Within each bin, we plot the prevalence of the diagnoses and provide standard errors (+/-1.96 SE) of the computed frequencies.
Genome and phenome-wide association scans identify known risk regions and elucidates correlated phenotypes
EHR-linked biobanks also offer the opportunity of investigating genetic associations with traits across the genome. These efforts impose special challenges, such as adjusting for population stratification and cryptic relatedness in health systems that serve entire families as well as extracting phenotypes from EHR, namely due to inconsistencies in mapping diagnosis codes (ICD codes) to phenotypes and difficulties in defining appropriate controls for specific phenotypes. Here, we implemented the phecode system (v1.2) [31], [49] within a GWAS pipeline that accounts for population stratification (see Methods). As an example, we present results for phecode 571.5 ‘other chronic nonalcoholic liver disease’ (see Methods) across 15,439 unrelated individuals of European (EUR) ancestry and 4,472 unrelated individuals of Admixed American (AMR) ancestry (Figure 7). GWAS associations are well-calibrated for both populations (Supplementary Figure S10), with little evidence of test statistic inflation (AMR genomic inflation factor λGC = 1.01; EUR genomic inflation factor λGC = 1.02).
GWAS Manhattan plots for “Other chronic nonalcoholic liver disease” in the (A) AMR genetic ancestry group (N-Case: 919, N-Controls: 3,262) and (B) EUR genetic ancestry group (N-Case: 2,275, N-Controls: 14,155). The red dashed line denotes genome-wide significance (p<5e-08). We recapitulate a known association at the 22q13.31 locus in both populations.
In the EUR study, we find three SNPs that pass genome-wide significance (p < 5e-8) and 70 SNPs that reach significance in the AMR study (Supplementary Table S9). All genome-wide significant SNPs from both studies fall within the 22q13.31 locus, which contains the PNPLA3 gene. This gene has been extensively studied for its role in the risk of various liver diseases such as nonalcoholic fatty liver disease [50], [51]. The lead SNP from both analyses, rs2294915, is an intronic variant in the PNPLA3 gene and has MAF=0.45 in the AMR group but only MAF=0.24 in the EUR group. A nearby SNP, though not directly tested due to quality control filtering, is rs738409, a missense variant for PNPLA3 that has been well-documented for its role in the susceptibility of several types of liver disease [52]. Using measurements of LD from the 1000 Genomes reference panel, we find that rs2294915 is in high LD with rs738409 in the AMR analysis (R2 =0.94) as well as in the EUR analysis, although to a slightly lesser extent (R2 =0.85) [53].
Next, we leverage GWAS for all existing phecodes to investigate the association of the lead variant, rs2294915, across all 1,330 EHR-derived phenotypes (i.e. a phenome-wide association study: PheWAS). After adjusting for both genome-wide significance and the number of phenotypes (p < 3.8e-11), we find that only the liver-related phenotypes within the AMR study reach significance (Figure 8). Additionally, multiple neoplastic and neurological phenotypes, which are comorbidities with severe liver disease [50], [54]–[56], are nominally significant only in the AMR study after adjusting for the number of tested phenotypes (p < 3.8e-5). These findings suggest possible differential genetic architecture across these two populations, as well as variation even at the phenotype level, reflecting possible genetic or environmental modifiers of important comorbidities.
PheWAS plot at rs2294915 for the Admixed American (top) and European (bottom) ancestry groups across 1,330 phecode phenotypes. The red dotted line denotes p=3.8e-11, the significance threshold after adjusting the genome-wide significance threshold for 1,330 tested phenotypes. The red dashed line denotes p=3.8e-5, the significance threshold after correcting for only the 1330 tested phenotypes.
Discussion
In this work, we introduce the ATLAS Community Health Initiative, a biobank embedded within the UCLA Health medical system comprising of de-identified EHR-linked genomic data from a diverse patient population. The UCLA Health system serves Los Angeles County, leading to a study population of great demographic, genetic, and phenotypic diversity. We investigate ancestry both on the continental as well as the subcontinental population level and find that genetic ancestry and self-reported demographic information yield distinct subpopulations in the ATLAS biobank. We present a collection of results cataloguing the associations between genetic ancestry and EHR-derived phenotype where we find that disease status is not only associated with continental genetic ancestry but also associated with the specific admixture profile describing an individual. We use multi-ethnic pipelines to recapitulate known associations for chronic nonalcoholic liver disease at the 22q13.31 locus and perform a phenome-wide association study at the lead SNP, where we find associations with neurological and neoplastic phenotypes exclusively in the Admixed American ancestry group. As the sample size increases, the ATLAS Community Health Initiative will enable rigorous genetic and epidemiological studies to further understand the role of genetic ancestry in disease etiology, with a specific aim to accelerate genomic medicine in diverse populations. Already, the ATLAS biobank accounted for 73.4% of the Admixed American samples utilized in the primary analysis from the COVID-19 Host Genetics Initiative [57].
As the field moves forward with increased collaboration between the genetics and healthcare communities, it is of utmost importance to also be aware of potential pitfalls that may occur when translating research findings into actual clinical populations. Currently, many clinical protocols are deeply ingrained with racial bias, no matter how benign the original goal was intended[58]–[62]. Many of these flawed policies stemmed from erroneously linking race, a social rather than biological construct, with disease risk despite not presenting any biological justification. Although race and genetic ancestry are correlated [63], [64], our work shows that populations constructed from these two concepts are not analogous. We encourage protocol decisions that are rooted in actual biological phenomena, such as genetic markers, providing transparent, immutable criteria. For example, Benign Ethnic Neutropenia (BEN) is observed predominantly in African Americans, but specifically is strongly associated with the variant at rs2814778 [65], [66]. Recent studies have suggested that genotype screening at rs2814778 could aid in the interpretation of neutropenia in African Americans and avoid unnecessary invasive procedures as well as lead to an increase of the inclusion of these individuals to various treatments [67]. Additionally, the Kidney Donor Risk Index (KDRI) equation uses race as a risk factor [68], but it has been recently proposed to use the presence of APOL1 variants as a factor instead [69]. Discovered after the creation of the KDRI, the presence of these variants was shown to be associated with shorter allograft longevity [70]. Despite this finding, the original KDRI score is still commonly used. In order to remedy and not perpetuate current healthcare inequalities, we underscore the importance of favoring transparent clinical protocols with clear biological justification instead of race-adjusted formulas that leverage convenience at the expense of potential inequities.
There are various limitations within our study, and we describe a few of these in detail as follows. First, the phenotypes are based on ICD codes, and due to the nature of billing codes, this form of labeling does not constitute a formal patient diagnosis and may contain individuals who do not have the specific disease. This uncertainty in phenotyping likely limits the power of our study to find disease associations. For the further investigation into specific phenotypes, we recommend refining each phenotype definition based on additional disease-specific factors and metrics. Additionally, although ICD codes are an international standard, there are still deviations between different institutions in how specific diagnoses are recorded. This adds further heterogeneity in phenotyping and could present future challenges when replicating studies or porting algorithms to other institutions. Second, due to the de-identified nature of the data, we lack information that could help us better describe the fine-scale population groups. For example, location of birth, zip code, and family history has been shown to be useful descriptors for determining subgroups of genetic ancestry [29]. This geographic information could also be used as a proxy for various environmental exposures such as pollution. Additional socioeconomic information, such as income and availability of health insurance, could likely account for a portion of observed associations as well as provide more insight into the socioeconomic determinants of health. Third, our findings within the African and South Asian ancestry populations are limited due to the smaller sample sizes. As sample sizes increase, we hope to further refine population substructure within these initial continental ancestry groups and have the power to detect novel disease associations that have previously been mired by lack of statistical power.
We conclude by discussing directions for future work. Although we investigate admixed populations, such as African American and Hispanic/Latino populations, admixed individuals who do not fall under these groups are excluded from downstream analyses due to concerns over population structure. In the future, we hope to incorporate methods and pipelines that properly control for population structure in all types of admixed populations. Additionally, we plan to compute polygenic risk scores (PRS) across all 5 continental ancestry groups. PRS has already shown modest clinical utility for diseases such as breast cancer [71] and cardiovascular disease [72], but has proven difficult to perform accurate predictions across populations [13]. The genetic diversity within the ATLAS Community Health Initiative biobank partnered with the longitudinal clinical data provides a unique resource to further explore the role of ancestry in PRS prediction. Furthermore, as the size of the biobank grows and more data is collected over time, we hope to explore even more individualized health solutions and interventions.
Methods
Study population
The UCLA Health System includes two hospitals (520 and 281 inpatient beds) and 210 primary and specialty outpatient locations predominantly located in Los Angeles County. The UCLA Data Discovery Repository (DDR) contains de-identified patient electronic health records (EHR) that have been collected since March 2nd, 2013, under the auspices of the UCLA Health Office of Health Informatics Analytics and the UCLA Institute of Precision Health. Currently, the DDR contains longitudinal records for more than 1.5 million patients, including basic patient information (height, weight, gender), diagnosis codes, laboratory tests, medications, prescriptions, hospital admissions, and procedures. The UCLA ATLAS Community Health Initiative includes the EHR-linked biobank within the UCLA Health System. Currently, there are more than 27,000 genotyped participants with their de-identified EHR linked through the DDR. Patients’ participation is voluntary and their privacy is protected by de-identifying the samples.
Self-reported demographic information
Participants self-report their race and ethnicity via two distinct fields where there are multiple-choice fields for race and ethnicity (see Supplementary Table S1, S2 for full list). Only one selection from each category can be chosen as a patient’s primary race and ethnicity. We group together race/ethnicity pairings to form 21 ‘self-reported race/ethnicity’ (SIRE) groupings (see Supplementary Table S3). Patients also report their ‘Preferred Language’ from multiple-choice fields.
Genotype quality control
Bio-samples collected from the UCLA ATLAS Community Health Initiative in the form of blood samples, were de-identified and then processed for DNA extraction and genotyping. ATLAS participants (N= 26,439) were genotyped on a custom Illumina Global Screening (GSA) array [17] that included a standard GWAS backbone and an additional set of pathogenic variants selected from ClinVar [73]. We filtered out poor quality markers by removing variants with >5% missingness (M= 9,318 variants removed) and then removed strand-ambiguous SNPs (M= 7,686). We excluded samples with missingness >5% (N=3 individuals removed) and kept one individual from each set of twins or duplicated individuals (N=22 individuals removed). All quality control steps were conducted using PLINK 1.9 [74] and duplicate individuals/twins were determined using KING [75]. These steps resulted in a total of 673,148 variants and 26,414 individuals.
Genetic relatedness inference
With the M=673,148 variants that passed quality control, we used KING [75] to compute pairwise kinship coefficients to determine family relationships. We identified a set of unrelated individuals (N=25,842) where individuals with kinship coefficient <0.0884 were included (‘king --unrelated’). Additionally, we identified 22 duplicate individuals or twins, 213 parent-offspring relatives, and 117 first-degree relatives. This level of relatedness is expected since families will often be members of the same healthcare system.
Genotype imputation
After performing array-level genotype quality control, the PLINK-formatted genotypes are converted to VCF format and uploaded to the Michigan Imputation Sever [76]. On a variant level, the server removes the variant if it is not an A, C, G, T allele, monomorphic, a duplicate, an allele mismatch between the reference panel and provided data, an insertion-deletion, or if the SNP call rate is less than 90%. The server will additionally correct for any necessary strand flips or allele switches needed to match the reference panel. The server additionally phases the data using Eagle v2.4 [77] and imputation is performed against the TOPMed Freeze5 imputation panel [78] using minimac4 [79]. In summary, the explicit parameters used on the server are “TOPMed Freeze5” for the reference panel, “GRCh38/hg38” for the array build, “off” for the rsq filter, “Eagle v2.4” for phasing, no QC frequency check, and “quality control & imputation” for the mode. After we filtered by R2>0.90 and MAF>1%, the final set of variants contained M=7,973,837 sites.
Genetic ancestry PCA analysis
We limited analyses to unrelated individuals (2nd degree relatives) and performed the ancestry-related PCA analyses on only the typed data. Genotypes were filtered by Mendel error rate (‘plink --me 1 1 –set-me-missing’), founders (‘--filter-founders’), minor allele frequency (‘–maf 0.15’), genotype missing call rate (‘--geno 0.05’), and Hardy-Weinberg equilibrium test p-value (‘–hwe 0.001’). Genotypes were then merged with the 1000 Genomes phase 3 dataset [23]. LD pruning was then performed on the merged dataset (‘--indep 200 5 1.15 --indep-pairwise 100 5 0.1’). We computed the top 10 PCs using the FlashPCA software [80] with all default settings.
We use the superpopulations from the 1000 Genomes dataset to define continental genetic ancestry groups (European, African, East Asian, South Asian). Ancestry clusters were determined by visually defining PC thresholds based on the labeled individuals from the 1000 Genomes dataset. The European group was determined by PC thresholds with PC1 and PC2 (Supplementary Figure S1A), the African group with PC1 and PC2 (Supplementary Figure S1B), the Admixed American group with PC2 and PC3 (Supplementary Figure S1C), the East Asian group with PC1 and PC2 (Supplementary Figure S1D), and the South Asian group with PC4 and PC5 (Supplementary Figure S1E). ATLAS participants that fell within the set of thresholds for each ancestry group were assigned to that ancestry. Individuals who were classified into multiple ancestry groups or individuals that could not be classified into any of the ancestry groups were given an ancestry label of ‘Admixed or other ancestry’.
Subcontinental PCA analysis
We ran PCA on individuals within the East Asian continental genetic ancestry group (N=2,242) and the individuals from the East Asian superpopulation within 1000 Genomes using FlashPCA. We define clusters of hypothesized Chinese, Vietnamese, Korean, Japanese, and Filipino ancestry. Ancestry groups were determined by visually defining PC thresholds based on the self-reported race information of the ATLAS participants: Korean ancestry based on PC2 (Supplementary Figure S3A), Chinese ancestry based on PC1 (Supplementary Figure S3B), Vietnamese ancestry based on PC1 (Supplementary Figure S3C), Filipino ancestry based on PC1 (Supplementary Figure S3D), and Japanese ancestry based on PC1 and PC3 (Supplementary Figure S3E, S3F).
We additionally ran PCA on individuals from the initial Admixed American ancestry cluster in ATLAS and individuals of Mexican (MXL), Puerto Rican (PUR), Columbian (CLM), and Peruvian (PEL) ancestry from 1000 Genomes. The Mexican ancestry cluster in ATLAS is described by PC1 (Supplementary Figure 4A) and the Puerto Rican cluster is determined by PCs 1, 2, and 8 (Supplementary Figure 4B, C, D). Explicit thresholds could not confidently be drawn for the Columbian and Peruvian ancestry clusters in ATLAS. Additionally, we did not compute explicit PC thresholds for the European subcontinental clusters.
GWAS quality control per ancestry
We limited our analyses to N=25,842 unrelated individuals and then performed additional quality control steps within each continental ancestry groups for GWAS (European, African, Admixed American, East Asian, South Asian). SNPs that violated Hardy-Weinberg equilibrium (HWE) with p<1e-12 were excluded. Individuals with a heterozygosity rate that surpassed +/-3 standard deviations from the ancestry-specific mean were also excluded. Analyses were restricted to common SNPs per ancestry group where MAF>1%.
IBD Calling
For IBD calling, an interim version of the ATLAS data comprising of 24,318 individuals was used. ATLAS data was merged with the 1000 Genome Project [23], the Simons Genome Diversity Project [27], and the Human Genome Diversity Project [28]. In total, 418,195 SNPs were kept for IBD analysis after filtering by missingness >=10% and MAF>1%. The merged dataset was then statistically phased using Shapeit4 [81]. IBD was called using iLASH using default parameters [82]. For downstream analysis, IBD segments were summed between individuals to create an adjacency matrix, where each row represented a pair of individuals, and each column represented the total genome-wide IBD between those two individuals. Using KING [75], the adjacency matrix was filtered to remove rows representing individuals who were third degree relatives or closer. Communities are annotated using the presence of reference individuals in a cluster and EHR characteristics, such as preferred language and self-reported race/ethnicity.
Genetic admixture analysis
We inferred the proportion of genetic ancestry by using the AMIXTURE software [46] under the unsupervised clustering mode with the number of clusters k=4, 5, 6. For each SIRE, we compare the ancestry proportions from the clusters. For k=4, we assign the cluster with the majority of NH-WC individuals as European ancestry, the cluster with the majority of NH-AfAm individuals as African ancestry, the cluster with the majority of NH-Asian individuals as East Asian ancestry, and the cluster with the highest number of HL-Other and HL-WC individuals as Native American ancestry.
Phecodes
We aggregated billing (ICD9/ICD10) codes into more clinically informative groupings known as phecodes [31]. We derived phecodes from ICD codes in the EHR using mappings described in the PheWAS catalogue (Phecode Map 1.2) [83]. Using phecodes to define case/control phenotypes, we treated individuals with the occurrence of a specific phecode as a case and a control otherwise. We restricted our analyses to phecodes that had >100 cases present in ATLAS, yielding a total of 1330 phenotypes.
Association between phecodes and genetic ancestry
We performed a marginal association between each phecode and continental genetic ancestry group under a logistic regression model while also adjusting for age and sex. Statistical significance was determined after correcting for the number of tested phecodes (p<3.8e-5).
Association between genetic admixture proportions and phecodes
We perform a marginal regression between each of the ancestry proportions estimated from ADMIXTURE where k=4 (European, African, East Asian, and Native American ancestry) and 1,300 EHR-derived phenotypes (phecodes) within each of the 7 ATLAS SIRE groups groups (NH-WC, NH-AfAm, HL-Other, HL-WC, NH-Asian, NH-PI, NH-AmIn). Additional details on computing admixture proportions can be found under the section Genetic admixture analysis. We additionally adjust for age and sex in the regression. Only traits with >10 cases per SIRE were tested. Significance is determined after adjusting for the number of tested phenotypes (p<3.8e-5).
GWAS for ‘Other chronic nonalcoholic liver disease’
We performed an association between all imputed autosomal variants and ‘Other chronic nonalcoholic liver disease’ within the European (N-Case: 2,275, N-Controls: 14,155) and Admixed American (N-Case=919; N-Controls=3,262) continental ancestry groups. Using PLINK 1.9, we performed a marginal association test at each SNP using a logistic regression model (‘plink --logistic beta’) where we adjusted for age, sex, and PCs 1-10. Quantile-quantile plots and genomic inflation factors (EUR λGC = 1.02; AMR λGC = 1.01) provide evidence that both analyses are well-calibrated (Supplementary Figure S10).
PheWAS
We performed an association between each typed SNP and 1330 phecodes. Due to the number of tests, we only perform associations at genotyped SNPs. To determine significance, we used a stringent threshold that corrects for both the number of tested phenotypes as well as genome-wide significance (p<3.8e-11) and a less stringent threshold that only corrects for genome-wide significance (p<5e-08).
Data Availability
Individual electronic health record data and genomic data are not publicly available due to patient confidentiality and security concerns.
References
- [1].↵
- [2].
- [3].
- [4].↵
- [5].↵
- [6].
- [7].
- [8].
- [9].
- [10].
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].
- [42].
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].
- [56].↵
- [57].↵
- [58].↵
- [59].
- [60].
- [61].
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵