Abstract
We undertook a genome-wide association study of susceptibility to invasive group A streptococcal (GAS) disease combining data from distinct clinical manifestations and ancestral populations. Amongst other signals, we identified a susceptibility locus located 18kb from PAX5, an essential B-cell gene, which conferred a nearly two-fold increased risk of disease (rs1176842, odds ratio 1.8, 95% confidence intervals 1.5-2.3, P=3.2×10−7). While further studies are needed, this locus could plausibly explain some inter-individual differences in antibody-mediated immunity to GAS, perhaps providing insight into the effects of intravenous immunoglobulin in streptococcal toxic shock.
Main
Invasive group A streptococcal (GAS) disease is defined by the isolation of Streptococcus pyogenes from a normally sterile site. While various manifestations can occur, infections of the skin and soft tissues are most frequent. Relative to other infections, invasive GAS is associated with a high case fatality rate[1], with an estimated 163,000 deaths attributable to the condition each year worldwide.[2]
Over time the majority of the population experience some degree of exposure to GAS yet invasive infections are rare.[3] Moreover, while the majority of invasive infection occur at the extremes of age, it remains unclear why a minority of otherwise healthy children and young adults develop devastating invasive infections.[4] While several pathogen and environmental factors are implicated, a role for host genetic variation is certainly plausible given the heritability of other GAS diseases, including rheumatic fever[5] and recurrent pharyngitis.[6] Indeed, host genetic variation in the human leukocyte antigen (HLA) locus has previously been linked to susceptibility to invasive GAS disease,[7,8] perhaps reflecting preferential binding of certain alleles to GAS superantigens.[8-10] Additionally, a number of reports have now emerged linking host genetic variation elsewhere in the genome to susceptibility to invasive bacterial disease more broadly.[11-13] Accordingly, we undertook a genome-wide association study (GWAS) of susceptibility to invasive GAS disease, combining the results of four case-control studies covering distinct clinical manifestations and ancestral populations.
This study used genotyping data from several projects, brought together to form four case-control analyses (Suppl. Figure 1). Ethical approval was granted to each project by the relevant institutional authorities.
(a) Regional association for the MAGI2 locus with (b) a forest plot for rs35089407, the lead variant. (c) Regional association plot for the PAX5 locus with (d) a forest plot for rs1176844. In the regional association plots, genomic position is plotted against the negative common logarithm of the p-value from the fixed-effects meta-analysis for all variants within 250kb of the variant most associated with susceptibility. Variants are coloured by linkage disequilibrium with the most associated variant averaged across the entire data set (estimated r2: dark blue, 0– 0.2; light blue, 0.2–0.4; green, 0.4–0.6; yellow, 0.6–0.8; red, 0.8–1.0). In the forest plots, the effect size estimates for the lead variant at each locus is shown under an additive genetic model with association statistics from each analysis combined by fixed-effects meta-analysis.
In the first, cases were children and adults of European ancestry with GAS-associated necrotising fasciitis (NF) (n=34) or other manifestations of invasive GAS (n= 9) treated at hospitals across the UK. We have previously investigated the relationship between the variants in HLA locus and risk of invasive GAS disease amongst these individuals finding the DQA1*01:03 associated with susceptibility.[14] Individuals aged 65 years or more or with co-morbidity were excluded. The youngest patient was 18 months of age and oldest was aged 63 years (median 35 years). In the second analysis, cases were children of European ancestry with GAS empyema thoracis (n=36) recruited from hospitals across the UK. The youngest patient was aged four months and the oldest 17 years (median 4 years). None had comorbidities or established immunodeficiency.
Both sets of cases were genotyped using the HumanCore platform (Illumina Inc., USA) or the Global Screening Array (Illumina) specifically for this analysis. Controls for both analyses were healthy adolescents of European ancestry recruited to vaccine studies based in Oxfordshire (n=1540) and had previously been genome-wide genotyped using the HumanOmniExpressExome platform (Illumina).
In the third analysis, cases were children of East African ancestry with GAS bacteraemia (n=66) recruited at a single rural hospital in Kilifi, Kenya, as part of a broader study of childhood bacteraemia in this setting. [15] The youngest patient was aged one day and the oldest nine years (median 162 days). Controls for this analysis were children recruited to a birth cohort set in the area surrounding that hospital (n=1689). Both cases and controls had been genotyped previously by the Wellcome Trust Case Control Consortium 2 using the Affymetrix 6.0 platform (Affymetrix, USA).[12,13]
In the fourth analysis, cases were children and adults without comorbidity of Oceanian ancestry with various invasive GAS phenotypes, including bacteraemia without source (n=7), soft tissue infection (n=3) and meningitis (n=2), treated at hospitals in New Caledonia (NC), a French territory in the South Pacific. The youngest patient was aged two months and the oldest patient 24 years (median three years). Controls for this analysis were healthy adult volunteers reporting Oceanian ancestry recruited in NC to a previously reported study of rheumatic heart disease (n=193).[16] Both cases and controls were genotyped using the HumanCore platform (Illumina).
Quality control was undertaken using standard approaches[17] but with an additional step to minimise confounding due to the use of multiple genotyping platforms (Suppl. Figure 1 & 2).[18] Crucially, after these steps, the resulting case-control datasets were well matched for genetic ancestry (Suppl. Fig. 3). Next, we randomly selected ten controls for each case carrying forward 161 cases and 1610 controls for analysis, equating to an effective sample size of 585 individuals. We then performed genotype imputation, run separately for the UK, Kenyan and NC datasets, before running case-control association analyses in each dataset using linear mixed models implemented in GCTA software[19] (v1.26.0) with no evidence of residual confounding based on the distribution of test statistics (Suppl. Figure 4). We then performed genome-wide meta-analysis using Metasoft software[20] (v2.0.1) for 4,234,532 variants present at minor allele frequency greater than 5% in all four datasets. Further analyses including estimation of effect sizes by transformation[21] were performed in R (v3.0). Finally, in the absence of a further dataset in which to confirm or refute our findings, we designed a three-step process to identify the most robust association signals that would be suitable for further investigation (Suppl. Fig. 5). This involved an initial screening genome-wide meta-analysis under the standard fixed-effects model followed by further analysis of putatively associated loci using the more stringent random-effects model to assess heterogeneity. Taking forward genomic regions containing at least one variant at suggestive significance in the random-effects meta-analysis (P<10−5), we performed a sensitivity analysis in which we reran the random selection of controls 100 times over before repeating the individual case-control analyses and rerunning the fixed-effects meta-analysis, prioritising those loci at which the association with susceptibility showed the greatest consistency.
Projection of the samples on to (a) the first and second, (b) first and third and (c) first and fourth principal components of genetic variation. Cases indicated by gold-coloured empty squares while controls are indicated by empty diamonds coloured by self-reported ancestry (GBR, British European; KEN, Kenyan African; MEL, Melanesian; POL, Polynesian).
For each analysis, directly genotyped and imputed variants were tested for association with invasive GAS susceptibility using linear mixed models. Each point represents a single variant. An estimate of the genomic inflation factor (λ) is shown. Plots are shown for: (a) Invasive infection, UK, (b) Childhood empyema, UK, (c) Childhood bacteraemia, Kenya, and (d) Invasive infection, New Caledonia.
In our initial genome-wide screen, we identified one genomic region that contained a variant associated with susceptibility at genome-wide significance (PFE<5×10−8) and 24 genomic regions that contained at least one variant associated at suggestive significance (PFE<10−5) (Suppl. Table 1; Suppl. Figure 5a). Thirteen of these regions also contained at least one variant at suggestive significance in the random-effects meta-analysis (Suppl. Table 2; Suppl. Figure 5b). In our sensitivity analysis, we found only five regions (nine variants) had no PFE values greater than 1×10−4 while only two regions (five variants), which we took forward for further analysis, had no PFE values greater than 5×10−5 (Suppl. Table 3).
The first of two regions shortlisted in the sensitivity analysis (maximum PFE<5×10−5) was on chromosome 7 in an intron of the MAGI2 gene. The lead variant (rs35089407, PFE=5.3×10−8) was well supported by surrounding variants (Figure 1a). The effect was strongest in the NC analysis where the minor allele (thymidine) was associated with a three-fold increased risk of disease declining to a 1.5-fold increase in the UK invasive disease analysis (combined odds ratio, OR, 2.25, 95% confidence interval, CI, 1.68– 3.02; Figure 1b). The overall minor allele frequency was 14.7% (range 7.5-21.9%) while the imputation accuracy was high (information metric, 0.96–0.97).
The second region was on chromosome 9 located 18kb from the PAX5 gene in the intron of a non-coding long RNA (termed AL450267.2). The peak compromised four neighbouring variants (PFE=3.2×10−7–7.3×10−7; Figure 1c) of which one was at suggestive significance in the random-effects analysis (rs1176844, PRE=7.6×10−6). The minor allele of rs1176844 (thymidine) was associated with an almost two-fold increased risk of the disease in the combined analysis (combined OR 1.9, 95% CI 1.4– 2.4; Figure 1d). Notably this variant, which had a minor allele frequency of 43.3%, had been directly genotyped in both the UK and NC datasets while the imputation accuracy in the Kenyan dataset was very high (information metric, 0.997).
We next investigated overlap between the five shortlisted variants in these two loci and known or predicted regulatory elements (Suppl. Table 3) using the RegulomeDB tool.[22] Overall evidence of regulatory function was minimal with nothing to suggest altered transcription factor binding or gene expression. However, rs1327501 within the PAX5 signal overlapped histone modifications consistent with enhancer elements found specifically in analyses of primary hematopoietic stem cells and primary B cells from cord blood but no other cell types. Additionally, the other three PAX5 variants overlapped an area of weak transcriptional activity in primary hematopoietic stem cells in short term culture but no other cell types (Suppl. Figure 6).
Directly genotyped and imputed variants at MAF greater than 5% were tested for association with invasive GAS susceptibility in all four case-control analyses using linear mixed models. The association statistics were combined by (a) fixed-effects or (b) random-effects meta-analysis. Each point represents a single variant. The blue horizontal line indicates the suggestive significance threshold (P = 10−5) and the red horizontal line indicates the genome-wide significance threshold (P = 5 x 10−8).
Genomic position is plotted against: (a) the negative common logarithm of the p-value from the fixed-effects meta-analysis; (b) the Combined Annotation Dependent Depletion (CADD) score, a scaled measure of pathogenicity in which scores of 10 and 20, respectively, indicate that a variant is among the most 10% and 1% deleterious in the human genome; [32] (c) the chromatin states of nine cell types annotated using the core 15-state state model (ChromHMM), which predicts regulatory annotations based on histone modifications. [33] E034, primary T-cells from peripheral blood; E033, primary T cells from cord blood; E029, primary monocytes from peripheral blood; E031, primary B cells from cord blood; E035, primary hematopoietic stem cells; E051, primary hematopoietic stem cells G-CSF-mobilised (male); E050, primary hematopoietic stem cells G-CSF-mobilised (female); E036, primary hematopoietic stem cells in short term culture; E032, primary B cells from peripheral blood. The plot was generated using the FUMA web application.[34]
Finally, we reviewed the human leukocyte antigen (HLA) locus in light of our earlier report of an association between the DQA1*01:03 allele and susceptibility in the UK invasive disease component of our dataset. [14] While we did not perform HLA imputation for this analysis, it is noteworthy that the two missense variants that define the DQA1*01:03 allele (i.e. rs36219699 and rs41547417) were excluded from this analysis because their minor allele frequency was less than 5% in the Kenyan dataset. However, neither were associated with susceptibility in the UK empyema or NC analyses (P=0.50–0.65). Despite this, the strongest signal in the HLA locus mapped to within 1kb of HLA-DQA1 (rs17612669, OR 1.5, 95% 1.1–2.1, PFE=0.005). While this signal is difficult to interpret in the context of a GWAS, it is noteworthy that it showed negligible heterogeneity despite the diverse genetic ancestry of the study population.
DISCUSSION
This preliminary investigation of susceptibility to invasive GAS disease provides the first evidence that common host genetic variants may influence the risk of this infection across a range of populations, age groups and sub-phenotypes.
The signal located 18kb from PAX5 is of particular interest because this gene is widely considered a key regulator of B cell maturation and fundamental to the genetic processes that lead to antibody production.[23] This is relevant because deficient antibody-mediated immunity to key GAS virulence factors has long been proposed as a key determinant of invasive GAS susceptibility in the general population. [24,25] Indeed, antibody-mediated neutralisation of virulence factors including superantigens may drive the protective effects of intravenous immunoglobulin, which, when administered in streptococcal toxic syndrome, was recently shown to be associated with a two-fold reduction in mortality at 30 days.[26] Moreover, the overlap with putative regulatory elements in haemopoietic stem cells, while far from definitive in the absence of experimental validation, raises the possibility these variants somehow impact differentiation of haemopoietic stem cells into B cell precursors, which might have downstream effects on antibody-mediated immunity.
In contrast, the MAGI2 signal is harder to relate to invasive GAS disease. MAGI2 plays a fundamental role in maintenance of synapses, endocytosis and postendocytic trafficking. [27] The gene is large, stretching 1.4Mb, and it is of interest that the locus has previously been to linked to infantile spasms. [28] Additionally, the gene has been linked by GWAS to traits including blood lipid and bone metabolism phenotypes[29] but overall it remains unclear how this locus could influence susceptibility to invasive GAS disease.
This study has a number of limitations, not least of which is the small sample size relative to most modern genetic association studies. Consequently, our analysis has limited power to detect all but the strongest signals while our estimates of effect size are imprecise and should be interpreted with caution. While follow-up of these preliminary findings is a vital next step, it is important to emphasise that invasive GAS is a rare disease and sporadic which renders recruitment challenging. Indeed, despite our focus on children and young adults, this report is, to our knowledge, the largest study of genetic susceptibility to invasive GAS to date.
Moreover, we deliberately set out to identify variants influencing susceptibility to GAS across a range of clinical manifestations and ancestral populations, as a consequence of which our study largely ignores the heterogeneous nature of this disease. Indeed, such heterogeneity might explain our findings in the HLA locus since it is possible the DQA1*01:03 allele effects predisposition to only a subset of clinical phenotypes or is dependent on the presence of specific bacterial factors such superantigens.[14] Future larger-scale prospective studies with access to more precise clinical data coupled with information about the pathogen should help clarify this issue. Nonetheless, our use of data from three distinct ancestral populations has substantial advantages, not least of which is improved detection of underlying causal variants.[30]
Overall, this is the first time a GWAS of susceptibility to invasive GAS has been undertaken. While the small sample size necessitates caution, these results suggest common genetic variation in the PAX5 locus, as well as potentially elsewhere in the genome, impacts susceptibility to invasive GAS disease. While follow-up of our findings in independent datasets is warranted, this analysis underscores the potential of investigating genetic susceptibility to GAS disease to bring about an improved understanding of pathogenesis.
Data Availability
Several components of the data referred to in the manuscript are available through the European Genome-phenome Archive including: the UK Invasive GAS study data (accession number EGAS0000100342); the Kenyan Bacteraemia Study Group data (EGAS00001001756); the Pacific Islands Rheumatic Heart Disease Genetics Network data (EGAS00001001881). Other components will be submitted to the European Genome-phenome Archive over the coming months and/or for the most part be available from the authors on reasonable request.
https://www.ebi.ac.uk/ega/studies/EGAS00001003421
Acknowledgments
This research was supported by grants awarded to T.P. from the Medical Research Council (G1100449), the European Society of Clinical Microbiology & Infectious Diseases (Research Grant 2013), the National Institute for Health Research (ACF-2016-20-001) and the British Heart Foundation (PG/14/26/30509). In addition, the genotyping undertaken by the Kenyan Bacteraemia Study Group was funded as part of the Wellcome Trust Case Control Consortium 2 (084716/Z/08/Z, 085475/B/08/Z and 085475/Z/08/Z). The genotyping undertaken by the Oxford Vaccine Group as part of EUCLIDS was funded through the European Union’s Seventh Framework Programme (EC-GA no. 279185). A.J.M. was supported by a Wellcome Trust Clinical PhD Fellowship (106289/Z/14/Z) while A.J.M., S.J.C., D.O. and A.J.P. were supported by the National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK. Additionally, M.B. was supported by a Medical Research Council Clinician Scientist Fellowship (MR/M008797/1), J.J.G. was supported by a Wellcome Trust Clinical PhD Fellowship (102342/Z/13/Z), J.A.G.S. was supported by a Wellcome Trust Senior Research Fellowship (098532), T.N.W. was supported by a Wellcome Trust Senior Research Fellowship (091758), J.C.K. was supported by a Wellcome Trust Investigator Award (204969/Z/16/Z) and A.V.S.H. was supported by a Wellcome Trust Senior Investigator Award (HCUZZ0) and by a European Research Council Advanced Grant (294557). S.S. was supported by the National Institute for Health Research Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, Imperial College London, London, UK.
We thank the Oxford Genomics Centre at the Wellcome Centre for Human Genetics for generating the genotyping data, subsidized by Wellcome Trust Core Awards (090532/Z/09/Z and 203141/Z/16/Z). We also thank the Oxford Biomedical Research Computing facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the National Institute for Health Research Oxford Biomedical Research Centre with funding from the Wellcome Trust Core Award Grant Number (203141/Z/16/Z). We also acknowledge the support of the National Institute for Health Research Oxford Biomedical Research Centre, Oxford, the National Institute for Health Research Imperial Biomedical Research Centre, London, UK, and the National Institute for Health Research Newcastle Biomedical Research Centre, Newcastle upon Tyne, UK. None of these funders had any role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Moreover, the views expressed here are those of the authors and not necessarily those of the National Health Service, the National Institute for Health Research or the Department of Health.