Abstract
Type 2 diabetes (T2D) is a disease for which both common genetic variants and environmental factors influence risk. A few genes have been identified in which very rare variants have large effects on risk and here we carry out a weighted burden analysis of rare variants in a sample of over 200,000 exome-sequenced participants in the UK Biobank project, of whom over 13,000 have T2D. Variant weights were allocated based on allele frequency and predicted effect, as informed by a previous analysis of hyperlipidaemia. There was an exome-wide significant increased burden of rare, functional variants in three genes, GCK, HNF4A and GIGYF1. GIGYF1 has not previously been identified as a diabetes risk gene but its product is plausibly involved in the modification of insulin signalling. A number of other genes did not attain exome-wide significance but were highly ranked and potentially of interest, including ALAD, PPARG, GYG1 and GHRL. Loss of function (LOF) variants were associated with T2D in GCK and GIGYF1 whereas nonsynonymous variants annotated as probably damaging were associated in GCK and HNF4A. Overall, fewer than 1% of T2D cases carried one of these variants. In two genes previously implicated in diabetes aetiology, HNF1A and HNF1B, there was an excess of LOF variants among cases but the small numbers of these fell well short of statistical significance, suggesting that even larger datasets will be helpful for more fully elucidating the contribution of rare genetic variants to T2D risk.
Introduction
Genome-wide association studies of type 2 diabetes (T2D) have implicated a large number of common genetic variants (Xue et al., 2018). In the UK Biobank, a genetic risk score derived from common variants was associated with T2D and incorporating it alongside conventional risk factors in order to predict T2D increased the area under the curve (AUC) from 0.851 to 0.855 (Chen et al., 2021). Common variants have individually modest effects on risk but a small number of genes have been identified in which rare variants with large effects can result in a phenotype of non-insulin dependent diabetes. Maturity onset diabetes of the young (MODY) is caused by mutations in a number of genes including HNF1A, HNF4A, GCK, HNF1B, KCNJ11, ABCC8 (Murphy, Ellard and Hattersley, 2008). Loss of function (LOF) and nonsynonymous variants in PPARG can cause a familial lipodystrophy with insulin resistant diabetes (Agostini et al., 2018). Recessively acting mutations in the INSR gene can cause insulin resistance with hyperglycaemia (Semple et al., 2011). These rare variants have typically been identified by targeted studies of familial cases and individuals with severe phenotypes but the availability of large samples of exome sequenced subjects now allows the possibility to explore the effects of variations in these genes in the population more broadly. Exome sequence data is now available for 200,000 of the 500,000 UK Biobank subjects (Szustakowski et al., 2020). We have recently analysed this in order to illuminate the effect of rare, coding variants on susceptibility to hyperlipidaemia and we now apply the same approach to study T2D (Curtis, 2021a).
Methods
The UK Biobank dataset was downloaded along with the variant call files for 200,632 subjects who had undergone exome-sequencing and genotyping by the UK Biobank Exome Sequencing Consortium using the GRCh38 assembly with coverage 20X at 95.6% of sites on average (Szustakowski et al., 2020). UK Biobank had obtained ethics approval from the North West Multi-centre Research Ethics Committee which covers the UK (approval number: 11/NW/0382) and had obtained informed consent from all participants. The UK Biobank approved an application for use of the data (ID 51119) and ethics approval for the analyses was obtained from the UCL Research Ethics Committee (11527/001). All variants were annotated using the standard software packages VEP, PolyPhen and SIFT (Kumar, Henikoff and Ng, 2009; Adzhubei, Jordan and Sunyaev, 2013; McLaren et al., 2016). To obtain population principal components reflecting ancestry, version 2.0 of plink (https://www.cog-genomics.org/plink/2.0/) was run with the options --maf 0.1 --pca 20 approx (Chang et al., 2015; Galinsky et al., 2016).
To define cases, a similar approach was used as was previously implemented for the investigation of hyperlipidaemia (Curtis, 2019, 2021a). The T2D phenotype was determined from three sources in the dataset: self-reported diabetes or type 2 diabetes (but not type 1 or gestational diabetes); reporting taking any of a list of named medications commonly used to treat T2D in the UK (https://www.diabetes.co.uk/Diabetes-drugs.html); having an ICD10 code for non-insulin-dependent diabetes mellitus in hospital records or as a cause of death. Subjects in any of these categories were deemed to be cases while all other subjects were taken to be controls.
The SCOREASSOC program was used to carry out a weighted burden analysis to test whether, in each gene, sequence variants which were rarer and/or predicted to have more severe functional effects occurred more commonly in cases than controls. Attention was restricted to rare variants with minor allele frequency (MAF) <= 0.01 in both cases and controls. As previously described, variants were weighted by overall MAF so that variants with MAF=0.01 were given a weight of 1 while very rare variants with MAF close to zero were given a weight of 10 (Curtis, 2021b). Variants were also weighted according to their functional annotation using the GENEVARASSOC program, which was used to generate input files for weighted burden analysis by SCOREASSOC (Curtis, 2012, 2016). The weights were informed from the analysis of the effects of different categories of variant in LDLR on hyperlipidaemia risk (Curtis, 2021a). Variants predicted to cause complete loss of function (LOF) of the gene were assigned a weight of 100. Nonsynonymous variants were assigned a weight of 5 but if SIFT annotated them as possibly or probably damaging then 5 or 10 was added to this and if SIFT annotated them as deleterious then 20 was added. In order to allow exploration of the effects of different types of variant on disease risk the variants were also grouped into broader categories to be used in multivariate analyses as described below. The full set of weights and categories is displayed in Table 1. As described previously, the weight due to MAF and the weight due to functional annotation were multiplied together to provide an overall weight for each variant. Variants were excluded if there were more than 10% of genotypes missing in the controls or if the heterozygote count was smaller than both homozygote counts in the controls. If a subject was not genotyped for a variant then they were assigned the subject-wise average score for that variant. For each subject a gene-wise weighted burden score was derived as the sum of the variant-wise weights, each multiplied by the number of alleles of the variant which the given subject possessed. For variants on the X chromosome, hemizygous males were treated as homozygotes.
The table shows the weight which was assigned to each type of variant as annotated by VEP, Polyphen and SIFT as well as the broad categories which were used for multivariate analyses of variant effects (Kumar, Henikoff and Ng, 2009; Adzhubei, Jordan and Sunyaev, 2013; McLaren et al., 2016).
For each gene, logistic regression analysis was carried out including the first 20 population principal components and sex as covariates and a likelihood ratio test was performed comparing the likelihoods of the models with and without the gene-wise burden score. The statistical significance was summarised as a signed log p value (SLP), which is the log base 10 of the p value given a positive sign if the score is higher in cases and negative if it is higher in controls.
Gene set analyses were carried out as before using the 1454 “all GO gene sets, gene symbols” pathways as listed in the file c5.all.v5.0.symbols.gmt downloaded from the Molecular Signatures Database at http://www.broadinstitute.org/gsea/msigdb/collections.jsp (Subramanian et al., 2005). For each set of genes, the natural logs of the gene-wise p values were summed according to Fisher’s method to produce a chi-squared statistic with degrees of freedom equal to twice the number of genes in the set. The p value associated with this chi-squared statistic was expressed as a minus log10 p (MLP) as a test of association of the set with the hyperlipidaemia phenotype.
For selected genes, additional analyses were carried out to clarify the contribution of different categories of variant. As described previously, logistic regression analyses were performed on the counts of the separate categories of variant as listed in Table 1, again including principal components and sex as covariates, to estimate the effect size for each category (Curtis, 2021a). The odds ratios associated with each category were estimated along with their standard errors and the Wald statistic was used to obtain a p value, except for categories in which variants occurred fewer than 50 times in which case Fisher’s exact test was applied to the variant counts. The associated p value was converted to an SLP, again with the sign being positive if the mean count was higher in cases than controls.
Data manipulation and statistical analyses were performed using GENEVARASSOC, SCOREASSOC and R (R Core Team, 2014).
Results
There were 13,938 cases of T2D and 186,694 controls. There were 20,384 genes for which there were qualifying variants. Given that there were 20,384 informative genes, the critical threshold for the absolute value of the SLP to declare a result as formally statistically significant is-log10(0.05/20384) = 5.61 and this was achieved by three genes, GCK (SLP = 22.24), HNF4A (SLP = 6.82) and GIGYF1 (SLP = 6.22). The quantile-quantile (QQ) plot for the SLPs obtained for all genes except GCK is shown in Figure 1. This shows that the test appears to be well-behaved and conforms well with the expected distribution. Omitting the genes with the 100 highest and 100 lowest SLPs, which might be capturing a real biological effect, the gradient for positive SLPs is 1.06 with intercept at −0.007 and the gradient for negative SLPs is 1.02 with intercept at −0.009, indicating only modest inflation of the test statistic.
QQ plot of SLPs obtained for weighted burden analysis of association with hyperlipidaemia showing observed against expected SLP for each gene, omitting results for GCK, which has SLP = 22.25.
Variants in GCK are recognised as the cause of up to half of cases of MODY, itself accounting for around 1-2% of cases of all diabetes diagnoses (Bishay and Greenfield, 2016). Likewise, HNF4A variants cause 5-10% of cases of MODY (Naylor, Johnson and Gaudio, 2018). By contrast, GIGYF1 has not previously been implicated in the aetiology of diabetes although it is known that its product binds to growth factor receptor-bound protein 10 (GRB10) and has a role in modulating the insulin-like growth factor (IGF1) receptor signalling pathway (Giovannone et al., 2003; Zhou et al., 2018). A variant in GRB10 has been reported to be associated with decreased early-phase insulin secretion and the muscle-specific ablation of Grb10 in mice causes increased glucose uptake into muscles with increased insulin signalling (Lyssenko, Groop and Prasad, 2015; Holt et al., 2018). However GRB10 itself showed no evidence for association in T2D in the current analysis (SLP = −0.02).
Table 2 shows all the genes achieving SLP with absolute value greater than 3, equivalent to an uncorrected p value of 0.001. Given that 20,384 genes were tested, one would expect that by chance about 20 would reach this level of significance whereas in fact there are 32. Thus it is possible that some of these highly ranked genes do demonstrate a biological signal which fails to reach statistical significance after correction for multiple testing and some of them seem worth commenting on. The expression of ALAD (SLP = 3.63) is reduced in obese subjects while the expression of Alad is reduced in rats with high-fat diet-induced weight gain (Moreno-Navarrete et al., 2017). Additionally, inhibition of ALAD with aminotriazole led to reduced glucose uptake in cultured human adipocytes. The common P12A variant of PPARG (SLP = 3.45) reduces risk of T2D whereas rare LOF variants and nonsynonymous variants which cause reduced activity (occurring in approximately 1 in 1,000 individuals) substantially increase risk (Majithia et al., 2014). Damaging variants in GYG1 (SLP = 3.22) cause deficiency of glycogenin 1, resulting in glycogen storage myopathies, but have not been reported to be associated with diabetes (Ben Yaou et al., 2017). GHRL (SLP = −3.15) encodes the ghrelin-obestatin preproprotein which is cleaved to yield two peptides, ghrelin and obestatin, which are involved in appetite and energy metabolism and there have been some studies which have claimed that the common Leu72Met (rs696217) variant is associated with reduced risk of T2D although the effect does not seem to be consistent and the gene was not highlighted in a large GWAS meta-analysis (Xue et al., 2018; Rivera-León et al., 2020). The results for all sets are provided in Supplementary Table S1.
Genes with absolute value of SLP exceeding 3 or more (equivalent to p<0.001) for test of association of weighted burden score with T2D.
In order to see if any additional genes were highlighted by analysing gene sets, gene set analysis was performed as described above after first removing all genes with absolute SLP value greater than 3. Given that 1,454 sets were tested a critical MLP to achieve to declare results significant after correction for multiple testing would be log10(1454*20) = 4.46 and this was not achieved by any set. There were two sets with MLP > 3, EXTERNAL SIDE OF PLASMA MEMBRANE (MLP = 3.29) and MONOSACCHARIDE TRANSMEMBRANE TRANSPORTER ACTIVITY (MLP = 3.06). The latter is of some interest because it consists of 10 genes, of which three were individually significant at p<0.05, these being SLC2A2 (SLP = 2.54), SLC2A3 (SLP = −2.37) and SLC2A4 (SLP = 1.70). SLC2A2, previously known as GLUT2, codes for a glucose transporter expressed by beta cells which senses glucose levels and recessively acting variants in it can cause neonatal diabetes (Sansbury et al., 2012). A common intronic variant of SLC2A2, rs8192675, is associated with the glycaemic response to metformin (Zhou et al., 2016). SLC2A4 codes for a glucose transporter whose levels in cell membranes increase in response to insulin but although candidate gene studies claim that common variants in it are associated with T2D these results are not supported by properly powered GWAS metanalysis (Xue et al., 2018; Hu et al., 2019). The results for all sets are provided in Supplementary Table S2.
For the genes of possible interest listed above, a logistic regression analysis of different categories of variant was carried out to elucidate their relative contributions. The results for the three exome-wide significant genes are shown in Table 3, which shows differences between the genes relating to the implicated pattern of variants. The results for GCK demonstrate that splice site variants and gene disruptive variants, comprising frameshift and stop variants, are associated with large effects on risk. These occur a total of 17 times among the 13,938 cases. However of note is that nonsynonymous variants annotated as probably damaging by PolyPhen are also associated with increased risk, with OR = 2.97 (1.59 - 5.54), and these occur 33 times among cases. The situation for HNF4A is quite different. There are no splice site variants and only 6 gene disruptive variants and these all occur in controls. Only probably damaging nonsynonymous variants show an effect, with OR = 2.97 (1.61 - 5.50), and these occur 34 times among cases. Finally, for GIGYF1 probably damaging nonsynonymous variants have no discernible effect and it is only the splice site (OR = 7.70 (2.62 - 22.67)) and disruptive (OR = 5.65 (3.07 - 10.40)) variants which increase risk and these occur 24 times in cases.
Results from logistic regression analysis showing the effects on risk of T2D of different categories of variant within exome-wide significant genes and genes of interest with absolute value of gene-wise SLP > 3. Odds ratios for each category are estimated including principal components and sex as covariates.
Table 3A Results for GCK.
Results for HNF4A.
Results for GIGYF1.
Results for ALAD.
Results for PPARG.
Results for GYG1.
Results for GHRL.
Table 3 also shows the results for the genes which, while not exome-wide significant, had SLP with magnitude >3 and which appeared to be potentially of interest, ALAD, PPARG, GYG1 and GHRL. The signal for ALAD seems to be driven by the fact that although splice site and disruptive variants occur only a total of 5 times in cases this still makes them about 7 times as frequent as in controls. The signal for PPARG arises from the fact that disruptive variants occur 4 times in cases and 7 times in controls, OR = 8.23 (2.29 - 29.63). The findings for GYG1 seem somewhat more robust, being based on an excess of both disruptive (OR = 1.98 (1.40 - 2.81)) and splice site (OR = 3.08 (0.96 - 9.81)) variants in cases, with a total of 37 occurrences. For GHRL, which has SLP = −3,15, implying that variants in it might be protective, no individual category has a statistically significant effect and there is more a general tendency for there to be fewer variants among cases which is spread over a number of categories, including 3 prime UTR, protein altering, indel, disruptive and deleterious.
The analyses described above failed to highlight a number of genes which have previously been implicated in T2D, comprising HNF1A (SLP = 1.66), HNF1B (SLP = −0.28), ABCC8 (SLP = 1.94) and INSR (SLP = −0.25). The association with T2D for each category of variant within these genes is shown in Table 4. From this it can be seen that LOF variants in HNF1A and HNF1B are commoner in cases than controls but that their absolute numbers are too small to produce statistically significant effects. By contrast, LOF variants have higher overall frequency in ABCC8 and they have approximately equal frequencies in cases and controls. These results are consistent with reports that activating variants in ABCC8 result in diabetes whereas LOF variants cause hyperinsulinaemia (De Franco et al., 2020). The overall frequencies of different category of nonsynonymous variant do not vary between cases and controls, possibly reflecting the inability of SIFT and PolyPhen to distinguish variants which have a gain of function effect. The frequency of LOF variants in INSR are similar in cases and controls, indicating that, although recessively acting variants can cause infantile hyperinsulinaemia followed by insulin dependent diabetes, the loss of function of a single copy of this gene has little discernible effect on risk of T2D (Semple et al., 2011).
Results from logistic regression analysis showing effect on risk of T2D of different categories of variant within genes previously implicated in diabetes pathogenesis.
Table 4A Results for HNF1A.
Results for HNF1B.
Results for ABCC8.
Results for INSR.
Discussion
These analyses provide a broad overview of the way very rare genetic variants contribute to risk of type 2 diabetes in a large sample broadly representative of the population. The weighted burden analysis successfully identifies two known diabetes genes, GCK and HNF4A, and implicates a novel gene, GIGYF1. A few other biologically plausible genes do not reach formal levels of statistical significance after correction for multiple testing but might be worthy of further investigation, including ALAD, PPARG, GYG1, GHRL, SCL2A2 and SLC2A4. Any possible role for these genes will become clearer as additional data becomes available, for example from the remaining 300,000 UK Biobank subjects for whom exome sequence data has not yet been released. Typically, hundreds of variants are identified per gene, mostly occurring in only a handful of subjects each. The findings for HNF1A and HNF1B demonstrate that even with over 13,000 cases it can be problematic to produce statistically significant results and it is reasonable to be optimistic that studying still larger datasets will be worthwhile.
Together, variants which can be identified as having large effects on risk occur in fewer than 1% of the cases of T2D in this sample. There is no doubt that identifying specific genetic causes may be useful to guide treatment for some patients (Agostini et al., 2018). However it needs to be acknowledged that for the vast majority of patients with T2D exome sequencing will be not be helpful in terms of identifying specific subtypes of disease which might benefit from specific treatments. Thus, the potential to apply a personalised medicine approach to T2D based on genetic testing seems to be somewhat limited.
The main potential utility of genetic investigations such as this might be to better characterise the mechanisms which can lead to disease, identify novel drug targets and develop improved therapeutic approaches which would benefit T2D patients in general, rather than only the small number carrying the relevant genetic variant. If the tentative findings reported here can be replicated then the genes identified could become the objects of more intensive investigation.
Data Availability
The raw data is available on application to UK Biobank. Detailed results with variant counts cannot be made available because they might be used for subject identification. Scripts and relevant derived variables will be deposited in UK Biobank.
Conflicts of interest
The author declares he has no conflict of interest.
Data availability
The raw data is available on application to UK Biobank. Detailed results with variant counts cannot be made available because they might be used for subject identification. Scripts and relevant derived variables will be deposited in UK Biobank.
Acknowledgments
This research has been conducted using the UK Biobank Resource. The author wishes to acknowledge the staff supporting the High Performance Computing Cluster, Computer Science Department, University College London. This work was carried out in part using resources provided by BBSRC equipment grant BB/R01356X/1. The author wishes to thank the participants who volunteered for the UK Biobank project.