Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Bayesian Shrinkage Priors in Zero-Inflated and Negative Binomial Regression models with Real World Data Applications of COVID-19 Vaccine, and RNA-Seq

View ORCID ProfileArinjita Bhattacharyya, Riten Mitra, Shesh Rai, Subhadip Pal
doi: https://doi.org/10.1101/2022.07.13.22277610
Arinjita Bhattacharyya
1Department of Bioinformatics & Biostatistics, University of Louisville, KY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Arinjita Bhattacharyya
Riten Mitra
1Department of Bioinformatics & Biostatistics, University of Louisville, KY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shesh Rai
1Department of Bioinformatics & Biostatistics, University of Louisville, KY, USA
2Biostatistics & Bioinformatics Facility, JG Brown Cancer Center, University of Louisville, KY, USA
3The Christina Lee Brown Envirome Institute, University of Louisville, KY, USA
4University of Louisville Alcohol Research Center, University of Louisville, KY, USA
5University of Louisville Hepatobiology & Toxicology Center, University of Louisville, KY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: shesh.rai{at}louisville.edu
Subhadip Pal
1Department of Bioinformatics & Biostatistics, University of Louisville, KY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Count data regression modeling has received much attention in several science fields in which the Poisson, Negative binomial, and Zero-Inflated models are some of the primary regression techniques. Negative binomial regression is applied to modeling count variables, usually when they are over-dispersed. A Poisson distribution is also utilized for counting data where the mean is equal to the variance. This situation is often unrealistic since the distribution of counts will usually have a variance that is not equal to its mean. Modeling it as Poisson distributed leads to ignoring under- or overdispersion, depending on if the variance is smaller or larger than the mean. Also, situations with outcomes having a larger number of zeros such as RNASeq data require Zero-inflated models. Variable selection through shrinkage priors has been a popular method to address the curse of dimensionality and achieve the identification of significant variables.

Methods We present a unified Bayesian hierarchical framework that implements and compares shrinkage priors in negative-binomial and zero-inflated negative binomial regression models. The key feature is the representation of the likelihood by a Polya-Gamma data augmentation, which admits a natural integration with a family of shrinkage priors. We specifically focus on the Horseshoe, Dirichlet Laplace, and Double Pareto priors. Extensive simulation studies address the efficiency of the model and mean square errors are reported. Further, the models are applied to data sets such as the Covid-19 vaccine, and Covid-19 RNA-Seq data among others.

Results The models are robust enough to address variable selection, and MSE decreases as the sample size increases, having lower errors in p > n cases. The noteworthy results showed that the adverse events of Covid-19 vaccines were dependent on age, recovery, medical history, and prior vaccination with a remarkable reduction in MSE of the fitted values. No. of publications of Ph.D. students were dependent on the no. of children, and the no. of articles in the last three years.

Conclusions The models are robust enough to conduct both variable selections and produce effective fit because of their high shrinkage property and applicability to a broad range of biometric and public health high dimensional problems.

  • shrinkage priors
  • negative binomial regression
  • horseshoe
  • Dirichlet Laplace
  • MCMC
  • Polya-Gamma
  • vaccine
  • RNASeq
  • Covid-19 vaccine
  • data augmentation

1 Introduction

The extension of linear models to generalized linear models (GLMs) introduced by [1] framework to handle data that are not typically modeled using a normal distribution (e.g., binary, count data) is a moment of immense success in the history of statistics. Most of the real-life datasets have number of explanatory variables (p) greater than the number of observations (n).The methodology for analyzing RNA sequencing data is rapidly expanding. Methods having application of shrinkage, flexibility of the designs, are highly in demand [2]. High-dimensional predictor selection and sparse signal recovery are routine statistical and machine learning practices.Sparsity relies on the property of a few large signals among many (nearly) zero noisy observations. A common goal in high-dimensional inference is to recover the low-dimensional signals observed in noisy observations[3]. The idea of global-local shrinkage hierarchies[4] has become the foremost research areas in Bayesian literature that incorporates heavy tailed prior distributions for coefficients in generalized linear regression models. In the exponential rise in the development of research in shrinkage priors, works that have gained mass popularity are [5], [6], [7], [8], [9], [10], [11], [12], [13], [14] among many. An overview of several shrinkage priors with several data applications is given in [15]. Here we discuss the posterior simulation for negative binomial (NB) regression and Zero-inflated Negative binomial (ZINB) for count data. Our main focus is the utilization of the Polya-Gamma (PG) data augmentation strategy [16] which utilizes Polya-Gamma random variables to enhance posterior simulations. The performance of three different priors Horseshoe (H)[17],Dirichlet Laplace (DL) [14], Double Pareto (DP)[13] are measured and also the method is applied to benchmark data sets. Bayesian global-local (GL) shrink-age estimation is the state-of the art for Gaussian regression models, extension to non-standard regression techniques such as Poisson and NB are the ones that we concentrate in our methodology. Here two extensions of the global-local shrinkage framework is implemented. Firstly, the utilization of the PG data augmentation technique to generate simple algorithms for sampling with NB and ZINB regression likelihoods. Results show that the priors are highly competitive on the basis of mean square errors (MSE). Extensive simulation studies and real data applications are conducted to evaluate the performance of the these priors with respect to prediction accuracy and MSE for variable selection.

The rest of the section follows as section 2 describes in detail the tradition methods such as Poisson and NB and their Bayesian counterpart such as BNB and BZINB.Section3 explains the simulation details and parameters, followed by results 4. Real data scenarios are explained in Section 5, finally summarizing manuscript with a discussion 6.

2 Method

2.1 Poisson Regression

In modeling the number of times an event occurs a generalized linear model (GLM) such as Poisson or negative binomial regression is commonly applied.Let y = (y1, y2, …, yn) denote the vector of n count measurements of a dependent variable of interest, and xi = (xi,1, …, xi,p) denote the vector of predictors (explanatory variables, covariates) associated with the response yi. Let X = (x1, …, xn) be the n × p matrix of explanatory variables. The GLM contains Poisson distribution which has a probability function Embedded Image where the mean and variance are equal. E(yi) = var(yi) = λi. The log-likelihood is given by Embedded Image. In GLM, a non-linear transformation or a link function of the mean response λi is applied which is a linear function of the covariates X [18],[19]. The link function for Poisson regression is Embedded Image, where β0 is the intercept and β is the vector regression coefficients. Poisson distribution depending on single parameter λ is equi-dispersed. If the Poisson mean is assumed to have a random intercept term and this term enters the conditional mean function in a multiplicative manner, the following relationship is achieved [20]: Embedded Image Here, exp(β0 + ϵi) is defined as the random intercept; Embedded Imageis the log-link function between the mean E(yi) and the independent fixed covariate matrix X. The real data, however, often shows that the variance is larger than the mean, which is called overdispersion. The two-parameter negative binomial distribution has more flexible and can efficiently model overdispersed count data leading to correct standard errors and inferences [21]. Many parametric models for count data are obtained by additionally introducing a heterogeneity term in the Poisson model. Unobserved heterogeneity is usually included as a multiple of the Poisson mean. y | µ, ν ∼ Poi(µν) and the random heterogeneity term ν ≥ 0 is integrated out to obtain the distribution of y | µ. These model structures are well-known as doubly stochastic Poisson by Cox [22] and a Cox process by Kingman [23]. In general, E(ν) = 1 is the setting for several leading models. Different distributions of v leads to various generalizations of Poisson and here the Poisson-Gamma mixture is explained which leads to Negative Binomial distribution.

2.2 Negative Binomial as Poisson-Gamma Mixture

The NB model is derived from a Poisson-gamma mixture distribution [18]. The interpretation and derivation of NB from Poisson-Gamma is detailed in [24]. The heterogeneity parameter is assumed to have a Gamma distribution. νi ∼ Gamma(d, b). The two-parameter Gamma distribution is represented as Embedded Image Embedded Image, setting E(νi) = 1 we get d = b, leading to a one-parameter Gamma distribution with Embedded Image. Now the Poisson model Poi(µν) can have easier interpretation if worked with the transformation λ = μν, i.e. Embedded Image. The Jacobian is obtained as Embedded Image. The probability density function (p.d.f) of λ is then given as Embedded Image The Poisson-gamma mixture is Embedded Image The property of the gamma function is utilized in getting the above form: Embedded Image for any Embedded Image for any c ≤ 0.

The equation (4) can also be represented as Embedded Image Taking Embedded Imagewe get the probability mass function (p.m.f) of the NB distribution as Embedded Image Here, E(y | µ, α) = µ, V ar(y | µ, α) = µ(1 + αµ) ≥ µ, since α ≥ 0. The variance can also be represented as Embedded Image is the dispersion parameter. Most often it is expressed as Embedded Image, α is the parameter responsible for heterogeneity and models the over-dispersion amount in the data An alternative parametrization of NB used in many references as well as algorithms is Embedded Image where 0 < p < 1 and h ≥ 0. Then Embedded Imageand Embedded Image Letting Embedded Image, yields the same parametrization of NB distribution in (6) and h = d in (5).

The log-likelihood of NB can be expressed as Embedded Image The mean of (7) is Embedded Image The link function in this parametrization of (7) is log-odds, i.e. Embedded Image. So, the likelihood for the Negative Binomial distribution ignoring the constant term is Embedded Image , where Embedded Image. This β vector is assumed to include the intercept term, i.e. β = (β0, β1, …, βp).

2.3 Bayesian Negative Binomial Regression with Hierarchical Prior Structures (BNB)

The likelihood of NB (10) is not in a closed form and will require Polya-Gamma (PG) data augmentation. The auxillary variables will be sampled from Polya-gamma distribution with parameter yi + h and Embedded Image. The Polya-Gamma distribution is the exact distribution needed to augment this posterior for simulation thus obtaining closed form posterior distributions that can be easily handled via Gibbs sampling. The method is useful when modeling proportions on the log-odds scale. Binary logistic regression [?] and negative binomial regression (NB) are the two fore frontiers that meets the criteria. To facilitate posterior sampling, we introduce a set of auxillary variables that follow Polya-Gamma distribution that are represented as scale mixtures of normals. Conditional on the latent variables, inference proceeds via straightforward Gibbs sampling. A Polya-Gamma variable, w ∼ PG(b, c) with b > 0 and c ∈ ℝ, can be defined as follows. Embedded Image The variable distribution is similar to Gamma distribution and, as b increases, it becomes approximately normal [25]. So here Embedded Image and Embedded Image. Similar, to logistic regression case, the joint posterior density of the parameter β and W with prior π(β) is obtained as Embedded Image Here c(Y) is the marginal distribution of Y. We can see that Embedded Image which is our targeted density. The conditional independence of Yi and Wi implies that π(w | β, y) = f (w | β). Thus, we can draw from π(w | β, y) by making n independent draws from the Polya-Gamma distribution. The other conditional density π(β | w, y) is multivariate normal.

A hierarchical representation of the Horseshoe prior [9] is stated as Embedded Image Here, Σ of the distribution of β is a diagonal matrix with elements Embedded Image. From equations (12), and the above hierarchical prior structure, the full posterior distribution is given by: Embedded Image where h(wi) is obtained from Embedded Image The conditional distributions required for our analysis follows:

The conditional density of β given y, w is Embedded Image where, WD and Σ are diagonal matrices where the elements are (w1, w2, …, wn), Embedded Imagerespectively and, Embedded Image.

The conditional density of wi given xi, β is Embedded Image The conditional density of the hyper-parameters are as follows Embedded Image Again, here all the posterior densities are in the closed form, and follow simple densities like Normal, Polya-Gamma and Inverse-Gamma making sampling from them trivial. Exploiting the scale-mixture representation of the global-local shrinkage priors, it is straightforward to formulate the Gibbs sampler.

The hierarchical structure of the Dirichlet Laplace prior Bhattacharya:2015 is Embedded Image The conditional posterior distributions remain same for β | yi and wi | β is similar to that of equations (16) and (17).

The conditional density of the hyper-parameters as obtained similar to Theorem 2.2 in [14] are as follows: Embedded Image To sample π(ϕ | βj) sample Embedded Image. where GIG(a, b, c) is the Generalized Inverse Gaussian distribution with densityEmbedded Image.

The hierarchical structure of Double Pareto prior[26] is Embedded Image Again, the conditional densities of β | yi and wi | β remains same as (??). Here Σ = Dτ is a diagonal matrix with elements (τ1, τ2, …, τp).

The conditional density of rest of the hyper-parameters are as follows: Embedded Image

2.4 Bayesian Zero-Inflated Model with Hierarchical Prior Structures (BZINB)

Zero-inflated models represents the excess zeros, and a count distribution for the remaining values, thus forming a mixture of zeros. The model is very useful when there is an excess number of two types of zeros in the concerned response variable. By construction, zero-inflated models partition zeros into two types. The first type, typically referred to as a “structural” zero, corresponds to individuals who are not at risk for an event, and therefore have no opportunity for a positive count. The second type, termed the “at-risk” or “chance” zero, applies to a latent class of individuals who are at risk for an event but nevertheless have an observed response of zero[27]. For example, in our application with Covid-19 vaccine data set, examining the number of adverse events, the structural zeros might represent patients who had no adverse event thus have no recorded adverse event. In contrast, the at-risk zeros might correspond to patients with a single occurrence of adverse event which has been determined not clinically significant, thus contributes to at-risk zero. Similarly for RNA Seq datasets, where the genes are counts containing a high proportion of zeros, the zero-inflated models can be viewed as latent class models in which the classes are formed by the two types of zeros. The zero-inflated model has two parts that models consisting of negative binomial distribution, and the logit distribution. Embedded Image Embedded Image ,where Embedded Image where NB(µi, r) takes the form of equation 6. Here yi is the count response for the ith individual. The latent indicator variable wi that takes values 1 and 0 with probabilities pi and 1 − pi where yi ∼ NB(µi, r) and yi = 0. The indicatorwi variable is binary, thus modelled with logistic regression as follows: Embedded Image The negative binomial distribution has likelihood similar to equation 10 and the posterior distribution of the coefficients β are obtained from equation 12. With the hierarchical prior structures on the coefficients β from section 2.3, The posterior distribution for both alpha and β are modelled similarly as 2.3 citebhattacharyya2021applications, Neelon:2019.

3 Simulation

Here each of the simulation structure along with their parameters are defined.

3.1 BNB

The data is generated utilizing the Poisson-gamma mixture representation of NB as expressed in (4). In the data generation process, the true values of β, are defined. The data is generated as follows:

  1. Calculate Embedded Image

  2. Get Embedded Image, where νi ∼ Gamma(h, 1)

  3. Generate yi ∼ Poi(λi)

The covariates are generated from multivariate normal distribution with mean vector 0 and covariance matrix Σ. 80% of the data set is reserved for the training set, and 20% for the test data set.

  • S1: n = 200, p = 10 with a correlation of about ρ = 0.5 among the covariates.

    The coefficient vector is given by Embedded Image with 5 non-zero coefficients.

  • S2: n = 120, p = 10, ρ = 0.2 Five non-zero coefficients.

    Embedded Image

  • S3: n = 50, p = 10, coefficients similar to S2.

  • Embedded Image

  • Embedded Image

Here we have considered two broad settings of design matrices comprising p = 200 and p = 500 covariates. The sampling of β for p > n with p = 500 is conducted from a Gaussian distribution and follows the fast sampling algorithm of [28]. For prediction accuracy, the mean squared error (MSE) is used as a prediction accuracy criteria, which is defined as Embedded Image. In terms of variable selection performance, the number of the truly nonzero coefficients which are incorrectly set to zero (FP), and the number of the true zero coefficients which are correctly set to zero (FN). The higher the values of FP, and the lower the values of FN, the better the variable selection performance is. The variable selection is determined by posterior credible intervals. For each simulation setting, 100 datasets were generated, and MSE, results were calculated by averaging over these 100 datasets. MCMC is used for sampling which relies on block-updating the sets of parameters. The no. of simulation runs is 10000 with 6000 as burn-in. Trace plots, auto-correlation plots, Geweke z-statistics [29] were some of the diagonistic criterion. R pacakge coda [30] was used to conduct the posterior sample diagnosis and pgdraw [31] [32] was used to generate the Polya-Gamma random variates. All the computations have been carried out in RStudio. The R package coda [30] was used to conduct the posterior sample diagnosis and pgdraw [31] [32] was used to generate the Polya-Gamma random variates.

3.2 BZINB

The following simulation scenarios were set up to understand the behaviour of the model. We followed similar data generation process as of [27].

  • Sim1: n = 1000, p = 4, α = (0.5, 0.5, −0.25, 0.25), β = (0.5, −1.00, 0.75, −0.25), r = 1

  • Sim2: n = 500, p = 10, α = (0.5, 0.5, −0.25, 0.25, 0.5, 0.5, −0.25, 0.5, 0.5, −0.25),β = (0.5, 0.5, 0.75, −0.25, 0.5, 0.75, 0.25, 0.75, −0.25), r = 1

  • Sim3: Embedded Image

4 Simulation Results

Table 1 shows the variable selection performance and MSE along with their standard deviations for these simulation scenarios for BNB model. Comparing S1, S2, and S3 accuracy and sensitivity were similar with the decrease in sample size for all the three priors. The MSE for β and their standard errors increases for all the three priors as the sample size decreases. From S1 and S2 and S3 are three scenarios having similar settings with the N/P ratio changing, there is a clear trend that with the decrease in sample size increases the MSE, S1 and S2 has 100% sensitivity implying all the non-zero β can be identified. There is not much of an impact of correlation on the results. S5 has 1% non-zero β with low effect sizes but almost 67% of the time they are identified by the 2 priors DL and Horseshoe. DP fails to identify the non-zeros, though keeping the MSEs low.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1:

Variable Selection Performance among BNB Simulation Scenarios

In summary, it is obvious that the simulation results has demonstrated the use of these three priors unanimously with the low standard errors across datasets. Trace plots, Geweke diagnostics and Monte Carlo standard errors were indicative of convergence and showed reasonable mixing fora range of model parameters. We tested the sampler for various settings of n, p and coefficients. Comparing S2, and S3 the MSE for β and their standard errors increases for all the three priors and drastically reduces the MSE of the fitted values as well. The increase in correlation in S4 than S3 doesn’t seem have much effect. As nandp increases from S1 to S2 there is a substantial decrease in MSE of β but not for the fitted values. S5, S7 and S12 compare different sample size under similar coefficient and simulation settings. MSE and their standard error decreases as their sample size increases. S9, S10 and S11 again compare sample sizes n = 200, 100, 8 for a different set of coefficient settings. Here all the βs were non-zero. Again the decrease in MSE with the increase of sample size is noticeable and also in the MSE of fitted values. In the all the above n > p scenarios we see that Horseshoe, and Double Pareto perform comparatively better than Dirichlet Laplace. For p > n scenarios S13 and S6, with the increase in the n/p ratio increases MSE increases. In summary, it is obvious that the simulation results has demonstrated the use of these three priors unanimously. We build on the R codes from this package and extend it with the three different prior structures.

In BZINB model, Horseshoe prior clearly outperforms the other two priors across simulation scenarios. The Sim3 p > n case has similar performance across priors. The variable selection performances are also measured with the MSEs and all the three priors performed reasonably Table 2.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2:

Comparison of MSE among BZINB Simulation Scenarios

5 Real Data Application

Here three real data applications are considered. In all the applications the count data and the excess amounts of zeros in the outcome made BNB and BZINB regression a best fit for the data. The variables selected by the three priors for BNB and BZINB, Poisson regression and NB models for each of the datasets is in the table 3. For the three priors, the starting values are chosen as the maximum likelihood estimates from NB Regression. The MSE for the coefficients for the tree priors are obtained by Embedded Image, p is the number of coefficients, βi: estimates obtained from NB regression and Embedded Image are the posterior mean estimates. This MSE measure can also be interpreted as a departure from the frequentist estimates.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3:

Variable Selection in Real World Data

5.1 No. of PhD publications

The first one deals with the number of publications produced by Ph.D. biochemists of [33]. It is available in the R package “pscl”[34].The response variable is the number of articles in the last three years of Ph.D. Five explanatory variables were used. They are: the gender (x1), the marital status (x2), the number of children under age six (x3), prestige of Ph.D. program (x4), and the number of articles by the mentor in last three years (x5). In this application, the response variable is following NB distribution. (x3) and (x5) were selected by the Bayesian methods. The NB, poisson followed by BNB methods were suitable for the data set, as per the MSE of fitted values.

5.2 Nuts data

In the second real data application, we considered the nuts dataset [18]. Here, n = 52 and p = 7. The nuts dataset defines the squirrel behavior and several features of the forest across different plots in Scotland’s Abernathy Forest. It is available in the R package “COUNT”[35].The response variable is the number of cones stripped follows negative binomial distribution. The explanatory variables are: the number of trees per plot (x1), the number of DBH per plot (x2), mean tree height per plot (x3), canopy closure (as a percentage) (x4), standardized number of trees per plot (x5), standardized mean tree height per plot (x6), standardized canopy closure (as a percentage) (x7). Here we use x5, x6, x7.

The variables chosen by all the methods except NB were x5, x6, x7. So all the models seem reasonable except NB. The variables selected in the first two datasets (Biochemists, and NUTS) are included withing the set of the variables selected either by the traditional Poisson or NB regression.

5.3 US National Medical Expenditure Survey

The third data set originated from the US National Medical Expenditure Survey (NMES) conducted in 1987 and 1988. The NMES is based upon a sample of the civilian non-institutionalized population and individuals admitted to long-term care facilities during 1987. The data are a sub-sample of individuals ages 66 and over all of whom are covered by Medicare (a public insurance program providing substantial protection against health-care costs). It is available in the R package “AER”[36]. It is a data frame containing 4,406 observations on 19 variables. The response variable considered is the number of physician office visits among other type of count variables present (emergency visits, no. of non-physician hospital outpatient visits etc.). We have considered 14 dependent variables hospital: number of hospital stays x1; health:self-perceived health status x2, levels are “poor”, “average”, “excellent”; chronic:number of chronic conditions x3; adl: indicator of whether individual having limits in activities of daily living x4 levels: “limited”,”normal”; region: indication region of the individual x5 levels northeast, midwest, west, other; age x6; afam (race): If African-American x7; gender:male/female x8; married: marital status x9; school:number of years of education x10; income (USD) x11; employment x12; insurance: whether the individual is covered by private insurance yes/No x13; medicaid x14. The MSE for the fitted values are also given. Figure 1 the posterior distribution of the 14 variables and their respective confidence intervals. For the NMES data, the three priors selects 10 important out of 14 variables, Poisson selects 13 and NB 8 of them. It seems that the three priors perform better than NB where they do include the relevant variables such as limited activity level x4 and race x7 but doesn’t do over-fitting such as Poisson.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Posterior distribution with 95% credible intervals: NMES Data

5.4 PBMC RNA-Seq data

The data set that is analyzed here is taken from https://satijalab.org/seurat/articles/pbmc3ktutorial.html. The RNASeq data is analyzed as a per gene model where each gene is the outcome and the cell type as the covariate. The top 10 genes selected by BNB and BZINB are in Table 3. The genes common between H and DL are 2742, 6369, 9338, 3665,7203,11458, 2217, 4360,4368.The genes common between DL and DP are 2742,13706,6369,9338,11458,7203, 2217, and 4368. The genes common between DP and H are 2742,6369,9338,3665,11458,7203,2217,3665,4368,4368. The genes common between three priors are 2742, 6369, 9338,7203,11458,3665, 2217, 4368. The common genes between H and DL are 74, 267, 1804,3417,6049,8146. The genes that belong to the intersection of DL and DP are 74, 267,1804, 3417, and 6049. The genes common between DP and H are 74, 267, 1804, 3417, and 6049. The genes common between the three priors are 74, 267, 1804, 3417, and 6049. The genes selected by both the models are selected by ranking them by the least MSE.

5.5 Covid-19 Vaccine Data

This data set consisted of the Covid-19 vaccine administered over the year 2021 and the symptoms (adverse events) gathered from the administration of vaccines. This data is a part of the Vaccine Adverse Event Reporting System (VAERS) which was created by the Food and Drug Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that may be associated with vaccines.The variables that are considered are age (x1), sex (x2), if there is life threat or not (x3), if there was emergency room visit (x4), if hospitalized (x5), number of hospital days (x6), no. of extended stay (x7), disability status (x8), recovery status (x9), medical history (x10), other medications (x11), laboratory data (x12), disease during vaccination (x13), prior vaccination status (x14), allergy status (x15), doctor’s office visit (x16), emergency visit (x17), vaccination route (x18), vaccination dose (x19), vaccine manufacturer (x20).The dimension of the cleaned data set post-processing was having 100 samples and 20 variables. The number of days between vaccination and onset of adverse symptoms is treated as the response variable (y) that had a median of 1 day, mean of 7 days, maximum minimum of 0, and 2.43 years respectively. The results from the BNB model, BZINB model, and both these models with respective priors are given in Table 3. We applied the region of practical equivalence (ROPE) method [37] was utilized to select the variables after obtaining the posterior samples. The variables that were significant by the ROPE method with 5% cut-off for ROPE by mainly the Horseshoe and the DP priors for both the BNB and BZINB models belonged to the super set of the set of the variables. The variables age (x1), sex (x2), if there is life threat or not (x3), if there was emergency room visit (x4), no. of extended stay (x7), other medications (x11), laboratory data (x12), disease during vaccination (x13), prior vaccination status (x14), allergy status (x15) belonged to the intersection of the two sets of variables identified by the BNB and BZINB models with Horseshoe prior, which was a significant overlap along with matching the variables identified by the traditional methods without shrinkage priors. In general, the BZINB with DP and Horseshoe priors were able to select more variables than the BNB model. Specifically, variables such as medical history (x10), prior vaccination status (x14), vaccination route (x18), vaccine manufacturer (x20) were some of the interesting features that seem to influence the results. All the shrinkage prior models surpassed the traditional models in their MSE of fitted values.

5.6 Covid-19 RNASeq Data

The data is taken from the article [38]. We selected the GSE152075 dataset from Gene Expression Omnibus (GEO: https://www.ncbi.nlm.nih.gov/gds), which contained RNA-seq data from 430 SARS-CoV-2 positive and 54 negative patients [?]. The data is analyzed similarly with the help of per gene model where each RNA-seq (count variable) is modelled against the covid-19 positivity status which is a binary variable. Upon pre-processing the following top 10 genes were selected by ranking the genes with the lowest MSE with the BZINB and BNB methods are in Table 3. The BZINB method was able to select about 30 genes that were not selected by the top BNB method. With the BNB method,there were 4 genes commonly selected by DL and H which are 4903, 7767, 11222, and 12202.The common genes between DL and DP are 8523,12202,13109 and between DP and H are 8998, 9016, 9856, 12202. Gene no. 12202 was selected by all three priors with BNB model.

6 Discussion

Our simulations showed that the approach performs well across a range of scenarios. The numerical results provide additional numerical and theoretical insights into the properties of global-local shrinkage priors including high-dimension case. Variable selection is a very helpful procedure for improving computational speed and prediction accuracy by identifying the most important variables that related to the response variable. The number of counts for each observation if large can make the sampler perform poorly as the Polya-Gamma augmentation will not be efficient as it require generation of Polya-Gamma random variates equal to the number of observations[16]. The generation of such random variates is also time consuming.The Polya-Gamma procedure is in general fast, easy to implement and flexible. The rate at which one can generate Polya-Gamma random variates is a key factor in the efficiency of the Polya-Gamma scheme; hence, building fast samplers is essential. R packages such as “bayesreg”[31] deals with high-dimensional Bayesian regularised regression. Alternative models such as the BNB model, BZINB models [27], truncated models, or quantile count models provide potential future research guidelines. The use of shrinkage priors with next generation sequencing data such as RNA-Seq data with Zero-Inflated or Negative Binomial models are also areas that needs further exploration [2]. All the three priors have their own advantage and caveats. A computationally efficient Bayesian approach for variable selection is proposed here that performs quite well in simulation scenarios and provide consistent results in these different case studies. One disadvantage of data augmentation schemes is that the number of latent variables is of the order of sample size. Hence for large n, the computation can slow down, as we have seen in settings with sample sizes greater than 1000. However, this disadvantage is offset by the gains we have over the traditionally used Metropolis-Hastings, which requires choosing proposal distributions and would probably generate a considerable number of rejection steps. Additionally, the model can be extended with other shrinkage priors, and other models such as hurdle models. More generally, the proposed method can be applied in scenarios where interest lies in modeling count data within a Bayesian inferential framework and exhaustive comparison of existing shrinkage priors in the literature. We believe that the rigorous, yet simple and systematic nature of Bayesian inference coupled with the latest advances in technology in high dimensional and next generation sequencing with RNA Seq data might strongly help in contributing to expanding the research field.

Data Availability

The datasets used and/or analysed are publicly available and information about it is included in this article.

Funding

This work was supported by the National Institute of Health grant P42 ES023716 to principal investigator: Dr S Srivastava and the National Institute of Health grant 1P20 GM113226 to principal investigator: Dr C McClain. Dr. Shesh Rai was also partially supported by Wendell Cherry Chair in Clinical Trial Research.

Availability of data and materials

The datasets used and/or analysed are publicly available and information about it is included in this article. Not applicable

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable

Authors’ contributions

A.B. has contributed to the methodology, data collection, analysis, and writing of the manuscript. R.M. and S.P. has contributed to the methodology. S.N.R. has contributed to developing ideas, analysis and valuable comments. All authors have contributed to the final preparation of the manuscript.

7 Acknowledgements

This work was supported by the National Institute of Health grant P42 ES023716 to principal investigator: Dr S Srivastava and the National Institute of Health grant 1P20 GM113226 to principal investigator: Dr C McClain. Dr. Rai was also partially supported by Wendell Cherry Chair in Clinical Trial Research.

Abbreviations

SSVS
Stochastic Search Variable Selection
GL
global-local
HS
Horseshoe
DL
Dirichlet Laplace
DP
Double Pareto
PG
Polya-Gamma
DA
Data-Augmentation
MCMC
Markov Chain Monte Carlo
MSE
Mean Squared Error
VS
variable selection
BZINB
Bayesian Zero-Inflated Negative Binomial
BNB
Bayesian Negative Binomial
NB
Negative Binomial
ZINB
Zero-Inflated Negative Binomial;

References

  1. 1.↵
    Nelder, J.A., Wedderburn, R.W.: Generalized linear models. Journal of the Royal Statistical Society: Series A (General) 135(3), 370–384 (1972)
    OpenUrlCrossRefWeb of Science
  2. 2.↵
    Van De Wiel, M.A., Leday, G.G., Pardo, L., Rue, H., Van Der Vaart, A.W., Van Wieringen, W.N.: Bayesian analysis of rna sequencing data by estimating multiple shrinkage priors. Biostatistics 14(1), 113–128 (2013)
    OpenUrlCrossRefPubMedWeb of Science
  3. 3.↵
    Bhadra, A., Datta, J., Polson, N.G., Willard, B., et al: Lasso meets horseshoe: A survey. Statistical Science 34(3), 405–427 (2019)
    OpenUrl
  4. 5.↵
    Polson, N.G., Scott, J.G.: Shrink globally, act locally: Sparse bayesian regularization and prediction. Bayesian statistics 9, 501–538 (2010)
    OpenUrl
  5. 5.↵
    Bae, K., Mallick, B.K.: Gene selection using a two-level hierarchical bayesian model. Bioinformatics 20(18), 3423–3430 (2004)
    OpenUrlCrossRefPubMedWeb of Science
  6. 6.↵
    Bhadra, A., Datta, J., Polson, N.G., Willard, B., et al: The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis 12(4), 1105–1131 (2017)
    OpenUrl
  7. 7.↵
    Griffin, J., Brown, P., et al: Hierarchical shrinkage priors for regression models. Bayesian Analysis 12(1), 135–159 (2017)
    OpenUrl
  8. 8.↵
    Ishwaran, H., Rao, J.S., et al: Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics 33(2), 730–773 (2005)
    OpenUrl
  9. 9.↵
    Makalic, E., Schmidt, D.F.: A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters 23(1), 179–182 (2015)
    OpenUrl
  10. 10.↵
    Park, T., Casella, G.: The bayesian lasso. Journal of the American Statistical Association 103(482), 681–686 (2008)
    OpenUrlCrossRefWeb of Science
  11. 11.↵
    Piironen, J., Vehtari, A.: On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. arXiv preprint arxiv:1610.05559 (2016)
  12. 12.↵
    Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288 (1996)
    OpenUrlWeb of Science
  13. 13.↵
    Armagan, A., Dunson, D.B., Lee, J.: Generalized double pareto shrinkage. Statistica Sinica 23(1), 119 (2013)
    OpenUrl
  14. 14.↵
    Bhattacharya, A., Pati, D., Pillai, N.S., Dunson, D.B.: Dirichlet–laplace priors for optimal shrinkage. Journal of the American Statistical Association 110(512), 1479–1490 (2015)
    OpenUrlCrossRefPubMed
  15. 15.↵
    Van Erp, S., Oberski, D.L., Mulder, J.: Shrinkage priors for bayesian penalized regression. Journal of Mathematical Psychology 89, 31–50 (2019)
    OpenUrlCrossRef
  16. 16.↵
    Polson, N.G., Scott, J.G., Windle, J.: Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American statistical Association 108(504), 1339–1349 (2013)
    OpenUrlCrossRef
  17. 17.↵
    Carvalho, C.M., Polson, N.G., Scott, J.G.: The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480 (2010)
    OpenUrlCrossRefWeb of Science
  18. 18.↵
    Hilbe, J.M.: Negative Binomial Regression. Cambridge University Press, ??? (2011)
  19. 19.↵
    McCullagh, P.: Generalized Linear Models. Routledge, ??? (2018)
  20. 20.↵
    Cameron, A.C., Trivedi, P.: Regression analysis of (1998)
  21. 21.↵
    Lehman, R.R., Archer, K.J.: Penalized negative binomial models for modeling an overdispersed count outcome with a high-dimensional predictor space: Application predicting micronuclei frequency. PloS one 14(1) (2019)
  22. 22.↵
    Cox, D.R.: Some statistical models related with series of events. Journal of the Royal Statistical Society Series B 17, 129–164 (1955)
    OpenUrl
  23. 23.↵
    Kingman, J.: On doubly stochastic poisson processes. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 60, pp. 923–930 (1964). Cambridge University Press
    OpenUrl
  24. 24.↵
    Greenwood, M., Yule, G.U.: An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. Journal of the Royal statistical society 83(2), 255–279 (1920)
    OpenUrlCrossRef
  25. 25.↵
    Glynn, C., Tokdar, S.T., Howard, B., Banks, D.L., et al: Bayesian analysis of dynamic linear topic models. Bayesian Analysis 14(1), 53–80 (2019)
    OpenUrl
  26. 26.↵
    Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association 88(422), 669–679 (1993)
    OpenUrlCrossRefWeb of Science
  27. 27.↵
    Neelon, B., et al: Bayesian zero-inflated negative binomial regression based on pólya-gamma mixtures. Bayesian Analysis 14(3), 849–875 (2019)
    OpenUrl
  28. 28.↵
    Bhattacharya, A., Chakraborty, A., Mallick, B.K.: Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika, 042 (2016)
  29. 29.↵
    Geweke, J., et al: Evaluating the Accuracy of Sampling-based Approaches to the Calculation of Posterior Moments vol. 196. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN, ??? (1991)
  30. 30.↵
    Plummer, M., Best, N., Cowles, K., Vines, K.: Coda: convergence diagnosis and output analysis for mcmc. R news 6(1), 7–11 (2006) 1.
    OpenUrl
  31. 31.↵
    Makalic, E., Schmidt, D.F.: High-dimensional bayesian regularised regression with the bayesreg package. arXiv preprint arxiv:1611.06649 (2016)
  32. 32.↵
    Makalic, E., Schmidt, D.: High-Dimensional Bayesian Regularised Regression with the BayesReg Package. arxiv:1611.06649v3
  33. 33.↵
    Long, J.S.: The origins of sex differences in science. Social forces 68(4), 1297–1316 (1990)
    OpenUrlCrossRefWeb of Science
  34. 34.↵
    Jackman, S., Tahk, A., Zeileis, A., Maimone, C., Fearon, J., Meers, Z., Jackman, M.S., Imports, M.: Package ‘pscl’. See http://github.com/atahk/pscl (2017)
  35. 35.↵
    Hilbe, J.M.: COUNT: Functions, Data and Code for Count Data. (2016). R package version 1.3.4. https://CRAN.R-project.org/package=COUNT
  36. 36.↵
    Kleiber, C., Zeileis, A., Zeileis, M.A.: Package ‘aer’. R package version 1.2 4 (2020)
  37. 37.↵
    Kelter, R.: Analysis of bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research. BMC Medical Research Methodology 20(1), 1–18 (2020)
    OpenUrlCrossRef
  38. 38.↵
    Sanchis, P., Lavignolle, R., Abbate, M., Lage-Vickers, S., Vazquez, E., Cotignola, J., Bizzotto, J., Gueron, G.: Analysis workflow of publicly available rna-sequencing datasets. STAR protocols 2(2), 100478 (2021)
    OpenUrl
Back to top
PreviousNext
Posted July 15, 2022.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Bayesian Shrinkage Priors in Zero-Inflated and Negative Binomial Regression models with Real World Data Applications of COVID-19 Vaccine, and RNA-Seq
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Bayesian Shrinkage Priors in Zero-Inflated and Negative Binomial Regression models with Real World Data Applications of COVID-19 Vaccine, and RNA-Seq
Arinjita Bhattacharyya, Riten Mitra, Shesh Rai, Subhadip Pal
medRxiv 2022.07.13.22277610; doi: https://doi.org/10.1101/2022.07.13.22277610
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Bayesian Shrinkage Priors in Zero-Inflated and Negative Binomial Regression models with Real World Data Applications of COVID-19 Vaccine, and RNA-Seq
Arinjita Bhattacharyya, Riten Mitra, Shesh Rai, Subhadip Pal
medRxiv 2022.07.13.22277610; doi: https://doi.org/10.1101/2022.07.13.22277610

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Public and Global Health
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)