Molecular Group and Correlation Guided Structural Learning for Multi-Phenotype Prediction
=========================================================================================

* Xueping Zhou
* Manqi Cai
* Molin Yue
* Juan Celedón
* Ying Ding
* Wei Chen
* Yanming Li

## Abstract

We propose a supervised learning algorithm to perform feature selection and outcome prediction for genomic data with multi-phenotypic responses. Our algorithm particularly incorporates the genome and/or phenotype grouping structures and phenotype correlation structures in feature selection, effect estimation, and outcome prediction under a penalized multi-response linear regression model. Extensive simulations demonstrate its superior performance over its competing methods. We apply the proposed algorithm to two omics studies. In the first study, we identified novel association signals between multivariate gene expressions and high-dimensional DNA methylation profiles, providing biological insights into how CpG sites regulate gene expressions. The second study is for cell type deconvolution. Using the proposed algorithm, we were able to achieve better cell type fraction predictions using high-dimensional gene expression data.

Key words
*   association study
*   cell-type deconvolution
*   DNA methylation
*   feature selection
*   genomic grouping structure
*   multi-type prediction
*   multivariate regression

## Introduction

Rapid advances in modern bio-technologies have yielded massive omics data, including genetics, transcriptomics, proteomics, and epigenetics among many others. Such datasets are high-dimensional, where the number of features exceeds the sample size. In association studies, it is often that only a small proportion of the features are associated with the outcomes of interest. To identify the truly associated features among high-dimensional candidates, regularization methods are commonly employed to shrink the coefficients of noises to zeros. Another intrinsic characteristic of omics data is the genome hierarchy grouping structures. For example, single nucleotide polymorphisms (SNPs) can be grouped by their harboring genes, and genes can be grouped by functional pathways. Regularized regression methods with high-dimensional predictors that take into consideration of predictor grouping structures have been developed [30, 35], and they have been shown to provide better prediction performance compared to their counterparts without considering the predictor grouping structures.

Recent arising research interest has been focused on omics studies with multivariate responses. Two distinct approaches are usually adopted in multivariate response association studies. The first approach involves carrying out separate association studies on one response at a time. The second approach involves conducting one joint association study with all responses together. Statistical methods for the first approach have been widely studied. The top performers include, but not limited to, the least absolute shrinkage and selection operator (Lasso) [31], the elastic net [37], the group Lasso [35] and the sparse group Lasso [30]. One drawback of such an approach is that it treats response variables as independent and overlooks the correlation and grouping structures among the responses. Moreover, the first approach is typically followed by multiple comparisons for post-selection hypothesis testing, which can be stringent and result in excessive false negatives, particularly when the dimension of the responses is high.

The available methods for the second approach are still relatively limited. Li, Nan and Zhu [21] introduced the multivariate sparse group Lasso (MSGLasso), which performs signal selection between high-dimensional multivariate responses and multiple predictors while taking into account the grouping structures on both. MSGLasso outperforms the univariate response approaches in terms of outcome prediction and feature selection. However, MSGLasso does not explicitly consider the response covariance structure. Wilms and Croux [32] later proposed a multivariate group Lasso with covariance estimation that takes into account the covariance structure of the responses. However, it only considered the group-level sparsity, but not the within-group sparsity.

In this study, we propose an extension of MSGLasso – the **Brilliant**: Biological gRoup guIded muLtivariate muLtiple lIneAr regression with peNalizaTion. Same to MSGLasso, Brilliant utilizes the intrinsic genome hierarchy grouping structures in model fitting and predictions. It selects both the important group-level and therein the important individual-level association signals. Different to MSGLasso, Brilliant explicitly incorporates the covariance structure of multivariate responses. By doing so, the proposed Brilliant method improves the performance of feature selection, outcome prediction, and coefficient estimation significantly compared to MSGLasso and other univariate response approaches.

The rest of the paper is structured as follows. In section 2, we introduce the detailed algorithm of Brilliant. Section 3 uses simulation studies to demonstrate the performance of Brilliant. Section 4 presents two real data applications. A discussion is given in Section 5.

## Method

Let *n* denote the number of samples, consider multivariate linear regression model ![Formula][1]</img>  where **Y** = (**y**1, …, **y***q*) ∈ ℝ*n*×*q* is the response data matrix of *n* samples and *q* responses, and **X** = (**x**1, …, **x***p*) ∈ ℝ*n*×*p* is the predictor matrix of *n* samples and *p* predictors, **B** ∈ ℝ*p*×*q* is the coefficient matrix, and **E** ∈ ℝ*n*×*q* is the matrix of random errors. Assume that each row vector of **E** follows a multivariate normal distribution *N* (**0, Σ**) with a zero mean vector and a *q* ×*q* covariance matrix **Σ**.

Suppose **X** has *P* groups and **Y** has *Q* groups. They together introduce *P* ×*Q* intersection block groups [21] on the coefficient matrix **B**. Let 𝒢 be the set of all *P* × *Q* groups and **B***g* be a single block group in 𝒢, *g* ∈ 1, …J, *G* = *P* × *Q*. For a group **B***g*, denote its *L*2 norm by ![Graphic][2]</img>.

Brilliant considers the following multi-responses optimization problem. ![Formula][3]</img>  where ![Graphic][4]</img> is a working version of the precision matrix **Ω** = **Σ**−1. When **Ω** is a diagonal matrix, Brilliant reduces to MSGLasso. Non-negative tuning parameters *λ*1 and *λ*2 are for the Lasso penalty and group Lasso penalty, respectively. The non-negative *w**g*s are adaptive weights for each individual group. In this paper, we set ![Graphic][5]</img> equal to total number of entries in the group *B**g*.

The working precision matrix ![Graphic][6]</img> can be estimated from the data. For example, using the centralized responses, one can calculate ![Graphic][7]</img> by directly invert the sample covariance matrix in a low dimensional setting or using graphical lasso [12] in a high-dimensional setting. In this paper, we propose to use a block diagonal working precision matrix with the diagonal blocks obtained by inverting the sample variance-covariance matrices of responses within each groups and the off-diagonal blocks set to be zero. We show that using a adequitely specified ![Graphic][8]</img>, Brilliant can outperform MSGLasso.

Let ![Graphic][9]</img> be the solution to (2) and *β**jk* be its entry on the *j*th row and *k*th column. For a given grouping structure 𝒢 on **B** and a given set of tuning parameters, an iterative algorithm adopted from the mixed coordinate descent algorithm [21] for coefficient estimation is summarized as in Algorithm 1. In the original mixed coordinate descent algorithm [21], the outcome covariance structure was not explicitly incorporated into the estimation. The Brilliant specifically incorporates the outcome covariance structure by integrating the user-specified working precision matrix into the mixed coordinating algorithm. In Algorithm 1, the estimates of *β**jk* by Brilliant depend on ![Graphic][10]</img> and *S**jk*, where ![Graphic][11]</img> is the *kk*-the element (*k*-th diagonal element) of the working precision matrix ![Graphic][12]</img> and ![Graphic][13]</img> with **B**(−*jk*) being the *jk*-th element of **B** replaced by zero and ![Graphic][14]</img> being the *k*th column vector of the working precision matrix. Incorporating the outcome covariance structure can be beneficial especially when the covariance structure of the outcomes is priory known or when the outcomes are highly correlated within or beteen groups. It is important to note that Brilliant incorporates both outcome correlations from within group and across groups, which is evidented in *S**jk*. Please refer to Appendix 6 for a detailed proofs of the updating formules in algorithm 1.

To reduce false positives, we further threshold the estimated coefficient matrix ![Graphic][15]</img> with a pre-defined non-negative threshold *b**thr*. That is, ![Graphic][16]</img>, where 1 ≤ *j* ≤ *p*, 1 ≤ *k* ≤ *q*. Here *b**thr* is a hyper-parameter to be tuned from the data.

## Simulation studies

In our simulation studies, we compared the performance of Brilliant with that of the MSGLasso using the R package *MSGLasso* [21], and of the univariate method Lasso using the package *glmnet* [11]. The total number of predictors was set to 500. These predictors were separated into 10 non-overlapping groups with each group containing 50 predictors. Predictors within the same group were correlated with either a compound symmetry (CS) or a first-order auto-regression (AR1) structure with a correlation coefficient *ρ* = 0.5. Predictors from different groups were set to be independent. The predictor vector for each subject was independently generated from multivariate normal distributions with the specified correlation structures, marginal variances 1, and zero means. The total number of responses was set to be 100. Responses were separated into four non-overlapping groups and each group included 20 responses. For a given grouping structure G, we further assume the variables within a group are correlated, while variables from different groups are uncorrelated. That is equivalent to say that the true covariance matrix **Σ** assumes the form of a block diagonal matrix. The error vector for each subject was also independently generated from multivariate normal distributions with the same correlation structure as the predictor vector and a correlation coefficient *ρ* taking value in {0.1, 0.3, 0.5, 0.7, 0.9}. The covariance structures of each response group are specified in Table 1.

View this table:
[Table 1.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/T1)

Table 1. Within-group covariance structures of **X** and **E** in each simulation setting.

The coefficient matrix **B** was created to have a group structure with *P* × *Q* intersection groups. To achieve group-level sparsity of **B**, we randomly selected 1*/*5 of all groups to be non-zero groups, or groups with informative features. To attain within-group sparsity, we randomly selected 1*/*2 of the entries within a non-zero group and set to 0s. The other 1*/*2 non-zero values were draw from a uniform distribution U{[−3, 0) ∪ (0,, 3]}. **Y** was generated based on the multivariate multiple linear regression framework described in (1).

Algorithm 1
Brilliant mixed coordinate descent algorithm

![Figure1](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F1.medium.gif)

[Figure1](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F1)

For each of the simulation scenarios, 100 experiments were independently replicated to evaluate the performance. Each experiment contained a training, a validation and a test dataset. We set the training sample size to 200, and the validation and test sample size to 100. We applied Brilliant on the training data using a block-diagonal working precision matrix with each diagonal block being the inverse of a within-group sample covariance matrix. A grid-search of selecting the optimal hyperparameters based on the prediction performance on the validation dataset was performed. The final prediction performance was evaluated on the test dataset.

The performance of outcome prediction, parameter estimation and feature selection was assessed. Outcome prediction was evaluated by average of the squared differences between the referenced cell fractions and their predicted values, or the mean square error (MSE), defined in (3), where *n* denotes the sample size in the test set. Feature selection and coefficient estimation were evaluated based on the estimated regression coefficient matrix ![Graphic][17]</img>. For coefficient estimation, mean absolute estimation error (MAEE) (4) was assessed. For feature selection, we calculated precision, recall and F1 score as formulated in (5)-(7), where |𝒮| is cardinality of a set 𝒮. ![Formula][18]</img>  ![Formula][19]</img>  ![Formula][20]</img>  ![Formula][21]</img>  ![Formula][22]</img>  

Figure 1 and Figure 7 summarize the performance of Brilliant, MSGLasso and Lasso when the covariance structure is CS or AR1, respectively. Brilliant outperforms MSGLasso and univariate lasso when the correlation *rho* within the same error group is 0.3 or greater. The other multivariate approach MSGLasso gives the second best prediction performance (Figure 1A) after Brilliant. This confirms that taking into account the multivariate response grouping structure can improve prediction compared to univariate response approach. It also shows that incoporating the responses’ covariances can further improve prediction. For coefficient estimation (Figure 1B), it shows that by explicitly incorporating the responses’ covariances, Brilliant obtains the lowest estimation error, especially when the responses are highly correlated. For feature selectio, as compared to MSGLasso and Lasso, Brilliant also gives higher precision, recall and F1 scores when response correlations range from moderate to high (Figure 1C-E). MSGLasso outperforms Brilliant when the correlation *rho* is low (*rho* = 0.1). This is due to the estimation error from calculting the precision matrix (inverting the working covariance matrix) in Brilliant.

![Fig. 1.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F2.medium.gif)

[Fig. 1.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F2)

Fig. 1. 
Simulation result of *p* = 500 with 10 **X** groups, *q* = 100 in 5 𝒴 groups. Covariance structure in each **X** group follows CS with correlation coefficient *ρ* = .5. Covariance structure in each 𝒴 group follows CS with *ρ* = 0.1, 0.3, 0.5, 0.7 or 0.9. Training sample size is 200 and test sample size is 100. **A**. Mean squared prediction error (MSE) with lower value indicating better prediction performance; **B**. Mean absolute estimation error of coefficient matrix (MAEE) with lower value suggesting better estimation; **C**. F1 score for feature selection; **D**. Precision for feature selection; **E**. Recall for feature selection; For **C**-**E**, a higher value indicates better feature selection performance.

## Real Data Application

We demonstrate the potential use and performance of Brilliant using the real data from our in-house multi-omics Epigenetic Variation and Childhood Asthma in Puerto Ricans (EVAPR) study [10] and the Framingham Heart Study (FHS) [24, 22]. In the EVAPR study, we collect phenotype, bulk RNA-sequencing (RNA-seq) gene expression and DNA methylation (DNAm) data from Puerto Rican children aged 9-20 years. Here, we focus on the 218 non-asthma controls, who have no missing data on all the predictors and outcomes. The FHS dataset is downloaded from dbGaP (phs000007.v32.p13), which contains blood microarray expression for 4,100 samples.

### Association analysis of DNA methylation and gene expression

In the first real data example, we applied Brilliant and its competing methods to the bulk RNA-seq gene expression and DNAm data from the nasal epithelium tissue samples of the 218 subjects to identify the susceptability CpG site associated with expressions of atopy-related genes.

The multivariate responses were expressions of 152 genes in four atopy-related biological pathways from the Gene Ontology Consortium [3]. Previous studies have shown that cytokine [26] and histone [1] play important roles in atopy. Here, we include 37 genes from the “positive regulation of cytokine production” pathway and 22 genes from the “histone methylation” pathway. We also include 55 genes in “energy derivation by oxidation of organic compounds” pathway since oxidative stress is directly related to atopic diseases [6]. Moreover, 39 genes from the “transport along mricrotuble” pathway were included due to their potential role in atopic diseases [15]. The average of within-pathway pairwise Pearson correlations was 0.715 (standard error= 0.168, details are given in Table 2 and Figure 2). DNAm information on 1,897 CpG sites around the 152 genes (within the region of 1, 000 base pairs up/down streams) were extracted. We further chosed the top 500 CpGs with the largest variances in their methylation levels (*M* values). The log-transformed gene expression TPM values and log-transformed methlylation *M* values were then used as the input responses and predictors for the Brilliant and MSGLasso. The groups of the gene responses were based on their belonging pathways. The groups of CpG sites were also based on the belonging pathways of their harboring or closest nearby genes. As a result, there were four gene groups and four CpG groups, which together yield 16 (= 4 × 4) interaction block groups on the underlying to be estimated regression coefficient matrix. The sizes of CpG groups vary from 91 to 143.

View this table:
[Table 2.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/T2)

Table 2. Average of pairwise Pearson correlations for genes within each outcome group.

![Fig. 2.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F3.medium.gif)

[Fig. 2.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F3)

Fig. 2. 
Pearson correlation heatmap of 152 genes. The vertical axis is indexed by gene names. The horizontal axis is indexed by the pathway names that the gene belongs to.

A three-fold nested cross-validation (CV) procedure was implemented for Brilliant and the competing methods. The inner CV was for selecting the optimal tuning parameters and the outer CV was for generating prediction results and avoiding overfitting. The Brilliant method outperforms both Lasso and MSGLasso in terms of giving the smallest predcition error per gene (Table 3). This is partially due to the the relatively high strength correlations between genes within each pathway group, and their un-ignorable roles in generating accurate prediction outcomes.

View this table:
[Table 3.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/T3)

Table 3. Prediction performance of Brilliant and its competing methods in the EVAPR Nasal gene expression and methylation association study.

Brilliant detected 7,600 potential associations between CpG methylations and gene expressions. MSGLasso and Lasso identified 2,575 and 1,728 associations, respectively. Among the 7,600 potential associations selected by Brilliant, 408 were also detected by both Lasso and MSGLasso (Figure 3). In the association block between the CpG group – “histone methylation” pathway and the gene group – “energy derivation by oxidation of organic compounds” pathway, Brilliant detected 96 association signals between DNAm and gene expression (Figure 9). There were 143 CpG sites in the “histone methylation” pathway and 54 genes in the “energy derivation by oxidation of organic compounds” pathway. The top association signals uniquely detected by Brilliant are listed in Table 4. Among them, majority of the associated CpG sites are located near genes known as key transcriptional and/or epigenetic regulators, with the associating genes being their targeted regulating genes. For instance, the association between *cg05608790* and *NDUFB9* was detected by Brilliant (*β**brilliant* = −0.0095). This reflects the regulation of *cg05608790* on its neighboring gene *AUTS2* (*β**brilliant* = −0.0150), which is an epigenetic regulator of a transcriptional network including *NDUFB9* [16]. Similarly, the association between *cg19832721* and *NDUFA4* (*β**brilliant* = −0.0104) reflects the regulational effects of *cg19832721* on its nearby gene *KANSL1* (*β**brilliant* = 0.0114), which is a key gene for producing the histone acetyltransferase 8 (KAT8, also known as MOF) regulatory nonspecific lethal complex. This complex is critical for acetylation of nucleosomal histone H4 at lysine 16, a unique histone mark controlling chromatin structure, and therefore involved in the regulation of transcription [28, 8, 29]. For this coefficient block, Lasso detects an additional 237 potential CpG-Gene associations. In contrast, MSGLasso detected only 7 association signals within this block.

View this table:
[Table 4.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/T4)

Table 4. Top association signals between DNA methylation and gene expression uniquely selected by Brilliant.

![Fig. 3.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F4.medium.gif)

[Fig. 3.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F4)

Fig. 3. 
Venn diagram of numbers of association signals detected by Brilliant and the competing methods in EVAPR study.

### Cell-type deconvolution for bulk gene expression in blood cell sample

Tissue-level omics studies are based on the average effects across heterogeneous collection of cells, such as studies on peripheral blood include signals from all underlying cell types. Therefore, the tissue-level association studies are prone to be confounded by the variation in cell composition across samples [19]. To eliminate such confounding, different algorithms of cell-type deconvolution have been proposed to infer the fractions of cell types in bulk gene expression data collected from tissue samples [4]. Among available methods, the reference-based cellular deconvolution are believed to be the most accurate and reliable approach [4, 17]

In the second real data application, we used both the EVAPR and FHS datasets to predict cell type proportions for blood neutrophil (Neutro), monocyte (Mono), lymphocyte (Lymph) and eosinophil (Eosino) using the bulk gene expression data from blood cell samples. The multivariate responses for both datasets were the fractions of the four cell types for each subject, the values of which were measured through complete blood cell counts [20]. We put all four cell type fractions into one response group for both datasets. Predictors in EVAPR dataset were expression levels of the signature genes of each cell type based on the widely used leukocyte gene signature matrix reference (LM22) [25]. Those genes were put into four predictor groups based on their reference cell type groups. More specifically, there were 64 signature genes in neutrophil group, 53 in lymphocyte group, 40 in monocyte group, and 39 genes in eosinophil group. The transcripts per million (TPM) counts of predictors were first undertaken a log transformation before applying the Brilliant algorithm. Predictors in FHS dataset contains blood microarray gene expressions from 4,100 samples Those genes were also grouped into four groups based on the leukocyte signature genes available in the dataset (Details in Appendix 9). The microarray gene expression data were also log transformed.

Prediction performance of Brilliant were evaluated within each and also across the two datasets. For internal prediction within each datasets, a three-fold nested cross-validation was perfromed on each dataset to select the optimal hyperparameters and avoid overfitting. For cross dataset prediction, FHS data were used as the training data and EVAPR data were used as the test data. A three-fold cross-validation was performed on the FHS data to select the optimal hyperparameters. The prediction results are shown in Figure 4, where the scatter plots between the predicted and referenced cell fractions and their Pearson correlations were presented. The correlations range from 0.692 to 0.896.

![Fig. 4.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F5.medium.gif)

[Fig. 4.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F5)

Fig. 4. 
Scatter plots of predicted v.s. referenced cell type fractions. The left panel plots are from CV using the EVAPR data only. The middle panel are from CV using the FHS data only. The right panel are from using EVAPR data as the test set while using the FHS data as the training set.

![Fig. 5.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F6.medium.gif)

[Fig. 5.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F6)

Fig. 5. 
Heatmap of coefficient matrix estimated by Brilliant using the FHS data. The matrix has been divided into four blocks based on the signature genes of each cell type. The four blocks contain the signature genes of Neutrophil, Eosinophil, Lymphocyte, and Monocyte cells, respectively.

The predicted cell type fractions from Brilliant were compared with other commonly used reference-based cell-type deconvolution methods, as in the study by Cai et al. [7], including CIBERSORT/CIBERSORTx [25], DeconRNASeq [13], DCQ [2], DeCompress [5], DSA [36], dtangle [17], EnsDeconv (Ensemble Deconvolution) [7], GEDIT (Gene Expression Deconvolution Interactive Tool) [23], EPIC (Estimating the Proportions of Immune and Cancer cells) [27], FARDEEP (Fast And Robust DEconvolution of Expression Profiles) [14], hspe (hybrid-scale proportions estimation) [18], ICeDT (Immune Cell Deconvolution in Tumor tissues) [33], and SCDC methods including non-negative least squares (SCDC NNLS) and inverse sum of squares (SCDC ISSE) [9]. All the results were based on the LM22 reference. The FHS dataset was used as the training data, and the EVARP dataset was used as the test data. The comparison results are given in Figure 6, where the Spearmans’s rank correlation, a convention performance evaluation metric in the cell-type deconvolution field, between the predicted and referenced cell fractions for each of the aforementioned methods. Although not designed as a cell-type deconvolution algorithm, Brilliant achieves the highest average Spearman’s correlation (the vertical bars in Figure 6) across the four cell types. We also calculated the mean square of errors (MSE) between the predicted and referenced cell fractions for each of the methods (Figure 10 in the Appendix). Brilliant gave the smallest average MSE across cell types among all methods.

![Fig. 6.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F7.medium.gif)

[Fig. 6.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F7)

Fig. 6. 
Comparison of cell-type fraction prediction using Brilliant and different deconvolution methods. Neutro: neutrophil; Mono: monocyte; Lymph: lymphocyte; Eosino: eosinophil. The black vertical bar on each row shows the mean of Spearman’s correlations, and the horizontal line presents the range of one standard error below/above the mean.

![Fig. 7.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F8.medium.gif)

[Fig. 7.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F8)

Fig. 7. 
Simulation result of *P* = 500 in 10 groups, *Q* = 100 in 5 groups, covariance structure of each predictor group is first-order auto-regressive (AR1) with associated correlation coefficient *ρ* = .5, covariance structure of each error group is CS with*ρ* = .1, 0.3 .5, .7, or .9. Training sample size is 200. **A**. Mean squared error (MSE) metric of prediction error, and lower value means better prediction performance; **B**. Mean absolute estimation error (MAEE) of coefficient matrix with lower value suggesting better estimation **C**; F1 score for feature selection; **D**. Precision for feature selection. **E**. Recall for feature selection; For **C**-**E** higher value indicates better feature selection performance.

![Fig. 8.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F9.medium.gif)

[Fig. 8.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F9)

Fig. 8. 
Heatmaps of estimated coefficient to demonstrate Lasso select more false positives when the multivariate outcomes are correlated. Note: Predictors without any nonzero estimated coefficients by both Lasso and Brilliant are omitted in the graph.

![Fig. 9.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F10.medium.gif)

[Fig. 9.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F10)

Fig. 9. 
Estimations of one regression coefficient block by Brilliant. Note: This example coefficient block corresponds to the associations between the 143 Cpg sites near genes from the “histone methylation” pathway and the 54 genes from the “energy derivation by oxidation of organic compounds” pathway. Only CpG sites and genes with at least one non-zero coefficient estimate are visualized in the graph.

![Fig. 10.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/12/29/2023.12.26.23300559/F11.medium.gif)

[Fig. 10.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/F11)

Fig. 10. 
Comparison of cell-type fraction prediction mean squared error (MSE) using Brilliant and different deconvolution methods on white blood bulk RNA-sequencing data in the EVA-PR study and blood microarry data in the FHS study. Neutro is neutrophil; Mono is monocyte; Lymph is lymphocyte; and Eosino is eosinophil. Dot denotes the MSEs between the predicted fraction and truth for one specific cell type. The black vertical bar shows the mean of cell-type specific MSEs, and the horizontal line presents mean *±* standard error of the mean.

The feature selection performance of Brilliant was evaluated by comparing the estimated coefficient matrix in the final model (built on FHS data) with the LM22 reference gene matrix. The heatmap in Figure 5 displays the coefficient matrix estimated by Brilliant. The matrix has been divided into four sub-matrices based on the signature genes of each cell type. As anticipated, for each cell type, the signature genes in the LM22 reference have the highest estimate magnitudes in their corresponding cell type categories (indicated by darker red color in the corresponding column). Among the 188 genes selected in the final model, Brilliant correctly identified cell types for 144 signature genes. In other words, the highest proportioned cell type in the LM22 matrix is also estimated by Brilliant to have the highest coefficient magnitude across four cell types for these 144 genes. For another 20 genes, the cell type with the second-largest estimated coefficient magnitude is the highest proportioned cell type in the LM22 reference. These demonstrate that Brilliant is capable of identifying the most informative features associated with the outcomes of interest.

## Conclusion

In this paper, we proposed an analytical tool – Brilliant for variable selection and outcome prediction on multivariate response and high-dimensional predictor data. Brilliant not only utilizes the intrinsic grouping structures of both predictors and responses and performs both group-level and within group-level variable selection, but also explicitly incorporates the covariance structure of the multivariate responses. Brilliant significantly improves the coefficient estimation accuracy, feature selection and outcome prediction performance compared to the competing methods. Brilliant does not require iteratively estimating the precision matrix as in [34]. It only requires a one-step estimation of a working covariance structure according to the user specified grouping structure on the responses.

We applied Brilliant to a methylation–to–gene expression association study and a study using gene expressions to predict cell type fractions. Brilliant can be easily extended to solve many other multivariate-response and multiple-predictor problems, including but not limited to single cell multiomics and expression quantitative trait loci (eQTL) studies, which are beyond the scope of this paper. We will explore its wide applications in the future.

## Data Availability

All data used in the present study are available upon reasonable request to the authors.

## Competing interests

The authors declare no competing interest is declared. This study was partially supported by National Science Foundation (award number 2225775).

## Author contributions statement

X.Z., Y.L. and W.C. conceived and designed the study. M.Y. and J.C. was responsible for data management. X.Z. and M.C. were responsible for data cleaning and analysis. X.Z., Y.D., Y.L. and W.C. were responsible for interpretation. X.Z. and Y.L. wrote the paper. All authors read and approved the final article.

## Acknowledgments

The authors thank all the participants and their families in the Epigenetic Variation and Childhood Asthma in Puerto Ricans study and Framingham Heart Study for enabling this research and further our understanding of childhood asthma and allergy by their participation in the studies. We appreciate the researchers of the Framingham Heart Study for making the data available on dbGAP. This research is supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. Specifically, this work used the HTC cluster, which is supported by NIH award number S10OD028483, and the H2P cluster, which is supported by NSF award number OAC-2117681.

## Appendix A Proofs of technical results

### Brilliant objective function

Assume **Y**|**X** ∼ *N* (**XB, Σ**), the log-likelihood of data can be written as follows. ![Formula][23]</img>  

Given the working precision matrix ![Graphic][24]</img>, the penalized log likelihood can be written as below. ![Formula][25]</img>  

### Derivative of Brilliant objective function

The objective function consists of three terms. We take the derivative of each term with respect to *β**jk* separately. Then, we will combine the gradient or sub-gradient together.

#### Take derivative of the first term with respect to βjk

![Formula][26]</img>  

Let **B** = **B**(−*jk*) + **B**(*jk*), where **B**(−*jk*) being the *jk*-th entry of **B** replaced by zero and **B**(*jk*) being all but the *jk*-th entry of **B** replaced by zeros. Define *S**jk* as shown below in equation ![Formula][27]</img>  

We can re-write equation (10) as ![Formula][28]</img>  

#### Take derivative of the second and third term with respect to βjk

For a coordinate *β**jk*, when *L*(**B**) is differentiable at *β**jk*, ![Formula][29]</img>  

For general case, the derivative of the second term or |·| penalty can be written as follows. ![Formula][30]</img>  

For general case, the derivative of the third term in the objective function or the derivative of ∥ · ∥2 penalty are shown below. ![Formula][31]</img>  

#### Combine the derivative of the three terms

![Formula][32]</img>  

### Solution to the Brilliant objective function

Set derivative to 0 and solve for *β**jk* with the corresponding constraint, we can get ![Formula][33]</img>  

Or, in one unified form: ![Formula][34]</img>  

For those who wish to delve deeper into the topic, we have included a comprehensive explanation below.

#### Details when βij < 0

![Formula][35]</img>  

With the constraint *β <* 0, or ![Graphic][36]</img>, we have ![Formula][37]</img>  

#### Details when βij > 0

![Formula][38]</img>  

With the constraint *β >* 0, or ![Graphic][39]</img>, we can get ![Formula][40]</img>  

#### Details on βjk = 0

We will see different scenarios involving *β**ij* = 0.

S1. The group containing *β**jk* is not zero groups. This suggests that the Lasso (*L*1) | · | penalty is not differentiable, but the ∥ · ∥ penalty is differentiable for all groups containing *β**jk*. Set the derivative to 0, we get ![Formula][41]</img>  where |*u*| *<* 1. When *u* ≥ 0, *S**jk* ≥ *nλ*1. When *u <* 0, *S**jk* *<* −*nλ*1. So the solution is the same as that in the above session for *β**jk* ≠ 0) or Equation (14).

S2. The group containing *β**jk* is a zero group. Let ![Graphic][42]</img>. Let **B***g* be the zero group containing *β**jk*. Let |*u*| *<* 1, and ∥**v**∥2 *<* 1. Set the derivative to 0, we have ![Formula][43]</img>  

For each (*jk* : *β**jk* ∈ *B**gi*), ![Formula][44]</img>  

We can see ![Graphic][45]</img> if ![Formula][46]</img>  

Discuss on a case by case basis, ![Formula][47]</img>  

We can get that ![Graphic][48]</img> ![Formula][49]</img>  

## Appendix B: Simulation Result

For *p* = 500 and *q* = 100, the simulation results under compound symmetry (CS) covariance structure (setting 1-5 in Table 1) are presented in the main article. For results under first order autoregressive (AR1) covariance structure (setting 6-10 Table 1), we observe very similar performance of Brilliant as in CS settings. More specifically, Brilliant achieves improved prediction and coefficient estimation compared to MSGLasso and univariate Lasso. Brilliant also shows significant improved feature selection performance, especially in F1 score and precision.

## Appendix C: Further explanation of Lasso feature selection performance

We use a simply simulation study to demonstrate that Lasso is susceptible to selecting more FP signals when there exist correlated outcomes. In this example, we simulated a group of 100 predictors and 40 outcomes. The predictors were generated using the multivariate normal distribution with covariance matrix following compound symmetry (cs) structure with *ρ* = 0.2. The outcomes were simulated from the multivariate normal distribution with covariance matrix following compound symmetry structure and *rho* = 0.9. Then we replaced *Y* 1 with *Y* 1 + *X*1 + *X*2. There were 25 observations in the training dataset. In this case, although both Lasso and Brilliant select the true signals, namely *Y* 1 − *X*1 and *Y* 2 − *X*2. Lasso identify an additional 331 false positives (FPs), while Brilliant only have 6 more FPs (Figure 8).

## Appendix D: Details on cell-type deconvolution for blood cell samples

In the LM22 reference matrix, all entries are positives with higher value indicating higher expression level of the gene in the corresponding cell type. For each gene, we set it as the signature gene for the cell type with the largest value in the reference matrix. The number of available signature genes in the EVAPR and FHS dataset are summarized in Table 5. Table 6 summarizes the top performers of cell type deconvolution for the EVAPR dataset.

View this table:
[Table 5.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/T5)

Table 5. Signature genes available in EVAPR and FHS dataset.

View this table:
[Table 6.](http://medrxiv.org/content/early/2023/12/29/2023.12.26.23300559/T6)

Table 6. Top performers of cell type deconvolution for EVAPR dataset.

*   Received December 26, 2023.
*   Revision received December 26, 2023.
*   Accepted December 28, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1. Bilal Alaskhar Alhamwe,  Razi Khalaila,  Johanna Wolf, Verena von Bülow,  Hani Harb,  Fahd Alhamdan,  Charles S Hii,  Susan L Prescott,  Antonio Ferrante,  Harald Renz, et al. Histone modifications and their role in epigenetics of atopy and allergic diseases. Allergy, Asthma & Clinical Immunology, 14(1):1–16, 2018.
    
    
2.  2. Zeev Altboum,  Yael Steuerman,  Eyal David,  Zohar Barnett-Itzhaki,  Liran Valadarsky,  Hadas Keren-Shaul,  Tal Meningher,  Ella Mendelson,  Michal Mandelboim,  Irit Gat-Viks, et al. Digital cell quantification identifies global immune cell dynamics during influenza infection. Molecular systems biology, 10(2):720, 2014.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoibXNiIjtzOjU6InJlc2lkIjtzOjg6IjEwLzIvNzIwIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTIvMjkvMjAyMy4xMi4yNi4yMzMwMDU1OS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

3.  3. Michael Ashburner,  Catherine A Ball,  Judith A Blake,  David Botstein,  Heather Butler,  J Michael Cherry,  Allan P Davis,  Kara Dolinski,  Selina S Dwight,  Janan T Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/75556&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=10802651&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000086884000011&link_type=ISI) 

4.  4. Francisco Avila Cobos, José Alquicira-Hernandez,  Joseph E Powell,  Pieter Mestdagh, and  Katleen De Preter. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nature communications, 11(1):1–14, 2020.
    
    
5.  5. Arjun Bhattacharya,  Alina M Hamilton,  Melissa A Troester, and  Michael I Love. Decompress: tissue compartment deconvolution of targeted mrna expression panels using compressed sensing. Nucleic acids research, 49(8):e48–e48, 2021.
    
    
6.  6. Russell P Bowler and  James D Crapo. Oxidative stress in allergic respiratory diseases. Journal of Allergy and Clinical Immunology, 110(3):349–356, 2002.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1067/mai.2002.126780&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=12209079&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000177936900002&link_type=ISI) 

7.  7. Manqi Cai,  Molin Yue,  Tianmeng Chen,  Jinling Liu,  Erick Forno,  Xinghua Lu,  Timothy Billiar, Juan Celedón,  Chris McKennan,  Wei Chen, et al. Robust and accurate estimation of cellular fraction from tissue omics data via ensemble deconvolution. 3010, 2022. Bioinformatics, 38(11):3004–
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btac279&link_type=DOI) 

8.  8. Yong Cai,  Jingji Jin,  Selene K Swanson,  Michael D Cole,  Seung Hyuk Choi,  Laurence Florens,  Michael P Washburn,  Joan W Conaway, and  Ronald C Conaway. Subunit composition and substrate specificity of a mof-containing histone acetyltransferase distinct from the male-specific lethal (msl) complex. Journal of Biological Chemistry, 285(7):4268–4272, 2010.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiamJjIjtzOjU6InJlc2lkIjtzOjEwOiIyODUvNy80MjY4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTIvMjkvMjAyMy4xMi4yNi4yMzMwMDU1OS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

9.  9. Meichen Dong,  Aatish Thennavan,  Eugene Urrutia,  Yun Li,  Charles M Perou,  Fei Zou, and  Yuchao Jiang. Scdc: bulk gene expression deconvolution by multiple single-cell rna sequencing references. Briefings in bioinformatics, 22(1):416–427, 2021.
    
    
10. 10. Erick Forno,  Ting Wang,  Cancan Qi,  Qi Yan,  Cheng-Jian Xu,  Nadia Boutaoui,  Yueh-Ying Han,  Daniel E Weeks,  Yale Jiang,  Franziska Rosser, et al. Dna methylation in nasal epithelium, atopy, and atopic asthma in children: a genomewide study. The Lancet Respiratory Medicine, 7(4):336–346, 2019.
    
    
11. 11. Jerome Friedman,  Trevor Hastie, and  Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
    
    
12. 12. Jerome Friedman,  Trevor Hastie, and  Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/kxm045&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=18079126&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000256977000005&link_type=ISI) 

13. 13. Ting Gong and  Joseph D Szustakowski. Deconrnaseq: a statistical framework for deconvolution of heterogeneous tissue samples based on mrna-seq data. Bioinformatics, 29(8):1083–1085, 2013.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btt090&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23428642&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000318109300019&link_type=ISI) 

14. 14. Yuning Hao,  Ming Yan,  Blake R Heath,  Yu L Lei, and  Yuying Xie. Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares. PLoS computational biology, 15(5):e1006976, 2019.
    
    
15. 15. Irene H Heijink,  Virinchi NS Kuchibhotla,  Mirjam P Roffel,  Tania Maes,  Darryl A Knight,  Ian Sayers, and  Martijn C Nawijn. Epithelial cell dysfunction, a major driver of asthma development. Allergy, 75(8):1902–1917, 2020.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/all.14421&link_type=DOI) 

16. 16. Kei Hori,  Kazumi Shimaoka, and  Mikio Hoshino. Auts2 gene: keys to understanding the pathogenesis of neurodevelopmental disorders. Cells, 11(1):11, 2021.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cells11010011&link_type=DOI) 

17. 17. Gregory J Hunt,  Saskia Freytag,  Melanie Bahlo, and  Johann A Gagnon-Bartsch. Dtangle: accurate and robust cell type deconvolution. Bioinformatics, 35(12):2093–2099, 2019.
    
    
18. 18. Gregory J Hunt and  Johann A Gagnon-Bartsch. The role of scale in the estimation of cell-type proportions. The Annals of Applied Statistics, 15(1):270–286, 2021.
    
    
19. 19. Andrew E Jaffe and  Rafael A Irizarry. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome biology, 15(2):1–9, 2014.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/gb-2014-15-1-r1&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24393533&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 

20. 20. Yale Jiang,  Olena Gruzieva,  Ting Wang,  Erick Forno,  Nadia Boutaoui,  Tao Sun,  Simon K Merid, Edna Acosta-Pérez,  Inger Kull,  Glorisa Canino, et al. Transcriptomics of atopy and atopic asthma in white blood cells from children and adolescents. European Respiratory Journal, 53(5), 2019.
    
    
21. 21. Yanming Li, Bin Nan, and  Ji Zhu. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics, 71(2):354–363, 2015.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/biom.12292&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25732839&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 

22. 22. Syed S Mahmood,  Daniel Levy,  Ramachandran S Vasan, and  Thomas J Wang. The framingham heart study and the epidemiology of cardiovascular disease: a historical perspective. The lancet, 383(9921):999–1008, 2014.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/s0140-6736(13)61752-3&link_type=DOI) 

23. 23. Brian B Nadel,  David Lopez,  Dennis J Montoya,  Feiyang Ma,  Hannah Waddel,  Misha M Khan,  Serghei Mangul, and  Matteo Pellegrini. The gene expression deconvolution interactive tool (gedit): accurate cell type quantification from gene expression data. GigaScience, 10(2):giab002, 2021.
    
    
24. 24. Brian B Nadel,  Meritxell Oliva,  Benjamin L Shou,  Keith Mitchell,  Feiyang Ma,  Dennis J Montoya,  Alice Mouton,  Sarah Kim-Hellmuth,  Barbara E Stranger,  Matteo Pellegrini, et al. Systematic evaluation of transcriptomics-based deconvolution methods and references using thousands of clinical samples. Briefings in Bioinformatics, 22(6):bbab265, 2021.
    
    
25. 25. Aaron M Newman,  Chih Long Liu,  Michael R Green,  Andrew J Gentles,  Weiguo Feng,  Yue Xu,  Chuong D Hoang,  Maximilian Diehn, and  Ash A Alizadeh. Robust enumeration of cell subsets from tissue expression profiles. Nature methods, 12(5):453–457, 2015.
    
    
26. 26. Ly P Ngoc,  Diane R Gold,  Arthur O Tzianabos,  Scott T Weiss, and  Juan C Celedón. Cytokines, allergy, and asthma. Current opinion in allergy and clinical immunology, 5(2):161–166, 2005.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/01.all.0000162309.97480.45&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=15764907&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000232293000009&link_type=ISI) 

27. 27. Julien Racle,  Kaat de Jonge,  Petra Baumgaertner,  Daniel E Speiser, and  David Gfeller. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. elife, 6, 2017.
    
    
28. 28. Sunil Jayaramaiah Raja,  Iryna Charapitsa,  Thomas Conrad,  Juan M Vaquerizas,  Philipp Gebhardt,  Herbert Holz,  Jan Kadlec,  Sven Fraterman,  Nicholas M Luscombe, and  Asifa Akhtar. The nonspecific lethal complex is a transcriptional regulator in drosophila. Molecular cell, 38(6):827–841, 2010.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.molcel.2010.05.021&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20620954&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000279297500008&link_type=ISI) 

29. 29. Michael Shogren-Knaak,  Haruhiko Ishii,  Jian-Min Sun,  Michael J Pazin,  James R Davie, and  Craig L Peterson. Histone h4-k16 acetylation controls chromatin structure and protein interactions. Science, 311(5762):844–847, 2006.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzMTEvNTc2Mi84NDQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8xMi8yOS8yMDIzLjEyLjI2LjIzMzAwNTU5LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

30. 30. Noah Simon,  Jerome Friedman,  Trevor Hastie, and  Robert Tibshirani. A sparse-group lasso. Journal of computational and graphical statistics, 22(2):231–245, 2013.
    
    
31. 31. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/J.2517-6161.1996.TB02080.X&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1996TU31400017&link_type=ISI) 

32. 32. Ines Wilms and  Christophe Croux. An algorithm for the multivariate group lasso with covariance estimation. Journal of Applied Statistics, 45(4):668–681, 2018.
    
    
33. 33. Douglas R Wilson,  Chong Jin,  Joseph G Ibrahim, and  Wei Sun. Iced-t provides accurate estimates of immune cell abundance in tumor samples by allowing for aberrant gene expression patterns. Journal of the American Statistical Association, 115(531):1055–1065, 2020.
    
    
34. 34. Jianxin Yin and  Hongzhe Li. Adjusting for highdimensional covariates in sparse precision matrix estimation by ℓ1-penalization. Journal of multivariate analysis, 116:365–381, 2013.
    
    
35. 35. Ming Yuan and  Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1467-9868.2005.00532.x&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000234129000003&link_type=ISI) 

36. 36. Yi Zhong,  Ying-Wooi Wan,  Kaifang Pang,  Lionel ML Chow, and  Zhandong Liu. Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC bioinformatics, 14(1):1–10, 2013.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/1471-2105-14-1&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23323762&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F29%2F2023.12.26.23300559.atom) 

37. 37. Hui Zou and  Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1467-9868.2005.00503.x&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000227498200007&link_type=ISI)

 [1]: /embed/graphic-1.gif
 [2]: /embed/inline-graphic-1.gif
 [3]: /embed/graphic-2.gif
 [4]: /embed/inline-graphic-2.gif
 [5]: /embed/inline-graphic-3.gif
 [6]: /embed/inline-graphic-4.gif
 [7]: /embed/inline-graphic-5.gif
 [8]: /embed/inline-graphic-6.gif
 [9]: /embed/inline-graphic-7.gif
 [10]: /embed/inline-graphic-8.gif
 [11]: /embed/inline-graphic-9.gif
 [12]: /embed/inline-graphic-10.gif
 [13]: /embed/inline-graphic-11.gif
 [14]: /embed/inline-graphic-12.gif
 [15]: /embed/inline-graphic-13.gif
 [16]: /embed/inline-graphic-14.gif
 [17]: /embed/inline-graphic-15.gif
 [18]: /embed/graphic-5.gif
 [19]: /embed/graphic-6.gif
 [20]: /embed/graphic-7.gif
 [21]: /embed/graphic-8.gif
 [22]: /embed/graphic-9.gif
 [23]: /embed/graphic-23.gif
 [24]: /embed/inline-graphic-16.gif
 [25]: /embed/graphic-24.gif
 [26]: /embed/graphic-25.gif
 [27]: /embed/graphic-26.gif
 [28]: /embed/graphic-27.gif
 [29]: /embed/graphic-28.gif
 [30]: /embed/graphic-29.gif
 [31]: /embed/graphic-30.gif
 [32]: /embed/graphic-31.gif
 [33]: /embed/graphic-32.gif
 [34]: /embed/graphic-33.gif
 [35]: /embed/graphic-34.gif
 [36]: /embed/inline-graphic-17.gif
 [37]: /embed/graphic-35.gif
 [38]: /embed/graphic-36.gif
 [39]: /embed/inline-graphic-18.gif
 [40]: /embed/graphic-37.gif
 [41]: /embed/graphic-38.gif
 [42]: /embed/inline-graphic-19.gif
 [43]: /embed/graphic-39.gif
 [44]: /embed/graphic-40.gif
 [45]: /embed/inline-graphic-20.gif
 [46]: /embed/graphic-41.gif
 [47]: /embed/graphic-42.gif
 [48]: /embed/inline-graphic-21.gif
 [49]: /embed/graphic-43.gif