Abstract
Renal clear cell carcinoma (RCC), the most common type of kidney cancer, lacks a well-defined collection of biomarkers for tracking disease progression. Although complementary diagnostic and prognostic RCC biomarkers may be beneficial for guiding therapeutic selection and informing clinical outcomes, patients currently have a poor prognosis due to limited early detection. Without a priori biomarker knowledge or histopathology information, we used machine learning (ML) techniques to investigate how mRNA, microRNA, and protein expression levels change as a patient progresses to different stages of RCC. The novel combination of big data with ML enables researchers to generate hypothesis-free models in a fraction of the time used in traditional clinical trials. Ranked genes that are most predictive of survival and disease progression can be used for target discovery and downstream analysis in precision medicine. We extracted clinical information for normal and RCC patients along with their related expression profiles in RCC tissues from three publicly-available datasets: 1. The Cancer Genome Atlas (TCGA), 2. Genotype-Tissue Expression (GTEx) project, 3. Clinical Proteomic Tumor Analysis Consortium (CPTAC). Our study found that among others, gene expression levels (mRNA) from GNG7 and BCR are potential predictors for RCC progression. For microRNA, we found hsa-mir-199a-2 and hsa-mir-129-1 to be potential predictors of RCC progression. Understanding how genes and protein expression levels change as RCC progresses will further guide the development of prognostic biomarkers and targets for RCC therapies.
Introduction
Kidney cancer is among the ten most common cancers. In the US, rates for renal cell cancer continue to rise but mainly for early-stage tumors, with kidney renal clear cell carcinoma (RCC) accounting for 75% of all kidney cancers [1]. In the US, for the year 2020 the number of new cases of kidney cancer was 73,750 and deaths was 14,830 [2]. Although biomarkers for RCC are emerging [3, 4, 5, 6], patients usually have a poor prognosis due to limited early detection. Early identification of metastatic potential may be beneficial for therapeutic selection, as well as informing clinical outcomes. Using integrated diagnostic techniques made possible by lower-cost sequencing and predictive modeling, a more precise treatment approach may emerge by combining multimodal data such as mRNA, microRNA, and protein expression with clinical information [7, 8, 9, 10, 11].
Cancer databases contribute greatly to research that explains the molecular mechanisms underlying tumorigenesis. For example, the Cancer Genome Atlas (TCGA) database [12] consists of over 20,000 primary cancer and matched normal samples spanning 33 cancer types. Similarly, the Genotype-Tissue Expression (GTEx) database [13] contains tissue-specific gene expression for 54 non-diseased sites across nearly 1,000 individuals. The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) consists of over 2,000 proteomes and facilitates the discovery of cancer-specific protein biomarkers that augment DNA and RNA sequencing [14]. Bioinformatics mining—computational methods that are complementary to clinical trials—enables researchers to quickly evaluate concepts and generate hypothesis-free models. With publicly-available datasets becoming increasingly available, bioinformatics mining has led to a burst of research activity related to computational cancer biomarker and target discovery.
Understanding how genes and protein expression change as RCC progresses can further guide the development of prognostic biomarkers and treatment planning for RCC patients. In this study, we investigated how mRNA, microRNA, and protein expression levels were associated with different stages of RCC patients, as well as to their overall survival outcomes. Ranked genes that are most predictive of survival and disease progression can be used for target discovery for drugs and biomarker discovery for clinical decision support. In contemporary practice, VEGF and VEGF receptors are the most common therapy targets in the clinical treatment of RCC [15].
In this paper, we extracted clinical information of normal and RCC patients along with their related expression profiles in RCC tissues. All TCGA patients were clinically and pathologically diagnosed with RCC, and tumors were classified by TCGA into Stages I to IV based on the Fuhrman nuclear grading system [16]. Without a priori biomarker knowledge or histopathology information, we used ML techniques to investigate how clinical and omic data could be combined to identify the pathological stage of the tumor based on normal, early (Stage I+II), and late (Stage III+IV) classes.
Methods
Data collection and preparation
Clinical and genomic data were obtained from the Cancer Genome Atlas (TCGA) -Kidney Renal Carcinoma (KIRC) project, Genotype-Tissue Expression (GTEx), and the Clinical Proteomics Tumor Analysis Consortium (CPTAC). The processed mRNA data was used following the approach of Wang et al that combined data sources from TCGA and GTEx [17]. TCGA was the single source for microRNA data. Protein data and associated clinical data for those patients were obtained from CPTAC. Disease progression stages were defined by the American Joint Committee on Cancer (AJCC). Overall survival, overall survival months, and AJCC-defined pathologic tumor stage were extracted from clinical data and joined with genomic and protein data.
A combination of clinical and genomic data was provided as input to ML methods that included a decision tree model (XGBoost) and an ensemble (AutoGluon) model, which were trained to predict disease progression (Stages I through IV) for each of these datasets. We carried out a stratified 70/30 train-test data split before experiment training. Tables 1 and 2 show distributions of patients for each tumor stage based on the mRNA, microRNA, and protein datasets. A large class imbalance exists in each dataset, which is noted by the “Percentage of cohort” column in each table.
For the different molecule types, the distribution of patients obtained from TCGA and GTEx for mRNA; TCGA for microRNA; and CPTAC for proteins.
Comparison of evaluation metrics of the different models and methods used to predict stages.
Top 20 genes ordered by importance in mRNA multi-class classification: AutoGluon.
Top 20 microRNA ordered by importance in multi-class classification selected: AutoGluon.
Top 20 genes ordered by importance in protein multi-class classification (AutoGluon) without survival analysis (from full set of available proteins).
To increase potential clinical insight, patients were stratified into three distinct groups per dataset. These classes are defined as follows:
Class 0: Normal Samples (includes matched normal samples)
Class 1: Stage I and Stage II
Class 2: Stage III and Stage IV
The stages of the samples were determined at the time of diagnosis by the TCGA consortium and are included in the public TCGA dataset. For this research, tumor stages were grouped together in classes based on biological similarities. For example, Stage I and II kidney tumors are concentrated within the kidney itself and combined into Class 1. Stage III and IV kidney tumors have progressed and spread outside of the kidney and combined into Class 2 [18]. This grouping allows for the identification of relevant potential biomarkers for tumor progression and the determination of patients with the highest risk of tumor progression. The distributions of patients over progression stages are shown in Table 1 by molecule type.
Preprocessing
Patients with missing ground truth entries or genomic features with a high fraction of missing data were removed from the dataset. Genetic expression levels were normalized to zero mean and unit standard deviation.
Survival analysis
The correlation between patient survival and tumor stage is shown in Table 2, which is supported by studies on genetics and survival [18]. Cox’s Proportional Hazard Test (Cox PH) was carried out for both mRNA and microRNA training sets separately based on the time to death in post-diagnosis days in order to determine which genes are significantly related to survival (p < 0.05) using the open-source Python package, lifelines [19]. Downstream models for disease progression were trained on a subset of each dataset containing only those genes significant for survival. This method allowed us to perform informed feature selection, reducing the mRNA feature space from 20,235 genes to 5,569 significant genes, and microRNA from 532 genes to 47 significant genes.
XGBoost
The open-source Python package for XGBoost version 1.4.2 [20] was coupled with an open-source Bayesian-optimization package for hyperparameter tuning [22] over 50 rounds per model. A model with optimal hyperparameters was determined with 10-fold cross-validation on the training set and returned from a Bayesian optimization procedure with 1,000 boosting rounds. The final model with optimal parameter settings was trained on the full training set. Model performance was evaluated with accuracy and ROC-AUC-OVO (Area Under the Curve of the Receiver Operating Curve, using a One-Versus-One evaluation for imbalanced multi-class data) on the test set. We then used the XGBoost feature importance tool to create importance matrices.
AutoGluon
AutoGluon version 0.3.1 [21], an open-source AutoML package, was also used to train a large ensemble model (linear, k-nearest neighbor, random forest, XGBoost, and tabular neural network models, among others) to compare with the custom XGBoost model, and also to compare the internal AutoGluon models that comprise the ensemble. The AutoGluon ensemble was trained for 30 minutes on each dataset, and, like the XGBoost models, the AutoGluon ensemble predictor was trained only on a subset of features deduced from Cox’s Proportional Hazard Test.
Model Tuning
For hyperparameter tuning, we performed Bayesian optimization with 10-fold cross validation using the average ROC-AUC-OVO score as the evaluation metric. Optimization was accomplished using an open-source Python package, BayesianOptimization [22].
Compute Environment
Models were trained using a ml.r5.4xlarge AWS SageMaker notebook instance. The code, models, and results are available in the GitHub repository listed in the Data Usage section. Data cleaning, Cox’s Proportional Hazard Test, XGBoost, and AutoGluon widely varied in time and computational resources due to the complexity of the data used (8,217 mRNA features as opposed to 11,710 protein features).
Results
As described previously, mRNA and microRNA data were obtained from TCGA’s KIRC/RCC dataset, mRNA data for non-RCC patients from GTEx, and protein data (and cohort follow-up) from CPTAC. From the survival analysis, we found 8,218 mRNA, 38 microRNA, and 10 of the top 20 proteins that were significant (p < 0.05). These findings were used as classifier inputs.
Using mRNA, microRNA, and protein input data, we trained a classifier on a 70/30 stratified train-test split to predict three classes:
Class 0: Normal Patients
Class 1: Stage I and Stage II
Class 2: Stage III and Stage IV
On a test dataset of 30% of the data, we noted the evaluation metrics as in Table 2. AutoGluon outperformed XGBoost in every scenario, which is due to AutoGluon’s ensemble of models.
The mRNA, microRNA, and proteins found to be highly predictive of RCC progression are shown in Tables 2 through 5.
Discussion and conclusion
mRNA prognostic biomarkers
For mRNA gene expression potentially predictive of RCC progression, we will discuss the top 10 genes in descending order of importance (GNG7, BCR, B3GALT1, SPTSSB, GLS2, RP11-537L4, TMEM159, SHOX2, PTPRR, AC003006.7). GNG7, a modulator in various transmembrane signaling systems, had the highest ranking. GNG7 is part of the guanine nucleotide-binding protein (G protein) family with downregulation associated with pancreatic and esophageal cancer [23, 24], as well as squamous cell carcinoma of the head and neck [25]. A further study by Xu et al found that GNG7 contributes to the progression of clear cell renal cell carcinoma, which is in agreement with our findings [26]. BCR is a protein with two opposing regulatory activities toward small GTP-binding proteins, and has been associated with leukemia [27]. Studies by Kim et al have shown a relationship between BCR gene expression and RCC [28, 29]. Gene expression of BCR was also upregulated in KIRP, a renal cancer closely related to RCC [30]. B3GALT1 is involved in the biosynthesis of the carbohydrate moieties of glycolipids and glycoproteins, and has been found to be involved in cervical adenosquamous carcinoma [31]. SPTSSB (serine palmitoyltransferase small subunit B) downregulation has been found to be detrimental to RCC patients [32]. GLS2 promotes mitochondrial respiration and increases ATP generation in cells. In a study by Shi et al, GLS2 was found to be upregulated in ccRCC cells; cells with GLS2 shRNA displayed lower survival, lower glutathione levels, and a high lipid peroxide level [33]. RP11-527L4 is a long intergenic non-protein coding RNA more commonly known as LINC01976. The relationship of this ln-RNA has not been studied with respect to renal cancer and searches did not reveal studies regarding its function. TMEM159 plays an important role in the formation of lipid droplets, but we were unable to find literature associating TMEM159 expression with RCC progression. SHOX2 is an important gene implicated in craniofacial, brain, heart, and limb development [34], and a methylation analysis of SHOX2 has been performed for early-stage lung cancer [35]. Further, Jung et al found that SHOX2 gene body methylation was positively correlated with mRNA expression in RCC tissues [36]. PTPRR was found to be related with multiple types of cancer, but searches did not find an association with RCC. This gene is a member of the protein tyrosine phosphatase (PTP) family. PTPs are known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation. Lastly, AC003006.7 is an uncharacterized protein, with one transcript having a KRAB-containing protein. We could not find literature regarding this protein, but include it for completeness.
microRNA prognostic markers
After comparing the top 10 microRNA sequences with search results from the mirbase database [37], several sequence candidates were implicated in RCC. For example, hsa-mir-301a has been cited as a possible biomarker of metastatic RCC [38]. hsa-mir-129-1 has been found to be a biomarker for hepatocellular carcinoma [39, 40], but no known relationship with RCC has been reported in literature. Hsa-mir-23b has been associated with various cancer types and is a promising therapeutic target [41]. Its downregulation has also been found to be a predictor of hepatocellular carcinoma progression. For miR-27b, Ishihara et al found that it functions as a tumor suppressor of RCC [42]. Hsa-mir-200b downregulation has been found to suppress metastasis of RCC [43], and hsa-let-7i has been implicated in RCC due to its relationship with TRIM genes [44]. Hsa-mir-24-1 plays a role in several different types of cancer [45] and may be connected to lupus nephritis [46]. Although hsa-mir-142 and hsa-mir-3676 have not been implicated in renal cancer in the literature, our final candidate, hsa-mir-133b, has been associated with renal cancer through the JAK2/STAT3 pathway [47]. It has also been identified as a biomarker in other studies [48].
Protein prognostic markers
Considering the top 10 proteins identified from AutoGluon’s importance matrix, we first searched for them in the Human Protein Atlas [50]. The Human Protein Atlas performed a detailed analysis of each gene available in open-source TCGA datasets across several different cancer types. From this search, we found that all of the top 10 were considered cancer-related genes. Of these genes, 4 were considered prognostic markers for renal cancer (MFSD4A, CHL1, HAPLN1, UTP11), although 6 genes were not previously described as prognostic (TUBB2B, TRIM37, HLA-DRB1, CRTC3, HLA-A, MT1A). Of the 6 non-prognostic genes for RCC, 3 were considered prognostic for other cancers (TUBB2B, TRIM37, HLA-A), 1 was found to be cancer enriched in liver cancer (MT1A), and the final 2 were not found to be associated with high cancer specificity (HLA-DRB1, CRTC3) [49].
MFSD4A is a protein within the major facilitator superfamily domain (MFSD) that focuses on glucose transmembrane transport, and UTP11 is a protein that processes pre-18S ribosomal RNA [51]. MFSD4A and UTP11, while noted as prognostic markers for renal cancer survival, did not have additional supporting literature beyond citations found in the Human Protein Atlas.
Although both of these genes were among the highest-ranked by the AutoGluon model’s importance matrix—the first and tenth, respectively—more research is required to explore the connection between these genes and RCC patient survival rates. CHL1 is an ATP-dependent DNA helicase that promotes genomic stability through DNA replication, DNA repair, and ribosomal RNA synthesis [51]. CHL1 was found to be a prognostic marker for renal cancer by multiple sources [50, 52], suggesting that under-expression can lead to lower survival rates in renal cancer patients. HAPLN1 is a protein that binds hyaluronic acid with proteoglycan monomers [51]. Other studies have found that high expression of HAPLN1 leads to lower survival rates in RCC patients [53].
About CPTAC Survival Data for Clear Cell Renal Carcinoma Patients and Contro
The results from Cox’s Proportional Hazard test on the CPTAC dataset for feature selection found only 3 significant proteins out of 11,170 (a 99.97% reduction). Due to the small sample size of deceased patients (12/203) and low correlation between tumor stage and vital status in the CPTAC cohort, the performance of XGBoost and AutoGluon models was evaluated on the full feature set of proteins from CPTAC data. XGBoost and AutoGluon performance improved when using all available proteins compared to the small subset found from Cox’s Proportional Hazard Test.
Conclusion
Although our findings may provide new perspectives in understanding the pathogenesis of RCC, some limitations exist in our study. The dataset size was relatively small, and results could be improved with more survival data for CPTAC patients. We used the AutoGluon permutation shuffling approach to assess feature importance. One limitation of permutation shuffling is that it can produce unrealistic inputs as a result of the shuffling procedure [54, 55]. This study is an exploratory analysis using computational techniques and further bench experimental studies are required for validation before clinical use.
In conclusion, the use of ML techniques enabled the evaluation of different models to predict RCC cancer progression and identify potential prognostic markers. Predicting cancer stages from RNA expression data may be beneficial for therapeutic selection, as well as informing clinical outcomes. This work will assist clinicians to inform patient prognosis, and can be extended with other multimodal data such as imaging and DNA methylation datasets. Employing graph-based neural network models could further elucidate the interaction of genes identified in this paper.
Data Availability
All data produced in the present work are contained in the manuscript.
Data Usage
TCGA and GTEx data are from https://doi.org/10.6084/m9.figshare.5330575[17]
CPTAC data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH) [14]. Data were accessed through the Python module cptac, PMID: 33560848 (https://github.com/PayneLab/cptac). Data from clear cell carcinoma (kidney) were originally published in PMID: 31675502.
The code for this study is in the GitHub repository: https://github.com/aws-samples/biomarker-discovery.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].
- [8].
- [9].
- [10].
- [11].
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵