Predict multicategory causes of death in lung cancer patients using clinicopathologic factors
=============================================================================================

* Fei Deng
* Haijun Zhou
* Yong Lin
* John Heim
* Lanlan Shen
* Yuan Li
* Lanjing Zhang

## Abstract

**Background** Random forest model is a recently developed machine-learning algorithm, and superior to other machine learning and regression models for its classification function and better accuracy. But it is rarely used for predicting causes of death in lung cancer patients. On the other hand, specific causes of death in lung cancer patients are poorly classified or predicted, largely due to its categorical nature (versus binary death/survival).

**Methods** We therefore tuned and employed a random forest algorithm (Stata, version 15) to classify and predict specific causes of death in lung cancer patients, using the surveillance, epidemiology and end results-18 and several clinicopathological factors. The lung cancer diagnosed during 2004 were included for the completeness in their follow-up and death causes. The patients were randomly divided into training and validation sets (1:1 match). We also compared the accuracies of the final random forest and multinomial regression models.

**Results** We identified and randomly selected 40,000 lung cancers for the analyses, including 20,000 cases for either set. The causes of death were, in descending ranking order, were lung cancer (72.45 %), other causes or alive (14.38%), non-lung cancer (6.87%), cardiovascular disease (5.35%), and infection (0.95%). We found more 250 iterations and the 10 variables produced the best prediction, whose best accuracy was 69.8% (error-rate 30.2%). The final random forest model with 300 iterations and 10 variables reached an accuracy higher than that of multinomial regression model (69.8% vs 64.6%). The top-10 most important factors in the random-forest model were sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality, which were also independently associated with 5-category causes of death.

**Conclusion** We optimized a random forest model of machine learning to predict the specific cause of death in lung cancer patients using a set of clinicopathologic factors. The model also appears more accurate than multinomial regression model.

## Introduction

Lung cancer is one of the leading causes of death in the U.S. and the world [1]. Basic research has revealed new insights into development and progression of lung cancer [2]. Clinical works also identified the factors associated with the survival outcomes of lung cancer, particularly non-small cell lung cancer, including histology and oncogenic drivers [3]. However, the survival outcomes of the lung cancers are mostly binary, which were either alive versus death or alive versus progression/death. Few studies on lung cancer patients were focused on the specific deaths due to non-lung cancer[4], cardiovascular diseases[5, 6], infection or other causes[4, 7]. Even fewer works investigated the multicategory causes of death (COD).

Random forest (RF) model is a recently developed machine-learning algorithm, and superior to other machine learning and regression models for its classification function and better accuracy [8]. But it is rarely used for predicting causes of death in lung cancer patients, while we have shown a better prediction accuracy in prostate cancers [8]. On the other hand, specific causes of death in lung cancer patients are poorly classified or predicted, largely due to its categorical nature (versus binary death/survival). Therefore, this study was aimed to understand the factors associated with multicategory COD in lung cancer patients using RF and multinomial regression models.

## Material and methods

We extracted clinical, pathologic and socioeconomic data of the patients in the Surveillance, Epidemiology, and End Results-18 (SEER-18) Program ([www.seer.cancer.gov](http://www.seer.cancer.gov)) SEER*Stat Database with Treatment Data, who were diagnosed of lung carcinoma in 2004. The follow up was last conducted in Nov. 2016 (the last in the 2019 data release). The inclusion criteria included the survival time > 1 month, aged 20+ years, first primary-cases only and having a known COD. The average follow-up time was 12.5 years.

The informed consent was not obtained for the SEER patients due to de-identified nature of the dataset. Owing to the use of publicly available, de-identified SEER cases, this study was exempt from an institutional review board approval. However, we have received the approval for using the SEER-18 data under the condition of compliance with their preset terms (user ID lzhang). Moreover, all 50 states in the USA have laws requiring newly diagnosed cancers to be reported to a central registry. The state cancer registries in the SEER program would deposit their extracted, de-identified cancer data to the SEER database after meeting quality control standards ([www.seer.cancer.gov](http://www.seer.cancer.gov)). Thus, the SEER data collection was authorized by the US state laws, and supervised by respective state public-health officials and ethical review committees.

We simplified the SEER COD, which were extracted from death certificates, into 5 categories, namely cardiovascular system disease (CVS), infection, non-breast cancer, breast cancer and others cause (including alive). The alive patients were a small proportion of the population and were thus included in the other causes. We included 40 dichotomized variables in the analyses since dichotomized variables (i.e., one-hot encoding) produced slightly better accuracy as shown before[8] and would not require normalization.

We tuned and employed a random forest algorithm (Stata, version 15) to classify and predict specific causes of death in lung cancer patients. The lung cancer diagnosed during 2004 were included for the completeness in their follow-up and death causes. The patients were randomly divided into training and validation sets (1:1 match). We also compared the accuracies of the final random forest and multinomial regression models using chi-squared test. A p value <0.05 was considered statistically different.

## Results

Among the 44,735 case diagnosed in 2004, we identified and randomly selected 40,000 qualified lung cancers for the analyses, including 20,000 cases for either train or test set (**Table 1**). The causes of death were, in descending ranking order, were lung cancer (72.45%), other causes or alive (14.38%), non-lung cancer (6.87%), cardiovascular disease (5.35%), and infection (0.95%). The mean follow-up time was 114.8 months (standard deviation 49.1).

View this table:
[Table 1.](http://medrxiv.org/content/early/2020/09/27/2020.09.25.20201095/T1)

Table 1. 
Baseline characteristics of the included cases according to the 5-category Cause of death.

We found more 250 iterations and the 10 variables produced the best prediction, whose best accuracy was 69.8% (error-rate 30.2%, **Figure 1**).

![Figure 1.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2020/09/27/2020.09.25.20201095/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2020/09/27/2020.09.25.20201095/F1)

Figure 1. The error rates in the validation set were reduced as the number of iterations and variables increase. But more than 10 variables were linked to a higher error rate.

The final random forest model with 300 iterations and 10 variables reached an accuracy higher than that of multinomial regression model (69.8% vs 64.6%, P<0.001, **Table 2**). The top-10 most important factors in the random-forest model were sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality (**Figure 2**).

View this table:
[Table 2.](http://medrxiv.org/content/early/2020/09/27/2020.09.25.20201095/T2)

Table 2. 
Confusion matrices of the random forest and multinomial regression models.

![Figure 2.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2020/09/27/2020.09.25.20201095/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2020/09/27/2020.09.25.20201095/F2)

Figure 2. In the final random forest model of machine learning with 300 iteration, 10 variables and the highest accuracy (69.72%), the top-10 most important factors in the random-forest model were sex, chemotherapy status, age, radiotherapy status and nodal status. Confirmation by pathology was not in the selected 10 variables/factors.

Interestingly, the pathological confirmation of the diagnosis appeared not very importance in the model. In the multinomial regression model, we also found that sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality (**Table 3**), while other factors were not, despite being associated with the multicategory COD in univariable regression analysis (Chi-squared test, Table 1). This finding was consistent with that from RF model.

View this table:
[Table 3.](http://medrxiv.org/content/early/2020/09/27/2020.09.25.20201095/T3)

Table 3. 
The factors associated with multicategory causes of death in lung cancer patients as shown in a multinomial regression model.

## Discussion

In this population-based study of 40,000 lung cancer patients with more than 12 years of follow-up, we optimized a RF model of machine learning to predict the specific cause of death in lung cancer patients. The RF model also appears more accurate than multinomial regression model (69.8% vs 64.6%, P<0.001). We also identified the factors linked to the 5-category COD in lung cancer patients, which was not reported by previous works [4, 7].

This study is limited by using the cases of a single diagnosis year, while we feel the large number of cases justified for our findings. Future studies may include the cases of recent diagnosis years so that the findings will be more related to the recent cases. Moreover, the dataset appeared unbalanced in the outcome. Such a nature of the dataset is the real-world evidence, but posed a challenge to modelling. This in part explained why the multinomial regression model had many cases missing an assigned COD. Finally, the findings were not validated by an external dataset. Future works should be carried out using other dataset to validate our findings.

In summary, we here identified the factors that were independently associated with 5-category long-term COD in lung cancer patients in the USA. We also tuned the RF model and showed that it was significantly more accurate in predicting 5-category long-term COD.

## Data Availability

The SEER data were available upon request to the SEER website (www.seer.cancer.gov). All other data are available upon request.

## Data availability

The SEER data were available upon request to the SEER website ([www.seer.cancer.gov](http://www.seer.cancer.gov)). All other data are available upon request.

*   Received September 25, 2020.
*   Revision received September 25, 2020.
*   Accepted September 27, 2020.


*   © 2020, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1.Siegel RL, Miller KD Jemal A. Cancer statistics, 2020. CA Cancer J Clin. 2020; 70: 7–30.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3322/caac.21590&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F09%2F27%2F2020.09.25.20201095.atom) 

2.  2.Parikh N, Shuck RL, Gagea M, Shen LDonehower LA. Enhanced inflammation and attenuated tumor suppressor pathways are associated with oncogene-induced lung tumors in aged mice. Aging Cell. 2018; 17:
    
    
3.  3.Shang G, Jin Y, Zheng Q, Shen X, Yang M,  Li YZhang L. Histology and oncogenic driver alterations of lung adenocarcinoma in Chinese. Am J Cancer Res. 2019; 9: 1212–1223.
    
    
4.  4.Yin J, Zhao M, Lu T, Huang Y, Sui Q, Xi J, Lin Z, Xu S,  Wang QZhan C. Non-lung cancer specific mortality after lobectomy or sublobectomy in patients with stage IA non-small cell lung cancer ⩽2 cm: A propensity score analysis. J Surg Oncol. 2019; 120: 1486–1496.
    
    
5.  5.Sturgeon KM, Deng L, Bluethmann SM, Zhou S, Trifiletti DM, Jiang C, Kelly SPZaorsky NG. A population-based study of cardiovascular disease mortality risk in US cancer patients. Eur Heart J. 2019; 40: 3889–3897.
    
    
6.  6.Gad MM, Saad AM, Al-Husseini MJ, Abushouk AI, Salahia S, Rehman KA, Riaz HAhmed HM. Temporal trends, ethnic determinants, and short-term and long-term risk of cardiac death in cancer patients: a cohort study. Cardiovasc Pathol. 2019; 43: 107147.
    
    
7.  7.Zaorsky NG, Churilla TM, Egleston BL, Fisher SG, Ridge JA, Horwitz EMMeyer JE. Causes of death among cancer patients. Ann Oncol. 2017; 28: 400–407.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/annonc/mdw604&link_type=DOI) 

8.  8.Wang J, Deng F, Zeng F, Shanahan AJ,  Li WVZhang L. Predicting long-term multicategory cause of death in patients with prostate cancer: random forest versus multinomial model. Am J Cancer Res. 2020; 10: 1344–1355.