Predicting the Risk of Asthma Development in Youth Using Machine Learning Models

Matthew Xie; Chenliang Xu

doi:10.1101/2024.06.24.24309438

Abstract

Asthma is a chronic respiratory disease characterized by wheezing and difficulty breathing, which disproportionally affects 4.7 million children in the U.S. Currently, there is a lack of asthma predictive models for youth with good performance. This study aims to build machine learning models to better predict asthma development in youth using easily accessible national survey data. We analyzed cross-sectional combined 2021 and 2022 National Health Interview Survey (NHIS) data from 9,716 youth subjects with their corresponding parent information. We built several machine learning models with various sampling techniques (under- or over-sampling) for asthma prediction in youth, including XGBoost, Neural Networks, Random Forest, Support Vector Machine (SVM), and Logistic Regression. We examined the associations of potential risk factors identified from both Random Forest and Least Absolute Shrinkage and Selection Operator (LASSO) with asthma in youth. Between the different sampling techniques, undersampling the major class (subjects without asthma) yielded the best results in terms of the area under the curve (AUC) and F1 scores for the different predictive models. The Logistic Regression performed the best with the under-sampled data, yielding an AUC score of 0.7654 and an F1 score of 0.3452. In addition, we have identified additional important factors associated with asthma development in youth, such as low family poverty ratio and parents ever had asthma. This study successfully built machine learning models to predict asthma development in youth with good model performance. This will be important for early screening and detection of asthma in youth.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study used only openly available human data that were originally located at the NHIS website: https://www.cdc.gov/nchs/nhis/data-questionnaires-documentation.htm.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced are available online at https://www.cdc.gov/nchs/nhis/data-questionnaires-documentation.htm.

https://www.cdc.gov/nchs/nhis/data-questionnaires-documentation.htm

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.