Abstract
Background Nowadays, the chance of discovering the best antibody candidates for explaining naturally acquired protection to malaria and detecting exposure to malaria parasites has notably increased due to publicly available multi-sera data. The analysis of these data is typically divided into a feature selection phase followed by a predictive one where several models are constructed for the outcome of interest. A key question in the analysis is to determine which and how each feature should be included in the predictive stage.
Results To answer this question, we developed three approaches for classifying malaria protected and susceptible groups: (i) a basic and simple approach based on selecting antibodies via the nonparametric Mann-Whitney test; (ii) a dichotomization approach where each antibody was selected according to the optimal cut-off via maximization of the χ2 statistic for two-way tables; (iii) a hybrid parametric/non-parametric approach that integrates Box-Cox transformation followed by a t-test, together with the use of finite mixture models and the Mann-Whitney test as a last resort. We illustrated the application of these three approaches with published serological data for predicting clinical malaria in 121 Kenyan children. The predictive analysis was based on a Super-Learner where predictions from multiple classifiers were pooled together. Our results led to almost similar areas under the Receiver Operating Characteristic curves of 0.72 (95% CI = [0.61, 0.82]), 0.80 (95% CI = [0.71, 0.90]), 0.79 (95% CI = [0.7, 0.88]) for the simple, dichotomization and hybrid approaches, respectively.
Conclusions The three feature selection strategies provided a better predictive performance of the outcome when compared to the previous results solely relying on Random Forests alone (AUC=0.68). Given the similar predictive performance, we recommended the three strategies should be used in conjunction in the same data set and selected according to their complexity.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Andre Fonseca has a PhD fellowship by FCT, Fundacao para a Ciencia e Tecnologia (ref. SFRH/BD/147629/2019). Clara Cordeiro and Nuno Sepulveda are partially financed by national funds through FCT under the project UIDB/00006/2020. Nuno Sepulveda is funded by Polish National Agency for Academic Exchange (ref. grant: PPN/ULM/2020/1/00069/U/00001).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
In this version, we made major revisions that included restructuring the overall analysis, making the objective of the paper clearer and reframing the main ideas in a more intuitive way. Major changes included dividing the overall analysis into a feature selection step and a predictive analysis step. This way of framing the analysis provided better reasoning for the use of different approaches to select the antibodies most associated with malaria disease. Here, a new approach consisting of the use of the chi-squared test to assess associations between antibody values and protection against malaria was added as a baseline model. A preliminary approach based on the Random Forest was also added to validate previously published results. Finally, a SuperLearner was adopted for conducting the predictive analysis. In opposition to the use of different separate models, the use of a SuperLearner is less time-consuming and allows to obtain more accurate results.
Data Availability
All data produced are available online at:https://doi.org/10.1371/journal.pcbi.1005812
Abbreviation List
- AIC
- Akaike’s Information Criterion
- Ama
- Apical membrane antigen 1
- AUC
- Area Under de Curve
- EBA
- Erythrocyte-binding antigen
- ELISA
- Enzyme-linked immunosorbent assay
- FDR
- False discovery rate
- GOF
- Goodness of fitness
- igG
- Immunoglobulin G
- LASSO
- Least Absolute Shrinkage and Selection Operator
- LDA
- linear discriminant analysis
- Log
- logarithmic
- LRM
- Logistic regression model
- MSP
- Merozoite Surface Protein
- MSRP
- MSP7-related proteins
- np
- Number of Protected individuals
- ns
- Number of Susceptible individuals
- Pf
- Plasmodium falciparum
- Prt
- Protected
- QDA
- Quadratic discriminant analysis
- RF
- Random Forest
- ROC
- Receiver Operating Characteristic
- rS
- Spearman Correlation Coefficient
- SeroTAT
- serological testing and treatment
- SL
- super learner
- sPLS-DA
- Sparse partial least squares discriminant analysis
- Sus
- Susceptible
- SVM
- Support vector machine
- SW
- Shapiro-Wilk
- χ2
- Chi-square
- XGB
- Extreme Gradient Boosting