Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Feature Selection for an Explainability Analysis in Detection of COVID-19 Active Cases from Facebook User-Based Online Surveys

Jesús Rufino, View ORCID ProfileJuan Marcos Ramírez, Jose Aguilar, Carlos Baquero, Jaya Champati, Davide Frey, Rosa Elvira Lillo, View ORCID ProfileAntonio Fernández-Anta
doi: https://doi.org/10.1101/2023.05.26.23290608
Jesús Rufino
aIMDEA Networks Institute, 28918, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Juan Marcos Ramírez
aIMDEA Networks Institute, 28918, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Juan Marcos Ramírez
  • For correspondence: juan.ramirez{at}imdea.org
Jose Aguilar
aIMDEA Networks Institute, 28918, Madrid, Spain
bCEMISID, Universidad de Los Andes, Mérida, 5101, Venezuela
cCIDITIC, Universidad EAFIT, Medellín, Colombia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Carlos Baquero
dUniversidade do Minho and INESC TEC, Braga, Portugal
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jaya Champati
aIMDEA Networks Institute, 28918, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Davide Frey
eInria Rennes, Rennes, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rosa Elvira Lillo
fUniversidad Carlos III, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Antonio Fernández-Anta
aIMDEA Networks Institute, 28918, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Antonio Fernández-Anta
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

In this paper, we introduce a machine-learning approach to detecting COVID-19-positive cases from self-reported information. Specifically, the proposed method builds a tree-based binary classification model that includes a recursive feature elimination step. Based on Shapley values, the recursive feature elimination method preserves the most relevant features without compromising the detection performance. In contrast to previous approaches that use a limited set of selected features, the machine learning approach constructs a detection engine that considers the full set of features reported by respondents. Various versions of the proposed approach were implemented using three different binary classifiers: random forest (RF), light gradient boosting (LGB), and extreme gradient boosting (XGB). We consistently evaluate the performance of the implemented versions of the proposed detection approach on data extracted from the University of Maryland Global COVID-19 Trends and Impact Survey (UMD-CTIS) for four different countries: Brazil, Canada, Japan, and South Africa, and two periods: 2020 and 2021. We also compare the performance of the proposed approach to those obtained by state-of-the-art methods under various quality metrics: F1-score, sensitivity, specificity, precision, receiver operating characteristic (ROC), and area under ROC curve (AUC). It should be noted that the proposed machine learning approach outperformed state-of-the-art detection techniques in terms of the F1-score metric. In addition, this work shows the normalized daily case curves obtained by the proposed approach for the four countries. It should note that the estimated curves are compared to those reported in official reports. Finally, we perform an explainability analysis, using Shapley and relevance ranking of the classification models, to identify the most significant variables contributing to detecting COVID-19-positive cases. This analysis allowed us to determine the relevance of each feature and the corresponding contribution to the detection task.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was partially supported by grant CoronaSurveys-CM, funded by IMDEA Networks and Comunidad de Madrid, Spain, grants COMODIN-CM and PredCov-CM, funded by Comunidad de Madrid and the European Union through the European Regional Development Fund (ERDF), grants TED2021-131264B-I00 (SocialProbing) and PID2019-104901RB-I00, funded by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR, and individual donations to the CoronaSurveys Project https://coronasurveys.org.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Ethics Board (IRB) of IMDEA Networks Institute gave ethical approval for this work on 2021/07/05. IMDEA Networks has signed Data Use Agreements with Facebook, Carnegie Mellon University (CMU) and the University of Maryland (UMD) to access their data, specifically UMD project 1587016-3 entitled C-SPEC: Symptom Survey: COVID-19 and CMU project STUDY2020_00000162 entitled ILI Community-Surveillance Study. The data used in this study was collected by the University of Maryland through The University of Maryland Social Data Science Center Global COVID-19 Trends and Impact Survey in partnership with Facebook. Informed consent has been obtained from all participants in this survey by this institution. All the methods in this study have been carried out in accordance with relevant of ethics and privacy guidelines and regulations.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The data presented in this paper (in aggregated form) and the programs used to process it will be openly accessible at https://github.com/GCGImdea/coronasurveys/. The microdata of the CTIS survey from which the aggregated data was obtained cannot be shared, as per the Data Use Agreements signed with Facebook, Carnegie Mellon University (CMU) and the University of Maryland (UMD).

https://github.com/GCGImdea/coronasurveys/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted June 05, 2023.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Feature Selection for an Explainability Analysis in Detection of COVID-19 Active Cases from Facebook User-Based Online Surveys
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Feature Selection for an Explainability Analysis in Detection of COVID-19 Active Cases from Facebook User-Based Online Surveys
Jesús Rufino, Juan Marcos Ramírez, Jose Aguilar, Carlos Baquero, Jaya Champati, Davide Frey, Rosa Elvira Lillo, Antonio Fernández-Anta
medRxiv 2023.05.26.23290608; doi: https://doi.org/10.1101/2023.05.26.23290608
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Feature Selection for an Explainability Analysis in Detection of COVID-19 Active Cases from Facebook User-Based Online Surveys
Jesús Rufino, Juan Marcos Ramírez, Jose Aguilar, Carlos Baquero, Jaya Champati, Davide Frey, Rosa Elvira Lillo, Antonio Fernández-Anta
medRxiv 2023.05.26.23290608; doi: https://doi.org/10.1101/2023.05.26.23290608

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Public and Global Health
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)