Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Predicting infectious disease for biopreparedness and response: A systematic review of machine learning and deep learning approaches

View ORCID ProfileRavikiran Keshavamurthy, Samuel Dixon, View ORCID ProfileKarl T. Pazdernik, View ORCID ProfileLauren E. Charles
doi: https://doi.org/10.1101/2022.06.30.22277117
Ravikiran Keshavamurthy
1Pacific Northwest National Laboratory, Richland, WA 99354, USA
2Paul G. Allen School for Global Health, Washington State University, Pullman, WA 99164, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ravikiran Keshavamurthy
Samuel Dixon
1Pacific Northwest National Laboratory, Richland, WA 99354, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Karl T. Pazdernik
1Pacific Northwest National Laboratory, Richland, WA 99354, USA
3Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Karl T. Pazdernik
Lauren E. Charles
1Pacific Northwest National Laboratory, Richland, WA 99354, USA
2Paul G. Allen School for Global Health, Washington State University, Pullman, WA 99164, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lauren E. Charles
  • For correspondence: lauren.charles{at}pnnl.gov
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Despite the complex and unpredictable nature of pathogen occurrence, substantial efforts have been made to better predict infectious diseases (IDs). Following PRISMA guidelines, we conducted a systematic review to investigate the advances in ID prediction capabilities for human and animal diseases, focusing on Machine Learning (ML) and Deep Learning (DL) techniques. Between January 2001 and May 2021, the number of relevant articles published steadily increased with a significantly influx after January 2019. Among the 237 articles included, a variety of IDs and locations were modeled, with the most common being COVID-19 (37.1%) followed by Influenza/influenza-like illnesses (8.9%) and Eastern Asia (32.5%) followed by North America (17.7%), respectively. Tree-based ML models (38.4%) and feed-forward DL neural networks (26.6%) were the most frequent approaches taking advantage of a wide variety of input features. Most articles contained models predicting temporal incidence (66.7%) followed by disease risk (38.0%) and spatial movement (31.2%). Less than 10% of studies addressed the concepts of uncertainty quantification, computational efficiency, and missing data, which are essential to operational use and deployment. Our study summarizes the broad aspects and current status of ID prediction capabilities and provides guidelines for future works to better support biopreparedness and response.

Introduction

Infectious disease (ID) events have plagued human and animal populations throughout history, resulting in massive numbers of morbidities and mortalities as well as substantial social and economic impacts across the world1. The effects of climate change, urbanization, and globalization have rendered these diseases borderless, enabling them to spread easily across regions and inevitably increasing the risk of epidemics and pandemics. Currently, ID prediction is one of the most important operational epidemiological tools with the potential to provide early warning to actively prevent disease occurrence and spread. By combining robust data collection, engineering, and analysis strategies, predicting disease event information, such as location, timing, intensity, and various other factors responsible for its occurrence, becomes possible. Timeliness of this type of predicted information is crucial for decision makers to effectively mobilize health resources to the area of concern and properly implement control and prevention strategies2.

Predicting ID is a challenging task mainly due to the complex and unpredictable nature of pathogen ecology and evolutionary dynamics3. These inherent complexities demand robust uncertainties quantification systems for better decision making. The challenge of ID prediction is exacerbated due to inadequate and biased disease surveillance initiatives, a lack of disease reporting systems, as well as incomplete and delayed epidemiological data sharing4,5. Despite these limitations, significant efforts have been made in the past couple of decades to utilize ID prediction models in operational control and prevention strategies. In particular, the emergence of the coronavirus disease 2019 (COVID-19) pandemic has resulted in accelerated development and integration of ID prediction models in worldwide public health decision making6.

In recent years, factors, such as an exponential increase in computing power, easy access to large and diverse datasets, and advancements in artificial intelligence, have facilitated extraordinary growth in the field of infectious disease predictions7. Machine Learning (ML) and Deep Learning (DL) methods are widely used for a variety of disease intelligence tasks, including temporal, spatial, and risk factor predictions8. ML models have been shown to outperform traditional statistical techniques to give more accurate and reliable predictions9,10. The popular ML techniques most widely used in the field of ID prediction include tree-based approaches 10–12 and Support Vector Machines (SVM)13–15 due to their ease of implementation and interpretability. On the other hand, DL techniques, such as feed-forward neural networks (FNN)16,17 and recurrent neural networks (RNN)18,19, are popular for their ability to integrate large and complex data into their predictions.

There are many, complex factors that contribute to and influence the presence of an ID event, such as epidemiologic, geographic, climatic, demographic, behavioral, and sociopolitical. Traditional ID prediction models can only process a limited number of explanatory variables and do not perform will on cross-correlated features. On the other hand, ML and DL models excel at processing large amounts of feature data and finding complex and hidden connections amongst data sources. ID prediction modeling has, therefore, greatly benefited from the recent “big data” revolution20. Remote sensing satellite imagery and census data yield high resolution information about critical disease related factors, such as climate, environment, population density, and demography. With the increase in worldwide internet and mobile phone usage, non-traditional information (e.g., internet searches, social media usage, phone call records, news media trends, and population mobility data) are also readily available. The ML and DL approaches have become highly efficient in utilizing large and complex information gathered through multiple channels to provide a unique opportunity to understand and model ID dynamics like never before3,21. However, the utilization of large datasets and increased complexity of the prediction models could lead to an exponential rise in computational requirements. Hence, optimizing the memory and processing requirements of the ML and DL algorithms without compromising their predictive capabilities is crucial.

This study investigates the advances to and quality of ID prediction capabilities, focusing on ML and DL techniques applied over the past two decades. To do this evaluation, we systematically reviewed the scientific literature to identify research that included ML and/or DL models to predict IDs in humans and/or animals. Within the collection, we highlighted specific tasks performed by each prediction model type, input features used for model building, the study spatial and temporal scales, and error metrics applied. We specifically noted if the studies addressed the important issues of uncertainty quantification, computational efficiency, and missing data when building the models. By focusing on the above-mentioned research areas, we identified the best approaches and strategies as well as revealed gaps present in the field of ID prediction modeling. This systematic analysis can be used as a guide to improve future research studies, to better address operational needs for model deployment, and to inform areas where public health and veterinary policies can help improve predictive capabilities.

Methodology

To assess the application of ML and DL techniques in the field of infectious disease prediction, we conducted a systematic review following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines22. A diverse set of subject matter experts, spanning infectious diseases, public health, epidemiology, computer engineering, data science, and statistics, formulated the following specific research questions:

  • Which IDs are modeled using ML and DL techniques?

  • Which global geographic regions are modeled in the ID prediction studies?

  • What is the trend and extent of ML and DL types and sub-types used in ID predictions?

  • What are the various tasks performed by ID prediction models?

  • What are the different input features used for ID predictions?

  • What are the spatial (geographic extent) and temporal (duration) scales of the studies?

  • What are the error metrics used?

  • Is uncertainty quantification, computational efficiency, or missing data handling addressed?

Eligibility criteria

Specific eligibility criteria were developed based on subject matter expert recommendations. Inclusion criteria required that the study (1) must explicitly include temporal, spatial, and/or risk prediction models of infectious diseases; (2) must utilize ML and/or DL techniques for predictions; (3) must be an original study; and (4) must be published in a peer-reviewed journal in the English-language between Jan 2001 and May 2021. We excluded prediction studies containing sexually transmitted diseases, cancer, clinical trials, and only biomarker data (e.g., genomics, proteomics, transcriptomics). In addition, we did not include research that primarily utilized traditional statistics-based regression or classification methods (e.g., linear, non-linear, autoregressive moving average, logistic or Poisson models). Preprints, book chapters, conferences presentations, reviews, opinions, commentaries, and dissertations were excluded. We also excluded articles with missing or inaccessible full texts.

Search strategy

In May 2021, the scientific literature databases of PubMed, Web of Sciences, Embase, Scopus, and Google Scholar were searched to guarantee effective and adequate coverage of targeted studies (Table 1). The literature published between Jan 2001 and May 2021 was searched using the keywords recommended by subject matter experts. We restricted the Google Scholar searches to the first 300 results, which provides an acceptable search coverage of academic literature without excluding useful references23. The citation manager Mendeley (https://www.mendeley.com/) was used to manage imported review citations.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1:

Search keywords and scientific literature databases used to identify potentially relevant publications for systematic review.

Selection Strategy

Citations were first de-duplicated before proceeding to the manual screening of abstracts. As the first step, each abstract was evaluated by two independent reviewers for possible eligibility in the systematic review based on defined eligibility criteria. Next, the full texts of potential candidate articles were evaluated in detail by the reviewers to ensure all criteria were met. Articles that passed the two-part screening were included in the final publication list and, ultimately, in the systematic review.

Information extraction

The ML and DL models present in the review literature were classified into broad categories based on the tasks they performed listed below.

Temporal prediction models

utilize historic disease information to predict future disease events. These models attempt to answer when the next disease outbreak would occur in the future based on past events.

Spatial prediction models

utilize historic disease information to predict the geographic distribution of disease events. These models attempt to answer where the next disease outbreak might occur by imputing the locations where disease occurrence information is not available.

Risk prediction models

assess the relationship between disease events and various factors associated with their occurrence. These models attempt to estimate spatial and/or temporal risk factors correlated with the disease event.

During the process of full-text review, the reviewers recorded the following information: model types and subtypes, disease names, primary study hosts, input features or explanatory variables used for predictions, study area, study duration, temporal forecasting distance, error metrics used, uncertainty quantification, missing data handling, and computational efficiency. These groupings are not mutually exclusive. For example, Zhang et ala211 compared the performance of temporal prediction models belonging to FFNN and RNN to forecast typhoid fever incidence in China. To evaluate their model performance, they used three error metrics (mean absolute error, mean absolute percentage error, and mean square error). Hence, this citation was placed under multiple prediction model subtype and error metric categories. Similarly, if a publication model performed multiple tasks, such as modeling multiple diseases, geographic locations, or prediction categories, the citation was placed in all relevant categories. Any differences in opinion between the independent reviewers raised during the collection, screening, and information recording processes of the review were resolved through internal discussion until consensus was achieved.

Results

We identified 16,148 articles that were published in peer reviewed journals between January 2001 and May 2021 (Fig. 1). After removing the duplicates and screening the records for inclusion and exclusion criteria, 237 articles were selected for the final systematic review. The complete list of articles that were included in this systematic review is provided in Supplementary Note (a1–a237).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: PRISMA flow diagram.

The illustration of the overall the selection process.

* Google Scholar searches were restricted to the first 300 results

ML and DL modeling for infectious disease prediction

Among the large and diverse number of ID prediction models applying ML or DL methods found in the literature based on our criteria, COVID-19 undoubtedly received the most attention and was studied in 88 (37.1%) articles. Influenza and influenza-like illnesses were modeled in 22 (9.3%) articles followed by dengue and malaria in 21 (8.9%) and 12 (5.1%) articles, respectively. The complete list of all the infectious diseases identified in the literature review along with their citations is presented in Table 2.

View this table:
  • View inline
  • View popup
Table 2:

Citations categorized by infectious disease and study host

A large majority (205, 86.5%) of articles focused on modeling only humans followed by only domestic animals (9, 3.8%), only wildlife (6, 2.5%), and only vectors (6, 2.5%) (Fig. 2). There were only a few articles (12, 5.1%) that used more than one host species for modeling IDs.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2: Venn diagram of articles grouped by host species included in infectious disease modeling using machine learning and deep learning techniques.

Domesticated animals include livestock and companion animals; wildlife includes wild animals and birds.

Regional distribution of studies

Of the 237 included studies, the majority of them were focused on Eastern Asia (77, 32.5%), followed by North America (42, 17.7%), Southern Asia (31, 13.1%), Latin America (20, 8.4%) and Western Europe (18, 7.6%). There were 36 (15.2%) studies that included multiple regions (more than four) which were grouped as a separate category. A complete breakdown of the articles with ID models belonging to each geographical region grouped by diseases is presented in Fig. 3.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3: Distribution of articles with infectious disease models built for each geographical region.

If an article included infectious disease models for more than four regions, they were placed in “multiple regions” category. Similarly, if an article included models for multiple diseases, they were placed in each respective category.

Trend and extent of use of ML and DL in infectious disease prediction models

There has been an increasing trend in the use of ML and DL techniques for ID prediction since 2001 with a substantial rise between January 2019 and May 2021 (Fig. 4). Of the 237 articles included in the study, 127 (53.6%) of them applied at least one type of ML approach and 129 (54.4%) used at least one DL technique for disease prediction (Fig 4a). For the DL models, the FNN (63, 26.6%), RNN (48, 20.3%), and DL hybrids/ensembles (27, 11.4%) were the most common approaches (Fig 4b). Tree-based methods (91, 38.4%) followed by SVM (36, 15.2%) and then likelihood-based methods (22, 9.3%) were the most common ML approaches (Fig 4c). Within tree-based ML methods, Random Forest (RF) (44, 18.6%) followed by Boosted Regression Trees (BTR) (30, 12.7%) and Extreme Gradient Boosts (XGB) (12, 5.1%) were most often used (Fig 4d). More details including the citations groups by model type and subtype and are presented in Supplementary Table S1.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4: Trend and extent of ID prediction models published (January 2001-May 2021):

Number of citations placed by a) model types (i.e., ML or DL) b) DL model subtypes c) ML model subtypes d) Tree-based ML model subtypes. Note: if an article contained models from different types or subtypes, it was placed in each respective group.

Utilization of ML and DL approaches for different prediction categories

We grouped the 237 articles into prediction categories based on the tasks they performed. Majority of the articles performed temporal predictions (158, 66.7%) followed by disease risk predictions (90, 38.0%) and spatial predictions (74, 31.2%). COVID-19 was the most frequently modeled disease with the majority being temporal prediction models (Fig. 5). More details, including the citations groups by prediction categories and model type, are presented in Supplementary Table S1.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5: Model prediction categories.

The distribution of disease prediction models grouped by model categories and diseases. If an article contained models that performed multiple prediction tasks and for multiple diseases, it was placed in each respective group.

Spatial and temporal scales of the dataset used in the studies

The spatial scale (geographic extent) and temporal scales (duration) of the datasets used to make predictions were identified through the geographic extent/size and duration of the studies, respectively. Overall, for all ID prediction model categories, most articles predicted ID at the country level (124, 52.3%) using only up to one year of data (132, 55.7%) (Fig. 6a, 6b). Among temporal prediction models, near-term forecasting (up to one month) using one year worth of data was the most common (53, 33.5%) (Fig. 6c).

Figure 6:
  • Download figure
  • Open in new tab
Figure 6: Spatial and temporal scale of ID prediction models.

a) Proportion of the spatial scale (geographic extent) of the models grouped by model categories b) Proportion of temporal scale (duration) of the models grouped by model categories c) Among temporal prediction models, proportion of forecasting distance grouped by temporal scale. An article was placed in its respective groups if it utilized ID models with multiple model categories, spatial and/or temporal scales.

Input feature groups utilized for disease prediction

The articles included in the study utilized input features that belonged to the following eight groups: case counts (154, 65.0%), climate/weather (98, 41.4%), demographics/socioeconomics (63, 26.6%), landscape/geography (58, 24.5%), social media/internet searches (18, 7.6%), health and comorbidity (7, 3.0%), human mobility (4, 1.7%), and news (3, 1.3%). Each disease modeled has a unique signature of input feature groups used for prediction (fig 7a). Focusing on the model prediction type categories, the number of input feature groups used in each category ranged from a minimum of one feature group (n = 151, 63.7%) to a maximum of five groups (n = 3, 1.3%) (fig 7b). A complete breakdown of the characteristics of each input feature group utilized for ID prediction is presented in Figure 3.

Figure 7:
  • Download figure
  • Open in new tab
Figure 7: Characteristics of input feature groups utilized for disease prediction.

Articles (n = 237) categorized by a) input feature groups used by disease type b) number of input feature groups utilized by ID prediction model categories. If an article utilized multiple input features, modeled multiple diseases and/or belonged to multiple model categories, the article was counted within each respective grouping.

Uncertainty quantification, computational efficiency, and missing data

We identified only 21 (8.9%) of the articles to quantify uncertainty in their model predictions. The uncertainty quantification techniques used included frequentist (10, 4.2%)a46, a67, a68, a91, a107, a123, a145, a152, a193, a195, simulation/sampling based (7, 3.0%)a26, a53, a156, a200, a213, a214, a219, and Bayesian techniques (3, 1.3%) a94, a111, a115.

Only 7 (3%) publicationsa10, a13, a22, a63, a64, a79, a102, a103 meeting the review criteria included information about computational efficiency while evaluating the performance of their models.

We also noted any missing data handling techniques used in model building. The majority of the articles (220, 84.4%) either did not report any missing data or did not explicitly mention how missing data was handled in their work. For the 18 (7.6%) articles that did discuss this topic, the techniques applied included replacement with mean/median or zerosa56, a64, a72, a185, a187, moving averagea136, a96, a128, regressiona103, a108, a185, correlationa220, KNNa103, multivariate imputationa111, a136, a139, exclusiona24, and pixel gap filling a157.

Common error metrics used in ID prediction modeling

Among classification models that predicted discreet values (e.g., presence or absence of a disease), Area Under the Curve - Receiver Operating Characteristic (AUC-ROC) curve (46, 19.4%), accuracy (29, 12.2%), and sensitivity (16, 6.8%) were the top three error metrics (Fig. 8b). Alternatively, among regression models that predicted continuous values (eg., monthly number of disease cases), Root Mean Square Error (RMSE) (98, 41.4%) followed by Mean Absolute Error (MAE) (67, 28.3%) and Mean Absolute Percentage Error (MAPE) (57, 24.1%) were the most common (Fig. 8b).

Figure 8:
  • Download figure
  • Open in new tab
Figure 8: Error metrics utilized in ID prediction models:

Citations grouped by a) Classification error metrics and b) Regression error metrics. If an article used error metrics from different classes, it was placed in each respective group. Abbreviations: AUC-ROC (Area Under the Curve - Receiver Operating Characteristic curve), AIC/BIC (Akaike’s/Bayesiasn Information Criteria, corr coeff. (Correlation coefficient), MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), MSE (Mean squared error), RMSE (Root Mean Square Error)

Discussion

The ID threat is constantly changing across space and time, hence, an accurate and timely estimation of their occurrence is critical to planning and implementing successful disease preparedness and response strategies 24,25. This systematic review was conducted to understand the current state and extent of utilization of ML and DL algorithms in ID prediction. Our review showed that overall, there was a constant increase in the number of studies that utilized ML and DL to build ID prediction models between 2005 and 2019. Unsurprisingly, we saw an exponential rise in this trend after the COVID-19 pandemic outbreak. The overall global responses to the COVID-19 pandemic by the scientific community, governments, and non-government agencies have been unprecedented. This collective effort has resulted in increased collaboration among health sectors, large-scale disease surveillance, accessible data, and artificial intelligence technology sharing initiatives26. The availability of the crucial epidemiological knowledge through these initiatives along with the need for an accurate assessment of the disease dynamics has led to a dramatic increase in the utilization of ML and DL prediction modeling.

Most of the IDs that were modeled were either zoonotic in nature or diseases solely affecting the human population. Apart from COVID-19, influenza and influenza-like illnesses, dengue, malaria, and tuberculosis received major attention. These diseases have an ability to spread easily among the human community either directly through aerosolization or contact (influenza and influenza-like illnesses, tuberculosis) or propagated by vectors (dengue, malaria). This potential to spread easily and the ability to cause wide-scale mortalities and morbidities most likely led to increased attention from global health communities. Furthermore, almost all recent pandemics and a large proportion of emerging IDs originated from wildlife spillover and involve complex dynamic interactions within human and domesticated animal populations27. Our review found the majority of zoonotic diseases modeled had humans as their primary host. More efforts are required to integrate other host species that might significantly affect the transmission and persistence of an IDs across time and space. We identified only a very small number of publications on non-zoonotic livestock diseases, which could be due to inadequate livestock disease surveillance and the unavailability of reliable epidemiological data for modeling purposes. More efforts should be made to better predict these economic significance veterinary diseases since many of them, such as African swine fever, are highly contagious transboundary diseases with significant global food security and safety impacts.

The articles identified were almost evenly split between ML and DL techniques for ID prediction tasks. Within ML techniques, tree-based methods were popular among all prediction categories. Tree-based methods such as RF, BTR, and XGB are often among the best performing types of prediction models9,28,29. These models are also easy to implement, fast to compute, highly performant, and provide a form of interpretability through input feature importance, which could be the main reasons for their popularity in ID modeling30,31. Alternatively, FFNs, and RNNs were the most frequently used DL techniques and were mostly used for temporal prediction. The FFNs are artificial neural networks that can learn complex and non-linear patterns without making any prior assumptions concerning data distributions32,33. The RNNs are the derivatives of FFNs (e.g., Long Short-Term Memory and Gated Recurrent Unit) and are known to produce strong predictions with time series or other types of sequential data because of their ability to utilize historic information to predict future values34. Given that the ID outbreaks generally follow a non-linear and complex pattern, these neural networks are often shown to produce superior predictions compared to other approaches and are hence commonly used in disease forecasting tasks. It is also worthwhile to note that ML and DL hybrids/ensembles have attracted great attention from the ID communities in the past few years, evident by their increased use in publications. Hybrid and ensemble models are information fusion concepts that combine statistical, mechanistic, ML, and/or DL approaches working together (hybrid) or independently (ensemble) to minimize prediction noise and increase accuracy over the individual models, which could be one of the possible explanations for their increased popularity in recent years8,35.

A wide variety of input features were used for training ID models. Conventional variables (e.g., previous case counts, climate/weather, demographics/socioeconomics, and landscape/geographic data) were routinely utilized to make disease predictions. However, one of the biggest constraints for building a reliable ID prediction model to accurately estimate the progression of the disease is the timeliness of available, essential outbreak-related data. These constraints are aggravated in cases involving a novel disease outbreak or neglected endemic disease where the spatial and temporal patterns of the pathogen emergence are largely unknown. Furthermore, a major outbreak could lead to a significant shift in population social behavior and movement due to public health efforts and government policies resulting in prediction inaccuracies. Hence, the incorporation of novel data sources that account for these dynamic behaviors is vital for accurate and timely decision making. In our review, we identified studies that utilize news articles, social media or internet search queries, heath information collected using phone/wearable devises, and human mobility data. The ML and DL models used in these studies exploited a large quantity of structured and unstructured data with the goal to produce better ID predictions.

Although methods used in ID prediction are becoming more sophisticated, we also identified consistent concerns in the structure of the analyses that could limit their practical use. First, we found that the data collection duration for a large majority of the studies was less than or equal to one year regardless of the prediction category. Secondly, most articles do not include uncertainty quantification or account for missing data. This was apparent especially during the early stages of the COVID-19 pandemic where the availability of data was limited and there was a widespread underreporting of the cases. Since each ID is known to show specific occurrence patterns that change over time and space, such short-term predictions could be subject to biases and estimation inaccuracies, which should be carefully accounted for while deploying the algorithms to an operational environment. Though a short turnaround time could be vital for a real-world ID event, we recommend updating the models regularly with new data and retraining them for better and long-term practical usage.

Another serious limitation common to the literature reviewed is a lack of discussion regarding data quality and the functional deployment of an algorithm. While one algorithm may perform the best in terms of overall tested accuracy, it may overstate its confidence, may be unrealistic to implement due to computational efficiency, or may simply fail when in the presence of missing data. Since disease prediction models are meant to provide situational awareness, reliable and near-real-time results are necessary36. The fact that so few publications consider the critical aspects of automated algorithm implementation suggests that a greater emphasis should be placed on the operational aspects of epidemiology for operational biopreparedness and response.

While our systematic review was comprehensive, it still has some limitations. First, we only included peer-reviewed studies that were published in a scientific journal. This could have resulted in a selection bias by excluding important studies disseminated as preprints, conference proceedings, books, dissertations, or theses. Second, we did not include studies that primarily utilized traditional statistics-based regression or classification methods. Considering the amount of literature available about these techniques, they will require a separate literature review of their own.

In summary, we conducted a systematic review to determine the current state of ID prediction capabilities that utilized ML and DL techniques. We specifically looked for IDs that were modeled, type of the ML and DL techniques utilized, the geographical distribution of the modeling studies, prediction tasks performed, input features utilized, spatial and temporal scale of the studies, error metrics used, the computational efficiency of the models, uncertainty quantification and missing data handling methods adapted. We observed a diverse number of IDs modeled with COVID-19 appearing most frequently in recent literature. We also note that there has been a consistent increase in the number of studies that apply use ML and DL techniques in ID prediction tasks over the past two decades. However, despite the increased use of data-driven methodology, more studies are needed that include the full disease ecology specifically for zoonotic and veterinary disease predictions from the human, animal, vector, and environmental aspects in a One-Health context. Finally, to enable biopreparedness and response, studies should incorporate the assessment of uncertainty in their predictions and computational requirements of their models, which are crucial for operational deployment.

Data Availability

All data generated or analyzed during this review are included in this article and its supplementary information files

Author contributions

R.K., L.E.C., K.P. and S.D. designed the search strategy, implemented the study protocol, and retrieved articles. R.K. led the screening and data extraction process, performed data analysis and visualization of the results. R.K., L.E.C., K.P., wrote the manuscript. L.E.C and K.P. acquired the funding. All authors reviewed the manuscript and agreed to the published version of the manuscript.

Funding

This work was funded by the Defense Threat Reduction Agency (project number CB11029).

Data availability statement

All data generated or analyzed during this review are included in this article and its supplementary information files

Acknowledgements

The authors wish to thank Nakita Pradhan, Samuel Ortega, and Jaidyn Bryant for their contribution in the initial review process. The authors wish to thank Samantha Erwin for reviewing the manuscript and providing general feedback. RK acknowledges the support from the Pacific Northwest National Laboratory (PNNL)-Washington State University (WSU) Distinguished Graduate Research Program (DGRP) Fellowship for facilitating this research collaboration.

References

  1. 1.↵
    Feldmann, H. et al. Emerging and re-emerging infectious diseases. Medical Microbiology and Immunology 191:2 191, 63–74 (2002).
    OpenUrlCrossRefPubMed
  2. 2.↵
    Woolhouse, M. How to make predictions about future infectious disease risks. Philosophical Transactions of the Royal Society B: Biological Sciences 366, 2045–2054 (2011).
    OpenUrlCrossRefPubMed
  3. 3.↵
    Heesterbeek, H. et al. Modeling infectious disease dynamics in the complex landscape of global health. Science 347, 6227 (2015).
    OpenUrl
  4. 4.↵
    Charles-Smith, L. E. et al. Using Social Media for Actionable Disease Surveillance and Outbreak Management: A Systematic Literature Review. PLOS ONE 10, e0139701 (2015).
    OpenUrlCrossRefPubMed
  5. 5.↵
    Keshavamurthy, R., Thumbi, S. M. & Charles, L. E. Digital Biosurveillance for Zoonotic Disease Detection in Kenya. Pathogens 2021, Vol. 10, Page 783 10, 783 (2021).
    OpenUrl
  6. 6.↵
    Becker, A. D. et al. Development and dissemination of infectious disease dynamic transmission models during the COVID-19 pandemic: what can we learn from other pathogens and how can we move forward? The Lancet Digital Health 3, e41–e50 (2021).
    OpenUrl
  7. 7.↵
    Wong, Z. S. Y., Zhou, J. & Zhang, Q. Artificial Intelligence for infectious disease Big Data Analytics. Infection, Disease & Health 24, 44–48 (2019).
    OpenUrl
  8. 8.↵
    Alfred, R. & Obit, J. H. The roles of machine learning methods in limiting the spread of deadly diseases: A systematic review. Heliyon 7, e07371 (2021).
    OpenUrl
  9. 9.↵
    Dixon, S. et al. A Comparison of Infectious Disease Forecasting Methods across Locations, Diseases, and Time. Pathogens 11, 185 (2022).
    OpenUrl
  10. 10.↵
    Kane, M. J., Price, N., Scotch, M. & Rabinowitz, P. Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks. BMC Bioinformatics 15, 276 (2014).
    OpenUrl
  11. 11.
    Salami, D., Sousa, C. A., Martins, M. R. O. & Capinha, C. Predicting dengue importation into Europe, using machine learning and model-agnostic methods. Scientific Reports 10, 1–13 (2020).
    OpenUrlCrossRef
  12. 12.↵
    Herrick, K. A., Huettmann, F. & Lindgren, M. A. A global model of avian influenza prediction in wild birds: The importance of northern regions. Veterinary Research 44, 1–9 (2013).
    OpenUrlCrossRefPubMed
  13. 13.↵
    Zhang, X., Zhang, T., Young, A. A. & Li, X. Applications and Comparisons of Four Time Series Models in Epidemiological Surveillance Data. PLOS ONE 9, e88075 (2014).
    OpenUrlCrossRefPubMed
  14. 14.
    da Silva, C. C. et al. Covid-19 Dynamic Monitoring and Real-Time Spatio-Temporal Forecasting. Frontiers in Public Health 9, 641253 (2021).
    OpenUrl
  15. 15.↵
    Darwish, A., Rahhal, Y. & Jafar, A. A comparative study on predicting influenza outbreaks using different feature spaces: Application of influenza-like illness data from Early Warning Alert and Response System in Syria. BMC Research Notes 13, 1–8 (2020).
    OpenUrl
  16. 16.↵
    Mollalo, A., Rivera, K. M. & Vahedi, B. Artificial neural network modeling of novel coronavirus (COVID-19) incidence rates across the continental United States. International journal of environmental research and public health 17, 4204 (2020).
    OpenUrl
  17. 17.↵
    Liu, K. et al. Enhancing fine-grained intra-urban dengue forecasting by integrating spatial interactions of human movements between urban regions. PLoS Neglected Tropical Diseases 14, 1–22 (2020).
    OpenUrlCrossRef
  18. 18.↵
    Bomfim, R. et al. Predicting dengue outbreaks at neighbourhood level using human mobility in urban areas. Journal of the Royal Society Interface, 17, 20202691 (2020).
    OpenUrl
  19. 19.↵
    Santosh, T., Ramesh, D. & Reddy, D. LSTM based prediction of malaria abundances using big data. Computers in Biology and Medicine 124, 103859 (2020).
    OpenUrl
  20. 20.↵
    Bansal, S., Chowell, G., Simonsen, L., Vespignani, A. & Viboud, C. Big Data for Infectious Disease Surveillance and Modeling. The Journal of Infectious Diseases 214, S375–S379 (2016).
    OpenUrlCrossRef
  21. 21.↵
    Milinovich, G. J., Magalhães, R. J. S. & Hu, W. Role of big data in the early detection of Ebola and other emerging infectious diseases. The Lancet Global Health 3, e20–e21 (2015).
    OpenUrl
  22. 22.↵
    Moher, D., Liberati, A., Tetzlaff, J. & Altman, D. G. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 339, 332–336 (2009).
    OpenUrlCrossRef
  23. 23.↵
    Haddaway, N. R., Collins, A. M., Coughlin, D. & Kirk, S. The Role of Google Scholar in Evidence Reviews and Its Applicability to Grey Literature Searching. PLOS ONE 10, e0138237 (2015).
    OpenUrlCrossRefPubMed
  24. 24.↵
    Morse, S. S. Public health surveillance and infectious disease detection. Biosecurity and Bioterrorism 10, 6–16 (2012) doi:10.1089/bsp.2011.0088.
    OpenUrlCrossRefPubMed
  25. 25.↵
    Corley, C. D. et al. Disease prediction models and operational readiness. PLoS ONE 9, e91989 (2014).
    OpenUrl
  26. 26.↵
    Luengo-Oroz, M. et al. Artificial intelligence cooperation to support the global response to COVID-19. Nature Machine Intelligence 2, 295–297 (2020).
    OpenUrl
  27. 27.↵
    Allen, T. et al. Global hotspots and correlates of emerging zoonotic diseases. Nature Communications 8, 1–10 (2017).
    OpenUrl
  28. 28.↵
    Schapire, R. The boosting approach to machine learning: an overview. 141–171 (2003).
  29. 29.↵
    Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining doi:10.1145/2939672.
    OpenUrlCrossRef
  30. 30.↵
    James, G., Witten, D., Hastie, T. & Tibshirani, R. Tree-Based Methods. 327–365 (2021) doi:10.1007/978-1-0716-1418-1_8.
    OpenUrlCrossRef
  31. 31.↵
    Kingsford, C. & Salzberg, S. L. What are decision trees? Nature Biotechnology 2008 26:9 26, 1011–1013 (2008).
    OpenUrlCrossRefPubMed
  32. 32.↵
    Gardner, M. W. & Dorling, S. R. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric Environment 32, 2627–2636 (1998).
    OpenUrlCrossRefWeb of Science
  33. 33.↵
    Eldan, R. & Shamir, O. The Power of Depth for Feedforward Neural Networks. 49, 1–34 (2016).
    OpenUrl
  34. 34.↵
    Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Scientific Reports 8, 1–12 (2018).
    OpenUrl
  35. 35.↵
    Ardabili, S., Mosavi, A. & Várkonyi-Kóczy, A. R. Advances in Machine Learning Modeling Reviewing Hybrid and Ensemble Methods. Lecture Notes in Networks and Systems 101, 215–227 (2020).
    OpenUrl
  36. 36.↵
    Broadway, K. M. et al. Operational Considerations in Global Health Modeling. Pathogens 10, 1348 (2021).
    OpenUrl
Back to top
PreviousNext
Posted July 02, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Predicting infectious disease for biopreparedness and response: A systematic review of machine learning and deep learning approaches
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Predicting infectious disease for biopreparedness and response: A systematic review of machine learning and deep learning approaches
Ravikiran Keshavamurthy, Samuel Dixon, Karl T. Pazdernik, Lauren E. Charles
medRxiv 2022.06.30.22277117; doi: https://doi.org/10.1101/2022.06.30.22277117
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Predicting infectious disease for biopreparedness and response: A systematic review of machine learning and deep learning approaches
Ravikiran Keshavamurthy, Samuel Dixon, Karl T. Pazdernik, Lauren E. Charles
medRxiv 2022.06.30.22277117; doi: https://doi.org/10.1101/2022.06.30.22277117

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Infectious Diseases (except HIV/AIDS)
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)