Machine learning to identify socio-behavioural predictors of HIV positivity in East and Southern Africa

Erol Orel; Rachel Esra; Janne Estill; Stéphane Marchand-Maillet; Aziza Merzouki; Olivia Keiser

doi:10.1101/2020.01.27.20018242

Abstract

Background There is a need for high yield HIV testing strategies to reach epidemic control. We aimed to predict the HIV status of individuals based on socio-behavioural characteristics.

Methods We analysed over 3,200 variables from the most recent Demographic Health Survey from 10 countries in East and Southern Africa. We trained four machine-learning algorithms and selected the best based on the f1 score. Training and validation were done on 80% of the data. The model was tested on the remaining 20% and on a left-out country which was rotated around. The best algorithm was retrained on the variables which were most predictive. We studied two scenarios: one aiming to identify 95% of people living with HIV (PLHIV) and one aiming to identify individuals with 95% or higher probability of being HIV positive.

Findings Overall 55,151 males and 69,626 females were included. XGBoost performed best in predicting HIV with a mean f1 of 76·8% [95% confidence interval 76·0%-77·6%] for males and 78·8% [78·2%-79·4%] for females. Among the ten most predictive variables, nine were identical for both sexes: longitude, latitude and, altitude of place of residence, current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last intercourse and, wealth index. Model performance based on these variables decreased minimally. For the first scenario, 7 males and 5 females would need to be tested to identify one HIV positive person. For the second scenario, 4·2% of males and 6·2% of females would have been identified as high-risk population.

Interpretation We were able to identify PLHIV and those at high risk of infection who may be offered pre-exposure prophylaxis and/or voluntary medical male circumcision. These findings can inform the implementation of HIV prevention and testing strategies.

Funding Swiss National Science Foundation.

Introduction

In order to reach epidemic control by 2030, the Joint United Nations Programme (UNAIDS) have set fast track targets to rapidly scale up effective HIV services.¹ One of the aims is to ensure that 95% of the approximately 38 million people living with HIV (PLHIV) are aware of their HIV status and that 95% of those with HIV positive diagnoses are on treatment.²

People in East and Southern Africa are disproportionately burdened by HIV, constituting more than half of the global PLHIV with 20·6 million people currently estimated to be HIV positive.² As of 2018, 85% of PLHIV in this region were aware of their HIV status, of whom 79% were accessing treatment.³ In addition, 25% of new HIV infections in East and Southern Africa were concentrated among key populations such as female sex workers, men having sex with men, prisoners and, people who inject drugs.³

HIV is transmitted within a complex network that is influenced by biological, behavioural and, social factors. In East and Southern Africa, there is large geographical variation in the distribution of the HIV epidemic.⁴ In order to identify populations at a high risk of infection, global HIV prevention efforts have shifted toward optimizing resource allocation by considering geographical data as a way of increasing program impact and efficiency.⁵

Machine learning algorithms have the power to substantially enhance HIV prevention and detection, increasing the prediction capability by processing large amounts of data of a different nature. This methodology has been implemented to establish patterns of HIV risk behaviour, to optimise HIV treatment modalities and, to identify high-risk individuals for targeted interventions from a number of novel data sources.^6–15

As more PLHIV are diagnosed, finding persons with undiagnosed HIV becomes progressively more difficult and expensive. Hence, resource constraints and potential funding shortages have resulted in demands for differentiated high yield testing strategies in parallel to provider- initiated HIV testing and counselling (PITC).^14,16,17 We therefore aimed to compare different machine learning algorithms to identify new key populations based on a variety of socio- behavioural characteristics. These insights intend to both inform targeted case-finding strategies as well as identify high risk HIV negative individuals eligible for prevention services such as voluntary medical male circumcision (VMMC) and/or pre-exposure prophylaxis (PrEP).

Methods

Data

Since 1984, the Demographic and Health Surveys (DHS) program has provided technical assistance for over 400 surveys in more than 90 countries, advancing global understanding of health and population trends in developing countries.¹⁸ DHS are nationally-representative household surveys that provide data for a wide range of monitoring and impact evaluation indicators on health and nutrition. Standard DHS surveys have large sample sizes (usually between 5,000 and 30,000 households) and are typically conducted every five years.¹⁹ We used the most recent DHS surveys at or after 2013 of ten East and Southern African countries (Table A1) with a generalised HIV epidemic: Angola, Burundi, Ethiopia, Lesotho, Malawi, Mozambique, Namibia, Rwanda, Zambia and, Zimbabwe. Male and female’s datasets were combined separately with geographic position of groups of households where the individuals live and HIV test results. We then merged the ten countries and obtained two datasets containing 68,979 males and 83,910 females with 527 and 3,213 variables, respectively.

Data pre-processing, model validation and, algorithm selection

We compared four machine learning algorithms for the prediction of the HIV status of an individual; a penalized logistic regression (Elastic Net),²⁰ a generalized additive model (GAM), ²¹ a support vector machine (SVM) and,²² a gradient boosting tree (XGBoost).²³ The Elastic Net and the GAM are among the most widely used classification methods in biology and medicine, SVM is a very common machine learning algorithm and XGBoost is a decision-tree based ensemble which has gained a lot of attraction since its development few years ago.

The first part of the analyses was done in several steps for each of the four algorithms, and separately for males and females (Figure 1). In the data pre-processing step (Figure 1, step 1), we first cleaned and transformed data from the ten countries into numerical values (Table A2). Only persons for whom the HIV status was either positive or negative were included in the analysis. The cleaned datasets included 55,151 males and 69,626 females with 84 and 122 variables, respectively; 73 variables were common for both sexes (Table A3). Since we wanted to test the generalizability of our model, one country was left out for later testing, and the left- out country was rotated around. We then split each of the data from nine countries combined in an 80% training sample and a 20% test sample. Missing values were imputed using multiple imputation by chained equations (MICE) (as detailed in appendix) and data were further harmonized and scaled.²⁴

Figure 1: Diagram explaining the first part of the analyses

All steps are detailed in the method section.

Training and validation were done using five-fold cross-validation on 50 sets of hyperparameters randomly chosen from a grid (Figure 1, step 2). For each of these 50 sets, we calculated the mean f1 scores across the five validation sub-samples and selected the set of hyperparameters that gave the highest value. The f1 score combines the sensitivity and the PPV in a harmonic mean.²⁵ Our primary interest was to find a large number of HIV positive persons (sensitivity or recall) with a high yield (precision or positive predictive value (PPV)). The probability threshold to classify if someone is considered HIV positive was set at 50%.

Once the best models of each algorithm were obtained, we calculated the f1 scores on the ten 20% test and left-out country samples which were not used to train and validate the model (Figure 1, step 3). We averaged these f1 scores in order to compare the different algorithms and the maximum mean f1 on the 20% test samples allowed us to select the best one.

Variables selection, direction of association and, calibration of two scenarios

For the second part of the analysis, where no country was left out, only the selected algorithm was trained and validated again using a random search over 250 sets of parameters (instead of 50) with the same five-fold cross-validation scheme than previously. The first estimation was performed using all variables. We compared the f1 score, the sensitivity and, the PPV using MICE imputation (models M1 and F1 for males and females, respectively) with a different imputation method within the algorithm,²³ that considerably simplified the engineering process (models M2 and F2).

We then performed a sequential forward floating selection (SFFS) using the best imputation method on the 80% training samples and calculated the f1 scores for different number of variables. We selected the subset of variables for which the f1 scores plateaued and assessed the direction of the association between these variables and the probability of being HIV- positive using Shapley values.²⁶

We retrained the best algorithm with the above defined subsets of variables (models M3 and F3) and also on a minimal subset common for both sexes (models M4 and F4). The f1 scores, the sensitivity and, the PPV were compared to the ones obtained for M1, M2, F1 and, F2. We computed the Precision-Recall (PR) and the Threshold-Scores curves (TS) for our preferred model per sex. The PR curve displays the PPV for different sensitivities. This curve is not influenced by imbalanced datasets and is therefore preferred over ROC curve.²⁷ The TS curve, highlights the PPV, the sensitivity and, the f1 score for varying thresholds of classifying if someone is HIV positive.

We then tested two scenarios: for the first scenario, the sensitivity was set to 95%, equivalent to 95% of PLHIV knowing their status. We selected the threshold that corresponds to this sensitivity and reported the corresponding precision and number of individuals to be tested. For the second scenario, we identified a population for which the predicted probability of being HIV positive was 95% or higher. We considered that these individuals are either HIV positive targets for high yield testing strategies or HIV negative individuals who would be ideal candidates for prevention services.

All analyses were performed in Python version 3.7.4. The code is available on https://gitlab.com/Triphon/predicting_hiv_status.

Results

Overall, 55,151 males and 69,626 females were analysed with an HIV positivity ranging from 0·8% among males in Ethiopia to 33·3% among females in Lesotho. The overall HIV positivity was 8·0% (4,417 individuals) for males and 11·5% (8,011 individuals) for females. Persons aged 25 to 34 years represented the largest age group representing 35·9% of females and 31·9% of males. About two-thirds of people lived in rural areas (Table A4).

Algorithms’ performance on the test samples

Figure 2 shows the performance of the four algorithms. XGBoost had the highest f1 scores on all 20 test samples (ten per sex) with a mean f1 score of 76·8% [95% confidence interval (CI) 76·0%-77·6%] for males and 78.8% [78·2%-79·4%] for females. For SVM, the mean f1 score for males was 69·2% [68·2%-70·2%] and 74·6% [73·7%-75·5%] for females and for Elastic Net, the mean f1 score was 32·6% [31·8%-33·4%] for males and 41·5% [40·3%-42·7%] for females. GAM performed worst with a mean f1 score of 26·2% [25·0%-27·4%] for males and 39·8% [38·1%-41·5%] for females (see Table A5i to Table A5iv).

Figure 2: Boxplot of the f1 scores for the 4 algorithms on the test and left-out samples per sex

Generalized Additive Model (GAM), Support Vector Machine (SVM).

Algorithms’ performance on the left-out country samples

The performance of the algorithms on the ten left-out samples was substantially lower than on the test samples and the f1 scores varied more widely (Figure 2). The mean f1 score was best for Elastic Net with 21·4% [12·3%-30·5%] for males and 32·6% [21·2%-44·0%] for females followed by XGBoost with 20·9% [14·3%-27·5%] and 29·8% [19·0%-40·6%], respectively. For SVM, the mean f1 score was 15·4% [10·9%-19·9%] for males and 22·3% [14·1%-30·5%] for females. Again, GAM performed worst with a mean f1 scores of 6·6% [0·9%-12·1%] and 17·1% [4·4%-29·8%] (see Table A5i to Table A5iv). The algorithms performed better in countries with higher HIV positivity.

Best algorithm’s performance on the complete datasets

We selected the best performing algorithm, XGBoost, for the second part of the analysis, where no country was left out. The results on all variables using the two different imputation methods are shown in Table 1. For both sexes, the XGBoost imputation (M2 and F2) resulted in slightly higher f1 scores compared to the MICE imputation (M1 and F1). The f1 scores on the validation samples were 75·5% [73·7%-77·3%] vs 74·9% [73·3%-76·5%] for males and 76·1% [74·9%- 77·3%] vs 75·5% [74·6%-76·4%] for females. Given the above results and the simplicity of the XGBoost imputation, we used this imputation for further analyses (i.e. models M3, F3, M4 and, F4).

View this table:

Table 1: Results per sex of the XGBoost algorithm for different imputation methods and sets of variables

True positive (TP), False negative (FN), False positive (FP), True negative (TN), Positive Predictive Value (PPV).

Variables selection and direction of associations

Figure 3A and 3B show the result of the SFFS procedure which was used to select a subset of most relevant variables. The f1 scores plateaued above 99·6% with 15 variables for males and above 97·6% with 27 variables for females. Figure 3C and 3D show the 15 and 27 key variables of individual HIV status for males and females. Among these top ten most predictive variables, nine were identical: geographic position (longitude, latitude and, altitude), current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom used during last sexual intercourse with most recent partner, and, a wealth index from the DHS which combines numerous wealth-related variables such as household assets and utility services.²⁸ The age at first sexual intercourse ranked tenth for males and twentieth for females. The Rohrer’s index (an estimate of obesity) ranked sixth for females, but was not available for males. Among the fifteen most predictive variables, four were specific for either males or females (‘number of women fathered children with’ and ‘respondent circumcised’ for males and ‘currently breastfeeding’ and ‘fertility preference’ for females). Other females-specific characteristics included ‘time to get to water source’ and ‘entries in birth history’.

Figure 3: Sequential floating forward selection (SFFS) and Shapley values

The (SFFS) procedure was implemented (A + B) to determine the saturation point for variable selection base on the f1 score. This resulted in the selection of the 15 and 27 most important variables for males and females, respectively. The variables are displayed below (C + D) sorted by importance from top to bottom (from the highest Shapley value to the lowest). The blue and red colours represent the value range of the variable (e.g. blue represents low value range of the variable). For example, the older the age the more likely the persons will be HIV positive.

Figure 3A and 3B highlight the direction of the association between each variable and the probability of HIV positivity. Among the nine common predictive variables for both sexes, older age, older age of most recent partner, a higher number of total lifetime number of sexual partners, condom used during last sexual intercourse with most recent partner and, longitude were positively associated with the probability of HIV positivity for most individuals. A higher wealth index, a larger latitude coordinate of the residence, altitude and, more years lived in place of residence were mainly negatively associated with HIV positivity.

Performance on subsets of variables

Table 1 shows the results of the XGBoost algorithm on the 15 most important variables for males (M3) and 27 most important variables for females (F3). As expected from the SFFS procedure, the f1 scores for M3 and F3 were similar to M2 and F2. The f1 scores decreased by 1·8 percentage points for males and by 0·5 percentage points for females. Finally, we checked the performance of the algorithm on the nine most predictive common variables for both sexes (M4 and F4). The f1 scores were 72·9% for males and 72·4% for females, decreasing respectively by 2·6 and 3·7 percentage points compared to M2 and F2, and by 0·8 and 3·2 percentage points compared to M3 and F3. M4 and F4 were the models used for the calibration of the two scenarios.

Scenarios

1) 95% PLHIV know their status

Figures 4A and 4B show the PR-curves calculated on the test samples. For males, a sensitivity of 95% would require that 5,450 individuals out of 11,031 (49·4%) would need to be tested to identify 840 HIV positives out of the 883 PLHIV. The corresponding PPV is 15·4%; 7 individuals would therefore need to be tested to find one HIV positive person (number needed to test NNT). For females, 6,696 individuals out of 13,926 (48·1%) would need to be tested to find 1,522 HIV positives out of the 1,602 PLHIV. The PPV is 22·7% and the NNT is 5.

Figure 4: Precision-Recall Curves (A + B) and Threshold-Scores Curves (C+ D) for models with 9 variables (models M4 and F4)

In addition, f1 iso-curves are shown for a typical range of f1 scores that we achieved with the models. Along these lines the f1 scores remain constant.

Positive Predictive Value (PPV)

Figure 4C and 4D: 95% Confidence Interval has been obtained using a bootstrap with n=50

2) 95% or more probability of being HIV positive

Figures 4C and 4D show the TS-curves calculated on the test samples. Out of the 11,031 males and 13,926 females, 461 males (4·2%) and 862 females (6·2%) were identified as high-risk population. For males, 447 would have been correctly identified HIV positive out of the 883 PLHIV. For females, 833 would have been correctly identified HIV positive out of the 1,602 PLHIV.

Discussion

Using large representative datasets with over 120,000 persons from ten East and Southern African countries, we were able to accurately predict the HIV status of individuals using demographic and socio-behavioural characteristics. Using all variables, XGBoost performed better than the three other algorithms on the test samples with a mean f1 score of 76·8% [95% CI 76·0%-77·6%] for males and 78.8% [78·2%-79·4%] for females. Our approach allowed us to select the nine most important predictor variables common for both sexes: geographic position (longitude, latitude and, altitude), current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom used during last sexual intercourse with most recent partner and, wealth index. The performance of the algorithm using only these nine variables to predict HIV positivity was similar to that of the total dataset.

We also determined the direction of the association between predictor variables and HIV status. We confirmed a number of established HIV risk factors such as older age or older age of the most recent partner,²⁹ a large number of sexual partners and, living in an urban area.³⁰ Additionally, circumcision and breastfeeding were associated with a lower risk of HIV positivity. Unlike previous findings,³¹ condom use during the last sexual intercourse increased the probability of HIV positivity in our study. This seemingly counterintuitive finding may be the result of increased condom use in individuals who are already aware of their positive HIV status. The cross-sectional nature of our study limits our ability to investigate this further. We also identified risk factors for HIV infection which have rarely been investigated before. For example, an increased distance to water was positively associated with HIV infection in some persons, and negatively associated in others. This is in line with a previous study which showed that the risk of sexual assault of women, and hence the risk of HIV infection, increased when the time to reach a water source increased.³² However, longer time to get to water sources are more common in rural areas where HIV prevalence is generally lower, hence a decrease in risk of HIV positivity.

When applying these machine learning algorithms in real world settings, the trade-off between sensitivity (% of HIV positives identified) and PPV (yield) needs to be considered. A model with a sensitivity of 95% would be required to ensure that 95% of PLHIV know their status. In this first scenario, using a model with only nine predictors, we showed that the NNT was 5 (PPV of 15·4%) for males and 7 (PPV of 22·7%) for females. This represents approximately twice the PPV that would be achieved by general population testing. A previous systematic review of different testing strategies showed that NNTs ranged between 3 and 86 for community-based testing strategies and between 4 and 154 for facility-based testing strategies.³³

In contrast, if targeted HIV case-finding strategies are implemented to increase the cost- effectiveness of testing strategies, a high PPV is important to ensure that the yield is high, and many of those tested are HIV positive. It is currently unknown if additional behavioural-based testing strategies can enhance or complement current targeted case-finding strategies such as index testing. Acceptable cut-offs for both sensitivity and PPV would need to be adapted for specific settings and for the desired testing coverage. For example, we defined a second scenario to simultaneously identify both high-risk HIV positive individuals for testing and high-risk HIV negative individuals for preventative services such as pre-exposure prophylaxis (PrEP).

To our knowledge, this study is the first to use machine learning methods to predict HIV in generalised HIV epidemic East and Southern African countries using routinely collected survey data. One of the limitations of this study is the generalizability of our findings. The distribution of risk factors varies between countries, and the accuracy of the prediction decreased for countries not used to train the algorithm. It is therefore not surprising, that geographic location of the residence (longitude, latitude and, altitude) were among the strongest predictors, since they were proxies for country-level differences. We were also limited by the available variables in our dataset, and as a result we were unable to consider differences in viral load suppression, health-care expenditure, specific HIV-related interventions and conflicts and wars. Additionally, although HIV testing was laboratory-based and not self-reported, some results were inconclusive and were discarded. A number of variables were self-reported and therefore subject to social desirability and recall bias. Missing values were imputed using multiple imputation, or directly within the extreme gradient boosting algorithm. However, the proportion of missing values was relatively small for most variables and both imputation methods gave similar results.

In conclusion, we were able to identify strong predictors of HIV positivity. Our findings may explain the spatial variability of HIV prevalence and can inform HIV testing strategies in resource-limited settings. While the implementation of a machine learning based risk score for targeted interventions was feasible in rural East Africa,³⁴ the acceptability and use of potentially sensitive behavioural risk factors to directly identify individuals for HIV testing needs to be evaluated. Our algorithm performed well with only a limited number of variables, which do not require extensive interviews or questionnaires. This approach may be implemented by clinicians and community health care workers or utilised through additional HIV case-finding modalities such as call centres, social media and, self-testing initiatives. The availability of individual-level data on the association of various diseases with socio- behavioural characteristics is rapidly increasing. Advanced methods to analyse these large sources of data can help to prevent, diagnose and treat HIV and other diseases more efficiently.

Data Availability

This is a study based on publicly available data from the Demographic and Health Surveys (DHS). The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data. https://dhsprogram.com/Data/

https://gitlab.com/Triphon/predicting_hiv_status

Author’s contribution

EO, AM and, OK designed the study with support from SMM. EO wrote the code and performed the analysis with support from AM. EO, AM and, OK interpreted the results with support from JE and SMM. EO and RE reviewed the literature. EO, RE and, OK wrote the manuscript, which was reviewed by JE, SMM, and, AM.

Conflict of interest

We declare no competing interests.

Supplementary material

Selection of variables

Datasets were resampled per country using sample weights from the HIV test results. We excluded individuals whose HIV status was “indeterminate” or “inconclusive” and individuals who reported that they never had sexual intercourse. We then removed variables with no variance and the ones containing more than 30% missing values. Finally, after additional encoding steps (e.g. creation of new aggregated variables or dummy coding of nominal variables), we manually removed an additional 77 non-informative variables for males and 122 for females (e.g. relating to metadata or information on how the survey was conducted), resulting in a final dataset of 55,151 males and 69,626 females with 84 and 122 variables, respectively. Overall 73 variables were common for both sexes (Table A2 and Table A3).

Stratification, MICE imputation and Standardization (Figure 1 - step 1)

The stratification was done based on each sample HIV prevalence to ensure that the percentage of HIV positive individuals in the training and validation samples remained similar to the originals. We imputed missing values of each 80% training sample by multiple imputations using chained equations (MICE), and then applied the same imputation model to the corresponding test and left-out country samples. The regressions on the chained equations have been iterated ten times using the entire set of variables. The imputations have been performed five times and the results were averaged. Finally, the variables were standardized to a variance of one, ensuring that the penalization scheme is fair to all regressors.

View this table:

Table A1: List of the Demographic and Health Surveys (DHS)

View this table:

Table A2: Pre-processing of variables

View this table:

Table A3: List of variables

Variable names correspond to the name in the Demographic and Health Survey (DHS).

View this table:

Table A4: Characteristics of Demographic and Health Survey (DHS) individuals

View this table:

Table A5i: Results of the XGBoost algorithm per sex for the validation, test and, left-out samples

View this table:

Table A5ii: Results of the Support Vector Machine (SVM) algorithm per sex for the validation, test and, left-out samples

View this table:

Table A5iii: Results of the Elastic Net algorithm per sex for the validation, test and, left-out samples

View this table:

Table A5iv: Results of the Generalized Additive Model (GAM) algorithm per sex for the validation, test and, left-out samples

Python libraries

- Matplotlib 3.1.1
- Mlxtend 0.17.0
- Numpy 1.16.5
- Pandas 0.25.1
- Pathlib 1.0.1
- Pyshp 2.1.0
- Pygam 0.8.0
- Scikit-learn 0.21.3
- Scipy 1.3.1
- Seaborn 0.9.0
- Shap 0.30.1
- Xgboost 0.90
- Yellowbrick 1.0.1

Some of the computations were done on the Baobab cluster of the University of Geneva.

Acknowledgements

We acknowledge the support of the Swiss National Science Foundation (SNF professorship grant n° 163878 to O Keiser) which funded this study. We thank Antoine Flahault, Amaury Thiabaud, Danny Sheath and, Isotta Triulzi for helpful discussions and comments.

References

↵
UNAIDS. Understanding Fast-Track - Accelerating action to end AIDS epidemic by 2030. 2015 https://www.unaids.org/sites/default/files/media_asset/201506_JC2743_Understanding_FastTrack_en.pdf.
↵
UNAIDS. Global Aids Update 2019. UNAIDS, 2019.
↵
HIV and AIDS in East and Southern Africa regional overview.;: 20.
↵
Zulu LC, Kalipeni E, Johannes E. Analyzing spatial clustering and the spatiotemporal nature and trends of HIV/AIDS prevalence using GIS: the case of Malawi, 1994-2010. BMC Infect Dis 2014; 14: 285.
OpenUrl
↵
Cuadros DF, Li J, Branscum AJ, et al. Mapping the spatial variability of HIV infection in Sub-Saharan Africa: Effective information for localized HIV prevention and control. Sci Rep 2017; 7: 1–11.
OpenUrl CrossRef PubMed
↵
Ahlström MG, Ronit A, Omland LH, Vedel S, Obel N. Algorithmic prediction of HIV status using nation- wide electronic registry data. EClinicalMedicine 2019; 0. DOI:10.1016/j.eclinm.2019.10.016.
OpenUrl CrossRef
Marcus JL, Hurley LB, Krakower DS, Alexeeff S, Silverberg MJ, Volk JE. Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study. Lancet HIV 2019; 6: e688–95.
OpenUrl
Balzer LB, Havlir DV, Kamya MR, et al. Machine learning to identify persons at high-risk of HIV acquisition in rural Kenya and Uganda. Clin Infect Dis 2019;: ciz1096.
Krakower DS, Gruber S, Hsu K, et al. Development and validation of an automated HIV prediction algorithm to identify candidates for pre-exposure prophylaxis: a modelling study. Lancet HIV 2019; 6: e696–704.
OpenUrl
Huang G, Cai M, Lu X. Inferring Opinions and Behavioral Characteristics of Gay Men with Large Scale Multilingual Text from Blued. Int J Environ Res Public Health 2019; 16. DOI:10.3390/ijerph16193597.
OpenUrl CrossRef
Wray TB, Luo X, Ke J, Pérez AE, Carr DJ, Monti PM. Using Smartphone Survey Data and Machine Learning to Identify Situational and Contextual Risk Factors for HIV Risk Behavior Among Men Who Have Sex with Men Who Are Not on PrEP. Prev Sci 2019; 20: 904–13.
OpenUrl
Bisaso KR, Karungi SA, Kiragga A, Mukonzo JK, Castelnuovo B. A comparative study of logistic regression based machine learning techniques for prediction of early virological suppression in antiretroviral initiating HIV patients. BMC Med Inform Decis Mak 2018; 18: 77.
OpenUrl
Feller DJ, Zucker J, Yin MT, Gordon P, Elhadad N. Using Clinical Notes and Natural Language Processing for Automated HIV Risk Assessment. J Acquir Immune Defic Syndr 1999 2018; 77: 160–6.
OpenUrl
↵
Zheng W, Balzer L, van der Laan M, Petersen M, SEARCH Collaboration. Constrained binary classification using ensemble learning: an application to cost-efficient targeted PrEP strategies. Stat Med 2018; 37: 261–79.
OpenUrl
↵
Young SD, Yu W, Wang W. Toward Automating HIV Identification: Machine Learning for Rapid Identification of HIV-related Social Media Data. J Acquir Immune Defic Syndr 1999 2017; 74: S128–31.
OpenUrl
↵
De Cock KM, Barker JL, Baggaley R, El Sadr WM. Where are the positives? HIV testing in sub- Saharan Africa in the era of test and treat. AIDS Lond Engl 2019; 33: 349–52.
OpenUrl
↵
Ahmed S, Schwarz M, Flick RJ, et al. Lost opportunities to identify and treat HIV-positive patients: results from a baseline assessment of provider-initiated HIV testing and counselling (PITC) in Malawi. Trop Med Int Health 2016; 21: 479–85.
OpenUrl
↵
The DHS Program - Team and Partners. https://dhsprogram.com/Who-We-Are/About-Us.cfm (accessed Dec 9, 2019).
↵
The DHS Program - Demographic and Health Survey (DHS). https://dhsprogram.com/what-we-do/survey-Types/dHs.cfm (accessed Dec 9, 2019).
↵
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc Ser B Stat Methodol 2005; 67: 301–20.
OpenUrl CrossRef
↵
Hastie T, Tibshirani R. Generalized Additive Models. Stat Sci 1986; 1: 297–310.
OpenUrl CrossRef PubMed
↵
Vapnik VN. The Nature of Statistical Learning Theory. Berlin, Heidelberg: Springer-Verlag, 1995.
↵
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min - KDD 16 2016;: 785–94.
OpenUrl
↵
Buuren S van, Groothuis-Oudshoorn K. mice?: Multivariate Imputation by Chained Equations in R. J Stat Softw 2011; 45. DOI:10.18637/jss.v045.i03.
OpenUrl CrossRef
↵
1. C.J. Van Rijsbergen
Blair DC. Information Retrieval, 2^nd ed. C.J. Van Rijsbergen. London: Butterworths; 1979: 208 pp. Price: $32.50. J Am Soc Inf Sci 1979; 30: 374–5.
OpenUrl CrossRef
↵
1. Guyon I,
2. Luxburg UV,
3. Bengio S
Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, et al., eds. Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017: 4765–4774.
↵
Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 2015; 10: e0118432.
OpenUrl CrossRef PubMed
↵
DHS Wealth Index.pdf..
↵
Akullian A, Bershteyn A, Klein D, Vandormael A, Bärnighausen T, Tanser F. Sexual partnership age pairings and risk of HIV acquisition in rural South Africa. AIDS 2017; 31: 1755.
OpenUrl
↵
Temah CT. What Drives HIV/AIDS Epidemic in Sub-Saharan Africa?Rev Econ Dev 2009; Vol. 17: 41–70.
OpenUrl
↵
Pinkerton SD, Abramson PR. Effectiveness of condoms in preventing HIV transmission. Soc Sci Med 1982 1997; 44: 1303–12.
OpenUrl CrossRef PubMed Web of Science
↵
Sommer M, Ferron S, Cavill S, House S. Violence, gender and WASH: spurring action on a complex, under-documented and sensitive topic. Environ Urban 2015; 27: 105–16.
OpenUrl CrossRef
↵
Suthar AB, Ford N, Bachanas PJ, et al. Towards universal voluntary HIV testing and counselling: a systematic review and meta-analysis of community-based approaches. PLoS Med 2013; 10: e1001496.
OpenUrl CrossRef PubMed
↵
Koss CA, Ayieko J, Mwangwa F, et al. Early Adopters of Human Immunodeficiency Virus Preexposure Prophylaxis in a Population-based Combination Prevention Study in Rural Kenya and Uganda. Clin Infect Dis Off Publ Infect Dis Soc Am 2018; 67: 1853–60.
OpenUrl

View the discussion thread.

Posted January 27, 2020.

Download PDF

Data/Code

Citation Tools

Subject Area

HIV/AIDS

Subject Areas

All Articles

Addiction Medicine (349)
Allergy and Immunology (668)
Allergy and Immunology (668)
Anesthesia (181)
Cardiovascular Medicine (2648)
Dentistry and Oral Medicine (316)
Dermatology (223)
Emergency Medicine (399)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
Epidemiology (12228)
Forensic Medicine (10)
Gastroenterology (759)
Genetic and Genomic Medicine (4103)
Geriatric Medicine (387)
Health Economics (680)
Health Informatics (2657)
Health Policy (1005)
Health Systems and Quality Improvement (985)
Hematology (363)
HIV/AIDS (851)
Infectious Diseases (except HIV/AIDS) (13695)
Intensive Care and Critical Care Medicine (797)
Medical Education (399)
Medical Ethics (109)
Nephrology (436)
Neurology (3882)
Nursing (209)
Nutrition (577)
Obstetrics and Gynecology (739)
Occupational and Environmental Health (695)
Oncology (2030)
Ophthalmology (585)
Orthopedics (240)
Otolaryngology (306)
Pain Medicine (250)
Palliative Medicine (75)
Pathology (473)
Pediatrics (1115)
Pharmacology and Therapeutics (466)
Primary Care Research (452)
Psychiatry and Clinical Psychology (3432)
Public and Global Health (6527)
Radiology and Imaging (1403)
Rehabilitation Medicine and Physical Therapy (814)
Respiratory Medicine (871)
Rheumatology (409)
Sexual and Reproductive Health (410)
Sports Medicine (342)
Surgery (448)
Toxicology (53)
Transplantation (185)
Urology (165)

[1] ↵
UNAIDS. Understanding Fast-Track - Accelerating action to end AIDS epidemic by 2030. 2015 https://www.unaids.org/sites/default/files/media_asset/201506_JC2743_Understanding_FastTrack_en.pdf.

[2] ↵
UNAIDS. Global Aids Update 2019. UNAIDS, 2019.

[3] ↵
HIV and AIDS in East and Southern Africa regional overview.;: 20.

[4] ↵
Zulu LC, Kalipeni E, Johannes E. Analyzing spatial clustering and the spatiotemporal nature and trends of HIV/AIDS prevalence using GIS: the case of Malawi, 1994-2010. BMC Infect Dis 2014; 14: 285.
OpenUrl

[5] ↵
Cuadros DF, Li J, Branscum AJ, et al. Mapping the spatial variability of HIV infection in Sub-Saharan Africa: Effective information for localized HIV prevention and control. Sci Rep 2017; 7: 1–11.
OpenUrl CrossRef PubMed

[6] ↵
Ahlström MG, Ronit A, Omland LH, Vedel S, Obel N. Algorithmic prediction of HIV status using nation- wide electronic registry data. EClinicalMedicine 2019; 0. DOI:10.1016/j.eclinm.2019.10.016.
OpenUrl CrossRef

[7] Marcus JL, Hurley LB, Krakower DS, Alexeeff S, Silverberg MJ, Volk JE. Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study. Lancet HIV 2019; 6: e688–95.
OpenUrl

[8] Balzer LB, Havlir DV, Kamya MR, et al. Machine learning to identify persons at high-risk of HIV acquisition in rural Kenya and Uganda. Clin Infect Dis 2019;: ciz1096.

[9] Krakower DS, Gruber S, Hsu K, et al. Development and validation of an automated HIV prediction algorithm to identify candidates for pre-exposure prophylaxis: a modelling study. Lancet HIV 2019; 6: e696–704.
OpenUrl

[10] Huang G, Cai M, Lu X. Inferring Opinions and Behavioral Characteristics of Gay Men with Large Scale Multilingual Text from Blued. Int J Environ Res Public Health 2019; 16. DOI:10.3390/ijerph16193597.
OpenUrl CrossRef

[11] Wray TB, Luo X, Ke J, Pérez AE, Carr DJ, Monti PM. Using Smartphone Survey Data and Machine Learning to Identify Situational and Contextual Risk Factors for HIV Risk Behavior Among Men Who Have Sex with Men Who Are Not on PrEP. Prev Sci 2019; 20: 904–13.
OpenUrl

[12] Bisaso KR, Karungi SA, Kiragga A, Mukonzo JK, Castelnuovo B. A comparative study of logistic regression based machine learning techniques for prediction of early virological suppression in antiretroviral initiating HIV patients. BMC Med Inform Decis Mak 2018; 18: 77.
OpenUrl

[13] Feller DJ, Zucker J, Yin MT, Gordon P, Elhadad N. Using Clinical Notes and Natural Language Processing for Automated HIV Risk Assessment. J Acquir Immune Defic Syndr 1999 2018; 77: 160–6.
OpenUrl

[14] ↵
Zheng W, Balzer L, van der Laan M, Petersen M, SEARCH Collaboration. Constrained binary classification using ensemble learning: an application to cost-efficient targeted PrEP strategies. Stat Med 2018; 37: 261–79.
OpenUrl

[15] ↵
Young SD, Yu W, Wang W. Toward Automating HIV Identification: Machine Learning for Rapid Identification of HIV-related Social Media Data. J Acquir Immune Defic Syndr 1999 2017; 74: S128–31.
OpenUrl

[16] ↵
De Cock KM, Barker JL, Baggaley R, El Sadr WM. Where are the positives? HIV testing in sub- Saharan Africa in the era of test and treat. AIDS Lond Engl 2019; 33: 349–52.
OpenUrl

[17] ↵
Ahmed S, Schwarz M, Flick RJ, et al. Lost opportunities to identify and treat HIV-positive patients: results from a baseline assessment of provider-initiated HIV testing and counselling (PITC) in Malawi. Trop Med Int Health 2016; 21: 479–85.
OpenUrl

[18] ↵
The DHS Program - Team and Partners. https://dhsprogram.com/Who-We-Are/About-Us.cfm (accessed Dec 9, 2019).

[19] ↵
The DHS Program - Demographic and Health Survey (DHS). https://dhsprogram.com/what-we-do/survey-Types/dHs.cfm (accessed Dec 9, 2019).

[20] ↵
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc Ser B Stat Methodol 2005; 67: 301–20.
OpenUrl CrossRef

[21] ↵
Hastie T, Tibshirani R. Generalized Additive Models. Stat Sci 1986; 1: 297–310.
OpenUrl CrossRef PubMed

[22] ↵
Vapnik VN. The Nature of Statistical Learning Theory. Berlin, Heidelberg: Springer-Verlag, 1995.

[23] ↵
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min - KDD 16 2016;: 785–94.
OpenUrl

[24] ↵
Buuren S van, Groothuis-Oudshoorn K. mice?: Multivariate Imputation by Chained Equations in R. J Stat Softw 2011; 45. DOI:10.18637/jss.v045.i03.
OpenUrl CrossRef

[25] ↵
C.J. Van Rijsbergen
Blair DC. Information Retrieval, 2^nd ed. C.J. Van Rijsbergen. London: Butterworths; 1979: 208 pp. Price: $32.50. J Am Soc Inf Sci 1979; 30: 374–5.
OpenUrl CrossRef

[26] C.J. Van Rijsbergen

[27] ↵
Guyon I,
Luxburg UV,
Bengio S
Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, et al., eds. Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017: 4765–4774.

[28] Guyon I,

[29] Luxburg UV,

[30] Bengio S

[31] ↵
Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 2015; 10: e0118432.
OpenUrl CrossRef PubMed

[32] ↵
DHS Wealth Index.pdf..

[33] ↵
Akullian A, Bershteyn A, Klein D, Vandormael A, Bärnighausen T, Tanser F. Sexual partnership age pairings and risk of HIV acquisition in rural South Africa. AIDS 2017; 31: 1755.
OpenUrl

[34] ↵
Temah CT. What Drives HIV/AIDS Epidemic in Sub-Saharan Africa?Rev Econ Dev 2009; Vol. 17: 41–70.
OpenUrl

[35] ↵
Pinkerton SD, Abramson PR. Effectiveness of condoms in preventing HIV transmission. Soc Sci Med 1982 1997; 44: 1303–12.
OpenUrl CrossRef PubMed Web of Science

[36] ↵
Sommer M, Ferron S, Cavill S, House S. Violence, gender and WASH: spurring action on a complex, under-documented and sensitive topic. Environ Urban 2015; 27: 105–16.
OpenUrl CrossRef

[37] ↵
Suthar AB, Ford N, Bachanas PJ, et al. Towards universal voluntary HIV testing and counselling: a systematic review and meta-analysis of community-based approaches. PLoS Med 2013; 10: e1001496.
OpenUrl CrossRef PubMed

[38] ↵
Koss CA, Ayieko J, Mwangwa F, et al. Early Adopters of Human Immunodeficiency Virus Preexposure Prophylaxis in a Population-based Combination Prevention Study in Rural Kenya and Uganda. Clin Infect Dis Off Publ Infect Dis Soc Am 2018; 67: 1853–60.
OpenUrl

Machine learning to identify socio-behavioural predictors of HIV positivity in East and Southern Africa

Abstract

Introduction

Methods

Data

Data pre-processing, model validation and, algorithm selection

Variables selection, direction of association and, calibration of two scenarios

Results

Algorithms’ performance on the test samples

Algorithms’ performance on the left-out country samples

Best algorithm’s performance on the complete datasets

Variables selection and direction of associations

Performance on subsets of variables

Scenarios

1) 95% PLHIV know their status

2) 95% or more probability of being HIV positive

Discussion

Data Availability

Author’s contribution

Conflict of interest

Supplementary material

Selection of variables

Stratification, MICE imputation and Standardization (Figure 1 - step 1)

Python libraries

Acknowledgements

References

Citation Manager Formats

Subject Area