Seasonal variations in social contact patterns in a rural population in north India: Implications for pandemic control

Sargun Nagpal; Rakesh Kumar; Riz Fernando Noronha; Supriya Kumar; Debayan Gupta; Ritvik Amarchand; Mudita Gosain; Hanspria Sharma; Gautam I. Menon; Anand Krishnan

doi:10.1101/2022.08.19.22278966

Abstract

Social contact mixing patterns are critical to the transmission of communicable diseases and have been employed to model disease outbreaks including COVID-19. Nonetheless, there is a paucity of studies on contact mixing in low and middle-income countries such as India. Furthermore, mathematical models of disease outbreaks do not account for the temporal nature of social contacts. We conducted a longitudinal study of social contacts in rural north India across three seasons and analysed the temporal differences in contact patterns.

A contact diary survey was performed across three seasons from October 2015-16, in which participants were queried on the number, duration, and characteristics of contacts that occurred on the previous day. A total of 8,421 responses from 3,052 respondents (49% females) recorded characteristics of 180,073 contacts. Respondents reported a significantly higher number and duration of contacts in the winter, followed by the summer and the monsoon season (Nemenyi post-hoc, p<0.001). Participants aged 0-9 years and 10-19 years of age reported the highest median number of contacts (16 (IQR 12-21), 17 (IQR 13-24) respectively) and were found to have the highest node centrality in the social network of the region (pageranks = 0.20, 0.17). Employed males across all age groups were found to have a higher number of contacts than unemployed males (Negative Binomial Regression: rate ratio 1.18, 95% CI: 1.05-1.31). A large proportion (>80%) of contacts that were reported in schools or on public transport involved physical contact.

To the best of our knowledge, our study is the first from India to show that contact mixing patterns vary by the time of the year and provides useful implications for pandemic control. Our results can be used to parameterize more accurate mathematical models for prediction of epidemiological trends of infections in rural India.

Introduction

The study of contact mixing patterns has received considerable attention for its usefulness in parameterizing mathematical models of disease outbreaks and assessing the impact of intervention strategies. Prior to the COVID-19 pandemic, such studies have mostly focused on regions such as Europe [1,2], USA [3], China [4] and Thailand [5]. These studies have analysed the number and duration of contacts in various social settings (home, school, work, other), and whether the contacts were conversational or physical. Further studies [1,6–8] have made use of age and gender stratified contact mixing patterns to simulate infection epidemics and study epidemiological parameters such as disease prevalence, epidemic size, and cumulative infections. More recently, contact tracing and social mixing data from China and Europe have been used to model the transmission of SARS-CoV-2 and understand the effectiveness of measures such as social distancing and school closure [9–12].

Furthermore, while these studies are valuable for understanding the overall contact mixing patterns at a geographical location, they do not account for the temporal aspect of social contacts. Mathematical models of disease transmission have been developed with the assumption that social contacts do not change with time. Information on the timing of contacts can aid our understanding of how an infection might progress in different seasons and whether or not the same control strategy might be appropriate at different times of the year. Eames et al [13] analysed the dynamic contact patterns in the UK, but only compared contacts during school-holidays with term-time. Additionally, different disconnected periods were merged to form a single holiday period. Fournet et al [14] studied the longitudinal changes in social contacts over two years but their study was limited to a sample of high-school students in France. Another study by Béraud et al [15] looked at the temporal variations in contact patterns in France but their study was limited to a period of four months. Jiang et al [16] collected contact data from a population in China over three visits across several years but did not analyse temporal differences in the contact patterns.

Although some studies have attempted to generate synthetic contact matrices using surveys and demographic data [17,18], there is a paucity of studies collecting and analyzing empirical data from rural India - the second most populated country in the world with a population of over 1.3 billion and 65% rural population [19]. During 2015-2016, we collected contact mixing data from Ballabgarh, a rural town in Haryana, India, and published the results of contact mixing for a single season [20]. In this study, we report the findings of a contact diary survey conducted over three seasons across the year in Ballabgarh. We compared the differences in the number, duration and location of contacts by age-group and gender, and studied the impact of the season, age-group, employment and day of the week on the number and duration of contacts using multivariate negative binomial regression. We created a social network to further understand the age and gender specific contact patterns, and used the contact matrices in each season to parameterise a nine-compartment agent-based model for simulating a COVID-19 epidemic in each season.

Materials & methods

Field site

Our data was collected from Ballabgarh, located in the state of Haryana, India - an area characterised by rural agrarian communities and multi-generational households. A convenience sample of households consisting of 3052 respondents was taken from five villages, which were under an Acute Respiratory Infection (ARI) surveillance program that included weekly household visits by trained healthcare professionals to document ARI and influenza episodes among children aged less than 10 years and adults over 60 years of age.

Ethical Review

This study was reviewed and approved by the ethics boards at the All India Institute of Medical Sciences, New Delhi (IEC/NP-121/10-4-2015), as well as at the University of Pittsburgh (PRO15100147) and the Centers for Disease Control and Prevention, Atlanta (FWA 00014191). We received written consent from all participants over 7 years of age. Caregivers provided written consent for participants below 7 years of age.

Contact diary survey

Respondents completed a structured questionnaire regarding their contacts in the past 24 hours during a face-to-face interview. A caregiver responded on behalf of children under 6 years of age, while children aged 6-10 responded in the presence of a caregiver. An interviewer of the same gender interviewed all respondents aged 11-18 years. A contact was defined as a face-to-face conversation within a distance of three feet, which may or may not have involved a physical touch. Respondents were asked about the age and sex of their contacts, the place of contact (home, school, work, transport, or other), and whether the contacts were conversational or physical. In addition, the respondents were provided the option for reporting encounters with multiple individuals as “group” contacts, including the group size, duration of encounter, and the age range of individuals in the group. Respondents were interviewed three times in a period of 13 months. The data were gathered over three phases: October 2015 - February 2016 (winter); March - June 2016 (summer); and July - October 2016 (monsoon). A small fraction (6.8%) of Phase 1 records were from October 2015, and henceforth we refer to the three phases as seasons.

Data cleaning

The data contained missing values and typographical errors which had to be accounted for by cleaning the dataset. We first sanitised the dates and calculated respondent ages from their date of birth and the date of interview to create a minimally cleaned dataset. We further imputed contact genders by checking instances of the same name; the number, duration, and age range of group contacts by using the median of group contacts at the same location; and the contact ages by sampling from age distributions of similar respondents. This comprised a fully cleaned dataset. Table S1 contains a detailed description of the attributes cleaned and the methodology followed for imputation or correction. Our results were consistent with both datasets, and all results displayed in the manuscript are based on the fully cleaned dataset.

Comparison of contacts within and across the seasons

We performed two types of analyses of the data, the first being an analysis within each season, similar to the work of Kumar et al. [20], where we identified the differences in contact patterns of age groups for each season individually. For this study, statistical methods for independent samples were used for comparison. Further, we analysed the differences in contact patterns across the three seasons. For this study, participants who responded in two or more seasons were considered for analysis and statistical methods for dependent samples were used.

Number of contacts

For each respondent, we report the number of individual contacts over the previous day, the number of people contacted in a group setting and the total number of contacts by adding the individual and group contacts.

We defined “super-spreaders” as respondents having >95^th percentile contacts in each season. Since these respondents met a large number of people, they could be potential super-spreaders of a communicable disease.

Duration of contacts

The reported durations of contacts were recorded on a categorical scale: “<5 minutes”, “5-14 minutes”, “15-59 minutes”, “1-4 hours” and “>4 hours”. These were converted to numerical values by computing the mean duration for each interval. For instance, for the 15-59 minutes interval, a contact duration of 37.5 minutes was used. For the >4h category, we set the upper limit as 8 hours, the usual maximum for a working day. The total duration of group contact for each respondent was calculated by multiplying the group size with the duration of contact with the group. The total time spent in contact for each respondent was calculated as the sum of individual and group contact durations and reported as person-hours.

Age and gender-stratified contacts

We calculated contact numbers and durations stratified by the respondent age-groups (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69 and 70+). For analyses comparing respondents within a season, we used the Kruskal-Wallis nonparametric H test [21] to identify differences in the contact rates, followed by Dunn’s test to examine pairwise differences between age-categories, with the Bonferroni correction to account for multiple comparisons. For comparisons across the three seasons, we selected respondents who were present in more than one season and used statistical tests for dependent samples, namely the Friedman non-parametric test, followed by a Nemenyi post-hoc test.

Further, we plotted gender stratified boxplots for each age category and analysed if a significant difference existed between the number of contacts for males and females for each age group using the Kruskal-Wallis test. The effect size was calculated to quantify the difference in total contacts using the Cohen’s d [22] metric.

where x₁ and x₂ are the mean contacts for males and females respectively and s is the pooled standard deviation, given by:

where n₁, n₂ are the number of males and females, and s₁, s₂ are the standard deviations of the number of contacts of males and females respectively.

Social networks to visualise contact patterns

We created social networks to visualise the contact mixing patterns at Ballabgarh. Since our survey was conducted on a sample of the Ballabgarh population, and all reported contacts were not respondents in the survey, we represented contacts using a directed graph. In our networks, the nodes represent the ten-year age categories with their genders and the node sizes represent the total median contacts. The weighted directed edge from a node A to B represents the median number of contacts A had with B. The visualisation was created with Gephi [23] and the graph layout was generated using the ForceAtlas2 algorithm [24], taking into account the node sizes. The PageRank algorithm [25] was used to calculate the node centrality of each node and nodes were colored based on their PageRank, with darker colours representing higher values.

Contact setting

To understand if the contact patterns differed inside and outside home, boxplots for individual, group, and total number and duration of contacts were plotted for each season. Wilcoxon signed rank test [26] was used to check if the difference between inside and outside home contacts was significant.

The percentage of individual contacts involving a physical touch were calculated at each location (home, school, work, transport, other) for every respondent. Bar plots indicating the mean values and 95% confidence intervals were plotted for the three seasons. To analyse if the proportions of physical contacts differed across the seasons, respondents who responded in more than one season were filtered and the Friedman and Nemenyi post-hoc test were used to examine pairwise differences between the seasons. Barplots for the proportion of physical contacts with the contact duration and contact frequency for each season were also plotted.

Group contact purposes

Every respondent specified the purpose of each of their group contacts in a free-write format. These strings were inconsistent and often had spelling mistakes. To understand the difference in group contacts across the three seasons, we created 13 contact reason categories such as school, politics, weddings, work, and worship. A set of keywords was defined for each category (Table S2). Keyword matching was performed to map the contact reason strings to these categories. One string could map to multiple categories and a contact belonging to none of the other categories was mapped to the ‘other’ category. Bar plots were created to highlight the differences in the distributions of group contacts for each category across the three seasons. Both the total size of group contact in each category and the number of people who reported a group contact of each category were visualised.

Age-assortative mixing matrices

We calculated age-assortative mixing matrices for both the number and duration of contacts. These matrices represent the average number or duration of contacts between each pair of age categories. For group contacts, the individual age of each member in the group was not known, and only the least and highest age in the group was present in the dataset. Therefore, we sampled ages between the lower and upper age based on the census age proportions of rural Faridabad. These proportions were computed using the age-wise population counts for rural Faridabad, Haryana obtained from the 2011 census of India [27].

Observed to Expected Age-assortative mixing matrices

We calculated the Observed to Expected (O/E) mixing matrices to evaluate whether the actual number or duration of contacts between a pair of groups was higher or lower than the number we would expect if mixing were proportional to the population sizes. In a conventional mixing matrix, a high value between two groups could be obtained just because of a high number of individuals in one of the age categories. Therefore, calculating the O/E matrices helps us to account for the population sizes. A value greater than 1 in the matrix signifies a greater than expected mixing between two groups while a value less than 1 signifies the opposite.

To calculate the O/E matrix, 1000 bootstrap samples were drawn from the contact mixing data. The observed number and duration of contacts were calculated for each sample. The expected contacts of a respondent category with other contact categories were calculated by multiplying the total contacts for the respondent category with the census population proportions of the contact categories. We present these observed by expected matrices with 95% confidence intervals for both the number and duration of contacts, stratified by season, as well as by respondent gender. The Q-index [28] was used as a measure of the assortativity of the mixing matrices. It ranges from 0 to 1, with 0 representing random mixing and 1 representing perfect assortativity.

where P is the contact matrix normalised to a left-stochastic matrix and n is the number of age categories.

Regression models to predict contacts and understand the effect of explanatory variables

We fitted univariate regression models stratified by gender, with the age as the independent variable and the median number of contacts as the dependent variable. To avoid outliers, the median contacts for an age were only calculated if there were a minimum of three respondents of that age. A polynomial regression model of degree 5 was fitted on the data.

We also constructed a multivariate negative binomial regression model to understand the independent effect of the season, age-groups (in ten-year categories), employment, and whether the day was a weekend on the number and duration of contacts. Respondents with occupations listed as unemployed, retired, dependents, aged individuals, housewives, and girls doing household chores were treated as unemployed for the regression model. The age category of 10-19 years old (the group with the highest contacts in previous studies [1]) and the winter season were chosen as reference categories. The ‘NegativeBinomial’ GLM family was used from the ‘statsmodels’ package [29] in Python with the default hyperparameter values. We present the adjusted rate ratios for the model, along with their 95% confidence intervals.

Generating a synthetic population that mimics contact patterns

In order to simulate an epidemic in the population, we generated a synthetic population with contact patterns similar to that of our study. We generated a populations of size 10,000 (one for each season) with an age distribution similar to that of the census data for Faridabad, and used a Markov Chain Monte-Carlo approach to assign every agent a house and a workplace, such that the age-stratified contact pattern in homes and workplaces resembled the observed pattern of contacts within the home and outside the home respectively.

Agent-based modelling to simulate infection spread using contact mixing data

We used the synthetic populations to perform epidemiological simulations using BharatSim [30], an agent-based simulation engine. In an agent-based model, we initialise several ‘agents’ each with their own schedule, household and workplace, and simulate their interactions with one another. The rates and parameters used were derived from Kerr et al. [31] and are presented in Supplementary Tables S4 and S5. We simulated the spread of COVID-19 (as a model for a respiratory disease) using the compartmental model described by Hazra et. al [32] (Supplementary Fig S10) in the population in order to track differences in the spread of the infection due to the seasons.

Results

A total of 8,421 responses were obtained from 3,052 respondents across the three seasons. Table 1 shows the characteristics of those who participated in the survey. Supplementary Fig S1 illustrates the season-wise demographic details along with counts for new enrollments and loss to follow-up. A total of over 120,000 individual contacts and 58,000 group contacts were reported, for a total of 180,000 contacts. 2,913 respondents (95.4%) responded in more than one survey and were considered for a longitudinal analysis across the seasons.

View this table:

Table 1: Characteristics of survey participants, stratified by gender.