Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Prevalence of Non-obese Type 2 Diabetes in economically disadvantaged Indian rural populations

Saptarshi Bej, Jit Sarkar, Saikat Biswas, Pabitra Mitra, Partha Chakrabarti, View ORCID ProfileOlaf Wolkenhauer
doi: https://doi.org/10.1101/2020.09.21.20198598
Saptarshi Bej
1Department of Systems Biology and Bioinformatics, University of Rostock, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: saptarshibej24{at}gmail.com
Jit Sarkar
2Division of Cell Biology and Physiology, CSIR-Indian Institute of Chemical Biology, Kolkata, India
3Academy of Innovative and Scientific Research, Ghaziabad, India
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jitnpur{at}gmail.com jit1806{at}csir.iicb.res.in
Saikat Biswas
4Advanced Technology Development Centre, Indian Institute of Technology, Kharagpur, India
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Pabitra Mitra
5Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur, India
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Partha Chakrabarti
2Division of Cell Biology and Physiology, CSIR-Indian Institute of Chemical Biology, Kolkata, India
3Academy of Innovative and Scientific Research, Ghaziabad, India
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Olaf Wolkenhauer
1Department of Systems Biology and Bioinformatics, University of Rostock, Germany
6Stellenbosch Institute for Advanced Study (STIAS), Wallenberg Research Centre at Stellenbosch University, Stellenbosch, South Africa
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Olaf Wolkenhauer
  • For correspondence: olaf.wolkenhauer{at}uni-rostock.de
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Studies on Type 2 Diabetes Mellitus (T2DM) have revealed heterogeneous sub-populations in terms of underlying pathologies. However, identification of subpopulations in epidemiological datasets remain unexplored. We here focus on the detection of T2DM clusters in epidemiological data, specifically analysing the National Family Health Survey-4 (NFHS-4) dataset containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients.

Methods Epidemiological data provide challenges for analysis due to the diverse types of features in it. In this case, applying the state-of-the-art dimension reduction tool UMAP conventionally was found to be ineffective for the NFHS-4 dataset, which contains continuous, ordinal and nominal feature types. Continuous features, although smaller in numbers, had an overpowering effect on the distribution of clusters. We implemented a distributed clustering workflow combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.

Findings From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. Application of UMAP based clustering workflow for this type of dataset is novel in itself.

Our analysis reveals four significant clusters, with two of them comprising mainly of non-obese T2DM patients. These non-obese clusters has lower mean age and majorly comprises of rural residents. Surprisingly, one of the obese clusters had 90% of the T2DM patients practising non-vegetarian diet though they did not show an increased intake of plant-based protein-rich foods.

Interpretation Our findings demonstrate the presence of a heterogeneity among T2DM patients with regard to socio-demography and dietary pattern. From our analysis, we conclude that, existence of significant non-obese T2DM subpopulations characterized by younger age group and economic disadvantage, raise the need of different screening criteria for T2DM among rural Indian residents.

Funding This work was in part supported by funds from Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med, as well as the German Federal Ministry for Education and Research (BMBF) programs (FKZ 01ZX1709C). The work has also been funded and supported by the Indian Council of Medical research (ICMR) (No.3/1/3/JRF-2017/HRD-LS/56429/54).

1 Introduction

Type 2 Diabetes Mellitus (T2DM) is a multifactorial disease globally estimated to rise to 629 million cases by 2045 (See IDF Diabetes Atlas) [1, 2]. Though conceived as a homogeneous disease for long, several recent studies have found T2DM to be a mix of heterogenous disease subtypes [3, 4, 5]. These studies have reported a varied pathophysiology underlying T2DM and thereby suggest the possibility of a personalised treatment for T2DM.

Besides obesity, other factors like age, sex, socio-economic status, place of residence (rural/urban), smoking habit, alcohol intake, food frequency etc. significantly associate with T2DM [6, 7, 8, 9, 10, 11, 12, 13]. Several of these factors are modifiable in nature and hence are important in the management of T2DM [1]. However, modification of lifestyle-related factors vary and thereby lead to a differential degree of glycemic control among T2DM patients [14]. Glycaemic control and response to anti-diabetics has also been shown to be different among T2DM sub-groups [15]. To explore whether any particular pattern of patient sub-populations exist within the entire T2DM population based on socio-demographic and lifestyle factors, we used an unsupervised clustering approach on the largest and most comprehensive epidemiological dataset in India, the National Family Health Survey-4 (NFHS-4) dataset. Clusters were subsequently characterised to identify unique socio-demographic and lifestyle patterns associated with these sub-populations.

Epidemiological datasets provide a comprehensive set of information regarding socio-demography, lifestyle, addiction and co-morbidities. Variables containing such information are called features in the language of Machine Learning. In the T2DM-NFHS-4 dataset, there are 36 such features, containing information on each diabetes patient. Moreover, in our dataset, the features can be categorised into three types:

  1. Continuous features: These are the features which can assume any numeric value from a continuous range. For example, BMI of a patient is a continuous feature.

  2. Ordinal features: These are the features which assume values from a discrete range, such that, there is a sense of order in the values assumed by the feature. For example, let us assume a feature ‘meat consumption by a patient’, assumes values ‘daily’, ‘weekly’ or ‘monthly’. Clearly the range of the feature ‘meat consumption by a patient’ is discrete, since it can assume any one of the three values. Also, there is a sense of order in the values, indicating that daily meat consumption is the highest and daily meat consumption is the lowest, if we want to quantify meat consumption.

  3. Nominal features: These are the features which assume values from a discrete range, such that, there is no sense of order in the values assumed by the feature. For example, let us assume a feature ‘Religion of a patient’, assumes values ‘Hindus’, ‘Muslims’ or ‘Christians’. Clearly the range of the feature ‘meat consumption by a patient’ is discrete, since it can assume any one of the three values. But there is no sense of order in the possible values assumed by the features. Yet, this feature draws its importance from the fact that lifestyle patterns or diets vary largely among these religious groups.

Such diverse types of features in epidemiological data create challenges for the analysis. Conventional application of the state-of-the-art dimension reduction tool Uniform Manifold Approximation (UMAP) was found to be ineffective for the T2DM-NFHS-4 dataset,. Continuous features, although smaller in numbers, had a overpowering effect on the distribution of clusters. To address this problem, we implemented a distributed clustering workflow, combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.

The workflow realised for the present study (Figure 1) involves investigation of underlying socio-demographic patterns within patient sub-populations using unsupervised learning. Dimension reduction approaches are often used to reduce higher dimensional data to lower dimensions such that in the lower dimensional embedding of the data one can visualize underlying clusters within the data, that are not apparent in the higher dimensions [16]. Several such techniques have been developed over the last few decades. Until recently the dimension reduction technique t-Stochastic Neighbourhood Embedding (t-SNE) was a state-of the-art algorithm in this field providing numerous applications in various fields [17, 18, 19]. t-SNE projects high dimensional data to a lower dimension while maintaining the underlying local manifold structure in a sense that, in a lower dimension t-SNE can cluster points, that are close enough in the latent high dimensional manifold [17].

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Workflow describing the analysis of the T2DM NFHS-4 Dataset.

With a rigorous mathematical foundation, considerably high speed and easy to use using scikitlearn API, UMAP has turned out to be one of the most popular choices among the data scientists [20, 21, 22]. As opposed to t-SNE, UMAP uses a graph based manifold approximation mechanism which contributes to preservation of the global as well as Social properties of the latent data manifold in a lower dimensional representation of the data. Given some low dimensional representation of the data, a similar process can be used to construct an equivalent topological representation. UMAP builds a graph considering customized neighbourhoods for every data points. This graph is a representation of the higher dimensional data manifold. The end result is a patchwork of low-dimensional representations of neighbourhoods that groups similar data points on a local scale while better preserving long-range topological connections to more distantly related data points [20, 22]. For the ability of UMAP to preserve the long-range topological connections along with the short-range topological connections and because of its high computational efficiency we choose UMAP for our unsupervised clustering approach. Moreover, UMAP allows an user to specify several similarity measures through the tuning of the metric parameter. This has been critical in our workflow, since our data contains continuous and categorical features and choosing suitable similarity measures for continuous and categorical features is crucial for a meaningful and informative clustering [23].

2 Methodology

2.1 Source and Description of the T2DM NFHS-4 Dataset

Data preparation and pre-processing are the key aspects of approaching a problem from a Machine Learning perspective. In this Section we provide the details on the pre-processing approach adopted to generate the T2DM-NFHS-4 dataset.

The NFHS-4 dataset was downloaded from The Demographic & Health Surveys (DHS) Program website. NFHS-4 is the fourth version of national health survey conducted under the supervision of Ministry of Health and Family Welfare, Government of India with the International Institute for Population Sciences (IIPS), Mumbai serving as the main nodal agency for all the surveys. The sampling procedure followed in NFHS-4 was of stratified two-stage sampling covering all the 640 districts of India. The survey was successfully conducted with 601,509 households. In those interviewed households 112,122 men and 699,686 women could be successfully interviewed. Four survey questionnaires (Household Questionnaire, Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire) were implemented in 17 local languages to collect information on basic demographic information, socio-economic parameters, family planning issues, nutritional status, health indicators, contact with community health workers etc. Uniqueness of the NFHS-4 study was that it collected data on Diabetes status and performed a Random Blood Glucose for individuals (15-54 years) using a finger-stick blood specimen. As a result, the biomarker measurements and tests besides anthropometric measurements like anaemia testing, blood pressure measurement, blood glucose testing and HIV testing were included in the survey.

2.2 Dataset Preparation

For dataset preparation and cleaning, the three questionnaires were merged-Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire. The first two contained information about background characteristics (location, age, sex, religion, social group, literacy, wealth status etc), nutritional practices, addictions and co-morbidities while the bio-marker questionnaire contained information on height, weight, blood pressure and random blood glucose. A unique code was generated for all individuals in all the three questionnaires by appending the Country code and phase, Cluster number, Household number and Line number. The three datasets were joined by the unique code to prepare a single dataset of 810,971 individuals consisting of all men and women between 15-54 years of age. Pregnant women were next excluded to discard the possibility of Gestational Diabetes Mellitus. Individuals with missing diabetic and blood pressure status were also excluded. Variables known to be risk factors for DM (BMI, Age, Place of residence, Wealth Index, Smoking frequency, Alcohol intake frequency, Hypertension), socio-economic factors (Sex, Religion, Social group, Educational status), Dietary frequencies and haemoglobin level were selected for final analysis. BMI, age and haemoglobin level were taken as continuous variables and the rest as categorical variables. Outliers were removed separately for all the three continuous variables to obtain the final dataset with 610498 individuals (526678 females and 83820 males).

2.3 Dataset Preprocessing

We were interested in detecting significant T2DM sub-populations in the data and further sought to characterize these subpopulations based on the socio-demographic and co-morbid conditions. For this purpose, we extracted patients with known history of diabetes from the dataset: a total of 10,125 patients. We considered a diverse collection of socio-demographic and co-morbid conditions as ‘features’ in our dataset. Qualitatively our features can be divided into several categories:

  1. Co-morbid conditions: This class of features considers the co-morbid diseases among T2DM patients. We considered whether a T2DM patient had medical conditions such as Asthma, Thyroid disorder, Heart disease, Cancer, Tuberculosis and Hypertension. Thus, there were six features in this category. These features are binary in nature denoting whether a T2DM patient suffered from a given comorbidity or not.

  2. Food habits: This class of features considered the food habits of T2DM patients. The features considered here were how frequently the patient took the food items: Milk or Curd, Pulses or Beans, Dark leafy vegetables, Fruits, Eggs, Fish, Chicken, Fried food and Aerated drinks. Thus, there were nine features in this category. Features were categorical and ordinal in nature having four possible values: ‘Daily’, ‘Occasionally’, ‘Weekly’ and ‘Never’.

  3. Addiction history: This class of features considered the addiction pattern of T2DM patients. There were two features in this class, both binary in nature encoding whether a patient is a Smoker or whether a patient takes Alcohol.

  4. Socio-demographic features: These included features such as Sex, Age, Wealth index, Education level, Religion and Caste along with Body Mass Index (BMI) and Haemoglobin level of the patient. There were eight features in this category.

  5. Living conditions: This class of features quantify the living conditions of the patients. The features in this class considered whether a patient lives in a household possessing refrigerator, bicycle, motorbike, four wheeler vehicle and livestock. Moreover, there were features denoting type of residence, household structure, frequency of household members smoking inside the house, type of cooking fuel used, source of drinking water and time to reach the nearest drinking water source. Thus, there were eleven features belonging to this category.

For our study, 36 features or factors are considered to investigate significant patient populations among the diabetes patients into consideration. Note that there are both continuous and categorical features among these thirty six features. Among the categorical features there are both ordinal features and nominal features. Ordinal features have a sense of order among them, such as the features from the ‘food habits’ category as described before. The nominal features are categorical features with no sense of order such as sex of a patient. Note that for our dataset the continuous features are: Age, BMI, Haemoglobin level and Time to get to drinking water source; whereas the nominal features are: Sex, Religion, Caste, Household structure, Type of place of residence, Type of cooking fuel and Source of drinking water. The rest of the features are ordinal features. The categorization of features into continuous, nominal and ordinal is of utmost importance in our clustering paradigm which we discuss in Section 2.4.1.

2.4 Identification of T2DM sub-populations using U-MAP and DBSCAN

From our detailed description of our dataset we pointed out that our dataset has a variety of features including continuous and categorical features. Further, there are both ordinal and nominal features among the categorical features in our dataset. A simple UMAP on the entire dataset is depicted in Figure 2(a), revealing two broad clusters. For this clustering UMAP parameters n_neighbours have been chosen to be 30, whereas the metric parameter has been chosen to be euclidean. However we have a number of important nominal and ordinal categorical features whose effect would not be apparent from such a clustering. Moreover, the euclidean distance does not always make sense on categorical features, especially if they are nominal in nature. For example, observe Figure 2(d), where we have used UMAP considering only the nominal features with metric parameter hamming (based on hamming distance). This reveals a completely different picture of the dataset, showing several small clusters. Our clustering paradigm is designed to optimise this effect and find a balance in the clustering where a particular type of feature does not have an overpowering effect on the clustering process.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

(a) Figure showing UMAP clusters for all the features with Euclidean metric (b) Figure showing UMAP clusters for continuous features with Euclidean metric (c) Figure showing UMAP clusters for ordinal features with Canberra metric (d) Figure showing UMAP clusters for nominal features with Hamming metric

2.4.1 Clustering paradigm using UMAP

Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories we create a lower dimensional embedding of the dataset. Finally we integrate the lower dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. One advantage of this algorithm is that one does not need to specify the number of clusters from beforehand. DBSCAN considers closely or densely located points, as clusters [24]. For UMAP, we use the same values for the parameters n_neighbours= 30 and min_distance= 0.1 for all the feature types.

  • For the continuous features we use the metric measure to be Euclidean. The Euclidean distance between two vectors is given by: Embedded Image

  • For the nominal features we use themetric measure to be Hamming. Hamming distance is defined as: Embedded Image where δ(xi, yi) = 1 if xi = yi and δ(xi, yi) = 0 otherwise. Recall that, nominal features are also a type of categorical features which do not have a sense of order associated to them. For such features Hamming distance is widely used as a similarity measure between data points [23].

  • For the ordinal features we use the metric measure to be Canberra. It is a weighted version of the Manhattan measure. The Canberra distance is given by: Embedded Image

Ordinal features are also a type of categorical features. However, the Hamming metric can not capture the inherent ordered relationships and statistic information from categorical values [23]. We thus tried using UMAP for several metric measures and noticed that the Canberra distance measure retains a high variance in the lower dimensions. Thus we chose the Canberra distance measure as a similarity metric for ordinal features.

For the categorical and ordinal features we thus produce a two dimensional representation of each data point by taking into consideration the first two UMAP coordinates. For the nominal features we consider we produce a one dimensional representation, since the data points are too scattered in this case as shown in Figure 2(d) and thus can lead to too many clusters. Thus, we reduce every data point into a five dimension representation, two for each of the continuous and ordinal features and one for the nominal features. Finally, we look for clusters in the five dimensional representation using DBSCAN (eps= 1, minpoints= 200). After selecting the final clusters, we characterized them by summarizing all the 36 variables separately for each cluster. The continuous variables were summarized as their mean and the standard error of the mean. The categorical variables were summarized as their frequency distribution and the proportion of each value within each cluster.

2.4.2 Extraction of T2DM sub-populations using DBSCAN

Using our clustering paradigm described before, we can detect seven subpopulations among the patients where 261 patients are considered as outliers. We show the distribution of clusters in Figure 3a. We further perform a UMAP on the five dimensional reduced representation of our data to visualize the clusters detected by DBSCAN. For this we label the data points using the DBSCAN clustering labels and colour code them in the UMAP representation of the five dimensional reduced data as shown in Figure 3b. This provides validation to the fact the clustering done by DBSCAN makes sense. Note that, from our clusters we can detect four significant patient subpopulations containing 2898, 2301, 2226 and 1315 data points.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

(a) Distribution of clusters detected by DBSCAN on the five dimensional reduced representation of the data (b) UMAP clusters for five dimensional reduced representation of the data annotated by the DBSCAN generated clusters

3 Results

3.1 Characterization of clusters

Age and BMI both were found to be lower in Cluster 2 and Cluster 4

Age and obesity are the most important risk factors for T2DM. However, we found a heterogeneity in both these variables across all the clusters. Interestingly, the mean Age and BMI both were lower in Cluster 2

(Age: 38.3 ± 0.19 years, BMI: 23.9 ± 0.1) and Cluster 4 (Age: 37.9 ± 0.26 years, BMI: 23.6 ± 0.13) compared to Cluster 1 (Age: 41.3 ± 0.14 years, BMI: 26.7 ± 0.09) and Cluster 3 (Age: 39.9 ± 0.18 years, BMI: 26 ± 0.11). However distribution of males and females has been found to be similar across all the clusters.

Higher proportion of rural residents and lower proportion of richest wealth quintile in Cluster 2 and 4

Proportion of rural residents was found to be high in Cluster 2 (69.4% were Rural residents) and Cluster 4 (72.02% were Rural residents) compared to the other clusters (31.3% in

Cluster 1 and 49.19% in Cluster 3). Surprisingly, only 4.3% people in Cluster 2 and 8.37% in Cluster 4 belonged to the richest quintile of the Wealth Index category whereas 64.04% in Cluster 1 and 54.9% in Cluster 3 belonged to the same.

Frequency of co-morbid conditions were similar across all the clusters

Co-morbid conditions included history of asthma, thyroid disease, heart disease, cancer, history of tuberculosis, haemoglobin level and hypertension. Though the distribution of disease conditions show minor variation across the clusters (Table 1), the trend is almost similar in all the clusters.

View this table:
  • View inline
  • View popup
Table 1:

Detailed cluster-specific analysis for all numerical and categorical variables.

Lifestyle patterns show evidences of a lower quality of life for patient sub-populations in Cluster 2 and 4

Our analysis reveal several other factors that support the fact that T2DM sub-populations from Cluster 2 and Cluster 4 have a considerably lower quality of life.

  1. We observe that only 0.22% and 24.79% of patients belonging to Cluster 2 and Cluster 4 respectively possess a refrigerator compared to 95.48% and 65.77% of patients belonging to Cluster 1 and Cluster 3 respectively.

  2. Only 30.9% and 32.78% of patients belonging to Cluster 2 and Cluster 4 respectively possess a motorbike compared to 71.53% and 67.03% of patients belonging to Cluster 1 and Cluster 3 respectively.

  3. Only 3.26% and 3.19% of patients belonging to Cluster 2 and Cluster 4 respectively possess a car/truck compared to 23.5% and 17.34% of patients belonging to Cluster 1 and Cluster 3 respectively.

  4. 44.24% and 54.98% of patients belonging to Cluster 2 and Cluster 4 respectively, use plant based cooking fuel, which is relatively cheap, compared to 12.22% and 19.63% of patients belonging to Cluster 1 and Cluster 3 respectively. Moreover, only 41.94% and 36.2% of patients belonging to Cluster 2 and Cluster 4 respectively use Gas/Oil based cooking fuel, which is relatively expensive, compared to 84.89% and 70.17% of patients belonging to Cluster 1 and Cluster 3 respectively.

  5. 6.35 % and 15.51% of patients belonging to Cluster 2 and Cluster 4 respectively, drink water from unprotected sources, compared to 2.62% and 1.98% of patients belonging to Cluster 1 and Cluster 3 respectively.

Intake of non-vegetarian foods is invariably low in Cluster 3

Around 90% of the population in Cluster 3 had no intake of Egg (89.08%), fish (97.12%), chicken or meat (97.71%) whereas only less than 10% of the population in all the other 3 clusters had no intake of these non-vegetarian foods (Table 1). Though the Cluster 3 population had the highest daily intake of milk/curd (61.81%) and pulses/beans (50.31%) compared to the other clusters, other clusters also had almost similar proportion of people taking milk/curd and pulses/beans daily. Intake of other foods like dark leafy vegetables, fruits, fried foods and aerated drinks showed similar distribution across all the clusters.

4 Discussion

4.1 Rationale of the workflow in clustering epidemiological data

The clustering workflow used arises from some important observations that we will discuss here. To begin with we have a population of 10,125 T2DM patients with a diverse ensemble of features accounting for information on medical history, dietary and addiction habits, socio-economic and lifestyle patterns. Moreover, the features in the considered dataset are also diverse in terms of data types. We have a total of 36 features, out of which 4 are continuous features, 7 nominal features and 25 ordinal features, all of equal importance by assumption.

The aim is to find significant sub-populations in our data such that the identified sub-populations are interpretable in terms of the considered features. Note here that, by significant subpopulations we mean a subpopulation consisting of at least 10 percent of the total population. If there exists such sub-populations and we can explain the subpopulations in terms of the considered features, we can argue that these patterns exist in significant number of patients.

We have already argued in favour of using UMAP for our unsupervised approach to find clusters in the data. However, we observed that applying UMAP algorithm conventionally using the euclidean similarity metric on our entire dataset with 36 features turns out to be ineffective. The reason is, in this case the continuous features have an overpowering effect over the other feature types in determining the distribution of clusters. This can be observed from Figure 2(a) and 2(b). Note that Figure 2(a) shows UMAP clustering with all 36 features and 2(b) shows UMAP clustering with only four continuous features. Note that, there is a similarity in the clustering distribution of these figures, each containing one major cluster and seven small minor clusters. We observed that this is because of the fact that UMAP, when applied on all 36 features of the dataset using euclidean similarity measure is largely biased towards finding similarity among data points only in terms of the continuous features. Given that we have only four continuous features out of 36, this poses a problem as the diverse information present in the dataset in the form of the ordinal and nominal features are largely ignored.

To solve this problem, the clustering of continuous, ordinal and nominal features were treated separately by using different similarity matrices for them, giving rise to our clustering paradigm. We argued on our choice of similarity measures in Section 2.4.1. This generates for each feature type a data representation of lower dimension shown in Figure 2(b-d). We finally integrated these lower dimension data representations by taking two dimensional representations for continuous and ordinal features and an one dimensional representation (the one consisting of the most variance) for nominal features. The reason behind considering one dimensional representation for nominal features, is that using Hamming metrics for such data results in retaining a lot of variance in the data resulting in multiple clusters as we observe in Figure 2(d). Considering a two dimensional representation for this data while integrating these lower dimension data representations carry forward this variance and result in multiple small clusters in the final clustering distribution, which contradicts our aim of finding significantly large sub-populations (of at least 10 percent of the total population).

Finally, the integration is done by applying UMAP on the five dimensional reduced representation of the dataset using euclidean similarity measure (shown in Figure 3b). Note here that, in our final clusters we can observe patterns in all of continuous, ordinal and nominal data types. For example, in Cluster 4 the continuous feature ‘Time to Water source (min)’ shows very high values compared to other clusters. In Cluster 1 and 3, the nominal feature ‘Cooking fuel used’ shows a higher percentage for Gas/Oil users while in Cluster 2 and 4 the same feature shows a higher percentage for plant-based fuel users. In Cluster 3, the ordinal feature ‘Fish intake frequency’ shows a 97 percent of people to be never consuming fish. Thus, we infer that our clustering paradigm enables us to find significant sub-populations while keeping the clustering distribution unbiased, that is no feature type continuous, ordinal and nominal has an overpowering effect on the other.

4.2 Significance of T2DM clusters

T2DM was identified as a homogeneous disease with Insulin Resistance followed by β-cell dysfunction being the underlying pathology. However recent studies have explored and found T2DM to be a heterogeneous entity with the relative contribution of Insulin Resistance and β-cell dysfunction to differ across T2DM clusters [3]. These studies were performed on clinical and biochemical data with variables having uniform data types. On the other hand, our clustering approach takes into account the diverse data types obtained from an epidemiological dataset and discovers clusters among the T2DM population. Interestingly, two of the four clusters obtained in our study belonged to the non-obese T2DM phenotype where the mean BMI was below 25. These two non-obese clusters also had lower mean age compared to the other clusters. Both these non-obese clusters had larger proportion of rural residents and lower proportion of people belonging to the highest wealth quintile concluding to the fact that a large majority of T2DM people from rural India have lower BMI and are younger in age. The T2DM patient subpopulation belonging to these clusters have a relatively lower quality of life judging by analysis the lifestyle pattern based features. The non-obese phenotype of T2DM has been increasingly reported over the last two decades raising concern about the uniqueness of its underlying pathophysiology with a greater contribution of β-cell dysfunction compared to Insulin Resistance [25, 26, 27, 28]. This non-obese T2DM phenotype has been found among Asians and studies depicting and investigating its similarities and differences has been in place. Studies have concluded T2DM to occur among the Asians at a lower BMI cut-off and also at a younger age [29, 30]. This finding of two non-obese clusters with lower mean age provides confirmation to this.

Though non-obese T2DM is being considered as a unique phenotype, epidemiological studies for identifying high-risk population groups still remain undone. This is especially important for many Asian countries where over half of the T2DM population is of non-obese phenotype [25]. This analysis, reporting an increased presence of Rural residents in both the non-obese T2DM clusters, calls for a modification in BMI and Age cut-off for T2DM screening among rural residents. However identification of risk factors for T2DM specific to the rural population needs to be done. Representation of people from the highest wealth quintile was much lower in both the non-obese T2DM clusters. T2DM is a multi-factorial disease requiring strict compliance to lifestyle modification, proper diet and anti-diabetic therapy. Non-obese T2DM clusters with reduced representation from the highest wealth quintile suggests the possibility of an unequal access to care for non-obese T2DM people thereby generating the need of a more equitable healthcare policy in terms of prevention and therapy.

On the other hand, both the obese T2DM clusters had higher age and more urban residents. The proportion of people from the highest wealth quintile was higher in both the obese clusters. Interestingly one of the obese clusters (Cluster 3) had invariably low intake of non-vegetarian foods (egg, fish, chicken and meat) pointing out to the fact this T2DM cluster comprised of non-vegetarian people mainly. Dietary requirements in diagnosed T2DM patients involves reduced amount of carbohydrates and fats with increased amount of protein-rich foods [31]. Animal products, being rich sources of dietary protein, need to be included in the diet. One of the obese T2DM clusters with a strict non-vegetarian dietary pattern suggests the need to design a proper dietary guidelines for this group.

5 Conclusion

From a data science perspective, this analysis addresses the issue of diverse data types. We have shown that for such data conventional application of dimension reduction approaches might not be fruitful. We develop a workflow that contributes to finding meaningful and interpretable clusters such that the distribution of clusters is not biased by the data types.

Existence of a significant non-obese T2DM patient sub-population belonging to younger age group and having larger proportions of rural residents raises with a lower quality of life, indicate the need of a different screening criteria for T2DM among rural Indian residents. The obese T2DM cluster with around 90% of people sticking to the non-vegetarian diet calls for the need of dietary guidelines for T2DM patients having a non-vegetarian dietary pattern.

Data Availability

We support the idea of transparency and reproducibility of research. Therefore, all data relevant to this work are made publicly available on a GitHub repository.

https://github.com/Saptarshi-Bej/Type-2-Diabetes-Mellitus-T2DM-/blob/master/Preprocessed_DM_xx.zip

Data availability

We support the idea of transparency and reproducibility of research. Therefore, all data relevant to this work are made publicly available on the GitHub repository https://github.com/Saptarshi-Bej/Type-2-Diabetes-Mellitus-T2DM-/blob/master/Preprocessed_DM_xx.zip. More-over, the python code (in form of a jupyer notebook) for the implementation of our workflow is also provided publicly in https://github.com/Saptarshi-Bej/Type-2-Diabetes-Mellitus-T2DM-/blob/master/Clustering_paradigm_disc_cont.ipynb.

Author Contributions

Saptarshi Bej and Jit Sarkar are the first authors and contributed equally to this work. Saptarshi Bej, Jit Sarkar, Pabitra Mitra, Partha Chakrabarti and Olaf Wolkenhauer contributed to the study concept and design. Saptarshi Bej, Jit Sarkar and Saikat Biswas did the data analysis. Saptarshi Bej, Jit Sarkar and Olaf Wolkenhauer wrote the manuscript and are the guarantors of this work having full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. All authors approved the final version of the article, including the authorship list.

Disclosure Summary

The authors declare no conflict of interest.

Acknowledgements

This work was in part supported by funds from Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med, as well as the German Federal Ministry for Education and Research (BMBF) programs (FKZ 01ZX1709C). JS received a research fellowship from Indian Council of Medical research (ICMR) (No.3/1/3/JRF-2017/HRD-LS/56429/54).

Footnotes

  • We added some more interpretations on our results

References

  1. 1.↵
    Yan Zheng, Sylvia Ley, and Frank Hu. Global aetiology and epidemiology of type 2 diabetes mellitus and its complications. Nature Reviews Endocrinology, 14, 12 2017 doi: 10.1038/nrendo.2017.151.
    OpenUrlCrossRefPubMed
  2. 2.↵
    Lei Chen and Dianna Magliano. The worldwide epidemiology of type 2 diabetes mellitus-present and future perspectives. nat rev endocrinol 8: 228-236. Nature reviews. Endocrinology, 8:228–36, 11 2011 doi: 10.1038/nrendo.2011.183.
    OpenUrlCrossRefPubMed
  3. 3.↵
    Ranjit Anjana, Viswanathan Baskar, Anand Thakarakkattil, Narayanan Nair, Saravanan Jebarani, Moneeza Kalhan Siddiqui, R. Guha Pradeepa, Ranjit Unnikrishnan, Colin Palmer, Ewan Pearson, and Viswanathan Mohan. Novel subgroups of type 2 diabetes and their association with microvascular outcomes in an asian indian population: a data-driven cluster analysis: the inspired study. BMJ Open Diabetes Research & Care, 8:1506, 07 2020 doi: 10.1136/bmjdrc-2020-001506.
    OpenUrlCrossRef
  4. 4.↵
    Emma Ahlqvist, Petter Storm, Annemari Käräjämäki, Mats Martinell, Mozhgan Dorkhan, An-nelie Carlsson, Petter Vikman, Rashmi Prasad B, Dina Mansour Aly, Peter Almgren, Ylva Wessman, Nael Shaat, Peter Spégel, Hindrik Mulder, Eero Lindholm, Olle Melander, Ola Hansson, Ulf Malmqvist, Ake Lernmark, and Leif Groop. Novel subgroups of adult-onset diabetes and their association with outcomes: A data-driven cluster analysis of six variables. The Lancet Diabetes & Endocrinology, 6, 03 2018 doi: 10.1016/S2213-8587(18)30051-2.
    OpenUrlCrossRefPubMed
  5. 5.↵
    Seong Beom Cho, Sang Kim, and Myung Chung. Identification of novel population clusters with different susceptibilities to type 2 diabetes and their impact on the prediction of diabetes. Scientific Reports, 9, 12 2019 doi: 10.1038/s41598-019-40058-y.
    OpenUrlCrossRef
  6. 6.↵
    Sofia Carlsson, Niklas Hammar, Valdemar Grill, and Jaakko Kaprio. Alcohol consumption and the incidence of type 2 diabetes. Diabetes Care, 26(10):2785–2790, 2003. ISSN 0149-5992 doi: 10.2337/diacare.26.10.2785. URL https://care.diabetesjournals.org/content/26/10/2785.
    OpenUrlAbstract/FREE Full Text
  7. 7.↵
    Madelyn L. Wheeler, Stephanie A. Dunbar, Lindsay M. Jaacks, Wahida Karmally, Elizabeth J. Mayer-Davis, Judith Wylie-Rosett, and William S. Yancy. Macronutrients, food groups, and eating patterns in the management of diabetes. Diabetes Care, 35(2):434–445, 2012. ISSN 0149-5992 doi: 10.2337/dc11-2216. URL https://care.diabetesjournals.org/content/35/2/434.
    OpenUrlFREE Full Text
  8. 8.↵
    Emilie Agardh, Anders Ahlbom, Tomas Andersson, S Efendic, Valdemar Grill, Johan Hallqvist, and C Ostenson. Socio-economic position at three points in life in association with type 2 diabetes and impaired glucose tolerance in middle-aged swedish men and women. International journal of epidemiology, 36:84–92, 03 2007 doi: 10.1093/ije/dyl269.
    OpenUrlCrossRefPubMedWeb of Science
  9. 9.↵
    Emilie Agardh, Peter Allebeck, Johan Hallqvist, Tahereh Moradi, and Anna Sidorchuk. Type 2 diabetes incidence and socio-economic position: a systematic review and meta-analysis. International Journal of Epidemiology, 40(3):804–818, 02 2011. ISSN 0300-5771 doi: 10.1093/ije/dyr029. URL https://doi.org/10.1093/ije/dyr029.
    OpenUrlCrossRefPubMedWeb of Science
  10. 10.↵
    Teruo Nagaya, Hideyo Yoshida, Hidekatsu Takahashi, and Makoto Kawai. Heavy smoking raises risk for type 2 diabetes mellitus in obese men; but, light smoking reduces the risk in lean men: A follow-up study in japan. Annals of epidemiology, 18:113–8, 02 2008 doi: 10.1016/j.annepidem.2007.07.107.
    OpenUrlCrossRef
  11. 11.↵
    Lukas Schwingshackl, Georg Hoffmann, Anna-Maria Lampousi, Sven Knüppel, Khalid Iqbal, Carolina Schwedhelm, Angela Bechthold, Sabrina Schlesinger, and Heiner Boeing. Food groups and risk of type 2 diabetes mellitus: a systematic review and meta-analysis of prospective studies. European Journal of Epidemiology, 32, 04 2017 doi: 10.1007/s10654-017-0246-y.
    OpenUrlCrossRef
  12. 12.↵
    Gang Liu, Geng Zong, Kana Wu, Yang Hu, Yanping Li, Walter C. Willett, David M. Eisenberg, Frank B. Hu, and Qi Sun. Meat cooking methods and risk of type 2 diabetes: Results from three prospective cohort studies. Diabetes Care, 41(5):1049–1060, 2018. ISSN 0149-5992 doi: 10.2337/dc17-1992. URL https://care.diabetesjournals.org/content/41/5/1049.
    OpenUrlAbstract/FREE Full Text
  13. 13.↵
    V Connolly, N Unwin, P Sherriff, Rudy Bilous, and W Kelly. Diabetes prevalence and socioe-conomic status: A population based study showing increased prevalence of type 2 diabetes mellitus in deprived areas. Journal of epidemiology and community health, 54:173–7, 03 2000 doi: 10.1136/jech.54.3.173.
    OpenUrlAbstract/FREE Full Text
  14. 14.↵
    Surendra Borgharkar and Soma Das. Real-world evidence of glycemic control among patients with type 2 diabetes mellitus in india: The tight study. BMJ Open Diabetes Research & Care, 7: e000654, 07 2019 doi: 10.1136/bmjdrc-2019-000654.
    OpenUrlAbstract/FREE Full Text
  15. 15.↵
    John Dennis, Beverley Shields, William Henley, Angus Jones, and Andrew Hattersley. Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using clinical trial data. The Lancet Diabetes & Endocrinology, 7, 04 2019 doi: 10.1016/S2213-8587(19)30087-7.
    OpenUrlCrossRef
  16. 16.↵
    Zheng Sun, Weiqing Xing, Wenjun Guo, Seungwook Kim, Hongze Li, Wenye Li, Jianru Wu, Yiwen Zhang, Bin Cheng, and Shenghui Cheng. A Survey on Dimension Reduction Algorithms in Big Data Visualization, pages 375–395. Springer, 05 2020. ISBN 978-3-030-48512-2 doi: 10.1007/978-3-030-48513-931.
    OpenUrlCrossRef
  17. 17.↵
    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. url: http://www.jmlr.org/papers/v9/vandermaaten08a.html.
    OpenUrl
  18. 18.↵
    Dmitry Kobak and Philipp Berens. The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10:5416, 2019. ISSN 2041-1723. url: https://doi.org/10.1038/s41467-019-13056-x.
    OpenUrl
  19. 19.↵
    Wentian Li, Jane E. Cerise, Yaning Yang, and Henry Han. Application of t-sne to human genetic data. Journal of Bioinformatics and Computational Biology, 15(04):1750017, 2017 doi: 10.1142/S0219720017500172. URL https://doi.org/10.1142/S0219720017500172. PMID: 28718343.
    OpenUrlCrossRefPubMed
  20. 20.↵
    L. McInnes, J. Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection. J. Open Source Softw., 3:861, 2018.
    OpenUrl
  21. 21.↵
    A-M Galow, M Wolfien, P Müller, M Bartsch, RM Brunner, A Hoeflich, O Wolkenhauer, R David, and Goldammer T. Integrative cluster analysis of whole hearts reveals proliferative cardiomyocytes in adult mice. Cells, 9(5)(1144):1–16, 2020. ISSN 2073-4409.
    OpenUrl
  22. 22.↵
    Alex Diaz-Papkovich, Luke Anderson-Trocmé, Chief Ben-Eghan, and Simon Gravel. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLOS Genetics, 15:1–24, 11 2019 doi: 10.1371/journal.pgen.1008432. URL https://doi.org/10.1371/journal.pgen.1008432.
    OpenUrlCrossRef
  23. 23.↵
    Sheng Luo, Duoqian Miao, Zhifei Zhang, Yuanjian Zhang, and Shengdan Hu. A neighborhood rough set model with nominal metric embedding. Information Sciences, 520, 02 2020 doi: 10.1016/j.ins.2020.02.015.
    OpenUrlCrossRef
  24. 24.↵
    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996.
  25. 25.↵
    Unjali Gujral, Mary Weber, Lisa Staimez, and K M V Narayan. Diabetes among non-overweight individuals: an emerging public health challenge. Current Diabetes Reports, 18:60, 08 2018 doi: 10.1007/s11892-018-1017-1.
    OpenUrlCrossRefPubMed
  26. 26.↵
    Lisa Staimez, Mary Weber, Harish Ranjani, Mohammed Ali, Justin Echouffo-Tcheugui, Lawrence Phillips, Viswanathan Mohan, and K M V Narayan. Evidence of reduced beta cell function in asian indians with mild dysglycemia. Diabetes Care, 36, 04 2013 doi: 10.2337/dc12-2290.
    OpenUrlAbstract/FREE Full Text
  27. 27.↵
    Jit Sarkar, Sujay Krishna Maity, Abhishek Sen, Titli Nargis, Dipika Ray, and Partha Chakrabarti. Impaired compensatory hyperinsulinemia among nonobese type 2 diabetes patients: a crosssectional study. Therapeutic Advances in Endocrinology and Metabolism, 10, 2019.
  28. 28.↵
    K M V Narayan. Type 2 diabetes: Why we are winning the battle but losing the war? 2015 kelly west award lecture. Diabetes Care, 39:653–663, 05 2016 doi: 10.2337/dc16-0205.
    OpenUrlAbstract/FREE Full Text
  29. 29.↵
    R. Ma and J. Chan. Type 2 diabetes in east asians: similarities and differences with populations in europe and the united states. Annals of the New York Academy of Sciences, 1281:64–91, 2013.
    OpenUrlCrossRefPubMedWeb of Science
  30. 30.↵
    Ji Won R. Lee, Frederick L. Brancati, and Hsin-Chieh Yeh. Trends in the prevalence of type 2 diabetes in asians versus whites. Diabetes Care, 34(2):353–357, 2011. ISSN 0149-5992 doi: 10.2337/dc10-0746. URL https://care.diabetesjournals.org/content/34/2/353.
    OpenUrlAbstract/FREE Full Text
  31. 31.↵
    Position Statements. Nutrition principles and recommendations in diabetes. Diabetes Care, 27(suppl 1):s36–s36, 2004. ISSN 0149-5992 doi: 10.2337/diacare.27.2007.S36. URL https://care.diabetesjournals.org/content/27/suppl_1/s36.
    OpenUrlCrossRef
Back to top
PreviousNext
Posted October 18, 2020.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Prevalence of Non-obese Type 2 Diabetes in economically disadvantaged Indian rural populations
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Prevalence of Non-obese Type 2 Diabetes in economically disadvantaged Indian rural populations
Saptarshi Bej, Jit Sarkar, Saikat Biswas, Pabitra Mitra, Partha Chakrabarti, Olaf Wolkenhauer
medRxiv 2020.09.21.20198598; doi: https://doi.org/10.1101/2020.09.21.20198598
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Prevalence of Non-obese Type 2 Diabetes in economically disadvantaged Indian rural populations
Saptarshi Bej, Jit Sarkar, Saikat Biswas, Pabitra Mitra, Partha Chakrabarti, Olaf Wolkenhauer
medRxiv 2020.09.21.20198598; doi: https://doi.org/10.1101/2020.09.21.20198598

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Epidemiology
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)