Abstract
Electronic health records (EHRs) provide a rich source of observational patient data that can be explored to infer underlying causal relationships. These causal relationships can be applied to augment medical decision-making or suggest hypotheses for healthcare research. In this study, we explored a large-scale EHR dataset on patients with asthma or related conditions (N = 14,937). The dataset included integrated data on features representing demographic factors, clinical measures, and environmental exposures. The data were accessed via a service named the Integrated Clinical and Environmental Service (ICEES). We estimated underlying causal relationships from the data to identify significant predictors of asthma attacks. We also performed simulated interventions on the inferred causal network to detect the causal effects, in terms of shifts in probability distribution for asthma attacks.
1 Introduction
Causal inference [1] is re-emerging as an important tool in the domain of health sciences for informatics work such as finding effects of a drug or risk factors for a disease. Causality has traditionally been a core concept across all branches of medical science and considered when diagnosing patients based on their symptoms, effects of treatment, and years of historical evidence [2]. Electronic health records (EHRs) present a potential data source to analyze digital patient information like medical history, diagnoses, medications, and laboratory results. Inferring causal relationships from these data is useful for important tasks like prediction and explanation. For prediction, we want to measure the likelihood of occurrence of an event as a result of another event, for example, the occurrence of lung cancer based on exposure to smoke in the environment. However, such predictions are susceptible to fallacies if made only based on associations for instance, an increase in the sales of matches (e.g., in a blackout-prone area) can also be associated with lung cancer. Most black-box prediction models, unlike causal inference, are not able to identify confounding variables and hence cannot differentiate causal versus spurious associations.
Another aspect of causal inference is the ability to provide an explanation for the relationship between two events. For instance, causal inference helps us to unearth why a patient is sick and diagnose them based on the underlying cause of their symptoms and other aspects of their disease. Access to EHR data is thus critical for the advancement of clinical research and practice. However, due to the many regulations that surround clinical data, while necessary to ensure patient privacy and protection of sensitive data, access to the data for research is often challenging.
In this research, we analyzed a patient-level dataset extracted from a regulatory-compliant open service called the Integrated Clinical and Environmental Exposures Service (ICEES). ICEES supports several use cases including asthma. The ICEES data are constructed by integrating clinical data elements derived from patient EHRs and environmental exposures data derived from a variety of public sources of environmental exposures data before binning or recoding the data and stripping all protected health information.
The ICEES data are then exposed via an open application programming interface (OpenAPI). For our principal application use case, we asked if there is a causal relationship between asthma attacks and the following features: sex, race, prescriptions for prednisone, diagnoses of obesity, residential proximity to a major roadway or highway, residential density, and exposure to high levels of airborne pollutants. These features were selected because published studies, including our prior work [3, 4], have recognized them to be associated to asthma attacks. We focused on an existing ICEES cohort of patients with asthma or related conditions (see [3] for details), and we considered the number of annual emergency department (ED) or inpatient visits for respiratory issues as the primary outcome measure and indicator of asthma attacks. We used the ICEES OpenAPI to extract features that might be causally related to each other and used the resultant multivariate table for causal inference modeling. Because EHR data are purely observational, we also demonstrate a way to perform simulated external intervention, given a known causal network, to help answer important questions about the effects of clinical interventions. We use subject matter expert knowledge and publication support as our ground truth to measure the correctness of our causal inference modeling. Finally, we discuss our findings, including the benefits and limitations of our causal inference model and approach.
2 Analysis of the Multivariate ICEES Table
We queried the ICEES OpenAPI to generate an eight-feature multivariate table. The multivariate table analysed in this work comprised data on 14,937 patients (rows represent individual patients in the asthma cohort) and eight ICEES feature variables, per patient, namely, TotalEDInpatientVisits, Sex, Race, Prednisone, Obesity, PM2.5Exposure, RoadwayExposure, and EstResidentialDensity, where TotalEDInpatientVisits is our primary outcome variable (Table 1). In Fig. 1, we plot bar charts to show comparisons of the number of TotalEDInpatientVisits among the discrete categories of each feature. We can see that the count for zero TotalEDInpatientVisits is the largest among all categories. Upon further analysis, we found that the multivariate data table extracted from the openAPI largely consisted of patients who were inactive in the year 2010. Hence, to avoid bias and reduce noise in our analysis, we removed patients who were not active in the year of interest, meaning their EHR did not indicate any healthcare usage, by applying the “Active In Year” feature as a filter to extract a multivariate table, with Active In Year = 1 to select only patients who were active in 2010. We show the bar charts for the number of ED/inpatient visits for each feature in Fig. 1. We can observe that most of patients who were active in year 2010 only visited the ED or an inpatient clinic once. We also can see there is an imbalance among the levels in some features like Prednisone, Obesity, Race, RoadwayExposure, and Pm2.5exposure.
Feature variables used to generate multivariate table.
Stacked bar chart representing the number of TotalEDInpatientVisits across each level of the feature variables. See Table 1 for feature variable definitions.
2.1 Feature importance
We evaluated the importance of each feature and its contribution towards the model performance using a tree-based machine learning model: random forest. We leveraged the caret R package [5] to evaluate the feature importance. We controlled the parameters for training by using the repeatedcv method to divide our dataset into ten-folds cross-validation and repeated three times. We found Prednisone, Race, ObesityDx as the highest contributing factors, as shown in Fig 2.
Relative feature importance for all feature with respect to TotalEDInpatientVisits. See Table 1 for feature variable definitions.
2.2 Modeling causal networks
Most of the naturally occurring trends that we come across are simply passive observations of events occurring in the world that are either coincidental or unexplained associations. For example, statements like “Drinking beer everyday increase chance of prostate cancer” are common in the news and scientific reporting and in our day-to-day personal beliefs. These associations can be easily mistaken as causation, making us susceptible to logical fallacies without knowing the real underlying cause. Causal inference is the science of learning cause from effect [1]. It is an important field of research because it helps us eradicate spurious correlation [6, 6–8]. The primary aim of inferring causal relations from data is to discover interactions between different entities in the form of Vi → Vj, where Vi and Vj are observable features in domain and the arrow indicates that the state of Vi influences the state of Vj. Causal inference can be either discovered through observational measurements (seeing) or from measurements after performing some external manipulation/intervention (doing). A causal network [1, 9] can be represented with a directed acyclic graph (DAG) G = (V, E), where V = Vi,, Vn denotes the set of features and E ∈ (V × V) denotes the set of edges that are causal in nature. For a causal edge (Vi, Vj), we say that Vi is a cause (parent) of Vj, and Vj is the resulting effect (child) of Vi. Let pa(Vi) denote the set of parents of Vi. The conditional probability distribution Pi defines the probability of Vi given the state of its parents pa(Vi). A causal network represents a joint distribution P over variables V as long as it satisfies two main assumptions:
Causal Markov assumption: Any given variable Vi is independent of its non-descendants, conditioned on all of its direct causes (parents). This implies that the joint distribution P (V) can be factored as:
.
Faithfulness assumption: The joint distribution p(V1, …, Vn) is faithful to G if every conditional independence relation in the probability distribution P is entailed by the Markov assumption applied to G [10].
To reconstruct a causal graph from data, we generally start by finding an approximation of the graph, given V, and then optimize based on conditions on data. The two main approaches used for causal network inference are:
Score-based: This is based on a Bayesian scoring function S(G | D), which estimates the goodness-of-fit of graph G to the data D [11], as objective functions to maximize, while favoring simpler structures. The score function is usually combined with a search heuristic that explores the space of all possible graphs. Score-based methods are robust and can be extended to include interventional studies (if available), but they are not scalable as network or data size increases.
Constraint-based: This method is based on estimating some of the conditional (in)dependencies in the distribution P from the data D by performing hypothesis tests of conditional independence. Constraint-based methods usually start with a fully connected, undirected graph and progressively remove edges whenever a new conditional independence relation is discovered, while satisfying the corresponding d-separation statements. In this work, we will use a constraint-based approach called the PC algorithm, given that the dataset is observational. To infer the causal graph from data, we learn the equivalence class of a directed acyclic graph (DAG) from data with the traditional constraint-based PC algorithm proposed by [9]. Given a dataset D having n features Vi,, Vn, we conduct the following steps. We start with a complete undirected graph given n features. We then eliminate edges between variables that are unconditionally independent. For each pair of variables (Vi, Vj) with an edge between them, and for each variable Vk with an edge connected to either of them, we eliminate the edge between Vi and Vj if Vi ╨ Vj | Vk. For each pair of variables Vi, Vj having an edge between them, and for each pair of variables Vk, Vl with edges both connected to Vi or both connected to Vj, we eliminate the edge between Vi and Vj if Vi ╨ Vj | Vk, Vl. We continue to check independencies conditional on subsets of variables of increasing size n until there are no more adjacent pairs (Vi, Vj) such that there is a subset of variables of size n in which all of the variables in the subset are adjacent to Vi or adjacent to Vj. For each triple of variables (Vi, Vj, Vk) such that Vi and Vj are adjacent, Vj and Vk are adjacent, and Vi and Vk are not adjacent, we orient the edges Vi––Vj––Vk as Vi → Vj ← Vk, if Vj is not in the set conditioning on which Vi and Vk became independent and the edge between them was accordingly eliminated. We call such a triple of variables a v-structure. For each triple of variables such that Vi → Vj––Vk, and Vj and Vk are not adjacent, we orient the edge Vj––Vk as Vj → Vk. This is called orientation propagation.
3 Results
3.1 Inferring causal graphs
We first applied the PC algorithm to the ICEES multivariate feature table. In Fig. 3a, we show the inferred casual graphs, first using the entire table with all eight features and second in Fig. 3b using only the top four important features with respect to TotalEDInpatientVisits, as determined in section 2.1. Expected relationships between features based on subject matter expertise and published literature are represented in black (solid and dashed) lines. There are eight such expected edges, which we use to measure the structure-learning accuracy of the causal algorithm. Solid black lines represent expected edges (true positives) that are reported via the PC algorithm, while dashed lines are edges which were expected but missed (false negatives). Newly found relationships inferred by the PC algorithm, that are not expected, are represented in red (false positive). We note that there were a few undirected edges detected, for which the algorithm was not able to determine directionality.
Inferred causal graph. Solid black lines represent true positives, dashed lines represent false negatives and red lines represents false positives
Three of eight expected edges were inferred. For the determined false positive edges, we conducted a further literature survey to find multiple (5+) citations where a relationship between the features were reported. We marked them as reported edges, along with the true positives. Two out of the three additional edges detected were found in the literature; hence, we marked them as reported. The expected directed edge from Race → TotalEDInpatientVisits was also missed.
As discussed in section 2, we queried the openAPI a second time to generate a multivariate table containing only the top four important features (Prednisone, Race, Obesity and RoadwayExposure) with respect to TotalEDInpatientVisits, as identified by random forest (Fig. 3b). We found significant improvement in accuracy. Three out of five expected edges were detected. An undirected edge between Race and ObesityDx was also detected, which are reported in literature as highly associated features.
3.2 Effects of Intervention
Having learned a causal network from the data, we now use it to answer relevant questions by making inferences. To evaluate this, we computed the effects of interventions on features by modifying the network to simulate interventions. Firstly, because some of the edges detected in Fig. 3a were undirected, we removed them. We then learned the parameters of our learned causal DAG given the network structure and the data. Next, we constructed a mutilated network to simulate a perfect intervention by setting a target node to a particular value. Finally, we tested the effects of these interventions, while verifying the correctness of the learned causal network, to substantiate some commonly known causal links like the following expected claims:
- Claim (a). Obesity should have a direct effect on TotalEDInpatientVisits. Hence, conducting an intervention on the node “Obesity” should reflect a change (increase or decrease, accordingly) in the probability distribution of TotalEDInpatientVisits.
- Claim (b). Prednisone should have a direct effect on TotalEDInpatientVisits. Hence, conducting an intervention on the node “Prednisone” should reflect a change (increase or decrease, accordingly) in the probability distribution of TotalEDInpatientVisits.
- Claim (c). Sex2 should not have a direct effect on TotalEDInpatientVisits. Hence, conducting an intervention on the node “Sex2” should not reflect a change (increase or decrease, accordingly) in the probability distribution of TotalEDInpatientVisits.
We conducted these three interventions on our learned causal network. To test Claim (a), we created a mutilated network by fixing the state of ObesityDx to 1, which means we are forcing ObesityDx to be present. For Claim (b), we fixed the state of Prednisone to be 1, again meaning that we are forcing prednisone to be present. For Claim (c), we fixed state of Sex2 to be Male. Next, we compared the changes in the probability distribution of TotalEDInpatientVisits before and after these three ad hoc interventions to confirm the expected causal influences. We plotted the changes in the probability distribution of TotalEDInpatientVisits in Fig. 4. As expected, there were changes in the probability distribution of TotalEDInpatientVisits for interventions a and b, reflected in Fig. 4a and b, respectively. For intervention c, the changes before and after intervention were negligible, meaning that Sex2 had no causal effect on the frequency of TotalEDInpatientVisits.
Effect of intervention on (a) Obesity, (b) Prednisone and (c) Sex: change in the probability distribution of TotalEDInpatientVisits before (red) and after (blue) intervention.
4 Discussion
We demonstrated the ability to use the ICEES OpenAPI to answer important questions about causal relationships between factors affecting asthma attacks. We focused on a large cohort of patients with asthma or related conditions and a dataset that included data derived from EHRs and a variety of public sources of environmental exposures data. We applied PC analysis, a constraint-based causal learning algorithm, on the dataset and identified prednisone, race, and obesity as significant predictors of annual ED or inpatient visits for respiratory issues, followed by residential distance from a major roadway/highway, airborne particulate exposure, and sex. Of those, prednisone and obesity were found to be causally related to annual ED or inpatient visits in our causal inference model, and sex and race were found to be indirectly related to annual ED or inpatient visits via a causal relationship to obesity. On a smaller dataset, comprising only the four most important features, as determined by random forest analysis, we identified residential distance from a major roadway/highway as an additional variable that is casually related to annual ED or inpatient visits for respiratory issues.
We validated our findings based on expert knowledge and prior published literature. Most of our results are consistent with previously published literature [12]. For instance, prednisone, which is commonly prescribed for patients who are non-responsive to first-line treatments such as inhaled albuterol [13], has been identified as a factor associated with asthma exacerbations and ED or inpatient visits for respiratory issues [14]. Female sex, obesity, and African American race have previously been identified as factors that contribute to asthma attacks [15]. In another work by our group [4] and others [16], obesity and sex have been found to be highly related to asthma attacks. Several other works [17, 18] have additionally found a significant association between African American race and increased risk of asthma attacks. Exposure to major roadways or highways has also been found to be a risk factor for asthma. Several studies [19, 20] have demonstrated an increase in asthma attacks among patients residing in close proximity to a major roadway or highway. Our findings on the relationship between roadway exposures and asthma exacerbations have been inconsistent, with evidence to support [14] and negate [12] a relationship.
One factor that we expected to find in our model as causally related to asthma attacks, but did not, is exposure to airborne particulate matter. Exposure to airborne particulate matter is a well-established trigger for asthma attacks [3, 12, 14, 15, 21]. The failure to detect a causal relationship between exposure to airborne particulate matter and asthma attacks likely reflects the imbalance in the distribution of patients across bins. Indeed, we are actively refining both our exposure models and our binning strategy. For instance, instead of using a Python algorithm to bin the airborne pollutant exposures, we are considering a binning strategy based on subject matter expertise.
5 Conclusion
EHR data, while being rich data sources for important clinical information, are mostly observational and generally challenging to access due to regulatory constraints. Performing real-world interventions are not only costly, but even impractical, given the need to integrate large data sources across various domains. Causal inference provides an excellent tool to simulate clinical interventions and answer questions about the effects of medical and healthcare interventions. In this study, we used the regulatory-compliant open ICEES service to generate a multivariate feature table and apply a causal inference model, as well as conduct simulated interventions, to explore the influence of key demographic factors and environmental exposures on asthma attacks. Our results were largely consistent with expectations based on subject matter expert opinion and published literature. As part of our future studies, we are expanding our causal inference modeling to include additional features and additional years of data in order to reflect the underlying causal relationships at a larger scale, while supporting additional use cases, including primarily ciliary dyskinesia and other rare respiratory disorders.
5.1 Availability
The ICEES asthma OpenAPI can be accessed at https:icees-asthma.renci.org/apidocs. The associated public GitHub repositories include: https://github.com/ExposuresProvider/icees-api; https://github.com/NCATS-Tangerine/FHIR-PIT; https://github.com/NCTraCSIDSci/camp-fhir.
Data Availability
All data produced in the present work are contained in the manuscript
7 Funding Support
This project was funded with awards from the National Center for Advancing Translational Sciences, National Institutes of Health [OT3TR002020, OT2TR003430, UL1TR002489, UL1TR002489-03S4, OT2TR003428].
6 Acknowledgements
The authors wish to acknowledge Stanley C. Ahalt, Director of the Renaissance Computing Institute, for support and advice on the work described herein; David B. Peden for his expertise on the asthma use case; Emily R. Pfaff and James Champion for their help with the patient data; and Sarav Arunachalam, Stephen A. Appold, Alejandro Valencia Arias, and Lisa Stillwell for their help with the environmental exposures data.