Transport-based transfer learning on Electronic Health Records: Application to detection of treatment disparities
=================================================================================================================

* Wanxin Li
* Yongjin P. Park
* Khanh Dao Duc

## Abstract

Many healthcare systems increasingly recognize the opportunities Electronic Health Records (EHRs) promise. However, EHR data sampled from different population groups can easily introduce unwanted biases, rarely permit individual-level data sharing, and make the data and fitted model hardly transferable across different population groups. In this paper, we propose a novel framework that leverages unbalanced optimal transport to facilitate the unsupervised transfer learning of EHRs between different population groups using a model trained in an embedded feature space. Upon deriving a theoretical bound, we find that the generalization error of our method is governed by the Wasserstein distance and unbalancedness between the source and target domains, as well as their labeling divergence, which can be used as a guide for binary classification and regression tasks. Our experiments, conducted on experimental datasets from MIMIC-III database, show that our transfer learning strategy significantly outperforms standard and machine learning transfer learning methods, with respect to accuracy and computational efficiency. Upon applying our framework to predict hospital duration for populations with different insurance plans, we finally find significant disparities across groups, suggesting our method as a potential tool to assess fairness in healthcare treatment.

Keywords
*   Electronic Health Records
*   Transfer learning
*   Optimal transport
*   Treatment disparities

## 1 Introduction

An Electronic Health Record (EHR) database is a digital platform that securely stores and manages comprehensive health information for individual patients, offering healthcare providers quick and efficient access to crucial medical data. Building a comprehensive and unbiased database of EHRs is a crucial first step to precision and personalized medicine [Allen et al., 2012]. Comprehensive EHR databases generally achieve higher accuracy, avoiding potential issues of duplicate records and providing transparency among healthcare professionals, even across different healthcare systems [Menachemi and Collum, 2011]. Not just providing accurate information for each patient, a compendium of EHRs can serve as an important data set for augmenting human intelligence so that medical professionals can make informed decisions in everyday practice [Wagner et al., 2020].

In practice, it has been suggested that each EHR database was built for a different purpose to maximize its utility based on the needs [Kashani and Herasevich, 2015], making certain medical conditions highly, if not only, prevalent in specific studies and population groups [Woolf et al., 1955]. Having such an unequal distribution of medical conditions across different databases impedes our ability to diagnose rare conditions isolated within a specific population group. For example, alpha-1 antitrypsin deficiency, which affects the lungs and liver, is relatively uncommon in the general population but has higher prevalence rates in certain ethnic groups, such as those of Northern European descent [de Serres et al., 2007]. The variation in prevalence poses diagnostic challenges for patients in areas where the condition is less common. Moreover, unlike any other field of data science, data integration is practically not an option due to patient privacy and some hidden interests of stakeholders [Haas et al., 2011], which makes this problem even more challenging.

Overcoming this challenge introduces the need to transfer knowledge learned from data-rich population groups to data-rare population groups, which can be expressed as unsupervised *transfer learning* (TL) [Ganin and Lempitsky, 2015], in the case where labels are only available for the data-rich population group. Interestingly, methods that leverage tools from *Optimal Transport* (OT) have recently been proposed for unsupervised TL, as it aims to align and solve discrepancies between data distributions [Torres et al., 2021]. In the context of EHRs, unsupervised TL methods should also be careful of their specific characteristics, including heterogeneity, high-dimensionality and sparsity [Gupta et al., 2020]. In this paper, we hence introduce ***O****ptimal* ***T****ransport-based* ***T****ransfer learning for* ***E****lectronic* ***H****ealth* ***R****ecords, OTTEHR*, a novel method that leverages feature embeddings from EHRs and OT to perform unsupervised TL between unbalanced domains. Our contributions are as follows.

1.  We introduce a method *OTTEHR* to enable unsupervised TL by applying barycentric projection from OT between unbalanced domains. Using experimental data from the MIMIC-III database, we show that our method outperforms standard and recent unsupervised TL methods with respect to accuracy and computational efficiency.

2.  We establish a theoretical upper bound for the generalization error of *OTTEHR* and decompose it into a source error and labeling divergence terms that are universal, and a specific transport term, that allows us in practice to assess the suitability of our method on specific datasets.

3.  Upon applying *OTTEHR* in the context of predicting patient duration in hospital using medical codes across different groups (e.g insurance), we show that our method allows us to quantify treatment disparities, suggesting potential applications to uncover treatment biases among subgroups and improving patient care.

## 2 Related Work

### Application of OT-based transfer learning in healthcare

To the best of our knowledge, our paper is the first OT-based TL study for EHRs where medical codes (e.g. ICD codes) are used as explanatory variables. OT-based semi-supervised and unsupervised transfer learning has been previously explored in the context of EHRs for sepsis prediction where vital signs and laboratory values are used as explanatory variables [Ding et al., 2023, Wang et al., 2022a]. More generally, Gautheron presented a method for TL using OT in the context of mapping prostate cancer between two magnetic resonance images generated by different scanners [Gautheron, 2017]. Chen *et al*. introduced a semi-supervised transfer learning approach that utilizes OT and frequency mixup to enhance the performance of electroencephalography (EEG)-based motor imagery recognition [Chen et al., 2022]. Liu *et al*. introduced a method that leverages TL and OT to enhance the performance of P300 detection, a widely used marker of cognitive function in brain-computer interfaces [Liu et al., 2020].

### Medical record embedding

As medical records often contain high-dimensional data, it is crucial to embed important information into low-dimensional vectors as a preprocessing step. Medical record embedding methods can be distinguished as supervised and unsupervised [Miotto et al., 2016]: Supervised methods estimate latent space for a specific task, such as predicting treatment outcomes, the duration of hospitalization, or socioeconomic impacts [Choi et al., 2017, Che et al., 2016, 2018], while unsupervised methods seek to locate patient data vectors, not aware of label information, but solely based on high-dimensional EHR information. In our proposed framework, we leverage Principle Component Analysis (PCA) to obtain some basic embedding.

### Error upper bound for OT-based TL

Ben-David *et al*. defined a formal model of transfer learning, also known as domain adaptation, in the case of binary classification [Ben-David et al., 2006], and derived a theoretical upper bound on the error of target data [Ben-David and Urner, 2014, Ben-David et al., 2010], which can be determined by various factors, including the error rates inherent to the source data, the divergence between the source and target distributions, and unwanted bias introduced by site-specific labeling mechanisms. Similarly, we extend the formal model of transfer learning to regression tasks and prove the upper bound on the error of the target data in the context of unbalanced OT. Courty et al. extended these concepts to explore the target error bound of balanced OT in a joint training process [Courty et al., 2017]. In comparison, our work focuses on a two-phase training process with unbalanced optimal transport. The two phase training process offers more stability and the ease of training. By separating the training into two phases, each loss function can be optimized more efficiently without the need to balance between them, which can be challenging in a joint loss function scenario [Malik et al., 2021, Li et al., 2022]. Also, in the context of EHRs, datasets of different groups are rarely balanced. In addition, the presence of clinical outliers for EHR may distort the OT plan. We utilized “unbalanced” OT to improve the robustness of our framework.

### TL approaches

In the context of standard statistical methods, Correlation Alignment (*CA*) [Sun et al., 2017], Transfer Component Analysis (*TCA*) [Pan et al., 2010], Euclidean space data alignment [He and Wu, 2019] and Geodesic Flow Kernel (*GFK*) [Gong et al., 2012] tackle TL by aligning statistical properties between source and target domains. These methods are suitable for both classification and regression tasks. In the context of machine learning, OT-based TL has been, to the best of our knowledge, exclusively been studied for classification tasks and more specifically tested on image classification tasks. OT-based transfer learning methods such as Joint Distribution Adaptation [Long et al., 2013], Deep Joint Distribution Optimal Transport (*deepJDOT*) [Damodaran et al., 2018], and Class-aware Sample Reweighting [Wang et al., 2023] leverages OT to reduce the Wasserstein distance between source and target domains, thereby aligning their distributions. Additionally, approaches such as Decomposed Transport Distance further refine these methodologies by introducing a decomposed distance metric within the optimal transport framework [Wang et al., 2022b]. On the other hand, non-OT-based TL has been studied for both classification and regression tasks. For classification tasks, moment-matching methods reduce the distribution discrepancies by matching statistics from two distinct distributions [Long et al., 2015, 2017, Maria Carlucci et al., 2017]. Adversarial learning methods minimize the distribution discrepancy by optimizing a selected function over the hypothesis space, concurrently learning feature representations to bridge the gap between domains [Ganin and Lempitsky, 2015, Tzeng et al., 2015, Ganin et al., 2016, Luo et al., 2017, Long et al., 2018, Zhang et al., 2019, Peng et al., 2019]. For regression tasks, most recent TL methods using Representation Subspace Distance (*RSD*) [Chen et al., 2021] and inverse GRAM matriecs (*daregram*) [Nejjar et al., 2023] learn a shared feature extractor by minimizing some discrepancies of source and target features.

## 3 Methods

### 3.1 Background

#### 3.1.1 Optimal Transport

OT aims to solve a general transport problem where we consider moving data points in one distribution of mass to another at a minimal cost. In its discrete version, the OT problem can be formulated as follows: Consider two distributions *µ**A* and *µ**B* of point support **A** = *{***a***i* ∈ ℝ*k*, *i* = 1, …, *n}* and **B** = *{***b***i* ∈ ℝ*k*, *i* = 1, …, *m}* with probability mass functions *ϕ**A* and *ϕ**B*, respectively. We define a cost function *d* : **A** *×* **B** → ℝ≥0 (e.g. the Euclidean distance between **a***i* with **b***j*), and the associated objective function ![Formula][1]</img>  where *π* ∈ Mat*m,n*(ℝ≥0) and *p* ∈ [1, +∞]. Minimizing this objective function with the marginal constraints ![Graphic][2]</img> and ![Graphic][3]</img> defines the classical OT problem. To deal with the computational cost of the classical OT formulation and handle distributions with different mass, *unbalanced entropy-regularized OT* was recently introduced [Pham et al., 2020] by adding additional regularization constraints to Equation 1 as ![Formula][4]</img>  where ![Graphic][5]</img>, *λ* is the regularization parameter; *D**φ*(*α, β*) is the Csiszár *φ*−divergence and is given by, assuming that the discrete measures ![Graphic][6]</img> and ![Graphic][7]</img> share the same support {*x**i* : *i* = 1, …, *N*}, ![Formula][8]</img>  Minimizing Φ leads to define the *regularized unbalanced p-Wasserstein distance*, as ![Formula][9]</img>  where Π is the set of positive matrices (that in this context define transport plans between *µ**A* and *µ**B*), ![Graphic][10]</img>, and the minimizer of Equation (3) gives the *OT plan*. In the rest of the manuscript, we will work with *p* = 1, and will simply refer to the regularized unbalanced 1-Wasserstein distance as the *Wasserstein distance* for simplicity.

Upon finding *π**, and assuming ![Graphic][11]</img>, the transported features of **a***i* can then be derived by *barycentric projection*, as ![Formula][12]</img>  *Remark*: When ![Graphic][13]</img>, there is no mass transported from **a***i*, and hence the barycentric projector is not defined.

#### 3.1.2 Theoretical framework for Transfer Learning

We extend the theoretical framework of TL for binary classification [Ben-David et al., 2010] to include regression tasks as follows: Consider two spaces, a *source space*, represented by a distribution *µ**S* on *source domain D**S* and a labeling function *f**S*, and a *target space*, represented by a distribution *µ**T* on *target domain D**T* and a labeling function *f**T*. The label domain is *{*0, 1*}* for binary classification and ℝ for regression. We define a binary classifier/regressor a function *h* : *D**S* → *{*0, 1*}* or ℝ, so the probability that the binary classifier/regressor *h* disagrees with a labeling function *f* according to the distribution *µ**S* is ![Formula][14]</img>  where | · | is the *l*1 norm metric on ℝ. In particular, the source error of *h* is *ϵ**S*(*h*) = *ϵ**S*(*h, f**S*); and similarly, we define the target error, also known as the generalization error of *h*, as ![Formula][15]</img>  Then, the problem of TL via OT aims to fine-tune a regressor *h**, learned from the source domain, into *h**′* so that the resulting target error *ϵ**T* (*h**′*) is minimized.

### 3.2 Our Approach: *OTTEHR*

We propose our method *OTTEHR*, short for *Optimal Transport-based Transfer Learning for Electronic Health Records*, to address TL problems in EHR data using OT. We set up our notations as follows. Let *𝒳* be the input domain and *𝒴* be the label domain. Let (**X***S*, **Y***S*) denote the pair of source features and source labels and (**X***T*, **Y***T*) denote the pair of target features and target labels, such that ![Graphic][16]</img> where *n* and *m* are the numbers of points in source and target domains and ![Graphic][17]</img>. We preprocess the numerical medical codes into indicator arrays so that the *l**th* coordinate in **x***i* denotes the presence of the *l**th* code, using 0 or 1. We assume the unknown target labels **Y***T* are also in the same space as **Y***S*.

*OTTHER* proceed in four main steps, also shown in Figure 1:

![Figure 1:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F1)

Figure 1: 
The framework of *OTTEHR*. The four major steps (1) feature embedding, (2) model training, (3) domain transportation and (4) model prediction are in bold. Details of these four steps can be found in Section 3.2. The indicator arrays consist of 0s and 1s where the (*i, l*)*th* entry indicates the presence of the *l**th* medical code for the *i**th* source/target feature. The orange dotted lines denote the trained source classifier from step (2). The black dotted lines denote the OT plan learned from step (3). The colors of the points on the embedding spaces represent different classes. Note that when using *OTTEHR, (i)* the labels for the target features are not required and they present here for the visualization purpose; *(ii)* the source model can be either a classifier or a regressor.

#### (1) Feature embedding

Feature embedding is important to process EHRs as *(i)* EHRs are of high dimension, and the classifier/regressor *h** in step (2) and the OT plan *π** in step (3) cannot be learned well without proper dimensions; *(ii)* EHRs are also sparse, and feature embedding can be done effectively to resolve *(i)*.

To perform feature embedding, we run PCA on source features, that yields a mapping *g* : ℝ*q* → ℝ*k* where *k* ≪ *q*, where *k* was chosen to explain approximately 75% of variances in source domains. Applying *g* to source features and target features then yields source embedding and target embedding as ![Formula][18]</img>  Note that our procedure allows to replace PCA by more advanced dimensionality reduction methods, which we mentioned in Section 6.

#### (2) Model training

We train a classification or regression model *h** : *D**S* → *Y**S* using the source embedding *D**S* and source labels *Y**S*.

#### (3) Domain transportation

We transport all target embeddings in *D**T* onto the source embedding domain *D**S* using equation 4, so the set of transported target embeddings is ![Formula][19]</img>  Note that in the context of EHRs, datasets of different groups are rarely balanced. In addition, the presence of clinical outliers for EHR may distort the OT plan. Hence, we utilize “unbalanced” OT to improve the robustness of *OTTEHR*. Since the model training and domain transportation are performed separately, our framework is a two-phase training process compared to some previous works [Courty et al., 2017, Damodaran et al., 2018].

#### (4) Model prediction

The source model is applied to the transported target embedding to obtain the “projected” target labels. We predict the label, ![Graphic][20]</img> for all ![Graphic][21]</img>, using ![Formula][22]</img>  

### 3.3 Implementation details

We implemented *OTTEHR* in Python 3, with the source code available at this link. To solve OT problems, our package depends on the *POT* library [Reinger, 2021]. We ran our experiments on a workstation with 32 central processing units powered by AMD Ryzen Threadripper 2950X 16-Core Processor, 125GB of RAM, and x86_64 Ubuntu 20.04.5 LTS. We used Euclidean distance (*l*2 norm) as the cost function for OT and Kullback-Leibler divergence as the mass divergence *D**φ*. For OT hyperparameters, we set the entropic regularization parameter *λ* to 0.1 and the mass regularization parameter to 1.

## 4 Experiments

In this section, we first introduce our datasets derived from MIMIC-III database [Johnson et al., 2016], and then describe the set up of our experiments.

### 4.1 Dataset

The MIMIC-III database is a large, open-source database comprising anonymized health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [Johnson et al., 2016]. It is a relational database consisting of 26 tables consisting of 8,922 unique ICD codes. We focused on the patient table and admission table. The patient table contains 46,520 patients, with 26,121 males and 20,399 females; each patient is attached with patient ID, gender, expire flag, date of birth, etc. The admission table contains 58,976 admissions, and each admission is attached with patient ID, admission type, admission time, and discharge time.

We first merged the admission table with the patient table by indexing unique patient IDs. For each admission, we calculated the duration in the hospital by taking the difference between the discharge time and the admission time (in seconds). We then treated the presence of each ICD code as the explanatory variable and used the duration in the hospital as the response. Hence, we divided the admissions into source and target by insurance groups or marital status groups. Insurance groups include “Self_Pay,” “Private,” “Government,” “Medicare,” “Medicaid.” Marital status groups include “Separated,” “Divorced,” “Married,” “Widowed,” “Single.”

### 4.2 Experimental Setup

We ran *OTTEHR* to predict duration in hospital using linear regression models on all possible pairs within insurance (marital status) groups, considering one group as the source and the other as the target, resulting in 40 distinct experiments (i.e., 5 *×* 4 *×* 2). We validated our choice of group by visualizing in Figure S1 and Figure S2 the t-distributed Stochastic Neighbor Embedding (t-SNE) [Van der Maaten and Hinton, 2008] of the feature embeddings between pairwise groups for all insurance and marital status groups, showing that most of the distinct groups form their own clusters, and differences in the feature embedding space between each pair of source and target groups are significant. For each pair of source and target groups, we randomly sampled 120 admissions (training) from the source group and 100 (testing) admissions from the target group, and we conducted the same type of experiment repeatedly 100 times. The number of ICD codes varies from 700 to 900 for each experiment depending on the selected admissions. During the feature embedding step, we convert these ICD codes into a 50-dimensional space (*k* = 50). We analyzed the empirical relationship between the target error of *OTTEHR* and other terms presented in the upper bound in Section 5.1. Then, we benchmarked *OTTEHR* against existing TL methods.

## 5 Results

In this section, we first derive a theoretical upper bound on the target error for our method, as introduced in the methods Section 3.1.2. Using datasets from the MIMIC-III database, we validate the theoretical upper bound with empirical estimates obtained from our experiments. We then benchmark *OTTEHR*’s performance against its competitors, focusing on accuracy and computational cost. Finally, we use *OTTEHR* to reveal notable differences in duration in hospital based on insurance plans, suggesting *OTTEHR* as a potential tool to quantify treatment disparities.

### 5.1 Upper Bound for Binary Classification and Regression

To study the theoretical accuracy of *OTTEHR*, we derived an upper bound on the target error in the case of binary classification and regression. Our analysis focuses on Lipschitz continuous models. In practice, many models, including for EHRs, satisfy this condition [Goldstein et al., 2017, Clegg et al., 2016, Harutyunyan et al., 2019] or can be approximated as Lipschitz continuous [Bartlett et al., 2017, Nair and Hinton, 2010, Ioffe and Szegedy, 2015], and assuming Lipschitz continuity or linearity is common in theoretical studies of transfer learning [Tong et al., 2021, Tian and Feng, 2023, Cai et al., 2021].

As we adapt embeddings from the target domain to the source domain using the barycentric operator *T*, the fine-tuning of the learned function on the source *h** yields *h**′* = *h** *° T*. Assuming ![Graphic][23]</img> (see Equation (4)), we can control the target error by the following theorem.

Theorem 1
(Upper bound for target error). *Let µ**T* *(µ**S**) be a discrete target (source) distribution defined on a domain D**T* (*D**S*) *and with probability mass function ϕ**T* (*ϕ**S*). *Let h**′* = *h** *° T, where h** *is Lipschitz continuous and T is the barycentric projection (Equation* (4)*). The target error ϵ**T* (*h**′*) *defined by Equation* (5) *is bounded by:* ![Formula][24]</img>  *where K is the Lipschitz continuous constant for h**, *π** *is the OT plan and* ![Graphic][25]</img>.

The proof can be found in Appendix A. Ignoring the defined constants *M* and *K*, we interpret the upper bound as follows. The first term is the source model error evaluated on the target domain, illustrated by Figure S3. The second “transport term” is composed of two parts: *(i) 𝒲*1(*µ**T*, *µ**S*) that denotes the Wasserstein between *µ**T* and *µ**S* (see Equation (3)), and *(ii)* ![Graphic][26]</img> that denotes the variance between mass at *x* and the transported mass from *x* to *D**S*, indicating the “unbalancedness” of the OT plan (the larger *D**ϕ*(*π**A*|*µ**A*) and *D**ϕ*(*π**B*|*µ**B*) in Equation (2), the larger this term will be). The last “labeling divergence” term ![Graphic][27]</img> denotes the divergence between the source and target true labeling functions.

Note that when OT is used rather than unbalanced OT, we have the marginal constraints ![Graphic][28]</img>, resulting in eliminating unbalancedness term from Theorem 1. Also note that when *h** is linear, a special case of Lipschitz continuity, *K* in the theorem can be substituted by ![Graphic][29]</img>.

This generalization error bound provides a guide for analyzing and interpreting the performance of our method. More precisely, if the transport term is significantly smaller compared to the source model error and labeling divergence, then using OT to map individuals to another group should be a suitable option for the transfer learning task. Note that the estimation of the labeling divergence requires access to ground-truth target labels ![Graphic][30]</img> to estimate the labeling function on the target domain *f**T*, which might not be immediately available in practice. Nevertheless, we can mitigate this issue if we have partial access to the target labels or prior knowledge of *f**T*.

### 5.2 Predictive performance is dominated by the labeling divergence

To demonstrate the potential of our method to be applied to real data, we ran our method using an experimental dataset of ICD codes (see Section 4.1). More precisely, we focused on how the duration of hospitalization can be predicted by medical codes for different insurance or marital status groups (as detailed in Section 4.2), as a simple but relevant test case [Goldman and Smith, 2002, Umberson, 1987]. First, we compared the target errors of *OTTEHR*, guided by Theorem 1. Note that in this context, using the same model for estimating ![Graphic][31]</img> as *h** leads to ![Graphic][32]</img> for all *x* ∈ *D**T*, and ![Graphic][33]</img>, so it was not necessary to study the relationship between the target error and the source model error on the target domain, leaving the transport term and the labeling divergence to be studied.

For insurance group experiments, as shown in Figure 2, we separately plotted the target error against the transport term and against the labeling divergence with the results for all the insurance group experiments combined. We observed a strong correlation between the target error and labeling divergence, with a Pearson correlation coefficient (PCC) of 0.70 (Figure 2.B), in contrast to a weaker correlation between the target error and the transport term, with a PCC of 0.16 (Figure 2.A). Marital status group experiments yields similar patterns, with a PCC of 0.63 for the transport term (Figure S4.B) and 0.09 for the labeling divergence (Figure S4.A). In both cases, we explain these results by the large difference in orders of magnitude between the labeling divergence (108) and the transport term (10). Further analysis of pairwise insurance (marital status) groups confirmed consistent trends (see Figures S5 to S8), indicating that in our experiments, the target error is overall dominated by the labeling divergence. As previously discussed in Section 5.1, since the transport term is significantly smaller than the labeling divergence, *OTTEHR* is suitable for solving these transfer learning tasks.

![Figure 2:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F2)

Figure 2: 
Bound analysis for all insurance group experiments. The relationship between target error and (**A**) transport term with a PCC of 0.70, and (**B**) labeling divergence with a PCC of 0.16, combining the results for all pairwise insurance group experiments.

### 5.3 Benchmarking of accuracy and computation time

To assess the empirical performance of *OTTEHR*, we benchmarked *OTTEHR* against the standard statistical methods *Transfer Component Analysis (TCA)* [Pan et al., 2010], *Correlation Analysis (CA)* [Sun et al., 2017], *Geodesic Flow Kernel (GFK)* [Gong et al., 2012], machine learning OT-based method *deepJDOT* [Damodaran et al., 2018], and machine learning non-OT-based methods *Representation Subspace Distance (RSD)* [Chen et al., 2021] and *inverse Gram matrices (daregram)* [Nejjar et al., 2023] (also see Section 2) on the predictions for target groups for the transfer learning tasks detailed in Section 4.2 using mean absolute error (*MAE*) and root mean square error (*RMSE*) [Chai and Draxler, 2014].

For insurance group experiments, violin plots in Figure 3 shows the log-transformed *MAE* and *RMSE*, with smaller values indicating better performance. We notably observed that *OTTEHR*’s median *MAE*/*RMSE* is smaller than those of *TCA, TCA, GFK, RSD* and *daregram* with comparable standard deviations. Although *OTTEHR*’s median *MAE*/*RMSE* is slightly larger than that of *deepJDOT*, its standard deviation is significantly smaller. We also provided detailed results in Tables S1 and S2 for pairwise insurance groups, showing the medians and standard deviations of *MAE*/*RMSE* for all the methods, and outperformance ratios of *OTTEHR* to other methods. Overall, *OTTEHR* achieved a 14% to 28% reduction in median *MAE/RMSE* compared to *TCA, CA, GFK* and *RSD* and *daregram* and a 55% to 68% reduction in the standard deviation of the log-transformed *MAE/RMSE* compared to *deepJDOT*. Similar trends are observed in marital status group experiments, with *OTTEHR* outperforming *TCA, CA, GFK* and *RSD* and *daregram* by 14% to 26% in median *MAE/RMSE* and by 59% to 76% in the standard deviation of the log-transformed *MAE/RMSE* (refer to Figure S9 and tables S3 and S4 for details).

![Figure 3:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F3)

Figure 3: 
Benchmark results for all insurance group experiments. Violin plots of the log of (**A**) *MAE* and (**B**) *RMSE* between projected duration and observed duration on target admissions for insurance groups using *OTTEHR, TCA, CA, GFK, RSD* and *daregram*. The blue bars denote the medians. The heights of the violin plots denote the variability, positively correlated with the standard deviations. Smaller *MAE* and *RMSE* values indicate better performance.

In addition to accuracy, we compared *OTTEHR* with competing methods in terms of average computation time per experiment. Table S5 shows *OTTEHR* is the fastest among *TCA, CA, GFK, deepJDOT, RSD* and *daregram*. Specifically,*OTTEHR*’s runtime is on par with *TCA, GFK, RSD* and *daregram*, 24.8 (20116.7*/*810.8) times faster than that of *CA*, and 42.7 (34642.6*/*810.8) times faster than that of *deepJDOT*.

### 5.4 *OTTEHR* reveals treatment disparities based on insurance plans

After confirming the empirical performance of *OTTEHR*, we applied it to quantify treatment disparities based on insurance plans and predict the potential impact of transitioning between different plans. We focused on Medicaid, a collaborative program that assists individuals with limited income and resources [Gruber, 2003], Medicare, that provides coverage to people aged 65 and older, as well as to younger individuals with certain disabilities and diseases [Finkelstein and McKnight, 2008], and Private insurance, offered by various companies, that offers greater flexibility in healthcare services [Cutler and Gruber, 1997]. People often switch from Medicaid and Medicare to private insurance due to factors such as reductions in federal matching funds or when their preferred healthcare providers fall out of their Medicare network [Foundation, 2023, Kaiser Family Foundation, 2023, 2021]. Conversely, transitions to Medicaid and Medicare from private insurance can result from changes in age, income, or health status Long et al. [2014], Baicker and Chandra [2006].

Specifically, we considered the projected duration obtained from transfer learning to be significantly reduced if it is at least 300 hours (12.5 days) shorter than the originally observed duration. In Figure 4, we showed kernel density estimate plot (KDE) of projected duration versus the original duration in hours for admissions transitioning (**A**) from Medicaid to private insurance, (**B**) from private insurance to Medicaid, (**C**) from Medicare to private insurance, and (**D**) from private insurance to Medicare, with blue dots denoting admissions with significantly reduced durations, where we observed very different proportions of admissions with significantly reduced durations. Our findings indicate that 13.1% of admissions would result in significantly reduced durations when transitioning from Medicaid to private insurance, compared to only 9.5% with such reductions when transitioning from private insurance to Medicaid, making a difference of 3.6%. Similarly, 9.0% of admissions would result in significantly reduced durations when transitioning from Medicare to private insurance, compared to 10.5% with such reductions when transitioning from private insurance to Medicaid, making a difference of 1.5%. The larger difference of bi-directional transition process between private insurance and Medicaid suggests more disparities between private insurance and Medicaid compared to those between private insurance and Medicare. Such results can be generalized to more insurance groups (e.g., Government), as shown in Figure S10, where percentages of admissions with significantly reduced durations are shown across all pairs of insurance groups.

![Figure 4:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F4.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F4)

Figure 4: 
Admissions with significantly reduced durations of stay in hospital when transitioning between private insurance and Medicaid/Medicare. The KDE plot of the log of projected duration in hours versus the log of observed duration in hours for admissions transitioning (**A**) from Medicaid to private insurance, (**B**) from private insurance to Medicaid, (**C**) from Medicare to private insurance, and (**D**) from private insurance to Medicare. The dotted black lines denote projected duration = observed duration −300. The blue dots denote admissions with significantly reduced duration of stay in hospital, where the projected duration is at least 300 hours (12.5 days) less than the observed duration. The annotated percentages are the proportions of significantly reduced durations.

## 6 Discussion

This paper presents *OTTEHR*, an OT-based unsupervised TL framework for EHRs. While biased models can lead to incorrect diagnoses, treatments, and healthcare decisions [Chen et al., 2023, Mittermaier et al., 2023], *OTTEHR* can potentially alleviate these biases by leveraging OT when comparing different population groups. Our study more precisely establishes a theoretical upper bound for the generalization error. Interestingly, we decomposed this bound into some general terms (namely the source error and the labeling divergence) that are shared by any transfer learning method, and a specific transport term, that we can use in practice to evaluate the suitability of our method on real data, as shown in our application to the MIMIC-III dataset. We also note that all these terms are computable (as we did in Figures 2 and S4 to S8) or can be estimated if we have limited access to target labels or some prior knowledge about the target domain’s labeling function. Overall, our benchmarking suggests that in the context of EHRs, aligning probability distributions between the source and target domains can be highly effective. Upon focusing on predicting duration in hospital, we also detected significant differences underlying treatment disparities across insurance groups, suggesting our method’s potential for uncovering treatment biases among subgroups and improving patient care.

In conclusion, we can list several potential future directions. While we have shown that *OTTEHR* enables knowledge transfer between datasets, it would be interesting to further evaluate *OTTEHR* on other relevant regression and classification tasks and other demographic factors. These include predicting the time interval between consecutive visits [Poole et al., 2016] and mortality rates [Goodacre et al., 2006], using appropriate and potentially larger datasets [Pader et al., 2021, Sudlow et al., 2015]. Also, extending our method to semi-supervised transfer learning [Wei et al., 2019] and designing a unified model that simultaneously solves feature embedding and classification problems [Song et al., 2017] could improve predictive performance on target domains with limited labeled data. From a theoretical perspective, it would be beneficial to extend the upper bound theorem, to include other tasks, such as, multi-class classification, non-continuous Lipschitz models, and other OT-metrics that can be more suitable for transfer learning across different EHR systems [Séjourné et al., 2021]. Finally, there are several other potential areas for improving our current approach, including reducing its computational complexity to handle larger datasets, optimizing the embedding with more complex manifold learning techniques, and integrating heterogeneous information, such as laboratory results and doctor’s notes.

## Data Availability

All data produced are available online at [https://anonymous.4open.science/r/OTTEHR-C08B/](https://anonymous.4open.science/r/OTTEHR-C08B/).

[https://anonymous.4open.science/r/OTTEHR-C08B/](https://anonymous.4open.science/r/OTTEHR-C08B/) 

## Data and Code Availability

This paper uses the MIMIC-III dataset [Johnson et al., 2016], which is available on the PhysioNet repository [Moody et al., 2001]. The anonymized code repository is available at this link.

## Institutional Review Board (IRB)

This research does not require IRB approval.

## A Proof of Theorem - Upper Bound for Binary Classification and Regression

Let *µ**T* (*µ**S*) be a discrete target (source) distribution defined on a domain *D**T* (*D**S*) and with probability mass function *ϕ**T* (*ϕ**S*). Let *h**′* = *h** *° T*, where *h** is Lipschitz continuous and *T* is the barycentric projection (Equation (4)). The target error *ϵ**T* (*h**′*) defined by Equation (5) is bounded by ![Formula][34]</img>  where *K* is the Lipschitz continuous constant for *h**, *π** is the OT plan and ![Graphic][35]</img>.

*Proof*. We first rewrite the target error as ![Formula][36]</img>  Since *K* is the Lipschitz constant for *h**, for all *x*, there exists *K >* 0 such that ∥*h** *° T* (*x*) − *h**(*x*)∥ ≤ *K*∥*T* (*x*) − *x*∥. We now separately analyze (*) and (**). By the triangle inequality, ![Formula][37]</img>  Let ![Graphic][38]</img> is well defined since ![Graphic][39]</img>. We then obtain ![Formula][40]</img>  We can thus further bound ![Graphic][41]</img> as ![Formula][42]</img>  where ![Graphic][43]</img>. Finally, ![Formula][44]</img>  which yields the upper bound for (*) ![Formula][45]</img>  Considering (**), we have by triangle inequality, ![Formula][46]</img>  Plugging Equations (11) and (12) into Equation (6) yields ![Formula][47]</img>  which completes the proof.

We note that when *h** is linear, Equation (7) can be rewritten in the following way: ![Formula][48]</img>  where ![Graphic][49]</img>.

In this case, the theorem can be rewritten as: ![Formula][50]</img>  where ![Graphic][51]</img>. □

## B Supplementary Figures

### B.1 Feature embedding differences

![Figure S1:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F5.medium.gif)

[Figure S1:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F5)

Figure S1: 
Feature embedding for pairwise marital status groups using t-SNE. Marital status groups include “Self_Pay,” “Private,” “Government,” “Medicare,” and “Medicaid.” Most of the groups form their own clusters.

![Figure S2:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F6.medium.gif)

[Figure S2:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F6)

Figure S2: 
Feature embedding for pairwise marital status groups using t-SNE. Marital status groups include “Separated,” “Divorced,” “Married,” “Widowed,” and “Single.” Most of the groups form their own clusters.

### B.2 Illustration of *ϵ**T* (*h**, *f**S*) in Theorem 1

![Figure S3:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F7.medium.gif)

[Figure S3:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F7)

Figure S3: 
Illustration of *ϵ**T* (*h**, *f**S*). *D**S* and *D**T* are source and target embedding spaces. The blue dots denote the source embeddings and source labels. *f**S* is the ground-truth labeling function for source embedding features and source labels. *h** is the source model trained by source embedding features and source labels. The gray area denotes *ϵ**T* (*h**, *f**S*).

### B.3 Bound analysis

![Figure S4:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F8.medium.gif)

[Figure S4:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F8)

Figure S4: 
Bound analysis for all marital status group experiments. The relationship between target error and (**A**) transport term with a PCC of 0.63, and (**B**) labeling divergence with a PCC of 0.09, combining the results for all pairwise marital status group experiments.

![Figure S5:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F9.medium.gif)

[Figure S5:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F9)

Figure S5: 
Bound analysis for pairwise insurance group experiments with respect to transport term. Target error versus transport term for pairwise insurance groups with an average PCC of 0.09.

![Figure S6:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F10.medium.gif)

[Figure S6:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F10)

Figure S6: 
Bound analysis for pairwise insurance group experiments with respect to labeling divergence. Target error versus labeling divergence for pairwise insurance groups with an average PCC of 0.67.

![Figure S7:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F11.medium.gif)

[Figure S7:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F11)

Figure S7: 
Bound analysis for pairwise marital status group experiments with respect to transport term. Target error versus transport term for pairwise marital status groups with an average PCC of 0.03.

![Figure S8:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F12.medium.gif)

[Figure S8:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F12)

Figure S8: 
Bound analysis for pairwise marital status group experiments with respect to labeling divergence. Target error versus labeling divergence for pairwise marital status groups with an average PCC of 0.59.

### B.4 Benchmark results for marital status group experiments

![Figure S9:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F13.medium.gif)

[Figure S9:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F13)

Figure S9: 
Benchmark results for all marital status group experiments. Violin plots of the log of (**A**) *MAE* and (**B**) *RMSE* between projected duration and observed duration on target admissions for marital status groups using *OTTEHR, TCA, CA, GFK, RSD* and *daregram*. The blue bars denote the medians. The heights of the violin plots denote the variability, positively correlated with the standard deviations. Smaller *MAE* and *RMSE* values indicate better performance.

### B.5 Admissions with significantly reduced durations in hospital for all pairwase experiments

![Figure S10:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/03/28/2024.03.27.24304781/F14.medium.gif)

[Figure S10:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/F14)

Figure S10: 
Percentages of admissions having significantly reduced durations in hospital for all pairwise insurance group experiments. For example, when transitioning to self paid insurance plans, 15.3% of the admissions on Medicaid would result in significantly reduced durations in hospital.

## C Supplementary Tables

### C.1 Benchmark results for pairwise group experiments

View this table:
[Table S1:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/T1)

Table S1: 
Benchmark results for pairwise insurance group experiments. Medians and standard deviations of the log of *MAE* between projected duration and observed duration on target admissions for different insurance groups using *OTTEHR, TCA, CA, GFK, deepJDOT, RSD* and *daregram*. The outperformance ratio of *OTTEHR* to *TCA*/*CA*/*GFK*/*RSD*/*daregram* is defined as the percentage decrease in the median *MAE* from *TCA*/*CA*/*GFK* to *OTTEHR*. The outperformance ratio of *OTTEHR* to *deepJDOT* is defined as the percentage decrease in median of the log-transformed of standard deviation of *MAE* from *deepJDOT* to *OTTEHR*.

View this table:
[Table S2:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/T2)

Table S2: 
Benchmark results for pairwise insurance group experiments. Medians and standard deviations of the log of *RMSE* between projected duration and observed duration on target admissions for different insurance groups using *OTTEHR, TCA, CA, GFK, deepJDOT, RSD* and *daregram*. The outperformance ratio of *OTTEHR* to *TCA*/*CA*/*GFK*/*RSD*/*daregram* is defined as the percentage decrease in the median *RMSE* from *TCA*/*CA*/*GFK* to *OTTEHR*. The outperformance ratio of *OTTEHR* to *deepJDOT* is defined as the percentage decrease in median of the log-transformed of standard deviation of *RMSE* from *deepJDOT* to *OTTEHR*.

View this table:
[Table S3:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/T3)

Table S3: 
Benchmark results for pairwise marital status group experiments. Medians and standard deviations of the log of *MAE* between projected duration and observed duration on target admissions for different marital status groups using *OTTEHR, TCA, CA, GFK, deepJDOT, RSD* and *daregram*. The outperformance ratio of *OTTEHR* to *TCA*/*CA*/*GFK*/*RSD*/*daregram* is defined as the percentage decrease in the median *MAE* from *TCA*/*CA*/*GFK* to *OTTEHR*. The outperformance ratio of *OTTEHR* to *deepJDOT* is defined as the percentage decrease in median of the log-transformed of standard deviation of *MAE* from *deepJDOT* to *OTTEHR*.

View this table:
[Table S4:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/T4)

Table S4: 
Benchmark results for pairwise marital status group experiments. Medians and standard deviations of the log of *RMSE* between projected duration and observed duration on target admissions for different marital status groups using *OTTEHR, TCA, CA, GFK, deepJDOT, RSD* and *daregram*. The outperformance ratio of *OTTEHR* to *TCA*/*CA*/*GFK*/*RSD*/*daregram* is defined as the percentage decrease in the median *RMSE* from *TCA*/*CA*/*GFK* to *OTTEHR*. The outperformance ratio of *OTTEHR* to *deepJDOT* is defined as the percentage decrease in median of the log-transformed of standard deviation of *RMSE* from *deepJDOT* to *OTTEHR*.

### C.2 Computation time

View this table:
[Table S5:](http://medrxiv.org/content/early/2024/03/28/2024.03.27.24304781/T5)

Table S5: Average computational time in seconds per experiment for *OTTEHR, TCA, CA, GFKdeepJDOT, RSD* and *daregram*.

## Acknowledgments

We acknowledge the help from Zhenyuan Zhang and Greg d’Eon for discussing the proof for Theorem 1.

*   Received March 27, 2024.
*   Revision received March 27, 2024.
*   Accepted March 28, 2024.


*   © 2024, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.   Alistair EW Johnson,  Tom J Pollard,  Lu Shen,  Li-wei H Lehman,  Mengling Feng,  Mohammad Ghassemi,  Benjamin Moody,  Peter Szolovits,  Leo Anthony Celi, and  Roger G Mark. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
    
    
2.   George B Moody,  Roger G Mark, and  Ary L Goldberger. Physionet: A web-based resource for the study of physiologic signals. IEEE Engineering in Medicine and Biology Magazine, 20(3):70–75, 2001.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/51.932728&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=11446213&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F03%2F28%2F2024.03.27.24304781.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000169673500012&link_type=ISI) 

3.   Naomi Allen,  Cathie Sudlow,  Paul Downey,  Tim Peakman,  John Danesh,  Paul Elliott,  John Gallacher,  Jane Green,  Paul Matthews,  Jill Pell, et al. UK biobank: Current status and what it means for epidemiology. Health Policy and Technology, 1(3):123–126, 2012.
    
    
4.   Nir Menachemi and  Taleah H Collum. Benefits and drawbacks of electronic health record systems. Risk Management and Healthcare Policy, pages 47–55, 2011.
    
    
5.   Tyler Wagner,  FNU Shweta,  Karthik Murugadoss,  Samir Awasthi,  AJ Venkatakrishnan,  Sairam Bade,  Arjun Puranik,  Martin Kang,  Brian W Pickering,  John C O’Horo, et al. Augmented curation of clinical notes from a massive ehr system reveals symptoms of impending covid-19 diagnosis. Elife, 9:e58227, 2020.
    
    
6.   Kianoush Kashani and  Vitaly Herasevich. Utilities of electronic medical records to improve quality of care for acute kidney injury: past, present, future. Nephron, 131(2):92–96, 2015.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1159/000437311&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26418948&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F03%2F28%2F2024.03.27.24304781.atom) 

7.   Barnet Woolf et al. On estimating the relation between blood group and disease. Annals of Human Genetics, 19(4): 251–253, 1955.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1469-1809.1955.tb01348.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=14388528&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F03%2F28%2F2024.03.27.24304781.atom) 

8.   Frederick J de Serres,  Ignacio Blanco, et al. Pi s and pi z alpha-1 antitrypsin deficiency worldwide. a review of existing genetic epidemiological data. Monaldi Archives for Chest Disease, 67(4), 2007.
    
    
9.   Sebastian Haas,  Sven Wohlgemuth,  Isao Echizen,  Noboru Sonehara, and  Günter Müller. Aspects of privacy for electronic health records. International Journal of Medical Informatics, 80(2):e26–e31, 2011.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21041113&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F03%2F28%2F2024.03.27.24304781.atom) 

10.  Yaroslav Ganin and  Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189. PMLR, 2015.
    
    
11.  Luis Caicedo Torres,  Luiz Manella Pereira, and  M Hadi Amini. A survey on optimal transport for machine learning: Theory and applications. ArXiv:2106.01963, 2021.
    
    
12.  Vagisha Gupta,  Shelly Sachdeva, and  Subhash Bhalla. A novel deep similarity learning approach to electronic health records data. IEEE Access, 8:209278–209295, 2020.
    
    
13.  Ruiqing Ding,  Yu Zhou,  Jie Xu,  Yan Xie,  Qiqiang Liang,  He Ren,  Yixuan Wang,  Yanlin Chen,  Leye Wang, and  Man Huang. Cross-hospital sepsis rarly detection via semi-supervised optimal transport with self-paced ensemble. IEEE Journal of Biomedical and Health Informatics, 2023.
    
    
14.  Jie Wang,  Ronald Moore,  Yao Xie, and  Rishikesan Kamaleswaran. Improving sepsis prediction model generalization with optimal transport. In Machine Learning for Health, pages 474–488. PMLR, 2022a.
    
    
15.  Léo Gautheron. Domain adaptation using optimal transport: Application to prostate cancer mapping. Master’s thesis, Jean Monnet University, 2017.
    
    
16.  Peiyin Chen,  He Wang,  Xinlin Sun,  Haoyu Li,  Celso Grebogi, and  Zhongke Gao. Transfer learning with optimal transportation and frequency mixup for EEG-based motor imagery recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 30:2866–2875, 2022.
    
    
17.  Zhenjie Liu,  Qiang Qiu,  Jun Li,  Lizhe Wang, and  Antonio Plaza. Geographic optimal transport for heterogeneous data: Fusing remote sensing and social media. IEEE Transactions on Geoscience and Remote Sensing, 59(8):6935–6945, 2020.
    
    
18.  Riccardo Miotto,  Li Li,  Brian A Kidd, and  Joel T Dudley. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6(1):1–10, 2016.
    
    
19.  Edward Choi,  Siddharth Biswal,  Bradley Malin,  Jon Duke,  Walter F Stewart, and  Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks. In Machine Learning for Healthcare conference, pages 286–305. PMLR, 2017.
    
    
20.  Zhengping Che,  Sanjay Purushotham,  Robinder Khemani, and  Yan Liu. Interpretable deep models for ICU outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, page 371. American Medical Informatics Association, 2016.
    
    
21.  Zhengping Che,  Sanjay Purushotham,  Kyunghyun Cho,  David Sontag, and  Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018.
    
    
22.  Shai Ben-David,  John Blitzer,  Koby Crammer, and  Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.
    
    
23.  Shai Ben-David and  Ruth Urner. Domain adaptation–can quantity compensate for quality? Annals of Mathematics and Artificial Intelligence, 70:185–202, 2014.
    
    
24.  Shai Ben-David,  John Blitzer,  Koby Crammer,  Alex Kulesza,  Fernando Pereira, and  Jennifer Wortman Vaughan. A theory of learning from different domains. Mach Learn, 79:151–175, 2010.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s10994-009-5152-4&link_type=DOI) 

25.  Nicolas Courty,  Rémi Flamary,  Amaury Habrard, and  Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, volume 30, 2017.
    
    
26.  Farjad Malik,  Simon Wouters,  Ruben Cartuyvels,  Erfan Ghadery, and  Marie-Francine Moens. Two-phase training mitigates class imbalance for camera trap image classification with cnns. arXiv preprint arXiv:2112.14491, 2021.
    
    
27. 1.   Smaranda Muresan, 
    2.   Preslav Nakov, and 
    3.   Aline Villavicencio
    
     Yaoyiran Li,  Fangyu Liu,  Nigel Collier,  Anna Korhonen, and  Ivan Vulić. Improving word translation via two-stage contrastive learning. In  Smaranda Muresan,  Preslav Nakov, and  Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4353–4374, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.299. URL [https://aclanthology.org/2022.acl-long.299](https://aclanthology.org/2022.acl-long.299).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.18653/v1/2022.acl-long.299&link_type=DOI) 

28.  Baochen Sun,  Jiashi Feng, and  Kate Saenko. Correlation alignment for unsupervised domain adaptation. Domain adaptation in computer vision applications, pages 153–171, 2017.
    
    
29.  Sinno Jialin Pan,  Ivor W Tsang,  James T Kwok, and  Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2010.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TNN.2010.2091281&link_type=DOI) 

30.  He He and  Dongrui Wu. Transfer learning for brain–computer interfaces: A Euclidean space data alignment approach. IEEE Transactions on Biomedical Engineering, 67(2):399–410, 2019.
    
    
31.  Boqing Gong,  Yuan Shi,  Fei Sha, and  Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066–2073. IEEE, 2012.
    
    
32.  Mingsheng Long,  Jianmin Wang,  Guiguang Ding,  Jiaguang Sun, and  Philip S Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.
    
    
33.  Bharath Bhushan Damodaran,  Benjamin Kellenberger,  Rémi Flamary,  Devis Tuia, and  Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European conference on computer vision (ECCV), pages 447–463, 2018.
    
    
34.  Shengsheng Wang,  Bilin Wang,  Zhe Zhang,  Ali Asghar Heidari, and  Huiling Chen. Class-aware sample reweighting optimal transport for multi-source domain adaptation. Neurocomputing, 523:213–223, 2023.
    
    
35.  Bilin Wang,  Shengsheng Wang,  Zhe Zhang,  Xin Zhao, and  Zihao Fu. Decomposed-distance weighted optimal transport for unsupervised domain adaptation. Applied Intelligence, 52(12):14070–14084, 2022b.
    
    
36.  Mingsheng Long,  Yue Cao,  Jianmin Wang, and  Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105. PMLR, 2015.
    
    
37.  Mingsheng Long,  Han Zhu,  Jianmin Wang, and  Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208–2217. PMLR, 2017.
    
    
38.  Fabio Maria Carlucci,  Lorenzo Porzi,  Barbara Caputo,  Elisa Ricci, and  Samuel Rota Bulo. Autodial: Automatic domain alignment layers. In Proceedings of the IEEE International Conference on Computer Vision, pages 5067–5075, 2017.
    
    
39.  Eric Tzeng,  Judy Hoffman,  Trevor Darrell, and  Kate Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068–4076, 2015.
    
    
40.  Yaroslav Ganin,  Evgeniya Ustinova,  Hana Ajakan,  Pascal Germain,  Hugo Larochelle,  François Laviolette,  Mario March, and  Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(59):1–35, 2016.
    
    
41.  Zelun Luo,  Yuliang Zou,  Judy Hoffman, and  Li F Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. Advances in Neural Information Processing Systems, 30, 2017.
    
    
42.  Mingsheng Long,  Zhangjie Cao,  Jianmin Wang, and  Michael I Jordan. Conditional adversarial domain adaptation. Advances in neural information processing systems, 31, 2018.
    
    
43.  Yuchen Zhang,  Tianle Liu,  Mingsheng Long, and  Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pages 7404–7413. PMLR, 2019.
    
    
44.  Xingchao Peng,  Zijun Huang,  Ximeng Sun, and  Kate Saenko. Domain agnostic learning with disentangled representations. In International Conference on Machine Learning, pages 5102–5112. PMLR, 2019.
    
    
45.  Xinyang Chen,  Sinan Wang,  Jianmin Wang, and  Mingsheng Long. Representation subspace distance for domain adaptation regression. In International Conference on Machine Learning, pages 1749–1759, 2021.
    
    
46.  Ismail Nejjar,  Qin Wang, and  Olga Fink. Dare-gram: Unsupervised domain adaptation regression by aligning inverse gram matrices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11744–11754, 2023.
    
    
47.  Khiem Pham,  Khang Le,  Nhat Ho,  Tung Pham, and  Hung Bui. On unbalanced optimal transport: An analysis of sinkhorn algorithm. In International Conference on Machine Learning, pages 7673–7682. PMLR, 2020.
    
    
48.  Royce Reinger. POT: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8, 2021.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/arxiv.1912.02762&link_type=DOI) 

49.  Laurens Van der Maaten and  Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9 (11), 2008.
    
    
50.  Benjamin A Goldstein,  Ann Marie Navar, and  Rickey E Carter. Moving beyond regression techniques in cardiovascular risk prediction: Applying machine learning to address analytic challenges. European Heart Journal, 38(23): 1805–1814, 2017.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/eurheartj/ehw302&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27436868&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F03%2F28%2F2024.03.27.24304781.atom) 

51.  Andrew Clegg,  Chris Bates,  John Young,  Ronan Ryan,  Linda Nichols,  Elizabeth Ann Teale,  Mohammed A Mohammed,  John Parry, and  Tom Marshall. Development and validation of an electronic frailty index using routine primary care electronic health record data. Age and Ageing, 45(3):353–360, 2016.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ageing/afw039&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26944937&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F03%2F28%2F2024.03.27.24304781.atom) 

52.  Hrayr Harutyunyan,  Hrant Khachatrian,  David C Kale,  Greg Ver Steeg, and  Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific Data, 6(1):96, 2019.
    
    
53.  Peter L Bartlett,  Dylan J Foster, and  Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 2017.
    
    
54.  Vinod Nair and  Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, 2010.
    
    
55.  Sergey Ioffe and  Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456. pmlr, 2015.
    
    
56.  Xinyi Tong,  Xiangxiang Xu,  Shao-Lun Huang, and  Lizhong Zheng. A mathematical framework for quantifying transferability in multi-source transfer learning. Advances in Neural Information Processing Systems, 34:26103– 26116, 2021.
    
    
57.  Ye Tian and  Yang Feng. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, 118(544):2684–2697, 2023.
    
    
58.  Guanyu Cai,  Lianghua He,  MengChu Zhou,  Hesham Alhumade, and  Die Hu. Learning smooth representation for unsupervised domain adaptation. IEEE Transactions on Neural Networks and Learning Systems, 34(8):4181–4195, 2021.
    
    
59.  Dana P Goldman and  James P Smith. Can patient self-management help explain the ses health gradient? Proceedings of the National Academy of Sciences, 99(16):10929–10934, 2002.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMToiOTkvMTYvMTA5MjkiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8wMy8yOC8yMDI0LjAzLjI3LjI0MzA0NzgxLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

60.  Debra Umberson. Family status and health behaviors: Social control as a dimension of social integration. Journal of health and social behavior, pages 306–319, 1987.
    
    
61.  Tianfeng Chai and  Roland R Draxler. Root mean square error (rmse) or mean absolute error (mae). Geoscientific Model Development, 7(1):1525–1534, 2014.
    
    
62.  Jonathan Gruber. Medicaid. In Means-tested transfer programs in the United States, pages 15–78. University of Chicago Press, 2003.
    
    
63.  Amy Finkelstein and  Robin McKnight. What did medicare do? the initial impact of medicare on mortality and out of pocket medical spending. Journal of Public Economics, 92(7):1644–1668, 2008.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jpubeco.2007.10.005&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000257052100007&link_type=ISI) 

64.  David M Cutler and  Jonathan Gruber. Medicaid and private insurance: evidence and implications. Health Affairs, 16 (1):194–200, 1997.
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6MzoiUERGIjtzOjExOiJqb3VybmFsQ29kZSI7czo5OiJoZWFsdGhhZmYiO3M6NToicmVzaWQiO3M6ODoiMTYvMS8xOTQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8wMy8yOC8yMDI0LjAzLjI3LjI0MzA0NzgxLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

65. Kaiser Family Foundation. What happens after people lose medicaid coverage? KFF (Kaiser Family Foundation), 2023. Accessed: 2024-02-15.
    
    
66. Kaiser Family Foundation. 10 things to know about the unwinding of the medicaid continuous enrollment requirement, 2023. Accessed: 2024-02-15.
    
    
67. Kaiser Family Foundation. Half of all eligible medicare beneficiaries are now enrolled in private medicare advantage plans, 2021. Accessed: 2024-02-15.
    
    
68.  Sharon K Long,  Genevieve M Kenney,  Stephen Zuckerman,  Dana E Goin,  Douglas Wissoker,  Fredric Blavin,  Linda J Blumberg,  Lisa Clemans-Cope,  John Holahan, and  Katherine Hempstead. The health reform monitoring survey: Addressing data gaps to provide timely insights into the affordable care act. Health Affairs, 33(1):161–167, 2014.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToiaGVhbHRoYWZmIjtzOjU6InJlc2lkIjtzOjg6IjMzLzEvMTYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMDMvMjgvMjAyNC4wMy4yNy4yNDMwNDc4MS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

69.  Katherine Baicker and  Amitabh Chandra. The labor market effects of rising health insurance premiums. Journal of Labor Economics, 24(3):609–634, 2006.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1086/505049&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000239147300007&link_type=ISI) 

70.  Richard J Chen,  Judy J Wang,  Drew FK Williamson,  Tiffany Y Chen,  Jana Lipkova,  Ming Y Lu,  Sharifa Sahai, and  Faisal Mahmood. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nature Biomedical Engineering, 7(6):719–742, 2023.
    
    
71.  Mirja Mittermaier,  Marium M Raza, and  Joseph C Kvedar. Bias in ai-based models for medical applications: challenges and mitigation strategies. NPJ Digital Medicine, 6(1):113, 2023.
    
    
72.  Sarah Poole,  Shaun Grannis, and  Nigam H Shah. Predicting emergency department visits. AMIA Summits on Translational Science Proceedings, 2016:438, 2016.
    
    
73.  S Goodacre,  J Turner, and  Jon Nicholl. Prediction of mortality among emergency medical admissions. Emergency Medicine Journal, 23(5):372–375, 2006.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiZW1lcm1lZCI7czo1OiJyZXNpZCI7czo4OiIyMy81LzM3MiI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDI0LzAzLzI4LzIwMjQuMDMuMjcuMjQzMDQ3ODEuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

74.  Joy Pader,  Robert B Basmadjian,  Dylan E O’Sullivan,  Nicole E Mealey,  Yibing Ruan,  Christine Friedenreich,  Rachel Murphy,  Edwin Wang,  May Lynn Quan, and  Darren R Brenner. Examining the etiology of early-onset breast cancer in the canadian partnership for tomorrow’s health (CanPath). Cancer Causes & Control, 32(10):1117–1128, 2021.
    
    
75.  Cathie Sudlow,  John Gallacher,  Naomi Allen,  Valerie Beral,  Paul Burton,  John Danesh,  Paul Downey,  Paul Elliott,  Jane Green,  Martin Landray, et al. Uk biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779, 2015.
    
    
76.  Wei Wei,  Deyu Meng,  Qian Zhao,  Zongben Xu, and  Ying Wu. Semi-supervised transfer learning for image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3877–3886, 2019.
    
    
77.  Tiecheng Song,  Jianfei Cai,  Tianqi Zhang,  Chenqiang Gao,  Fanman Meng, and  Qingbo Wu. Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning. Pattern Recognition, 68:99–110, 2017.
    
    
78.  Thibault Séjourné,  François-Xavier Vialard, and  Gabriel Peyré. The unbalanced gromov wasserstein distance: Conic formulation and relaxation. Advances in Neural Information Processing Systems, 34:8766–8779, 2021.

 [1]: /embed/graphic-1.gif
 [2]: /embed/inline-graphic-1.gif
 [3]: /embed/inline-graphic-2.gif
 [4]: /embed/graphic-2.gif
 [5]: /embed/inline-graphic-3.gif
 [6]: /embed/inline-graphic-4.gif
 [7]: /embed/inline-graphic-5.gif
 [8]: /embed/graphic-3.gif
 [9]: /embed/graphic-4.gif
 [10]: /embed/inline-graphic-6.gif
 [11]: /embed/inline-graphic-7.gif
 [12]: /embed/graphic-5.gif
 [13]: /embed/inline-graphic-8.gif
 [14]: /embed/graphic-6.gif
 [15]: /embed/graphic-7.gif
 [16]: /embed/inline-graphic-9.gif
 [17]: /embed/inline-graphic-10.gif
 [18]: /embed/graphic-9.gif
 [19]: /embed/graphic-10.gif
 [20]: /embed/inline-graphic-11.gif
 [21]: /embed/inline-graphic-12.gif
 [22]: /embed/graphic-11.gif
 [23]: /embed/inline-graphic-13.gif
 [24]: /embed/graphic-12.gif
 [25]: /embed/inline-graphic-14.gif
 [26]: /embed/inline-graphic-15.gif
 [27]: /embed/inline-graphic-16.gif
 [28]: /embed/inline-graphic-17.gif
 [29]: /embed/inline-graphic-18.gif
 [30]: /embed/inline-graphic-19.gif
 [31]: /embed/inline-graphic-20.gif
 [32]: /embed/inline-graphic-21.gif
 [33]: /embed/inline-graphic-22.gif
 [34]: /embed/graphic-16.gif
 [35]: /embed/inline-graphic-23.gif
 [36]: /embed/graphic-17.gif
 [37]: /embed/graphic-18.gif
 [38]: /embed/inline-graphic-24.gif
 [39]: /embed/inline-graphic-25.gif
 [40]: /embed/graphic-19.gif
 [41]: /embed/inline-graphic-26.gif
 [42]: /embed/graphic-20.gif
 [43]: /embed/inline-graphic-27.gif
 [44]: /embed/graphic-21.gif
 [45]: /embed/graphic-22.gif
 [46]: /embed/graphic-23.gif
 [47]: /embed/graphic-24.gif
 [48]: /embed/graphic-25.gif
 [49]: /embed/inline-graphic-28.gif
 [50]: /embed/graphic-26.gif
 [51]: /embed/inline-graphic-29.gif