Machine learning fairness analysis on clinical data of The EMory BrEast imaging Dataset (EMBED) =============================================================================================== * Vikas Ramachandra * Mohit Sethi ## Abstract This paper explores the use of machine learning (ML) for predictive modeling on clinical data from the EMory BrEast imaging Dataset (EMBED) [1] with a focus on ML model fairness analysis. The aim of this study is to develop and evaluate fair machine learning models that can accurately predict breast cancer risk. We trained and tested various machine-learning models. Our findings show that machine learning can be effective for predicting breast cancer risk or diagnosing breast cancer, and that fairness considerations are crucial in the development of such models. Overall, our study highlights the potential of machine learning for clinical applications while emphasizing the need for ethical and fair practices in this field. ## Introduction Breast cancer is one of the most common forms of cancer in women worldwide. Early detection and accurate diagnosis play a crucial role in the successful treatment of breast cancer. Advances in medical imaging technologies have led to the creation of large datasets of clinical images and associated data. Machine learning (ML) models can be used to analyze such datasets and develop predictive models for the early detection and diagnosis of breast cancer. The Emory Breast Imaging Dataset (EMBED) is a large and diverse clinical dataset that contains thousands of mammograms, clinical images, and associated data, including patient demographics, mammography findings, pathology results, and treatment outcomes. EMBED provides a valuable resource for researchers and clinicians to develop and test ML-based predictive models for breast cancer diagnosis and treatment. However, the use of ML models in clinical settings raises concerns about fairness and bias. Models that are trained on biased or unrepresentative data can produce unfair predictions that disproportionately harm certain groups of patients. Therefore, it is important to evaluate the fairness and bias of ML models before deploying them in clinical settings. In this paper, we describe the development and evaluation of ML-based predictive models on the EMBED dataset for breast cancer diagnosis and treatment. We also perform a comprehensive fairness analysis to ensure that the models are fair and unbiased across different demographic groups. We hope our findings will contribute to the development of more accurate and equitable breast cancer diagnostic tools, improving patient outcomes and reducing health disparities. ## Dataset The dataset used for the analysis is The EMory BrEast imaging Dataset (EMBED) ([https://arxiv.org/abs/2202.04073](https://arxiv.org/abs/2202.04073)) [1]. The dataset comprises of 3.5M screening and diagnostic Mammograms and clinical data about the mammograms. The dataset tries to have a balanced racial demographic sample to help decrease racial bias in AI models. Our analysis is focussed on the provided clinical data for the mammograms. The complete data has 81776 records and 21 clinical feature columns. The pathology ground truth outcome is labeled as “path severity” which we have used as the target variable for our classification model. Majority of the records have the value for target variable missing, so we have selected 4689 records for which the target variable is present. We selected the following features as our model input patient covariates: “tissueden”, “side”, “numfind”, “total\_L\_find”, “total\_R\_find”, “calcnumber”, “ETHNICITY\_DESC”, “age\_at\_study”; and “path\_severity as our target variable. The rest of the features had to be dropped due to the majority of the values being null. The target feature/ outcome variable has 6 different classes. There is an extreme class imbalance present shown below (Table 1). To overcome this, we have used Synthetic Minority Oversampling Technique (SMOTE) oversampling technique to handle the class imbalance. The augmented data has 2132 records for each class. The dataset is split into train (70%) and validation (30%) set. The performance metrics used are: precision, recall and f1-score. View this table: [Table 1.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T1) Table 1. ## Model & Results Instead of doing multi-class classification of the target variable, we have trained 6 binary classification models in order to help with the fairness analysis of the model later. After benchmarking several classification models, we selected Random Forest Classifier. [6] The performance metrics for each model are attached below (Fig.1, Fig.2, Fig.3, Fig.4, Fig.5, Fig.6). ![Fig. 1.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F1.medium.gif) [Fig. 1.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F1) Fig. 1. ![Fig. 2.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F2.medium.gif) [Fig. 2.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F2) Fig. 2. ![Fig. 3.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F3.medium.gif) [Fig. 3.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F3) Fig. 3. ![Fig. 4.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F4.medium.gif) [Fig. 4.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F4) Fig. 4. ![Fig. 5.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F5.medium.gif) [Fig. 5.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F5) Fig. 5. ![Fig. 6.](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F6.medium.gif) [Fig. 6.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F6) Fig. 6. The models performed well with an average accuracy of 93.03 %. ### Fairness Analysis We have used the following metrics to assess the fairness of the above trained ML models. Demographic parity: Demographic parity in machine learning refers to the fairness metric that aims to ensure that the model’s outcomes or predictions are similar across different demographic groups in terms of positive outcomes. It requires that the proportion of positive outcomes is similar across different groups, regardless of other factors. Formula: **(TP + FP)** [4] Proportional parity: Proportional parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are proportional across different demographic groups. It requires that the proportion of positive outcomes is similar across different groups, relative to the proportion of positive outcomes in the population as a whole. Formula: **(TP + FP) / (TP + FP + TN + FN)** [4] Equalized odds: Equalized odds in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are accurate and consistent across different demographic groups. It requires that the true positive rate and false positive rate are similar across different groups, ensuring that the model does not unfairly favor or discriminate against any particular group. Formula: **TP / (TP + FN)** [4] Predictive rate parity: Predictive rate parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are consistent across different demographic groups in terms of predictive accuracy. It requires that the positive predictive value is similar across different groups, ensuring that the model’s accuracy is not biased towards any particular group. Formula: **TP / (TP + FP)** [4] Accuracy parity: Accuracy parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are consistent across different demographic groups in terms of overall accuracy. It requires that the model’s accuracy is similar across different groups, ensuring that the model is fair and equitable for everyone. Formula: **(TP + TN) / (TP + FP + TN + FN)** [4] False negative rate parity: False negative rate parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are consistent across different demographic groups in terms of false negative rates. It requires that the model’s false negative rate is similar across different groups, ensuring that the model does not unfairly harm any particular group. Formula: **FN / (TP + FN)** [4] False positive rate parity: False positive rate parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are consistent across different demographic groups in terms of false positive rates. It requires that the model’s false positive rate is similar across different groups, ensuring that the model does not unfairly favor any particular group. Formula: **FP / (TN + FP)** [4] Negative predictive value parity: Negative predictive value parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are consistent across different demographic groups in terms of negative predictive values. It requires that the model’s negative predictive value is similar across different groups, ensuring that the model’s accuracy is not biased towards any particular group. Formula: **TN / (TN + FN)** [4] Specificity parity: Specificity parity in machine learning is a fairness metric that aims to ensure that the model’s outcomes or predictions are consistent across different demographic groups in terms of specificity values. It requires that the model’s specificity is similar across different groups, ensuring that the model does not unfairly discriminate against any particular group. Formula: **TN / (TN + FP)** [4] Matthew’s correlation coefficient comparison: Matthews correlation coefficient comparison in machine learning is a fairness metric that measures the correlation between the actual outcomes and the model’s predictions, taking into account true positive, true negative, false positive, and false negative rates. It is used to compare the performance of different models and ensure fairness across different demographic groups. Formula: **(TP×TN-FP×FN)/**v**((TP+FP)×(TP+FN)×(TN+FP)×(TN+FN))** [4] Our dataset contains samples from 3 racial demographics: African Americans, Caucasians, and Asians. To analyze the model’s bias, we segregated the validation set based on racial demographics and used fairness metrics such as Demographic Parity, Proportional parity, etc. The fairness metrics results for each model are attached below (Table 2, Table 3, Table 4). View this table: [Table 2.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T2) Table 2. View this table: [Table 3.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T3) Table 3. View this table: [Table 4.](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T4) Table 4. To optimize our model to play fair for all racial demographics. We take help from an open-source project called fairlearn ([https://github.com/fairlearn/fairlearn](https://github.com/fairlearn/fairlearn)) [2]. We use the Threshold Optimizer algorithm from the package to equalize the odds between different groups of users of our model. The Threshold Optimizer is based on a paper called “Equality of Opportunity in Supervised Learning” ([https://arxiv.org/abs/1610.02413](https://arxiv.org/abs/1610.02413)) [3]. To compare the results of th original and fair model, we calculate the fairness metrics of both models, find out the confidence intervals and use histograms to visualize any disparity. To obtain confidence intervals, we first resample 70% of the data from the original dataset to be used as the training set and the remaining records to be used as the validation set. Then both the model and fair model are trained on the training dataset and fairness metrics are calculated. In total 12 models are trained 6 normal and 6 fair models, each model a binary classification for one of the six target variables. The above steps are iterated 1000 times. Then the collected data is used to plot histograms and calculate the confidence interval for each metric for each model. Confidence interval plots were generated for the above models. Below attached are the figures which show low overlap between the fair and regular models, which signifies there is a statistical difference between the performance of models. The plots show that the fair model might be les accurate but will not punish any particular demographic. ![Figure7](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F7.medium.gif) [Figure7](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F7) ![Figure8](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F8.medium.gif) [Figure8](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F8) ![Figure9](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F9.medium.gif) [Figure9](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F9) ![Figure10](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F10.medium.gif) [Figure10](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F10) ![Figure11](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F11.medium.gif) [Figure11](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F11) ![Figure12](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F12.medium.gif) [Figure12](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F12) ![Figure13](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F13.medium.gif) [Figure13](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F13) ![Figure14](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F14.medium.gif) [Figure14](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F14) View this table: [Table5](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T5) ## Data Availability All data is available from the EMBED website upon request for non commercial/academic use ## Appendix 1: Multimodal Image Classification ### Introduction In addition to our machine learning-based predictive modeling on clinical data from the Emory Breast Imaging Dataset (EMBED) [1] and model fairness analysis, we also propose a multi-modal classification approach that combines both images and clinical features to predict the target variable of path severity. Breast cancer diagnosis and prognosis typically rely on mammography images and clinical data, such as patient age, family history, and biopsy results. However, these approaches are limited by their reliance on either images or clinical data alone. By combining both modalities, we can potentially improve the accuracy and reliability of breast cancer diagnosis and prognosis. By integrating both modalities, we hope to develop a more comprehensive and accurate breast cancer diagnostic and prognostic tool that can ultimately improve patient outcomes. This multi-modal approach can potentially reduce the need for invasive procedures and provide more personalized treatment recommendations for patients with breast cancer. #### Dataset The ‘path_severity’ target variable is only present for 4788 rows and there is a huge imbalance as mentioned above already. So we decided to only do binary classification for target labels 0 and 4.The value count of both classes is mentioned below View this table: [Table6](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T6) We selected the following features as our input for the ANN model (clinical features): “tissueden”, “side”, “numfind”, “total\_L\_find”, “total\_R\_find”, “calcnumber”, “ETHNICITY\_DESC”, “age\_at\_study” and “path\_severity: as our target variable. The rest of the features had to be dropped due to the majority of the values being null. The corresponding mammogram images for each record were used as input for our CNN model. ### Model & Results ![Figure15](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F15.medium.gif) [Figure15](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F15) View this table: [Table7](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T7) ![Figure16](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F16.medium.gif) [Figure16](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F16) #### Dataset The ‘path_severity’ target variable is only present for 4788 rows and there is a huge imbalance as mentioned above already. So we decided to only do binary classification for target labels 0 and 4. The value count of both classes is mentioned below View this table: [Table8](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T8) This time we only used mammogram images to perform binary classification ### Model & Results We trained the model for 20 epochs. ![Figure17](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F17.medium.gif) [Figure17](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F17) View this table: [Table9](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/T9) ![Figure18](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/07/27/2023.07.23.23293043/F18.medium.gif) [Figure18](http://medrxiv.org/content/early/2023/07/27/2023.07.23.23293043/F18) Despite our best efforts, we encountered challenges in implementing our multi-modal classification approach for predicting path severity in breast cancer using both images and clinical features from the Emory Breast Imaging Dataset (EMBED) [1]. We encountered difficulties in training the model, with the loss function not reducing and poor accuracy results. Our attempts to use a combination of convolutional neural networks (CNNs) and Artificial Neural Networks (ANN / MLP) did not yield the expected results. While multi-modal approaches have shown promise in improving the performance of the model, the complexity and heterogeneity of the data in EMBED may have presented additional challenges in implementing a successful multi-modal approach. Despite these challenges, our work highlights the importance of continued research and development in multi-modal approaches to breast cancer diagnosis and prognosis. While our approach may not have yielded the desired results, our work provides valuable insights into the challenges and opportunities presented by multi-modal data analysis in the field of breast cancer research. In the future, we believe that further research and development in multi-modal approaches can lead to more accurate and reliable breast cancer diagnostic and prognostic tools that can ultimately improve patient outcomes. While our work may not have achieved the desired results, we remain optimistic about the potential of multi-modal approaches in the fight against breast cancer. ## Footnotes * † vikas.ramachandra{at}jacobs.ucsd.edu * ‡ m.sethi006{at}gmail.com * Received July 23, 2023. * Revision received July 23, 2023. * Accepted July 27, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. 1.Emory Breast Imaging Dataset (EMBED). (2022). [https://arxiv.org/abs/2202.04073](https://arxiv.org/abs/2202.04073) 2. 2.Fairlearn GitHub Repository. (n.d.). Retrieved from [https://github.com/fairlearn/fairlearn](https://github.com/fairlearn/fairlearn) 3. 3.Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS) (pp. 3315-3323). Retrieved from [https://arxiv.org/abs/1610.02413](https://arxiv.org/abs/1610.02413) 4. 4.Varga, T. V., & Kozodoi, N. fairness. Retrieved from [https://cran.r-project.org/web/packages/fairness/vignettes/fairness.html](https://cran.r-project.org/web/packages/fairness/vignettes/fairness.html) 5. 5.Kamiran, F., & Calders, T. (2012). Fairness-aware classifier with prejudice remover regularization. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) (pp. 35-50). Retrieved from [https://cran.r-project.org/web/packages/fairness/vignettes/fairness.html](https://cran.r-project.org/web/packages/fairness/vignettes/fairness.html) 6. 6.Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. doi: 10.1023/A:1010933404324 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1023/A:1010933404324&link_type=DOI)