Abstract
Aims Diabetic retinopathy (DR) is the most common cause of vision loss in the working age. This research aimed to develop a machine learning model which can predict the development of referrable DR from fundus imagery of otherwise healthy eyes.
Methods Our researchers trained a machine learning algorithm on the EyePacs dataset, consisting of 156,363 fundus images. Referrable DR was defined as any level above mild on the International Clinical Diabetic Retinopathy scale.
Results The algorithm achieved 0.81 Area Under Receiver Operating Curve (AUC) when averaging scores from multiple images on the task of predicting development of referrable DR, and 0.76 AUC when using a single image.
Conclusion Our results suggest that risk of DR may be predicted from fundus photography alone. Prediction of personalized risk of DR may become key in treatment and contribute to patient compliance across the board, particularly when supported by further prospective research.
Synopsis The deep learning algorithm our researchers developed was able to predict the development of referrable diabetic retinopathy in diabetic patients with otherwise healthy eyes with 0.81 AUC.
Introduction
Diabetic retinopathy (DR) is a common retinal vascular complication of diabetes mellitus, which is characterized by retinal micro-aneurisms, hemorrhages, neovascularization, and edema in the retina.1 Diabetic retinopathy can advance to blindness, and is the leading cause of vision-loss at the working age. While over 80% of diabetics develop retinopathy of some degree after 20 years of the disease,2 more than 90% of the sight-threatening cases can be treated, if found early, in time to prevent loss of sight.3
Current public health guidelines for individuals with diabetes prescribe screening every 12-24 months for the presence of DR.4,5 Clinical studies have demonstrated that screening can lead to early detection and timely treatment which ultimately can prevent serious visual impairment and blindness.6,7 While retinal screening is essential for patients with diabetes, it requires a specialized eye exam which is often inaccessible for patients. A large percentages of individuals with diabetes forego screening retinal exams and present late in the course of the disease.8–11 Early intervention is the key to mitigation of DR risk factors and damage. As such, early detection is the most promising way of mitigating the damages of DR.
AI and machine learning have recently been successfully applied to the autonomous diagnosis of referrable (more than mild) DR. One FDA-approved AI system reported sensitivity of 87% and specificity of 90%.12 More recently, we reported results of a Pivotal FDA study with 93% sensitivity and 91% specificity for referrable DR on images obtained by a desktop device and 92% and 94% sensitivity and specificity, respectively, on images obtained by a portable camera.13 Additionally, we presented strong efficacy for DR detection using a portable camera on a separate dataset.14
Recent work has shown that otherwise “normal” fundus images can be informative and predictive when presented to a machine learning algorithm.15–17 AI algorithms can interpret subclinical information of the retinal anatomy and make predictions about diseases, even those unrelated to the eye - such as chronic kidney disease (CKD),15 diabetes,16 and cardiovascular risk factors.17 Furthermore, machine learning algorithms have been trained to predict gender information with high accuracy from mere fundus photography – something previously unattainable with the standard clinical exam.
While some work has been done on finding risk factors for DR, using patient data such as age, HbA1c levels, gender, duration of disease, and the like,18,19 clinicians are traditionally unable to predict the development of DR in patients. However, a previous article published findings of AUC 0.79 using a machine learning algorithm to predict DR development using fundus photography.20 These findings improved to 0.81 when combined with patient-specific information on risk factors. In this study we present a first-in-class machine learning algorithm which predicts the development of future DR from otherwise normal retinal anatomy. The current study improves on previous work by predicting the development of DR over a longer period of time, namely improving the prediction period from two to over three years, which may be clinically significant.
Materials and Methods
Dataset
We utilized a dataset compiled and provided by EyePACS (http://www.eyepacs.org), comprised of fundus retinal images and expert readings of said images. The data consisted of 156,363 images from 21,730 patients who visited the clinics at least twice between 2016 and 2021. 19.6% of visit pairs were ≤ 12 months apart, 55% were 12-24 months apart, 19.8% were 24-36 months apart, and 5.5% of visits were ≥ 36 months apart (Appendix A). Of the patients, 37% were male and 63% were female or other; mean age was 55 years old (Table 1). All images and data were de-identified according to the Health Insurance Portability and Accountability Act “Safe Harbor” before they were transferred to the researchers. Institutional Review Board exemption was obtained from the Sterling Independent Review Board.
Key characteristics of the dataset.
The dataset contained up to 6 images per patient visit: one macula centered image, one disk centered image, and one centered image, per eye. Each eye was graded individually by an expert ophthalmologist for the presence and severity of diabetic retinopathy (DR). DR severity (none, mild, moderate, severe, or proliferative) was graded according to the International Clinical Diabetic Retinopathy scale.21,22
The image categorization in the current research was simplified to three severity categories by combining categories 3 to 5 into “more-than-mild DR”, as only these levels usually necessitate referral to an ophthalmologist and/or medical and surgical management.23,24
In order to prepare the dataset for model training and validation, each image was labeled by the maximal DR rating the patient was diagnosed with in a time period following the visit. Towards this purpose each patient visit was rated twice on the DR scale, once for each of the patient’s two eyes. Pairs were then created consisting of all possible pairings of each patient’s visits in a given time period. Values of the pairings were calculated by measuring the difference in DR ratings, and then taking the maximum value. Each time point (visit) was then assigned the highest value from the pairings in which that timepoint was the first, and each image was labeled by that value. Negative differences were disregarded, as the regression cause was unknown: true disease regression, clinical intervention, or misdiagnosis. Models were created for each of the chosen time periods.
For instance, a given patient has visited a clinic n times, once a year: v1, v2, …, vn. The cutoff for the given time period is set at two years, resulting in the following n-1 data points: v1 compared to v2 and v3 (taking the maximal difference), v2 compared to v3 and v4 (taking the maximal difference), etc. Further models using this patient’s data are also created, set at different time periods (three years, four years, and so on).
There were two reasons for choosing the maximal difference as the label. First, the main clinical value is in predicting whether a patient will develop DR, not in which eye it will be. Second, correlation found between the maximal right and left eye differences was relatively high (0.5), which indicates that difference between the eyes may well be incidental.
Algorithm development
To evaluate the models’ performance, a random 10% of the patients were designated as the validation set and not used for the training of models. Of these, all images deemed fully gradable were used. Given that this same 10% were used as the validation set across tasks, this choice made the fair comparison of different models and timeframes easier.
In order to train the model, all the datapoints representing a progression were included, and a subset of the negative datapoints were included at a ratio of 2:1.
Models were trained on four different tasks:
- Progression amongst DR patients (mild to more-than-mild).
- Prediction of DR development (normal to any DR).
- Prediction of clinically significant DR (from non-referrable to referrable DR).
- General progression (progress): any change for the worse in the DR condition.
The hyper-parameters for the model training were chosen beforehand, and not changed, to prevent over-fitting.
Risk factor predictive value
For 80% of the patients in the dataset, HbA1c level was recorded by the clinic, and disease duration was recorded for 98% of patients. HbA1c level and disease duration were treated as risk level scores. AUC was then calculated in order to rate predictive value for each of the scores, which were then compared to the predictive value of the model on each task (table 3).
Results
Transitional Retinopathy Results
The calculated baseline transition odds between different DR levels are displayed in Table 2 (for a more detailed table see appendix B). This is observational data, as regression- and progression-related factors are unknown; regression may have been caused by clinical intervention, and progression may be understated due to patients who experienced vision loss and therefore did not return for subsequent visits.
Disease progression was calculated between any given visit and the visit immediately following. “Regression” is defined as the patient’s recorded DR level being lower on the second visit, “No change” is defined as recorded DR levels being the same between visits, and “Progression” is defined as DR levels being higher.
Prediction Results
The model’s performance in determining the risk of mild DR becoming more-than-mild DR is comparable to risk factor-based prediction (area under the receiver operating curve (AUC) 0.65 vs. 0.66, respectively). For the other tasks more images were available, (appendix J) and performance improved significantly. The results improved still by using multiple images per patient and averaging the resulting score. The model scored best on the task of prediction of clinically significant DR, with the aggregated score resulting in AUC 0.81 (CI 95% 0.77-0.84) (Table 3. Additional timeframes and ROC curves available in appendices G and H). As the HbA1c levels and disease duration was not available for all patients, model scores were also calculated for the subsets of patients who had those scores, resulting in effectively the same scores (appendix M). The Pearson correlation between our model’s score and HbA1c levels was 0.12 (CI 95% 0.06-0.21). Correlation with disease duration was 0.21 (CI 95% 0.22-0.30).
The models’ score, in AUC, in predicting DR within two years. In the parentheses 95% CI. See supplemental for additional timeframes.
To further analyze the model’s prediction value in that task, the empirical risk as a function of the model’s score was investigated (Figure 1). When the model was trained to predict the transition to more-than-mild DR, the top 5% patients highest scored by the model were at 54% risk of getting DR, while the baseline odds in the validation set were 10% - almost a 5-fold increase.
the odds of the patient from a representative sample being diagnosed with clinically significant DR in two years, as a function of the model’s score. Each dot represents 5% of the patients. Additional figures in appendix K
Left: the right eye at the first timepoint, rated healthy by a human expert. Right: the same eye one year later, with severe DR. The model score for both healthy eyes was in the top 2% of severity, implying more than fivefold the baseline risk of developing more than mild DR in the following two years. One year later, the patient was diagnosed with severe and proliferative DR in the right and left eyes respectively.
In order to further analyze the model’s performance as a function of time, sets of images were split into four groups based on time elapsed between first and second visits and imaging (Table 4). In order to analyze the relation between the model’s assigned scores and the severity of developed DR, the scores assigned by the model were averaged and compared across different subgroups; normal patients who were diagnosed with DR up to two years after initial images were taken were sorted into subgroups based on DR severity. The groups of mild, moderate, and severe DR were score averaged at 0.36, 0.43, and 0.46 respectively (Figure 4).
Comparison of model performance as a function of time. In the parentheses 95% CI.
Additionally, the model’s predictive effectiveness regarding the more rapid development of DR was analyzed. Assigned scores were averaged and compared across subgroups organized by time elapsed between initial imaging and diagnosis: one, two, three, and four or more years. It was hypothesized that the healthier the eye appeared the more time would elapse between initial imaging and diagnosis. As expected, the model’s score declined in congruence with time passed before diagnosis (see Figure 3).
When compared with results from Bora et al., our model shows an improvement in performance from 0.79 AUC to 0.82,20 as well as the ability to predict the development of DR across a longer timeframe, up to five years. When additional relevant patient metadata was utilized, the former article improved results to 0.81 AUC. Given the lack of metadata in the dataset utilized for this research, results for this model may improve accordingly if validated on other datasets.
The model’s mean score as a function of how many years after the visit DR was diagnosed. As expected, model average score decreases in accordance with time elapsed between first visit and diagnosis
The model’s score as a function of the final diagnosed severity of the DR. The model is more confident of the future occurrence of DR if it’s more severe.
Discussion
The aim of this research was to develop a method of predicting the chances of future development of DR, before detection methods can even be applied. Given the relatively high prevalence of DR,2 the difficulty in permanently reversing retinal damage,25 and current patient non-compliance,8–11 prevention and not only treatment of DR is crucial to health outcomes. As such, prediction of individual DR risk may become a key element. Currently, to the best of our knowledge, there are two methods of doing this: risk factor based, which is of limited predictive clinical utility,18,19 and that developed by Bora et al.20 which utilizes both deep learning and risk factors for optimal results.
The current research was conducted using convolutional neural networks, a standard state of the art computer vision algorithm. Such algorithms have been reliably incorporated in multiple medical fields, such as ophthalmology,26 radiology,27 endocrinology,28 and others.29 The algorithms presented in this work, which are easily implemented and display promising performance, may carry widespread implications related to better DR prediction. For instance, given the knowledge that some patients are at very low risk of developing DR, screening may be able to be feasibly reduced according to individual risk levels, reducing strain placed on both patients and medical staff. Furthermore, in high-risk cases which have not yet manifested, patients may be forewarned of impending risk, increasing chances of mitigation and prevention through diabetes management. Bora et al. previously demonstrated the ability to predict the development of DR within two years.20 The current research is able to predict development within over three, which may have practical and clinical implications. Due to clinical guidelines recommending routine patient checkups every one to two years, the increase in predictive time may decrease the number of screenings required for patients, affording longer periods of time between necessary checkups.
A lack of patient education regarding the risks of diabetes, and DR specifically, has been cited as a contributing factor in patient non-compliance.8 Furthermore, patients may not attend screenings due to belief that they do not require retinal examinations or treatment as their vision is too good, or their diabetes is too mild to be relevant. The ability to concretely discuss personal risk levels with the patient may do much to mitigate these beliefs, contributing to higher compliance. Improved patient compliance in terms of DR may also improve compliance in terms of general diabetes management, bettering patient outcomes across the board.
One limitation of the current research is that the transition odds are observatory, rather than experimental. As such, there is the possibility of an under-statement of risk, given that blinded people likely did not continue to return for checkups. Odds of regression may similarly be over-stated, as regression factors are unknown and regression may have been caused by surgical or medical intervention.
Recommendations for future research include studies on how to incorporate the model into usual diabetes standards of care, in order to better mitigate and prevent DR. As the model’s score was not strongly correlated to risk factor score, there may be added value in including metadata on levels of previously recognized risk factors among patients in order to improve predictive value, as demonstrated in previous studies.20 Additionally, there is great value in examining whether use of this algorithm does, in fact, improve patient compliance. This model may also contribute to future research of DR risk factors and prevention, as at-risk patients with previously unknown risk factors may become recognizable, allowing for a more holistic understanding of contributing influences, both biological and behavioral.
Data Availability
All data produced in the present study is proprietary. Anonymized patient IDs are available upon request to the authors.
Funding
Employees and board members of AEYE Health designed and carried out the study; managed, analyzed, and interpreted the data; prepared, reviewed, and approved the article; and were involved in the decision to submit the article.
Footnotes
Updated review and sources