Abstract
Background Artificial intelligence (AI) shows promise in ophthalmology, but its performance in diverse healthcare settings remains understudied. We evaluated retinIA, an AI-powered screening tool developed with Mexican data, against first-year ophthalmology residents in a tertiary care setting in Mexico City.
Methods We analyzed 435 adult patients undergoing their first ophthalmic evaluation. RetinIA and residents’ assessments were compared against expert annotations for retinal lesions, cup-to-disk ratio (CDR) measurements, and glaucoma suspect detection. We also evaluated a synergistic approach combining AI and resident assessments.
Results For glaucoma suspect detection, retinIA outperformed residents in accuracy (88.6% vs 82.9%, p = 0.016), sensitivity (63.0% vs 50.0%, p = 0.116), and specificity (94.5% vs 90.5%, p = 0.062). While, the synergistic approach deemed a higher sensitivity (80.4%) than ophthalmic residents alone or retinIA alone (p < 0.001). RetinIA’s CDR estimates showed lower mean absolute error (0.056 vs 0.105, p < 0.001) and higher correlation with expert measurements (r = 0.728 vs r = 0.538). In retinal lesion detection, retinIA demonstrated superior sensitivity (90.1% vs 63.0% for medium/high-risk lesions, p < 0.001) and specificity (95.8% vs 90.4%, p < 0.001). Furthermore, differences between retinIA and residents were statistically significant across all metrics. The synergistic approach achieved the highest sensitivity for retinal lesions (92.6% for medium/high-risk, 100% for high-risk) while maintaining good specificity (87.4%).
Conclusion RetinIA outperforms first-year residents in key ophthalmic assessments. The synergistic use of AI and resident assessments shows potential for optimizing diagnostic accuracy, highlighting the value of AI as a supportive tool in ophthalmic practice, especially for earlycareer clinicians.
1 Introduction
The need for ophthalmic screenings in Mexico has increased significantly due to the high prevalence of risk factors associated with ophthalmic diseases. Diabetes and hypertension, which affect approximately 15 million and 40 million Mexicans over 20 years old, respectively [1–3], are risk factors for several conditions. These include glaucoma, diabetic retinopathy (DR), diabetic macular edema (ME), hypertensive retinopathy, and cataracts [4–7]. Furthermore, the elderly population, approximately 18 million individuals [8], also faces higher risks of glaucoma, age-related macular degeneration (AMD) and cataracts [4, 7, 9].
Glaucoma presents additional risk factors, including family history and increasing age [4], affecting primarily individuals over 40 years old [10]. This age group comprises over 44 million people in Mexico [11]. Additionally, it is estimated that over 1.5 million Mexicans have glaucoma, 50% remain undiagnosed [12]. Moreover, glaucoma is the second leading cause of irreversible blindness in Mexico and worldwide [12].
While periodic ophthalmic evaluations are recommended for these at-risk populations, Mexico faces a significant shortage of ophthalmologists, with only 4,213 registered on July 2024. Of whom 31.5% are concentrated in Mexico City [13]. This scarcity makes comprehensive screening of all at-risk individuals unfeasible.
To address this challenge, retinIA [14], a Mexican software for ophthalmic screening, was developed. It detects retinal lesions from fundus images, evaluates cup-to-disk ratio (CDR) for potential glaucomatous optic disks, and identifies possible media opacities. While initially designed for mass screenings in primary care settings, retinIA may also prove valuable in ophthalmology hospitals, particularly during initial consultations often conducted by first-year residents who are still developing their diagnostic skills.
Notably, even experienced ophthalmologists exhibit significant variability in CDR estimation and DR severity grading [15–17]. Recent studies have shown promising results in using artificial intelligence (AI) to improve detection rates [18].
Moreover, AI tools may offer particular benefits for early-career clinicians [19,20]. By providing a second opinion or highlighting potential areas of concern, AI can serve as an educational aid, helping to accelerate the learning curve and potentially improve diagnostic accuracy during this formative stage of clinical practice. This potential for AI to support early-career clinicians adds another dimension to its value in ophthalmic practice.
Our study aims to evaluate retinIA’s performance within an ophthalmic hospital setting, comparing it to the performance of first-year residents in detecting retinal lesions and estimating CDR. We also explore how retinIA’s results could aid first-year residents in diagnosis and subspecialty referrals as part of the triage process. Additionally, we assess the agreement between retinIA and first-year residents in cataract detection to evaluate the software’s efficacy in identifying media opacities.
2 Related Work
2.1 Challenges in ophthalmological assessment
Accurately estimating cup-disk ratio (CDR) to determine glaucoma suspects is challenging even among experienced ophthalmologists. Even when manually annotating optic cup and optic disk, CDR estimates vary. This has been shown from the RIGA dataset, which has annotations from 6 experts [21]. Agreement on vertical CDR lied between 29.4% and 46.1%, while the best accuracy was only 79.2% [15]. Furthermore, the Pearson correlation coefficient for CDR was calculated between experts and its values lied between 0.77 and 0.88 [16].
Previous studies on retinal disease detection have provided insights into intergrader agreement. For diabetic retinopathy (DR) detection, Cohen’s kappa (κ) among primary graders has been reported at approximately 0.7 [22, 23], indicating moderate agreement according to McHugh’s interpretation [24].
When assessing DR severity grades, the intergrader agreement varies by expertise level. General ophthalmologists demonstrate an average κ of 0.616, while retinal experts achieve a higher κ of 0.814 [17]. This suggests that first-year ophthalmology residents would likely exhibit lower intergrader agreement for DR assessment.
Intergrader agreement tends to decrease when a broader spectrum of retinal pathologies is considered. Thapa et al. reported an average κ of 0.6 for allied medical personnel in detecting various retinal abnormalities [25]. Furthermore, the average agreement between mid-level ophthalmic personnel and ophthalmologists was found to be κ = 0.55 [26].
Ruamviboonsuk et al. examined agreement on ophthalmological referrals, revealing that retina specialists achieved a κ of 0.67 when compared to a consensus of three retina specialists. In contrast, general ophthalmologists only reached a κ of 0.24 in the same comparison [27]. This highlights the significant variability in agreement levels across different professional categories and tasks within ophthalmology.
2.2 AI in clinical practice and ophthalmology resident training
AI can serve as a valuable assistive tool for ophthalmologists in their clinical practice. Previous studies have demonstrated that AI can enhance ophthalmologists’ sensitivity in detecting diabetic retinopathy (DR) [18]. Moreover, AI-based cup-to-disk ratio (CDR) measurements for glaucoma assessment have been shown to surpass the accuracy of the average expert [16].
Thus, utilizing AI as an assistive tool can help standardize and improve key diagnostic measurements such as CDR. Importantly, a recent survey revealed that ophthalmologists are receptive to incorporating AI as clinical assistive tools, particularly for DR, glaucoma, and age-related macular degeneration (AMD) detection [28].
In addition, some researchers have explored how AI can enhance the learning process of ophthalmology residents for DR and pathological myopia (PM) detection [19, 20].
This alignment between AI capabilities and clinician attitudes suggests a promising future for AI integration in ophthalmology.
2.3 AI development and applications in ophthalmology
In recent years, there has been significant development of artificial intelligence (AI) models for the automatic detection of retinal diseases in fundus images [29–39]. Other AI models applied to fundus images analyse the optic disk for glaucoma suspect detection [40–46]. While others are aimed to analyse the retinal vasculature to detect hypertensive retinopathy [47–50].
AI models in ophthalmology typically employ convolutional neural networks for either classification or segmentation tasks. Classification models, such as AlexNet [51], VGG-16 [52], Inception V3 [53], and ResNet50 [54], are used to categorize entire images. Segmentation models like U-Net [55] identify specific pixels corresponding to particular features.
AI models have shown particularly robust results in detecting DR [29–31, 36–38] and AMD [32–35]. Other implementations include automatic detection of PM, macular holes, and retinal vein occlusions [36–39].
For glaucoma suspect detection segmentation models have been employed to analyze the CDR [40, 41, 43]. Also, hybrid approaches that combine segmentation and classification techniques have been used [41, 44–46].
Furthermore, segmentation models have been applied to retinal vasculature analysis, enabling the identification of hypertensive retinopathy markers such as arterial narrowing and arteriovenous nicking [47–50].
These diverse applications underscore the potential of AI in enhancing ophthalmological diagnostics across a wide spectrum of conditions.
2.4 AI Validation in Latin American Settings
Most AI research in ophthalmology has been done in Asia, North America and Europe, with China, the United States, and India being the countries with the most published research [56].
However, Latin American researchers have contributed to this field with several publications [18, 40, 57–68]. These studies predominantly focus on two main areas: glaucoma detection [40, 60–63, 67] and diabetic retinopathy screening [18, 58, 59, 64, 65, 68]. Additionally, there has been research into retinopathy of prematurity detection [57] and identification of toxoplasmosis lesions [66].
Moreover, AI systems and models have been validated on real-world settings in Latin America. Arenas-Cavalli et al have validated DART, a DR screening tool, on the Chilean health system [58]. Rogers et al evaluated the Pegasus AI system for DR detection on a Mexican cohort [65], and González-Briceño et al evaluated their models on primary care data from the Mexican Institute of Social Security [68].
The latter validations are focused on DR screening, mainly on primary care settings. However, to our knowledge, AI systems that perform both retinal disease detection and glaucoma suspect detection have not been evaluated on third level population data on a Latin American context. This is important, because performance metrics differ when data comes from eye hospitals, community hospitals or primary care [37]. Moreover, AI should be considered as an assistive tool on clinical practice due to existing challenges on retinal disease detection and CDR estimation. Furthermore, it has been demonstrated AI’s relevance in ophthalmology residents’ training.
2.5 Research Objectives and Significance
In summary, the existing literature highlights several key points: (1) there is significant variability in human assessment of ophthalmic conditions, particularly in CDR estimation and DR grading; (2) AI systems have shown promising results in various ophthalmological tasks, often matching or exceeding expert performance; (3) real-world validations of AI systems have demonstrated their potential for clinical integration; and (4) while Latin American researchers have contributed to the field, there remains a gap in comprehensive evaluations of AI systems that perform both retinal disease detection and glaucoma suspect detection in Latin American populations.
Our study addresses this gap by evaluating retinIA, a tool developed using Mexican data, in a tertiary care setting. Furthermore, we extend current knowledge by comparing AI performance not only to expert annotations but also to first-year ophthalmology residents, providing insights into how AI might support early-career clinicians. By assessing retinIA’s performance across multiple ophthalmic conditions simultaneously, we also contribute to understanding how integrated AI systems might function in real-world clinical scenarios.
3 Methods
To evaluate the performance of retinIA at an ophthalmic hospital, we performed ophthalmic screenings using retinIA on adult patients over 18 years old, that underwent their first ophthalmic evaluation by a first-year ophthalmic resident. These studies were carried out from February 12 to March 14, 2024, at Conde de Valenciana Centro, an ophthalmic institute in Mexico City.
This project was approved by Fundación de Asistencia Privada Conde de Valenciana IAP Ethics in Research Committee (CEI-2023/12/01), the Fundación de Asistencia Privada Conde de Valenciana IAP Biosecurity Committee (CB-0053-2023), and the Fundación de Asistencia Privada Conde de Valenciana IAP Research Committee (CI-053-2023). Patients included in this study signed an informed consent, previous to their participation. This study was conducted according to the Declaration of Helsinki guidelines.
For each patient, we collected the hospital medical record ID, personal information, information related to risk factors, and ophthalmic symptoms. Some of the variables were sex, age, self-reported diabetes and hypertension, blurry vision, increased sensitivity to light, and difficulty to see at night among other variables. The complete list of variables is presented on Appendix A.
The latter information is required by the retinIA platform, to perform part of the screening process in which logical rules are applied to obtain some of the outcomes from the platform, such as media opacity.
Retinal fundus images were also required for screening, in order for retinIA to evaluate presence of retinal disease and to calculate CDR. For each patient, retinal fundus images were taken with a Horus 45° autofocus portable fundus non-mydriatic camera from Jedmed.
3.1 Data
A total of 468 screenings were done on 464 patients that underwent their first ophthalmic evaluation by a first-year ophthalmic resident at Conde de Valenciana Centro.
Nine screenings were excluded from the analysis due to an error in registration that resulted in ages less than 18 years old. Furthermore, six studies were excluded because of missing images resulting from camera problems. In addition, 15 patients were excluded, due to having empty medical records. Therefore, screenings for 435 patients were considered for analysis.
The average age was 59.1 (15.7 SD) years. Of the evaluated patients, 34.0% were men and 66.0% were women, 32.2% reported having been diagnosed with diabetes and 39.3% with hypertension.
3.2 Medical records
The medical records of the patients involved in this study were identified by the medical record ID. We extracted the following information: CDR by eye, initial diagnosis defined by the first year residents including whether it was associated with glaucoma, retina, or cataract, and reference to specialists or additional tests corresponding to glaucoma subspecialty, retina subspecialty, or cataract surgery.
3.3 RetinIA screening tool
For this project, the screenings were performed with an ophthalmic screening tool named retinIA, previously developed using Mexican data [14]. It analyzes retinal images, as well as patient information, such as risk factors and ophthalmic symptoms. The latter to give a recommendation whether to schedule an appointment with an ophthalmologist, to update their glasses, or to repeat the study the following year.
Along with the main recommendation, the screening tool gives a possible prediagnosis, which may include: DR, ME, AMD, PM, other retinal lesions, media opacity, possible ophthalmic damage due to chronic conditions, and possible presence of other ophthalmic conditions. It also estimates the CDR to help detect glaucoma suspects.
This tool also has explainability features for the AI analysis performed on the retinal images, including a colormap if retinal anomalies were found. The explanability features also include a close up to the optic disk with markings of the optic disk and optic cup heights and the CDR estimate.
3.4 Ground-truth annotation
A total of 1,013 fundus images were obtained during data collection, 918 belonged to the 435 patients that were considered for analysis.
To determine the groundtruth value for CDR analysis, three ophthalmologists annotated bounding boxes around the optic disk and the optic cup. This was performed on the 861 images where the optic disk was present in the image. Bounding box annotation was performed on LinkedAI annotation platform [69].
The cup disk ratio (CDR) was calculated using the height of the optic disk and the height of the optic cup, determined by the bounding boxes. The groundtruth CDR was calculated as the average of the CDR of the three experts. We determined the groundtruth for glaucoma suspect as follows. If the CDR was 0.6 or more, or if the CDR of one eye exceeded the CDR of the other eye by 0.2 or more, the patient was then considered as a glaucoma suspect. This criteria were based on Harizman et al ’s definition on absence of glaucomatous optic neuropathy, considering measurements associated to CDR [70].
To determine the groundtruth for presence of retinal lesions, all 918 images were annotated by a retinal expert and an ophthalmic expert. For images in which there was a discordance between the retina expert and the ophthalmologist, the groundtruth was obtained from the agreement between the retina expert and the ophthalmologist through negotiation.
Annotations were performed on the Televal platform [71], where experts selected different retinal findings and lesions that are associated with different ophthalmic diseases, including DR, ME, AMD, and PM, among others. A complete list of all annotated findings can be found in Appendix B and the logical rules to determine a possible prediagnosis and risk of visual loss are presented in Appendix C.
4 Results
For both glaucoma suspects and presence of retinal lesions, we calculated accuracy, specificity, sensitivity, positive predictive value (PPV) and F1-score. To assess the statistical significance of the performance differences between retinIA and residents, we used a bootstrap estimation method to calculate p-values. This method estimated the proportion of cases where residents obtained better metrics than retinIA across multiple resamples of the data. Additionally, for CDR we evaluated the absolute and relative errors, and calculated the Pearson correlation coefficient (r). Furthermore, we calculated the receiver operating characteristic curve (ROC) considering glaucoma suspect detection as the outcome and the maximum estimated CDR by patient as a model. Finally, we present the percentage of agreement between residents diagnosis of cataract and retinIA’s detection of media opacities.
4.1 Glaucoma suspect and cup disk ratio
RetinIA software analysed CDR for 61.6% of the patients, for the remaining 38.4% of patients the software considered patients’ images had insufficient quality for CDR evaluation. Meanwhile, groundtruth CDR was obtained for 78.6% of patients, where all three ophthalmologists considered that images had sufficient quality for grading.
On the other hand, residents of ophthalmology that evaluate patients at an in-person consultation registered CDR for 95.4% of patients. Additionally, 97.7% of those detected by the resident with glaucoma were referred to glaucoma subspecialty or to further tests.
We compared CDR annotated by the residents with referral to further glaucoma assessment, considering the 0.6 threshold for CDR. Among the referred patients, 89.4% had a CDR ≥ 0.6. Meanwhile, 73.6% of patients with a CDR ≥ 0.6 were referred to a glaucoma subspecialist or for further tests.
The analysis to asses metric performance was done on 245 patients (56.3%), corresponding to those who had a CDR evaluation by residents of ophthalmology, retinIA, and all three ophthalmologists.
For performance assessment of retinIA, we classified patients as glaucoma suspects if their CDR was 0.55 or higher in at least one eye. This threshold was set by maximizing the F1-score between the highest CDR per patient and the ground truth for glaucoma suspect. This threshold also results on a high specificity point (94.5%).
Metrics for glaucoma suspect detection are shown in Table 1. They include the evaluation of retinIA software and the ophthalmology residents’ diagnosis. Additionally, we present the metrics that would arise from a possible synergy between ophthalmology residents and retinIA, where a positive outcome for glaucoma is determined if either the residents of ophthalmology or retinIA indicated possible presence of glaucoma.
When comparing retinIA’s performance to that of the ophthalmology residents, retinIA consistently outperformed the residents across all metrics. RetinIA achieved an accuracy of 88.6% compared to 82.9%, a sensitivity of 63.0% compared to 50.0%, a specificity of 94.5% compared to 90.5%, a PPV of 72.5% compared to 54.8%, and an F1-score of 67.4% compared to 52.3%.
Additionally, the highest sensitivity arises from the synergy between the residents’ diagnosis and retinIA’s detection, achieving a sensitivity of 80.4%. Moreover, this combination yields an F1-score of 67.3%, maintaining a similar balance between sensitivity and PPV as retinIA on its own.
Statistical analysis using bootstrap estimation was performed to assess the differences in performance between retinIA and residents. The analysis revealed statistically significant differences in accuracy (p = 0.016), positive predictive value (p = 0.02), and F1-score (p = 0.026). The differences in sensitivity (p = 0.116) and specificity (p = 0.062) were not statistically significant at the 0.05 level.
We also evaluated significance of the increase in sensitivity obtained from the synergistic approach. This combined approach differed significantly from both retinIA’s performance (p < 0.001) and the resident’s performance (p < 0.001).
In Figure 1 we show the ROC curves for retinIA and residents of ophthalmology that would arise from considering the maximum CDR by patient as the model for both retinIA and the residents. The area under the ROC (AUC-ROC) for retinIA is 0.848, while the AUC-ROC for residents of ophthalmology is 0.801.
Receiver Operating Characteristic Curves for Glaucoma Detection Based on Maximum CDR by patient. Markings for corresponding values of sensitivity and specificity are shown for the residents’ diagnosis, three possible points for retinIA, and one possible value of synergy between the residents of ophthalmology and retinIA, considering where retinIA achieved a high specificity and the best F1-score. The three points for retinIA correspond to that where achieved a high specificity and the best F1-score; a matched-specificity of retinIA to that of the residents’ diagnosis; as well as a high sensitivity point that would correspond to an 80% sensitivity from retinIA.
We marked specific values for sensitivity and specificity based on retinIA’s results (high specificity point and best F1-score), the residents’ diagnoses, and the synergistic point between ophthalmology residents and retinIA. Additionally, we include markings for retinIA’s performance at a specificity matched to that of the residents’ diagnoses, as well as a point corresponding to an 80% sensitivity for retinIA.
The matched specificity point considering data from retinIA would result on a sensitivity of 67.4% and a specificity of 90.5%, that matches that of the residents. Even at this point, retinIA’s sensitivity is higher than that of the residents of ophthalmology. This point would arise when a CDR of 0.527 or higher is considered as glaucoma suspect.
The high sensitivity point that corresponds to a sensitivity of 80.4% was obtained when considering a CDR of 0.506 or higher as glaucoma suspect. This point, also matches the sensitivity of the synergistic point between retinIA and ophthalmology residents. However, specificity is lower (80.4%) than that attained at the synergistic point (86.4%).
We further analysed the CDR estimates by eye, by comparing them to the groundtruth CDR. This was evaluated on 362 eyes that were evaluated by all 3 ophthalmologists, by the residents of ophthalmology and by retinIA. The main results are shown in Table 2.
The mean absolute error for the CDR estimate for retinIA was 0.056 (SD: 0.042), and the relative error was 10.9% (SD: 7.9%). Meanwhile, the mean absolute error of residents of ophthalmology was 0.105 (SD: 0.074), and the relative error was 20.8% (SD: 14.4%). Moreover, the difference between AI and residents was significative with p < 0.001.
In Figure 2, we show the comparison between CDR groundtruth values and CDR estimates for both retinIA and ophthalmology residents. For both estimates, there is a significant relationship between estimation and groundtruth (p < 0.001). However, the Pearson correlation coefficient for retinIA was higher (r = 0.728) than for ophthalmology residents (r = 0.538).
Comparison of CDR estimates with groundtruth CDR. The left image shows the comparison with the residents of ophthalmology and the right image shows the comparison with retinIA.
From this analysis, we can conclude that retinIA is a useful tool to help determine CDR values more accurately and guide residents in estimating CDR and identifying glaucoma suspects. Moreover, the best sensitivity for glaucoma suspects was obtained when considering both the residents’ diagnoses and retinIA’s results.
Nevertheless, residents’ diagnoses are valuable on their own, since only 61.6% of the patients had an estimated CDR by retinIA due to image quality constraints. Meanwhile, residents can evaluate a broader range of patients (95.4%) due to the advantages of performing in-person evaluations.
Additionally, to better understand possible reasons related to insufficient image quality, we compared the prevalence of initial diagnosis (assigned by residents) between the subset with groundtruth CDR (78.6%) and the subset without (21.4%). Initial diagnosis where prevalence on the subset without groundtruth exceeded in at least 25% relative difference to that of the subset with groundtruth were PM (6.7% vs 0.5%), retinal detachment (10.0% vs 1.2%), vitreous hemorrhage (7.5% vs 1.5%), uveitis (0.8% vs 0.2%), non-functional eye or prosthesis (5.8% vs 1.5%), AMD (1.7% vs 0.5%), DR (5.8% vs 4.4%), and cataracts (9.2% vs 7.2%). From the latter, we may notice that opacities arising from either retinal detachment, vitreous hemorrhage, and cataracts may be related to a decreased image quality for OD annotation. Moreover, high myopia (related to PM) may also affect image quality resulting in blurry optic disk images. It’s also straight forward that images cannot be acquired on the prosthetic eyes and therefore annotations can’t be performed.
4.2 Retinal lesions
The analysis was performed on patients where at least one eye was evaluated by AI to determine presence of retinal lesions. This subset consisted on 395 patients (90.8%).
If retinIA indicated the presence of possible DR, ME, AMD, PM, or other retinal lesions, it was considered positive for presence of retinal lesions.
Of those patients that where diagnosed by residents of ophthalmology with retinal diseases, only 78.8% were referred to either retina subspecialty or to further tests.
Table 3 presents different performance metrics. We distinguished three categories for sensitivity analysis. The first category includes the presence of all retinal lesions, including those associated with mild stages of retinal disease (excluding tessellated fundus).
The second category includes retinal lesions associated with medium or high risk of vision loss, such as large drusen, intraretinal hemorrhages, as well as more severe findings like vitreous hemorrhage or retinal detachment.
The third category includes lesions associated with a high risk of vision loss and encompasses vitreous hemorrhages, retinal detachment, neovascularization, macular holes, and others.
A complete list of what is included in each category is provided in Appendix C.
Sensitivity increases for both retinIA and ophthalmology residents as the risk of vision loss increases. Across all categories, retinIA’s sensitivity was better than that of the residents and was statistically significant (p < 0.001). Sensitivity for all retinal lesions was 76.3% for retinIA compared to 51.9% for residents. For retinal lesions associated with a medium or high risk, sensitivity was 90.1% for retinIA and 63.0% for ophthalmology residents. Moreover, sensitivity for retinal lesions associated with high risk of vision loss was 100% for retinIA and 80.5% for ophthalmology residents. The latter information is shown on Figure 3.
Sensitivities by risk of visual loss and specificity for retinal diseases for ophthalmology residents, retinIA, and the synergy from both.
In addition, significant differences between retinIA and resident were obtained across all metrics. The differences in accuracy, sensitivity, PPV, and F1-score were all statistically significant with p < 0.001. The difference in specificity was also significant (p = 0.005).
Furthermore, retinIA’s PPV was 92.2%, indicating that its use can help increase sensitivity with minimal false positives. Moreover, considering a possible synergy between retinIA and ophthalmology residents —where presence of retinal lesions is considered positive when either the resident or retinIA detected them— increases sensitivity for all retinal lesions to 84.0%, for lesions associated with medium or high risk to 92.6%, and attains 100% sensitivity for high-risk lesions. This synergistic approach achieves a PPV of 81.4%, which is even better than the baseline PPV of the residents’ diagnoses.
4.3 Cataract
During the data collection process, information on cataract diagnosis and referral to cataract surgery was obtained. Furthermore, retinIA results include possible media opacity detection. Although there was no ground truth value for cataract, and currently retinIA does not have the capability to distinguish cataracts from other media opacities, we compared the residents’ diagnoses with the results from retinIA.
Percentage of agreement between the ophthalmology resident and retinIA was 86.0%. Figure 4 shows the confusion matrix that compares media opacity detection from retinIA and cataract diagnosis from ophthalmology residents.
Even though the percentage of agreement is high, the confusion matrix shows that there is little agreement on detection itself, which accounts for minimal agreement in terms of Cohen’s Kappa (κ = 0.237) [24]. This low agreement may be due to the presence of opacities or cataract-related symptoms that correspond to different diseases, as well as the presence of cataracts that still allow for fundus imaging with sufficient quality for analysis of the retina.
5 Discussion
In this study we compared the performance of retinIA, an AI tool that detects retinal diseases, estimates CDR, and determines possible media opacities against the initial diagnosis of ophthalmology residents.
For retinal disease detection, CDR estimation and glaucoma suspect detection performance was assessed against groundtruth values obtained through fundus image annotation by experts.
RetinIA demonstrated superior performance compared to ophthalmology residents in CDR estimations, obtaining a lower average error (0.056 vs 0.105, p < 0.001) and a higher Pearson correlation coefficient (0.728 vs 0.538). This high performance obtained from AI solutions could be expected, since CDR estimation is challenging even for experts [15, 16]. Consequently, the implementation of retinIA has the potential to standardize CDR measurements. Further-more, we did not assess how visual outcomes from retinIA could help residents determine CDR measurements. This can be evaluated with further research to assess the benefits of having a visual aid to determine CDR both in clinical practice and in learning processes.
Determining presence of glaucoma is highly related to OD assessments, including large CDR, notable differences in CDR from one eye to the other, evaluating whether the ISNT rule is fulfilled, as well as presence of hemorrhages in the OD [70]. Even though there are other relevant signs for glaucoma, it has been shown that CDR is a significant indicator for presence of glaucoma [72]. Moreover, 89.4% of patients referred to further tests related to glaucoma or to a glaucoma specialist had CDR estimates ≥ 0.6. Therefore, it’s straight forward that a better CDR estimation results in better glaucoma suspect detection. In this work we only considered patients as glaucoma suspects, those whose CDR ≥ 0.6 or the difference of CDR between both eyes was larger than 0.2, considering CDR related aspects that determine absence of glaucomatous optic neuropathy [70]. However, under those criteria, the sensitivity was higher for retinIA (63.0%), that only considers CDR, than that of ophthalmology residents (50.0%), who perform an in-site evaluation and have more elements for diagnosis. While both sensitivities are relatively low when considered independently and not significantly statistically different (p = 0.116), when considering simultaneously those patients detected by AI and by residents, sensitivity increased to 80.4% (p < 0.001), while maintaining a specificity of 86.4%. Furthermore, specificity may be higher if other elements such as IOP or hemorrhages in OD were considered. This suggests a potentially powerful synergy between AI and ophthalmology residents. Thus, implementing retinIA as an assistive tool can have positive effects on patient’s first consultation. Moreover, using AI offers a significant advantage in homogenizing this measurement, which is known to have high inter-observer variability.
For retinal disease detection, there is an important difference on sensitivity between retinIA and residents. Mild retinal lesions such as microaneurysms and small drusen may not require high sensitivity. However, retinal findings with medium or high risk of visual loss, should be acknowledged, such that the necessary considerations for referral are addressed. Sensitivity for medium or high risk findings was 90.1% for retinIA compared to 63.0% for residents. Even when sensitivity increased for high risk findings, residents only achieve 80.5% compared to a 100% detection by AI. Moreover, specificity is higher for retinIA than for residents, resulting on an overall better evaluation from retinIA. Furthermore, differences in accuracy, sensitivity, PPV, and F1-score were all statistically significant with p < 0.001. The difference in specificity was also significant (p = 0.005), indicating a notable performance difference between retinIA and first-year residents in detecting retinal lesions. Although not all patients with retinal lesions require referral to further tests or to retina specialists, high sensitivity is important in order to make a better decision in the referral process and give better recommendations to patients in terms of self-care and follow-up ophthalmic consultations.
RetinIA demonstrates significant potential in enhancing the detection of retinal lesions and guiding more accurate measurements of CDR. However, it is crucial to emphasize that the final diagnosis must always be based on the ophthalmologist’s comprehensive evaluation. Ophthalmologists perform a thorough assessment that encompasses examining the peripheral retina (which may not be visible in digital images), evaluating anterior segment structures, and detecting opacities that could compromise retinal image quality. These aspects of clinical examination are critical and cannot be fully replicated by current AI technologies.
It is important to note that not all diagnoses require immediate referral to a subspecialty. Some require additional studies, and some may not be referred at all, depending on the severity and nature of their condition. The decision-making process for referrals remains a critical aspect of the resident’s role. Furthermore, the role of residents in retinal disease detection should not be underestimated; notably, 35.9% of patients with images deemed insufficient for AI analysis were diagnosed with retinal disease by residents, highlighting the value of clinical experts.
Moreover, it must be considered that due to image quality, AI only estimates CDR for 61.6% of patients and does an assessment of the retina for 90.8% of patients. Especially for OD, poor image quality is a downside, this could be resolved by relaxing quality criteria, which is possible, since all three ophthalmologists considered 78.6% of patients had images with sufficient quality for annotation. To further increase evaluation, another option is using fundus cameras with higher resolution, however, this may also result in higher implementation costs. Additionally, images of patients with opacities or PM may not have sufficient quality due to physiological differences. Which could be the underlying cause of large prevalence differences (at least 25% relative difference) between the subsets with and without CDR ground truth.
Nevertheless, retinIA can serve as a valuable tool to enhance the detection of retinal lesions, homogeneize CDR estimates, and guide the selection of additional examinations, referral to subspecialty and follow-up. Moreover, a synergistic approach between AI and residents showed enhanced sensitivity for both glaucoma suspects and retinal disease detection, with little effect on specificity. Therefore, its use should be considered in future clinical practice.
6 Conclusions
This study demonstrates that retinIA, an AI-powered ophthalmic screening tool, outperforms first-year ophthalmology residents in several key areas of ophthalmic assessment in a Mexican tertiary care setting. Specifically:
CDR Estimation: RetinIA exhibited superior accuracy in CDR estimation compared to residents, with lower mean absolute error (0.056 vs 0.105, p < 0.001) and higher correlation with expert measurements (r = 0.728 vs r = 0.538). This suggests that AI implementation could help standardize this critical measurement.
Glaucoma Suspect Detection: RetinIA showed higher accuracy (88.6% vs 82.9%, p = 0.016), sensitivity (63.0% vs 50.0%, p = 0.116), and specificity (94.5% vs 90.5%, p = 0.062) compared to residents. The differences in accuracy, positive predictive value (p = 0.02), and F1-score (p = 0.026) were statistically significant, indicating a robust advantage for retinIA in glaucoma detection.
Retinal Lesion Detection: RetinIA demonstrated superior sensitivity across all risk categories, particularly for medium and high-risk lesions (90.1% vs 63.0%), while maintaining high specificity (95.8% vs 90.4%). The statistical significance of these differences (p < 0.001 for accuracy, sensitivity, PPV, and F1-score; p = 0.005 for specificity) strongly supports the superiority of retinIA in retinal disease detection.
These findings highlight the potential of AI as a valuable assistive tool in ophthalmic practice, particularly for early-career clinicians. The synergistic approach of combining AI and clinical assessment shows promise for optimizing diagnostic accuracy and patient care.
However, it is crucial to emphasize that AI should complement, not replace, comprehensive clinical evaluation by ophthalmologists. Challenges such as image quality limitations and the need for clinical judgment in referral decisions underscore the continued importance of human expertise in ophthalmic care.
Future research should focus on integrating AI tools like retinIA into clinical workflows, evaluating their impact on patient outcomes, and exploring their potential in ophthalmology resident training programs. As AI continues to evolve, its role in enhancing ophthalmic screening and supporting clinical decisionmaking is likely to expand, potentially improving access to high-quality eye care in diverse healthcare settings.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
Conflict of Interest Statement
Two of the authors, Dalia Camacho-García-Formentí and Alejandro Noriega, are employees of Prosperia Salud S.A. de C.V., the enterprise that developed retinIA. This relationship could be perceived as a potential conflict of interest. However, the study was conducted with the objective of maintaining scientific integrity and neutrality. All other authors declare no conflicts of interest.
Acknowledgments
We extend our gratitude to Diana Gonzólez for her assistance in the project definition and coordination of the optometry team. We also thank Nayelli Cruz for her role in coordinating the optometry team. Our appreciation goes to Iris Pantoja, Brenda Camarillo, and Rodolfo Pineda for their efforts in performing the retinIA study. Additionally, we thank Vanessa Tirado for her work in annotating optic disk images.
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].
- [31].↵
- [32].↵
- [33].
- [34].
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].
- [43].↵
- [44].↵
- [45].
- [46].↵
- [47].↵
- [48].
- [49].
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].
- [62].
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵