Abstract
Objectives To evaluate the effectiveness of a novel strategy for using AI as a supporting reader for the detection of breast cancer in mammography-based double reading screening practice. Instead of replacing a human reader, here AI serves as the second reader only if it agrees with the recall/no-recall decision of the first human reader. Otherwise, a second human reader makes an assessment, enacting standard human double reading.
Design Retrospective large-scale, multi-site, multi-device, evaluation study.
Participants 280,594 cases from 180,542 female participants who were screened for breast cancer with digital mammography between 2009 and 2019 at seven screening sites in two countries (UK and Hungary).
Main outcome measures Primary outcome measures were cancer detection rate, recall rate, sensitivity, specificity, and positive predictive value. Secondary outcome was reduction in workload measured as arbitration rate and number of cases requiring second human reading.
Results The novel workflow was found to be superior or non-inferior on all screening metrics, almost halving arbitration and reducing the number of cases requiring second human reading by up to 87.50% compared to human double reading.
Conclusions AI as a supporting reader adds a safety net in case of AI discordance compared to alternative workflows where AI replaces the second human reader. In the simulation using large-scale historical data, the proposed workflow retains screening performance of the standard of care of human double reading while drastically reducing the workload. Further research should study the impact of the change in case mix for the second human reader as they would only assess cases where the AI and first human reader disagree.
Introduction
The implementation of AI as an independent reader in mammography-based double reading breast cancer screening has the potential to reduce workload while preserving and possibly improving accuracy for cancer detection as suggested in recent large-scale retrospective studies [1–6]. To further assess the effectiveness of AI in screening practice, current evidence needs to be complemented by investigating the impact of clinical deployment of AI [7]. Here it is important to evaluate different strategies for the integration of AI with the goal to optimise the interaction between AI and human readers, maximising the combined benefit while ensuring patient safety and minimising clinical and operational risks [8].
Various AI workflows in screening have been discussed in the literature, including AI replacing one or all human readers [3–5,9], AI for triaging and prioritisation (before human reading) [10–12], or AI as an extra reader for identifying cancers missed by human reading [10,13]. These different roles for AI in the screening pathway have profound implications on the amount of automation, associated risks, participants’ acceptance, regulatory approval, and the downstream effects of any human-AI interaction [14]. While AI used for triaging and decision-referral showed potential for drastically reducing workload [15,16], the fact that a large number of cases would not be assessed by any human reader poses clinical risks and may hinder its acceptance by screening participants [17]. A large majority of screening participants seem to agree that some level of human oversight is desired [18]. In this context, using AI as a standalone reader entirely replacing human readers seems unlikely to become a viable strategy for deployment in the foreseeable future [19].
AI serving as an independent second reader in a double reading setting appears to strike a good balance through its potential to reduce workload while preserving double reader accuracy, keeping the assurances of having at least one human expert reader to assess every individual case at all times. A recent systematic review, however, has remarked that current AI systems, when serving as an independent second reader, may increase arbitration rates which would have clinical and operational implications [14]. Here, we propose and evaluate a novel strategy for adopting AI as a supporting reader within a human double reading pathway. In this new workflow, the AI serves as the second reader if the AI prediction agrees with the first human reader’s opinion. Otherwise, the AI prediction is disregarded, and a second human reader is making an assessment, from which point the standard human double reading is enacted. This added safety net reduces the risks of using AI while retaining the potential to significantly reduce workload. Evaluating the effectiveness of this new workflow and comparing it to both the standard of care of human double reading and AI serving as an independent second reader is the objective of this simulation analysis.
Methods
Study design
The study is a retrospective evaluation using data from a recent large-scale clinical study [6]. The study data was used to simulate double reading performance using AI as an independent reader for the detection of breast cancer in full-field digital mammography (FFDM) images. A novel workflow of using AI as supporting reader (AI-SR) was simulated and compared to the historical human double reading (HDR) and AI serving as an independent second reader (AI-IR) (see Figure 1). AI-SR is a variation of the previously explored AI-IR workflow [3,6], specifically designed to avoid an increase in arbitration rates. All performance comparisons were determined on the same unenriched screening cohorts representative of data the AI would see in real-world deployments. The original study protocol detailing the original inclusion/exclusion criteria and target performance metrics was established prior to opening the original study, which is presented elsewhere [6]. In the present evaluation, cases with a history of breast cancer (1.7%) were included to assess performance on a wider range of patients.
Diagrams illustrating the three workflows compared in this study
Study population
The study population consisted of 280,594 cases from 180,542 female participants who were invited to breast cancer screening between 2009 and 2019. The sample was representative of the population to which the AI system would be applied to in real-world screening practice. De-identified cases were collected from seven sites in the United Kingdom and Hungary. The three UK centres included Leeds Teaching Hospital NHS Trust (LTHT), Nottingham University Hospitals NHS Trust (NUH), and United Lincolnshire Hospitals NHS Trust (ULH). All three centres participate in the UK NHS Breast Cancer Screening Programme and adhere to a three-year screening interval, with women between 50 and 70 years old invited to participate. A small cohort of women between 47 and 49 years, and 71 and 73 years old who were eligible for the UK age extension trial (Age X) were also included [20]. The Hungarian centre, MaMMa Klinika (MK), involved four sites, Budapest (KAP), Kecskemét (KKM), Szekszárd (SZE), Szolnok (SZO), and corresponding mobile screening units (BUS), which follow a two-year screening interval and invite women aged 45 to 65. Across all sites, women outside the regional screening programme age range, who chose to participate as per standard of care (opportunistic screening) were also included. Screening cases were acquired from the dominant mammography hardware vendor at each site: Hologic (at LTHT), GE Healthcare (NUH), Siemens (ULH), and IMS Giotto (MK).
Positive cases were pathology-proven malignancies confirmed by fine needle aspiration cytology (FNAC), core needle biopsy (CNB), vacuum-assisted core biopsy (VACB) and/or histology of the surgical specimen. All negatives had evidence of a three-year negative follow-up result. Further details on ground truthing, including subsample definitions are reported elsewhere [6].
AI system
The AI system employed in this study (Mia™ version 2.0.1, Kheiron Medical Technologies) was previously assessed in a large-scale retrospective study [6]. The AI system works with standard DICOM (Digital Imaging and Communications in Medicine) cases as inputs, analyses four images with two standard FFDM (craniocaudal and mediolateral oblique) views per breast. The output of the AI is a single binary recommendation per case of “recall” (for further assessment due to suspected malignancy) or “no recall” (until the next screening interval). The AI software version and its operating points were fixed prior to the study. None of the study data was used in any aspect of algorithm development.
Standard of care human double reading
At all screening sites, the second human reader had access, at their discretion, to the opinion of the first human reader. In cases of disagreement, an arbitration, performed by a single or group of radiologists, made the final decision. When the opinions of the first and second human reader agreed “no recall”, a definitive “no recall” decision was reached. When the opinions agreed “recall”, a “recall” decision was reached, or an arbitration performed by a single or group of radiologists made the definitive “recall” or “no recall” decision, depending on the site’s local practice.
Double reading with an AI system
Previous studies [3,6] considered fully replacing the second human reader with AI as an independent reader (AI-IR) which was simulated by combining the (historical) first human reader’s opinion with the AI’s prediction. This workflow has direct implications on the arbitration process, as in the case of disagreement, human arbitrators would need to consider a human reader’s opinion together with an AI prediction for making a final decision instead of relying on opinions from two human readers as available in the standard of care human double reading (HDR).
In the proposed workflow of using AI as a supporting reader (AI-SR), the interaction between AI and human assessment is limited. Here, the first human reader’s opinion is the final “recall” or “no-recall” decision if it agrees with the AI’s prediction. In case of disagreement, the AI’s prediction is disregarded, and a second human reader makes an assessment. Thus, only human reader opinions are considered whenever arbitration is necessary (which is the case when the two human readers disagree). Similar to the AI-IR workflow, the AI-SR is simulated using the historical first and human reader’s opinion in this evaluation. Figure 1 illustrates the different workflows of HDR, AI-IR, and AI-SR compared in this study.
Statistical analysis
Performance of historical HDR and the simulated use of AI was measured in terms of recall rate (RR), cancer detection rate (CDR), sensitivity, specificity, and positive predictive value (PPV). For these metrics, bootstrapping was used to calculate 95% confidence intervals. Non-inferiority was defined to rule out a relative difference of more than 10% in the direction of reduced performance with a 97.5% confidence and an alpha of 2.5%. Superiority was tested when noninferiority was passed and was also based on the same confidence intervals and alpha. Operational performance in terms of workload reduction was assessed as arbitration rate (the rate of disagreement between the first and second readers) and number of cases requiring second human reading.
Results
Study population characteristics
Table 1 presents characteristics of the study population. Of the 280,594 total cases, there were 2783 (0.99%) positives overall (historically detected), with 2397 (0.85%) screen-detected positives (in-line with screening expectations) and 386 (0.18%) interval cancers (ICs). From those, 293 (0.10%) were three-year ICs for the UK sample, and 93 (0.03%) were two-year ICs for the Hungarian sample. A breakdown per clinical site is provided in the supplementary material (see Table S1).
Population characteristics
Cancer detection performance
Table 2 presents the average cancer detection performance separately for the UK and Hungarian sample in terms of RR, CDR, sensitivity, specificity, and PPV for the historical HDR and the simulated use of AI. Note, the cancer detection performance of the AI system in the simulation is the same for the AI-IR and AI-SR workflows.
Cancer detection performance per country
For the UK sample, the RR is 3.84% (95% CI 3.75 to 3.92) and CDR is 8.93% (8.52 to 9.35) for HDR, compared with a RR of 3.82% (3.73 to 3.91) and CDR of 8.71% (8.31 to 9.12) for AI-IR/SR. The sensitivity is 86.23% (84.81 to 87.72) with a specificity of 96.98% (96.83 to 97.13) for HDR, compared with a sensitivity of 84.09% (82.55 to 85.69) and specificity of 97.05% (96.91 to 97.19) for AI-IR/SR. The PPV is 23.28% (22.39 to 24.19) for HDR and 22.82% (21.92 to 23.70) for AI-IR/SR. The use of AI is non-inferior on RR, CDR, sensitivity, and PPV, and superior on specificity compared to HDR.
For the Hungarian sample, the RR is 11.75% (11.54 to 11.96) and CDR is 7.92% (7.35 to 8.51) for HDR, compared with a RR of 10.35% (10.16 to 10.55) and CDR of 7.83% (7.25 to 8.41) for AI-IR/SR. The sensitivity is 88.73% (86.33 to 90.88) with specificity of 94.45% (94.08 to 94.80) for HDR, compared with a sensitivity of 87.69% (85.25 to 89.85) and specificity of 95.58% (95.25 to 95.92) for AI-IR/SR. The PPV is 6.74% (6.27 to 7.23) for HDR and 7.57% (7.03 to 8.10) for AI-IR/SR. The use of AI is non-inferior on CDR and sensitivity, and superior on RR, specificity, and PPV.
A breakdown of the results per clinical site is provided in the supplementary material (see Table S2).
Operational performance
For the standard of care HDR, all cases are read by a first and second human reader from which 9,655 (3.40%) cases were referred to arbitration due to disagreement between the human readers. For the simulated AI-IR workflow where the AI is fully replacing the second human reader there were 35,199 (12.50%) cases referred to arbitration due to disagreement between the first human reader and the AI prediction. For the proposed AI-SR workflow, these AI discordant cases would be referred to a second human reader, from which 5,056 (1.80%) cases were referred to arbitration due to disagreement between the human readers. When comparing HDR with AI-SR, the simulation shows a potential operational benefit for AI-SR with a reduction in arbitration rate of 47.63% and a reduction in the number of cases requiring second human reading of 87.50%. The results are visually presented in Figure 2.
Workload in terms of number of cases for arbitration and second human reading
Discussion
The proposed novel workflow of using AI as a supporting reader (AI-SR) retains cancer detection performance compared to the historical human double reading (HDR) in a simulation using a large-scale screening population. Compared to using AI as an independent reader (AI-IR) replacing the second human reader, the AI-SR workflow resulted in a significantly lower arbitration rate (12.50% vs 1.80%), which was also lower than the arbitration rate of HDR (3.40%). While AI-IR requires no second human reading among the cases eligible for AI processing, its increase in arbitration rate compared to HDR had been previously raised as a concern for the use of AI in screening [14]. Arbitration is a more time-consuming process than individual screening reads, and AI-IR comes with the implication of assessing a human reader’s opinion together with an AI prediction in the arbitration process. The AI prediction may be less interpretable and specific training for arbitrators may be required. The impact of this change in the arbitration process is unclear which has led to concerns regarding the clinical deployment of AI-IR [14]. Here, the proposed AI-SR workflow mitigates this concern as only human reader opinion’s would be considered during arbitration while the arbitration rate remains low and the number of cases requiring second human reading is still drastically reduced.
A main limitation of the study is that all results for the cancer detection and operational performance for the use of AI are based on a simulation using historical data. The simulation is exact in the case of AI-SR, while for AI-IR it is an approximation as the second human reader’s opinion was used when historical arbitration was not available, which was the case in 85.60% of arbitration cases. As we would expect that the second reader performance is worse than true arbitration, the approximation for AI-IR is likely to provide a lower bound of the real world performance. A further assumption is made that the historical second human reader behaviour is the same in HDR and AI-SR. However, as the second human reader in the AI-SR workflow would only assess cases where the first human reader and the AI disagree, there could be a change in the case mix as the AI discordant cases might be generally more difficult to assess which may impact the reader’s performance. Additional training may be required to adapt human readers to this change in the screening pathway.
A key strength of the study is the use of a large-scale, unenriched screening population with participants from two countries and multiple clinical sites including mobile units, and imaging data acquired on machines from four hardware vendors. This is important for the generalisability of the simulation results and their translation to screening practice.
Compared to AI-IR, one downside of the AI-SR workflow is that it does not provide the opportunity to identify more cancers that may have been missed by HDR. However, an important benefit is the added safety net for the clinical deployment by providing the assurance that the standard of care of human double reading is enacted whenever the first human reader and the AI prediction disagree. This may not only positively impact the participants’ acceptance for the integration of AI into screening practice, but may also help with gaining support for setting up prospective and randomised controlled trials involving AI systems. Prospective studies are an important next step to obtain further evidence about the benefit of AI in breast cancer screening. These studies will need to be carefully designed to ensure patient safety while minimising clinical and operational risks. Here, the proposed workflow of using AI as supporting reader could be a viable solution.
Data Availability
The data for the current study are not publicly available. Due to reasonable privacy and ethical concerns, the imaging data cannot be distributed to researchers without ethical approval and research agreements with the original data providers.
Declaration of interests
This work was funded by Kheiron Medical Technologies Ltd (‘Kheiron’). A.Y.N., B.G., C.O., G.F., J.N., E.K., S.K., P.D.K. are employees of Kheiron and hold stock options as part of the standard compensation package.
Role of the funding source
The study sponsor, Kheiron Medical Technologies Ltd, was involved in study design; in the collection, analysis, and interpretation of data; in the writing of the report; and in the decision to submit the paper for publication.
Supplementary Material
Population characteristics per clinical site
Cancer detection performance per clinical site