DICOM Imaging Router: An Open Deep Learning Framework for Classification of Body Parts from DICOM X-ray Scans

Hieu H. Pham; Dung V. Do; Ha Q. Nguyen

doi:10.1101/2021.08.13.21261945

Abstract

X-ray imaging in Digital Imaging and Communications in Medicine (DICOM) format is the most commonly used imaging modality in clinical practice, resulting in vast, non-normalized databases. This leads to an obstacle in deploying artificial intelligence (AI) solutions for analyzing medical images, which often requires identifying the right body part before feeding the image into a specified AI model. This challenge raises the need for an automated and efficient approach to classifying body parts from X-ray scans. Unfortunately, to the best of our knowledge, there is no open tool or framework for this task to date. To fill this lack, we introduce a DICOM Imaging Router that deploys deep convolutional neural networks (CNNs) for categorizing unknown DICOM X-ray images into five anatomical groups: abdominal, adult chest, pediatric chest, spine, and others. To this end, a large-scale X-ray dataset consisting of 16,093 images has been collected and manually classified. We then trained a set of state-of-the-art deep CNNs using a training set of 11,263 images. These networks were then evaluated on an independent test set of 2,419 images and showed superior performance in classifying the body parts. Specifically, our best performing model (i.e., MobileNet-V1) achieved a recall of 0.982 (95% CI, 0.977– 0.988), a precision of 0.985 (95% CI, 0.975–0.989) and a F1-score of 0.981 (95% CI, 0.976–0.987), whilst requiring less computation for inference (0.0295 second per image). Our external validity on 1,000 X-ray images shows the robustness of the proposed approach across hospitals. These remarkable performances indicate that deep CNNs can accurately and effectively differentiate human body parts from X-ray scans, thereby providing potential benefits for a wide range of applications in clinical settings. The dataset, codes, and trained deep learning models from this study will be made publicly available on our project website at https://vindr.ai/datasets/bodypartxr.

1. Introduction

X-ray is the most commonly performed procedure in clinical practice. More than 600 million X-ray examinations are conducted yearly [3] for evaluating various human body parts such as the lungs, heart size, bowel, and bones. In recent decades, many automatic medical image analysis systems, particularly deep learning-based systems, have been studies and deployed to support radiologists in interpreting X-ray scans. To date, hundred AI software products for clinical radiology [15] have been introduced. These systems are often developed for analyzing specific anatomies (e.g., lung, abdominal, spine, etc.) and often require the identification of the human body contained in the input image. Vast, non-normalized databases of X-ray images from hospitals raise the need for an automated approach to classify body parts from X-ray scans. An automatic system for accurate classification of body parts from X-ray scans helps identify the right input for AI systems. It is also a useful tool for data management at hospitals or medical centers. Several body part recognition systems, which were relied on carefully hand-crafted features, have been introduced [1, 7]. In particular, machine learning-based algorithms [1, 12] have been applied and shown their superior performance on this task. We observed two limitations of the existing approaches. First, these methods were developed and tested on ImageCLEF’s 2015 – a quite small dataset with 500 training images and 250 test images. This fact raises concerns [10] about the robustness of the predictive models in real clinical contexts. Second, an automatic body part recognition system plays as an image router that requires a near-perfect level of performance (100%) in recognizing the images. Meanwhile, the existing approaches reported a performance of about 80%–85% in accuracy, which is not confident enough to deploy in real-world clinical settings. Hence, this work aims to develop a highly accurate deep learning-based system for grouping unknown X-ray images into five anatomical groups: abdominal X-ray, adult chest X-ray, pediatric chest X-ray, spine X-ray, and others. To this end, a large-scale X-ray dataset consisting of 16,093 images has been collected and manually classified. We then trained a set of state-of-the-art deep CNNs using a training set of 11,263 images. These networks were then evaluated on an independent test set of 2,419 images and showed superior performance in classifying the body parts while requiring less computation for inference. To summarize, the main contributions of this work two folds:

We introduce and release a large-scale dataset for the classification of body parts from X-ray scans. The dataset contains 16,093 X-ray images in DICOM format, for which each was manually annotated for five anatomical groups: abdominal X-ray, adult chest X-ray, pediatric chest X-ray, spine X-ray, and others. To the best of our knowledge, this is the largest X-ray dataset for human body part classification task to date. It will be opened for public access from https://vindr.ai/datasets/bodypartxr.
We develop a robust DICOM Imaging Router that used a state-of-the-art deep CNN model to classify X-ray images based on the presence of the body part in the image. Our experimental results show superior performance on an independent test set while requiring less computation for inference. The proposed system potential benefits for a wide range of applications in clinical settings. It was made publicly available at https://github.com/vinbigdata-medical/DICOM-Imaging-Router for the community as an open deep learning framework that can be easily reused and finetuned.

2. Methodology

2.1. DICOM Imaging Router: System overview

An overview of the DICOM Imaging Router is illustrated in Figure 1. It is a deep learning-based classifier that accepts an unknown X-ray as input and classifies it into one of five groups, including abdominal X-ray, adult chest X-ray, pediatric chest X-ray, spine X-ray, and others. From a practical point of view, a reliable DICOM Image Router should ensure two essential requirements, including (1) a nearly 100% classification accuracy, and (2) a low inference time. To achieve these goals, we collect and annotate a large-scale X-ray dataset. We then train a set of state-of-the-art lightweight CNN models. Mathematically, this is a supervised multi-class classification task task that assigns a class label for each input example. Given a training dataset of N labeled examples of the form {(x⁽ⁱ⁾, y⁽ⁱ⁾)}, where x⁽ⁱ⁾ ∈ ℝⁿ is the i-th X-ray example and y⁽ⁱ⁾ ∈ 1, …, K is the i-th class label. Here, K denotes the number of classes.

Figure 1.

We develop a deep learning-based classifier for automatic recognition of body parts from X-ray scans. Given an unknown X-ray as input, the system is able to classify the scan into one of five groups, including adult chest X-ray, pediatric chest X-ray, spine X-ray, abdominal X-ray, and others. In a simple practical scenario, each classified image can be then passed through the corresponding AI model.

In this task, we aim at building a learning model f_θ such that it classifies accurate for new unseen examples [2]. This task can be done by training a deep CNN that learns a non-linear mapping from the input x⁽ⁱ⁾ ∈ ℝⁿ to the corresponding label y⁽ⁱ⁾ = f_θ(x⁽ⁱ⁾) ∈ ℝ^K. One common solution to train the network is to minimize the softmax cross-entropy loss over all N training examples. Here the standard softmax function σ : ℝ^K → [0, 1]^K is defined by the formula for i = 1, …, K and z = (z₁, …z_K) ∈ ℝ^K.

2.2. Data collection and annotation

The dataset used in the study was collected from the Picture Archiving and Communication System (PACS) of several major hospitals. The ethical clearance of this study approved by the IRB of each hospital before any research activities. All patient-identifiable information in the data has been removed. The need for obtaining informed patient consent was waived because this study did not impact clinical care or workflow at the hospital. We recruited a group of human readers to participate in our labeling labeling process. Specifically, all X-ray scans were manually reviewed and classified case-by-case into five groups: abdominal X-ray, adult chest X-ray, pediatric chest X-ray, spine X-ray, and others. In particular, each example was manually classified into two rounds by two different readers. In total, 16,093 images have been collected and manually categorized. We used a stratified random sampling method for dividing the dataset into train, validation, and test set with respective ratios of 0.7/0.15/0.15. As a result, 11,263 images will be used to train deep learning algorithms, 2,411 and 2,419 images will be used as validation and test sets, respectively, for evaluating the algorithms. Each image was then stored in the .PNG format and rescaled to the size of 512×512 pixels. Table 1 below summarizes the data sets used in this study.

View this table:

Table 1.

Details of training, validation, and test data sets used in this study. To the best of our knowledge, this is the largest X-ray dataset for human body part classification tasks to date.

2.3. Deep learning algorithms

To classify body parts from X-ray images, we exploited state-of-the-art, light-weight CNNs that have achieved remarkable performance on many image classification tasks, including MobileNet-V1 [6], MobileNet-V2 [13], ResNet-18 [5], ResNet-34 [5], and EfficientNet-B0/B1/B2 [14]. We followed the original implementations [6, 13, 5, 14] with minor modifications. Specifically, we replaced the last fully connected layer of each architecture with a new layer of 5 neurons, corresponding to the number of body parts. During the training stage, we rescaled all training images to 512 × 512. All models were trained using cross-entropy loss function with Adam optimizer [8]. The learning rate was set at 1× e⁻⁴ and then simulated warm restarts by scheduling the learning rate [9]. All networks were trained for 100 epochs using Pytorch (v1.7.0) on a machine with one RTX 2080 Ti GPU.

3. Experiments and Results

3.1. Experimental setup and evaluation metrics

We evaluated the performance of the proposed models on an internal test set (N = 2,419) and an external (N = 1,000) test set using precision, recall, F1-score and mean inference time (in second on GPU) per image. Using the final prediction provided by the models and the ground truth labels, we calculated the true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) as Table 3.

The precision, recall and F1-score were then computed by For each measure, we estimated 95% bootstrap confidence interval with 10,000 iterations.

3.2. Model performance on internal test set

Table 2 summarizes quantitative results for all the classification models. Deep CNNs showed excellent performances on 2,419 of the external test set. Specifically, our best performing model (i.e. MobileNet-V1 [6], 3.2M) achieved a recall of 0.982 (95% CI, 0.977–0.988), a precision of 0.981 (5% CI, 0.975–0.987) and a F1-score of 0.981 (95% CI, 0.976–0.987), whilst requiring less computation for inference (0.0295 second per image).

View this table:

Table 2.

Classification performance of different network architectures on the test set. Inference time (in second) is measured on an RTX 2080 Ti GPU machine. Best results are in red.

View this table:

Table 3.

Confusion matrix

3.3. Model performance on external test set

The domain shift across different hospital settings is the main obstacle in transferring deep learning models into clinical practice [11]. It can result in poor generalization and decreased accuracy [4]. To investigate the generalization ability of the proposed approach across multiple data sources, we performed an external validation test on 1,000 X-ray images collected from another patient cohort. The best-performing model MobileNet-V1 [6] was used for this experiment. It reported a recall of 0.9712, a precision of 0.9738, and an F1-score of 0.9725. This high diagnostic accuracy shows the robustness of the system across different patient cohorts, scanner vendors, and imaging protocols without additional training cost.

4. Conclusions

This work developed and validated a deep learning-based DICOM Imaging Router to classify body parts from X-ray images. A benchmark dataset with 16,093 X-ray images of body parts has been introduced. Experiments demonstrated the effectiveness of the proposed method. The DICOM Imaging Router can be applied for many real-world applications in radiology. For example, it can be integrated into a PACS system to help radiologists find and classify X-ray images quickly and accurately for interpretation. The system can play the role of pre-filter for other AI applications. Our trained models and dataset used in this study will be opened for further development and deployment. For future work, we plan to conduct more experiments and evaluate the impact of the proposed framework in real-world clinical settings.

Data Availability

The dataset used in this study will be made publicly available from our project website at https://vindr.ai/.

https://vindr.ai/

References

[1].↵
Moshe Aboud, Assaf B Spanier, and Leo Joskowicz. Automatic classification of body parts X-ray images. In CLEF (Working Notes), 2015.
[2].↵
Mohamed Aly. Survey on multiclass classification methods. Neural Netw, 19:1–9, 2005.
OpenUrl
[3].↵
Cathrine Christiansen. X-ray contrast media: An overview. Toxicology, 209(2):185–187, 2005.
OpenUrl CrossRef PubMed Web of Science
[4].↵
Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey. arXiv preprint arxiv:2102.09508, 2021.
[5].↵
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[6].↵
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arxiv:1704.04861, 2017.
[7].↵
Vincent Jeanne, Devrim Unay, and Vincent Jacquet. Automatic detection of body parts in X-ray images. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 25–30, 2009.
[8].↵
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arxiv:1412.6980, 2014.
[9].↵
Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arxiv:1608.03983, 2016.
[10].↵
Luke Oakden-Rayner. Exploring large-scale public medical image datasets. Academic radiology, 27(1):106–112, 2020.
OpenUrl CrossRef
[11].↵
Eduardo HP Pooch, Pedro L Ballester, and Rodrigo C Barros. Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification. arXiv preprint arxiv:1909.01940, 2019.
[12].↵
Sanad Saha, Asif Mahmud, Amin Ahsan Ali, and Md Ashraful Amin. Classifying digital X-ray images into different human body parts. In International Conference on Informatics, Electronics and Vision, pages 67–71, 2016.
[13].↵
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
[14].↵
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114, 2019.
[15].↵
Kicky G van Leeuwen, Steven Schalekamp, Matthieu JCM Rutten, Bram van Ginneken, and Maarten de Rooij. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. European Radiology, 31(6):3797–3804, 2021.
OpenUrl