ABSTRACT
Objectives To determine the impact of segmentation techniques on radiomic features extracted from ultrahigh-field (UHF) MRI of the brain.
Materials and Methods Twenty-one 7T MRI scans of the brain, including a 3D magnetization-prepared two rapid acquisition gradient echo (MP2RAGE) T1-weighted sequence with an isotropic 0.63 mm³ voxel size, were analyzed. Radiomic features (histogram, texture, and shape; total n=101) from six brain regions -cerebral gray and white matter, basal ganglia, ventricles, cerebellum, and brainstem-were extracted from segmentation masks constructed with four different techniques: the iGT (reference standard), based on a custom pipeline that combined automatic segmentation tools and expert reader correction; the deep-learning algorithm Cerebrum-7T; the Freesurfer-v7 software suite; and the Nighres algorithm. Principal components (PCs) were calculated for histogram and texture features. To test the reproducibility of radiomic features, intraclass correlation coefficients (ICC) were used to compare Cerebrum-7, Freesurfer-v7, and Nighres to the iGT, respectively.
Results For histogram PCs, median ICCs for Cerebrum-7T, Freesurfer-v7, and Nighres were 0.99, 0.42, and 0.11 for the gray matter; 0.84, 0.25, and 0.43 for the basal ganglia; 0.89, 0.063, and 0.036 for the white matter; 0.84, 0.21, and 0.33 for the ventricles; 0.94, 0.64, and 0.93 for the cerebellum; and 0.78, 0.21, and 0.53 for the brainstem. For texture PCs, median ICCs for Cerebrum-7T, Freesurfer-v7, and Nighres were 0.95, 0.21, and 0.15 for the gray matter; 0.70, 0.36, and 0.023 for the basal ganglia; 0.91, 0.25, and 0.023 for the white matter; 0.80, 0.75, and 0.59 for the ventricles; 0.95, 0.43, and 0.86 for the cerebellum; and 0.72, 0.39, and 0.46 for the brainstem. For shape features, median ICCs for Cerebrum-7T, FreeSurfer-v7, and Nighres were 0.99, 0.91, and 0.36 for the gray matter; 0.89, 0.90, and 0.13 for the basal ganglia; 0.98, 0.91, and 0.027 for the white matter; 0.91, 0.91, and 0.36 for the ventricles; 0.80, 0.68, and 0.47 for the cerebellum; and 0.79, 0.17, and 0.15 for the brainstem.
Conclusions Radiomic features in UHF MRI of the brain show substantial variability depending on the segmentation algorithm. The deep learning algorithm Cerebrum-7T enabled the highest reproducibility. Dedicated software tools for UHF MRI may be needed to achieve more stable results.
INTRODUCTION
Ultrahigh-field (UHF) whole-body MRI scanners operating at 7T have recently been introduced into clinical practice following their initial CE certification. The higher field strength offers several advantages over standard 1.5 and 3T scanners, including increased spatial resolution and tissue contrast [1], which enables improved visualization of subcentimeter anatomic tissue properties. However, 7T MRI also has drawbacks, such as more pronounced artifacts at tissue-air interfaces (e.g., at the skull base and orbits) and secondary to motion, and substantial B0 and B1 field inhomogeneity [1].
Inhomogeneity and data size at 7T are also a problem for the application of segmentation tools that were developed using MR images obtained at lower field strengths. Such tools are now widely used in MRI research, for example to map functional MRI data to underlying brain structures. To address this problem in the area of neuroimaging, the deep-learning algorithm Cerebrum-7T has recently been proposed for fast and fully automatic segmentation of major brain structures on 7T images [2]. Cerebrum-7T showed good performance when compared to a partly automatic and partly manual, de-facto reference standard, and, in few cases, also to a purely manual segmentation. Cerebrum-7T also performed favorably compared to other commonly used segmentation tools [2].
The prior evaluation of Cerebrum-7T used three measures of segmentation similarity for evaluation of segmentation accuracy, with an emphasis on the Dice coefficient [2], but did not explore the impact of differences in segmentation masks on radiomic features at 7T that capture intra-volume properties such as signal heterogeneity. This topic is of interest because radiomic features –which are now widely used for the analysis of both focal brain lesions and diffuse abnormalities within the brain [3–9]– are sensitive to the type of segmentation method [10–14].
Based on the hypothesis that differences in segmentation masks would have a particularly strong effect on histogram-, texture-, and shape-based radiomic features at 7T, we reanalyzed the publicly accessible, held-out dataset that was used to test Cerebrum-7T. It was our aim to compare the reproducibility of radiomic features extracted from segmentation masks produced by Cerebrum-7T and two other segmentation tools, relative to partly manual reference standard segmentation.
MATERIALS AND METHODS
Subjects and design
We retrospectively analyzed 21 T1-weighted MRI scans of the brain from the Glasgow test dataset of the Cerebrum-7T study that are publicly available in the repository of the project (URL: https://rocknroll87q.github.io/cerebrum7t/). No Institutional Review Board or Ethics Committee approval was required as no patient identifiable information was accessed, and all data used for the analysis was sourced from a publicly available repository.
MRI protocol
A 3D magnetization-prepared two rapid acquisition gradient echo (MP2RAGE) T1-weighted MRI sequence of the brain was obtained on a 7 Tesla MRI scanner (Siemens 7T Terra Magnetom) equipped with a 32-channel head coil, as previously described by Svanera et al [2]. Acquisition parameters were: repetition/echo time (TR/TE), 4680 ms/2.07 ms; inversion times (TI), 840 and 2370 ms; flip angles 5° and 6°; voxel size 0.63 x 0.625 x 0.625 mm³ with a base resolution was 384; FOV 204 x 220 x 178.5 mm. MP2RAGE is a commonly used sequence at 7T, because it is free of proton density and T2 contrasts and reduces magnetic field inhomogeneity effects while providing a high contrast-to-noise ratio [15]. The de-identified images are available for research purposes from https://rocknroll87q.github.io/cerebrum7t/data.
Image analysis
We used the 3D Slicer software package (https://www.slicer.org/) version 5.4.0 [16] for all image analyses. Segmentations of six anatomic structures, as provided in the project repository –gray matter, basal ganglia, white matter, ventricles, cerebellum, and brainstem– which were obtained using the following four previously described segmentation techniques, were used:
iGT (“inaccurate ground truth”): A custom segmentation pipeline, used as the de-facto reference standard for the other three segmentation techniques. The iGT relied on two processing branches: one for white and gray matter segmentation, using a combination of the AFNI-3dSeg algorithm and a geometric and clustering technique [17, 18]; and one for specific segmentation of basal ganglia, ventricles, brainstem and cerebellum, using a combination of denoising and Freesurfer-v6 [19, 20]. Results of the two processing branches were then combined, and manual supervision and correction of the segmentations by an expert rater was performed as the final step, to ensure satisfactory results [2].
Cerebrum-7T: A fully automatic, deep-learning segmentation algorithm (deep encoder/decoder network with three layers) based on the original Cerebrum algorithm developed for 3T data [21]. Using the whole MRI volumes as network input, the convolutional neural network (CNN) was specifically trained on 7T MRI data (Glasgow training dataset), using the iGT segmentations as a reference standard [2]. The Cerebrum-7T code is available for download from https://github.com/rockNroll87q/cerebrum7t.
Freesurfer-v7: Version 7 of the Freesurfer image analysis suite, which is documented and freely available for download at http://surfer.nmr.mgh.harvard.edu and was utilized in a large number of neuroimaging studies, including several using 7T MRI [22–24]. Partly built on atlas information, this tool offers improvements for UHF image segmentation compared to prior versions [25].
Nighres: A Python-based toolbox building on the quantitative and high-resolution image-processing capabilities of the CBS High-Res Brain Processing Tools software suite, which was designed to specifically deal with submillimeter resolution MRI [26, 27]. Nighres is available for download from https://github.com/nighres/nighres.
Based on these segmentations, the PyRadiomics software module for 3D Slicer was used (available from https://github.com/AIM-Harvard/SlicerRadiomics) to extract a total of 101 three-dimensional radiomic features from each of the six brain regions, separately for each of the four segmentation algorithms: 18 based on the gray-level histogram; 74 texture features (including 24 from the co-occurrence matrix, 16 from the run-length matrix, 16 from the size-zone matrix, 13 from the gray-level dependence matrix, and 5 from the neighboring gray-tone difference matrix), and nine shape features (Elongation, Flatness, Maximum diameter, Major axis length, Minor axis length, Least axis length, Mesh volume, Sphericity, Surface area) [28]. The full list of features and equations is available at https://pyradiomics.readthedocs.io/en/latest/features.html. As preprocessing steps, intensity discretization using a fixed number of 64, and Z-score intensity normalization were performed. No spatial resampling was performed; the original isotropic voxel size of 0.63 mm³, which was identical for all included scans, was used.
Statistical analysis
Given the large number of radiomic features, many of which may be highly correlated, dimensionality and data redundancy reduction by principal component analysis (PCA, based on Eigenvalues >1, with a maximum of 25 iterations for convergence) was applied, independently to the 18 gray-level histogram features and to the 74 texture features, across all brain structures and algorithms. No PCA was performed for the nine shape features. To test the reproducibility of radiomic features of the three classes, intraclass correlation coefficients (ICC) using a two-way mixed-effects model for absolute agreement were used to compare Cerebrum-7, Freesurfer-v7, and Nighres to the iGT, respectively. ICCs were interpreted as previously recommended: <0.5, poor; 0.5-0.75, moderate; 0.76-0.9, good; and >0.90, excellent reproducibility [29]. Analyses were performed using SPSS 28.0 (IBM Corp., Chicago, IL, USA).
RESULTS
Feature extraction was successful for all segmentations (see Fig. 1). PCA identified five principal components (PC) for the gray-level histogram (originally 18 features), and seven PCs across all texture feature categories (originally 74 features).
Comparison of the four segmentation techniques. While segmentation maps for gray matter (green), basal ganglia (yellow), white matter (brown), ventricles (blue), cerebellum (purple), and brainstem (light blue) were overall quite similar for the iGT, Cerebrum-7T and Freesurfer-v7 – although the inferior portion of the brainstem was missed by Freesurfer-v7– Nighres segmentation maps clearly differ, most prominently for the white matter.
Histogram and texture features extracted from the Cerebrum-7T segmentations generally showed clearly better reproducibility than those extracted from Freesurfer-v7 and Nighres segmentations, relative to features extracted from iGT segmentations (see Tables 1 and 2, Fig. 2). Reproducibility of features from Cerebrum-7T segmentations was excellent with ICCs >0.9 for 14/30 histogram PCs, and for 18/42 texture PCs (Tables 1 and 2); whereas it was poor (ICC<0.5) for only 2/42 texture PCs, and for none of the histogram PCs. Conversely, few histogram and texture PCs from Freesurfer-v7 and Nighres segmentations showed excellent, and many PCs, poor reproducibility, relative the iGT. For histogram PCs, median ICCs for Cerebrum-7T, Freesurfer-v7, and Nighres were 0.99, 0.42, and 0.11 for the gray matter; 0.84, 0.25, and 0.43 for the basal ganglia; 0.89, 0.063, and 0.036 for the white matter; 0.84, 0.21, and 0.33 for the ventricles; 0.94, 0.64, and 0.93 for the cerebellum; and 0.78, 0.21, and 0.53 for the brainstem. For texture PCs, median ICCs for Cerebrum-7T, Freesurfer-v7, and Nighres were 0.95, 0.21, and 0.15 for the gray matter; 0.70, 0.36, and 0.023 for the basal ganglia; 0.91, 0.25, and 0.023 for the white matter; 0.80, 0.75, and 0.59 for the ventricles; 0.95, 0.43, and 0.86 for the cerebellum; and 0.72, 0.39, and 0.46 for the brainstem.
Exemplary scatter plots for two evaluated brain regions, based on the respective first histogram and texture principal components (PC). For the gray matter, Cerebrum-7T and iGT measurements show a considerable overlap, and a partial overlap with Freesurfer-v7 data, whereas the Nighres cluster is clearly separate. On the other hand, for the brainstem, there is less overlap between iGT measurements and those based on the other three segmentation techniques, including Cerebrum-7T.
Shape features extracted from the Cerebrum-7T segmentations again showed the overall best reproducibility relative to the iGT, but differences to the two other segmentation techniques were less pronounced, especially to Freesurfer-v7, which showed better reproducibility for several individual features (Table 3). For the feature Sphericity, reproducibility was poor for all three segmentation tools. Median shape feature ICCs for Cerebrum-7T, FreeSurfer-v7, and Nighres were 0.99, 0.91, and 0.36 for the gray matter; 0.89, 0.90, and 0.13 for the basal ganglia; 0.98, 0.91, and 0.027 for the white matter;; 0.91, 0.91, and 0.36 for the ventricles; 0.80, 0.68, and 0.47 for the cerebellum; and 0.79, 0.17, and 0.15 for the brainstem.
DISCUSSION
The results of our study demonstrate that the fully automatic Cerebrum-7T deep-learning segmentation algorithm, specifically developed for 7T data, achieved the overall highest reproducibility of radiomic features across the different brain regions, relative to the iGT. This was especially true for histogram and texture features, and most prominently visible in the cerebral gray and white matter as well as the cerebellum, where many PCs showed ICCs of >0.9 (see Tables 1 and 2). Contrary to that, both Freesurfer-v7 and Nighres performed poorly for the majority of evaluated PCs, and across all segmented brain regions. While this is not particularly surprising for Nighres, for which Svanera et al. reported low Dice coefficients especially for the gray and white matter, it is quite surprising for Freesurfer-v7, which had previously demonstrated good segmentation performance for most brain regions [2]. These results provide further proof that, within the radiomics workflow, segmentation is a critical step where minor differences in segmentation masks can have a profound effect on calculated feature values. This is also underscored by the fact that even Cerebrum-7T, although trained directly on iGT segmentations, showed only moderate-to-good histogram and texture feature reproducibility in the brainstem, and partly also in the ventricles (although the clinical value of signal characteristics obtained from the ventricles is probably limited).
Shape feature results, on the other hand, differed in several aspects from the above-described histogram and texture feature results. While Nighres again performed quite poorly for the majority of features, with only few exceptions, feature reproducibility differences between Cerebrum-7T and Freesurfer-v7 segmentations were less pronounced, and for some features and anatomic regions, minimal or nonexistent (see Table 3). Axis length measurements, which may also be used clinically, for example to assess cortical thickness, showed mostly good to excellent reproducibility across brain regions for both Cerebrum-7T and Freesurfer-v7. Only for the brainstem was Cerebrum-7T clearly superior to both Freesurfer-v7 (and also Nighres). Surprisingly, for the feature Sphericity, which is a descriptor of roundness of the segmented shape relative to a sphere, there was practically no correlation between the iGT and all three segmentation methods evaluated, the reason for which is unclear.
Svanera et al. originally used three metrics, which were previously recommended in the MICCAI MRBrainS18 challenge to benchmark segmentation algorithms, for comparison of Cerebrum-7T to the reference standard: the Dice coefficient to assess the overlap between segmentations; the 95th percentile of the Hausdorff Distance to assess the proximity between segmentation contours; and the Volumetric Similarity as a non-overlap-based metric to assess volumetric similarity [30]. While these well-established metrics capture the quality of the segmentation, they do not assess quantitative differences between segmentations in terms of voxel-based gray-level statistics and patterns contained therein. This information, which is captured by radiomic features, is currently evaluated in a wide array of diseases of the brain, with recent MRI applications including detection of cognitive impairment [8], distinguishing between active and chronic multiple myeloma lesions [9], EGFR and HER2 status prediction in adenocarcinoma brain metastases [5], corticospinal tract involvement in glioma [4], and prediction of local tumor control in patients with brain metastases following postoperative radiotherapy [6].
Previous studies have demonstrated that the results of radiomics-based classification are generally improved at higher spatial resolution [31, 32], which is one of the key strengths of UHF MRI over MRI at lower field strengths. However, voxel intensity inhomogeneity and artifacts at 7T may negatively impact feature values, and in particular, their reproducibility. Based on prior MRI research in different areas of the body, including the brain, which already showed that radiomic feature are sensitive to variations in segmentation [10–14], we hypothesized that differences in segmentation masks would have a particularly strong impact of radiomic features at 7T. Such major differences were indeed observed in our study, in particular for segmentation tools that were not optimized for the specific type of MRI dataset.
Our study has several limitations, including first the use of the iGT as a reference standard. The reason is that Cerebrum-7T was trained using the iGT –although on a different dataset that was not used in this study– and that therefore, Cerebrum-7T segmentations may more closely resemble those of the iGT than segmentations created by the two other methods, which were independent of the iGT. On the other hand, the iGT included a manual, human expert-level correction step at the end, and therefore, was the logical choice of reference standard out of the available segmentation techniques. Another limitation was our inability to directly compare the reproducibility data at 7T to a matching dataset obtained at 3T or 1.5T, because no such data was available in the repository. Finally, the analysis only included a single, T1-weighted MRI pulse sequence, whereas clinical as well as research protocols also include other sequences, such as fluid-attenuated inversion recovery (FLAIR) and/or T2-weighted sequences.
In conclusion, the results of our study demonstrate that the deep learning-based segmentation algorithm Cerebrum-7T, which was specifically trained on 7T data, provided a higher degree of radiomic feature reproducibility than other available segmentation tools. These findings support the notion that, for the processing of UHF MRI data, dedicated analysis tools such as segmentation algorithms may need to be developed, especially since 7T MR scanners have now entered the clinical stage. Our results also provide further evidence that histogram, texture, as well as shape features are highly sensitive to the choice of segmentation technique, and comparison between results of different studies must take this factor into account.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
https://search.kg.ebrains.eu/instances/Dataset/2b24466d-f1cd-4b66-afa8-d70a6755ebea
Footnotes
Conflicts of Interest and Source of Funding: M.E.M. received honoraria for lectures from Siemens, General Electric, and Bristol Myers Squibb. The other authors declare no potential conflicts of interest.