Estimating model uncertainty of artificial intelligence (AI)-based breast cancer detection algorithms could help guide the reading strategy in breast cancer screening. For example, the recall decision can be made solely by AI when it exhibits high certainty, while cases where the certainty is low should be read by radiologists. This study aims to evaluate two metrics to predict model uncertainty of a lesion characterization network: 1) the variance of a set of outputs generated with stochastic layer depth, and 2) the entropy of the average output. To test these approaches, 367 mammography exams with cancer (333 screen-detected, and 34 interval) and 367 cancer-negative exams from the Dutch Breast Cancer Screening Program were included. Using a commercial lesion detection algorithm operating at high sensitivity, 6,477 suspicious regions were included (14.1% labeled malignant). By varying the uncertainty threshold, the predictions were classified as certain or uncertain by a specified proportion. Radiologists double reading had a sensitivity of 90.9% (95% CI 89.0% – 92.7%) and a specificity of 93.8% (95% CI 93.2% – 96.2%) for all regions. At equal specificity, the network had a sensitivity of 92.1% (95% CI 89.9% – 94.0%) for all regions. The sensitivity of the network was higher for regions with low uncertainty for both approaches; for the top 50% most certain regions the sensitivity was 96.9% (95% CI 94.7% – 98.4%) and 97.1% (95% CI 94.9% – 98.8%) at equal specificity to radiologists. In conclusion, AI-based lesion classification uncertainty of breast regions can be estimated by applying stochastic layer depth during prediction.
We aim to investigate if ordering mammograms based on texture features promotes visual adaptation, allowing observers to more correctly and/or rapidly detect abnormalities in screening mammograms, thereby improving performance. A fully-crossed, multi-reader multi-case evaluation with 150 screening mammograms (1:1, positive:negative) and 10 screening radiologists was performed to test three different orders of mammograms. The mammograms were either randomly ordered, ordered by Volpara density (low to high), or ordered by a self-supervised learning (SSL) encoding. Level of suspicion (0–100) scores and recall decisions were given per examination by each radiologist. The area under the receiver operating characteristic curve (AUC), sensitivity, and specificity were compared between ordering conditions using the open-access iMRMC software. Median reading times were compared with the Wilcoxon signed rank test. The radiologist-averaged AUC was higher when interpreting screening mammograms from low to high density than when interpreting mammograms in a random order (0.924 vs 0.936, P=0.013). The radiologist-averaged specificity for the mammograms ordered by density tended to increase (87.3% vs 91.2%, P=0.047) at similar sensitivities (79.9% vs 80.4%, P=0.846) with reduced reading time (29.3 seconds vs 25.1 seconds, P<0.001). For the SSL order no significant difference in screening performance (AUC: 0.924 vs 0.914, P=0.381) and reading time (both 29.3 seconds, P=0.221) with the random order was found. In conclusion, this study suggests that ordering screening mammograms from low to high density enables radiologists to improve their screening performance. Studies within a screening setting are needed to confirm these findings.
KEYWORDS: Image segmentation, Education and training, Muscles, Image processing, Breast, Mammography, Data modeling, Atomic force microscopy, Deep learning, Digital mammography
PurposeWe developed a segmentation method suited for both raw (for processing) and processed (for presentation) digital mammograms (DMs) that is designed to generalize across images acquired with systems from different vendors and across the two standard screening views.ApproachA U-Net was trained to segment mammograms into background, breast, and pectoral muscle. Eight different datasets, including two previously published public sets and six sets of DMs from as many different vendors, were used, totaling 322 screen film mammograms (SFMs) and 4251 DMs (2821 raw/processed pairs and 1430 only processed) from 1077 different women. Three experiments were done: first training on all SFM and processed images, second also including all raw images in training, and finally testing vendor generalization by leaving one dataset out at a time.ResultsThe model trained on SFM and processed mammograms achieved a good overall performance regardless of projection and vendor, with a mean (±std. dev.) dice score of 0.96±0.06 for all datasets combined. When raw images were included in training, the mean (±std. dev.) dice score for the raw images was 0.95±0.05 and for the processed images was 0.96±0.04. Testing on a dataset with processed DMs from a vendor that was excluded from training resulted in a difference in mean dice varying between −0.23 to +0.02 from that of the fully trained model.ConclusionsThe proposed segmentation method yields accurate overall segmentation results for both raw and processed mammograms independent of view and vendor. The code and model weights are made available.
Segmentation of digital mammograms (DMs) into background, breast, and pectoral muscle is an important pre-processing step for many medical imaging pipelines. Our aim is to propose a segmentation method suited for processed DMs that generalizes across cranio-caudal (CC) and medio-lateral oblique (MLO) projections, and across models of different vendors. A dataset of 247 diagnostic DM exams was used, totaling 493 CC and 494 MLO processed images, of which 199 (40.4%) and 486 (98.4%) contained a pectoral muscle, respectively. The images were acquired with 10 different DM models from GE (73%) and Siemens (27%). The multi-class segmentation was done by a U-Net trained with a multi-class weighted focal loss. Several types of data augmentation were used during training, to generalize across model types, including random look-up table and random elastic and gamma transformations. The DICE coefficients for the segmentations were (mean ± std. dev.) 0.995 ± 0.005, 0.980 ± 0.016, 0.839 ± 0.243 for background, breast, and pectoral muscle, respectively. Background segmentation did not differ significantly between CC and MLO images. The pectoral muscle segmentation resulted in a higher DICE coefficient for MLO (0.932 ± 0.104) than CC images (0.636 ± 0.323). The false positive rate of pectoral muscle segmentation was 1.5% in CC images without any pectoral muscle. Among different model types, the mean overall DICE coefficients ranged from 0.985-0.990 for the different system models. The developed method yielded accurate overall segmentation results, independent of view, and was able to generalize well over mammograms acquired by systems of different vendors.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.