Numerous deep learning models can detect and classify imaging findings with a performance that rivals human radiologists. However, according to a new study published in the Journal of the American College of Radiology, many of these AI models aren’t nearly as impressive when applied to external data sets.
“This potential performance uncertainty raises the concern of model generalization and validation, which needs to be addressed before the models are rushed to real-world clinical practice,” wrote first author Xiaoqin Wang, MD, University of Kentucky in Lexington, and colleagues.
The authors explored the performance of six deep learning models for breast cancer classification, including three that had been previously published by other researchers and three they designed themselves. Five of the AI models—including all of them designed for this specific study—used transfer learning, which “pretrains models on the natural image domain and transfers the models to another imaging domain later.” The final model, on the other hand, used the instance-based learning method, a “widely used deep learning method for the object detection with proven success in multiple image domains.”
The models were all trained on the Digital Database for Screening Mammography (DDSM) data set and then tested on three additional external data sets. Overall, the three previously published models achieved an area under the receiver operating characteristic curve (auROC) scores ranging from 0.88 to 0.95 on images from the DDSM data set. The three models designed for this study achieved auROC scores from 0.71 to 0.79.
When applied to the three external data sets, however, the six AI models all suffered, achieving scores between 0.44 and 0.65.
“Our results demonstrate that deep learning models trained on a limited data set do not perform well on data sets that have different data distributions in patient population, disease characteristics, and imaging systems,” the authors wrote. “This high variability in performance across mammography data sets and models indicates that the proclaimed high performance of deep learning models on one data set may not be readily transferred or generalized to external data sets or modern clinical data that have not been ‘seen’ by the models.”
Wang et al. then concluded their study by pointing to the need for more consistency across the board when it comes to the training and development of AI models to be used in healthcare.
“Guidelines and regulations are needed to catch up with the AI advancement to ensure that models with claimed high performance on limited training data undergo further assessment and validation before being applied to real-world practice,” they wrote.