AI melanoma detectors that equaled or bettered dermatologists in clinical trials have stumbled on the way to the real world of patient care, according to a study published Jan. 21 in NPJ Digital Medicine.
The setback came during a proof-of-practice study at UC-San Francisco in which investigators put lab-proven algorithms through computational “stress tests” designed to ascertain the AI’s generalizability across varying patient demographics.
The tests used real-world dermatologic photos and convolutional neural networks (CNNs) that were developed with training, evaluation and validation protocols untied to any particular site of care.
Senior author Maria Wei, MD, PhD, and colleagues found the automated models thrown off by consecutive photos of the same lesions and flummoxed by simple changes like rotating a photo.
These types of changes caused the models to bring back false positive or false negative diagnoses for as many as 22% of skin images.
From this Wei and co-authors conclude AI melanoma detectors that tied or bested experienced dermatologists in initial clinical trials “need further validation with computational stress tests to assess clinic readiness.”
“While CNN models are nearly ready to augment clinical diagnostics, the potential for harm can be minimized by evaluating their calibration and robustness to images repeatedly taken of the same lesion and images that have been rotated or otherwise transformed,” Wei and colleagues write. “Our findings support the reporting of model robustness and calibration as a prerequisite for clinical use, in addition to the more common conventions of reporting sensitivity, specificity and accuracy.”
The study is available in full for free.