Academic researchers in the U.K. have completed a systematic review of 62 representative studies on the use of AI for COVID-19 diagnostics and prognostics on X-rays and CT scans. Their findings may strike some as a setback.
The 62 were selected for their high quality from a field of more than 2,200 such papers published and archived online in the first three quarters of 2020.
The reviewers found investigative deficiencies so widespread that, by the team’s lights, the entire body of research is rendered moot.
Not even one of the machine learning models described in the 62 is “of potential clinical use due to methodological flaws and/or underlying biases,” Michael Roberts of the University of Cambridge and colleagues write. “This is a major weakness, given the urgency with which validated COVID-19 models are needed.”
Most of the papers, 23, focused on traditional machine learning while seven involved deep learning and two considered both types of AI.
Among the flaws Roberts and colleagues describe in their review, published this month in Nature Machine Intelligence:
- Almost all papers had a high (45 of 62) or unclear (11 of 62) risk of bias for their participants. The reviewers deemed only six as having a low risk of bias.
- For 38 of the 62 papers, the reviewers could not judge biases in predictors because the predictors were based on either unknown or abstract features in the medical images.
- Just 10 papers had a low risk of bias in their authors’ analysis.
“The high risk of bias in most papers was principally due to a small sample size of patients with COVID-19 (leading to highly imbalanced datasets), use of only a single internal holdout set for validating their algorithm (rather than cross-validation or bootstrapping) and a lack of appropriate evaluation of the performance metrics (for example, no discussion of calibration/discrimination),” Roberts and co-authors write.
In the body of the review, Roberts et al. offer corresponding recommendations for most of the shortcomings they identify.
The recommendations fall into five primary areas: the data used for model development and common pitfalls; the evaluation of trained models; reproducibility; documentation in manuscripts; and the peer-review process.
The quality and size of datasets used to train AI models for interpreting COVID-focused X-rays and CT scans, both critical attributes, “can be continuously improved if researchers worldwide submit their data for public review,” the authors comment. “Because of the uncertain quality of many COVID-19 datasets, it is likely more beneficial to the research community to establish a database that has a systematic review of submitted data than it is to immediately release data of questionable quality as a public database.”
To that Roberts et al. add:
The intricate link of any AI algorithm for detection, diagnosis or prognosis of COVID-19 infections to a clear clinical need is essential for successful translation. As such, complementary computational and clinical expertise, in conjunction with high-quality healthcare data, are required for the development of AI algorithms. Meaningful evaluation of an algorithm’s performance is most likely to occur in a prospective clinical setting. Like the need for collaborative development of AI algorithms, the complementary perspectives of experts in machine learning and academic medicine were critical in conducting this systematic review.
The study is available in full for free.