Machine learning models can be trained to predict chronic diseases such as dementia using electronic medical record (EMR) data, according to a new study published in Artificial Intelligence in Medicine.
Approximately 5.7 million Americans have Alzheimer’s disease, the most common form of dementia, and treating the disease is expected to cost $1 trillion annually in the United States by the year 2050. Predicting which patients may have dementia later in life, the authors noted, could help screen individuals earlier than ever before and even delay the onset of the disease.
After considering numerous machine learning techniques for their research, including support vector machine (SVM) and artificial neural networks (ANNs), the author chose a random forest (RF) classifier. SVM was not as interpretable as RF, and ANNs had a lower accuracy.
“This choice was motivated by several factors that were derived from the literature and from our own preliminary investigation,” wrote lead author Zina Ben Miled, PhD, School of Engineering and Technology at Indiana University-Purdue University Indianapolis, and colleagues. “Namely, RF is interpretable, computationally efficient and can handle a high dimensional space of noisy, continuous and categorical features.”
The authors explored EMR data related to more than dementia 2,000 cases from 15 different facilities in Indiana. For the study’s controls, the team used data related to more than 11,000 dementia cases from 25 different facilities in Indiana. The race and gender of the cases and controls were “similar,” reducing bias among the two groups.
The team extracted certain features from the “prescription (Rx),” “diagnosis (Dx)” and “medical notes (Nx)” sections of the EMR. Relevant ICD-9 and ICD-10 codes were also identified. Separate models using EMR data from one year and three years prior to the onset of dementia were developed for the Rx dataset, Dx dataset and Nx dataset. Those same two models were also developed for a combined “RDNx” dataset that merged the three datasets into one.
Overall, the models for one year prior to the onset of dementia had a higher accuracy, sensitivity and specificity. Models developed using the Nx dataset had the highest accuracy, sensitivity and specificity.
Also, the combined RDNx dataset had the higher accuracy of them all.
“This is an indication that, despite the fact that Nx models have a higher prediction accuracy, some of the Rx and Dx features (e.g., antihyperlidemics) make a significant contribution to the overall accuracy of the combined model,” the authors wrote.
The team also observed that age was “consistently among the top features of all the models for both cases and controls” while race and gender did not appear as top features for any model. Race and gender, then, are “unlikely, according to these models, to be significant predictors of