Researchers have used unsupervised machine learning to predict disease-causing properties in more than 36 million genetic variants across more than 3,200 disease-related genes.
In the process they’ve advanced the classification of more than 256,000 genetic variants whose properties—helpful, harmful or neither—have been unknown.
The work was conducted at Harvard Medical School and Oxford University. The resulting study is posted online in Nature.
“Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences,” write co-lead authors Jonathan Frazer, Mafalda Dias and colleagues to contextualize their pursuit.
“In principle, computational methods could support the large-scale interpretation of genetic variants,” they add. “However, state-of-the-art methods have relied on training machine learning models on known disease labels.”
For the current project, the team sought to overcome this limitation by modeling the distribution of sequence variation across organisms—and over vast swaths of time.
In so doing, they hypothesized, they would isolate fitness-maintaining features in protein sequences.
Calling their model EVE for evolutionary model of variant effect, the authors report their technique proved more accurate than labeled-data AI approaches.
What’s more, it can equal or improve upon predictions from more commonly used approaches.
The team states their work with EVE suggests models of evolutionary information can “provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.”
In coverage of the project by Harvard’s news division, Harvard science writer Ekaterina Pesheva reports that EVE looked for patterns that evolution preserved over time. To do so, it analyzed data from 140,000 species—including endangered and extinct organisms.
Co-senior author Debora Marks of Harvard warns that EVE is not a diagnostic test.
Instead, its “computational prowess” can combine with existing clinical options to help geneticists and physicians make diagnoses, predict disease progression and “even choose treatment based on the presence of certain disease-causing genetic mutations.”
To this co-senior author Yarin Gal of Oxford adds:
“We’re not providing clinicians merely with a number but also giving them the degree of uncertainty that comes with it,” Gal said. “This is something that the expert can take and use in the decision-making process. … Building trust between the tool and the expert is an important aspect of this work.”
More from Marks:
“Our results turned out to be far better than we expected. It seems that by simply training a model to fit the distribution of sequences across evolution we extract information that enables us to make unexpectedly precise predictions about disease risk arising from a given genetic variant.”