Written by Charlotte Harrison, Science Writer
A paper in BMC Bioinformatics has tested the applicability of machine learning methods to genomic data.
The ability of researchers and clinicians to use genomic data to predict a phenotype is a key goal of medicine and biology. Examples include the desire to predict a patient’s disease susceptibility based on their genetic information, or to predict antibiotic-resistant strains of bacteria based on the genomes of pathogenic microbes.
Machine-learning methods are used to make predictions in other fields, like computer vision and astrophysics, but they have had limited success in biology and medicine, largely due to the complexity of genomic data.
In particular, the paper tested the robustness, generalization potential and prediction accuracy of widely used machine learning models with a variety of heterogeneous genomic datasets. The paper focused on neural networks – a machine learning method that finds underlying relationships in data through a process that mimics the way the human brain operates.
Are literature models robust?
The first question that the authors tested was whether the neural network models proposed in the literature are robust across heterogeneous datasets.
They concluded that recurrent neural networks are relatively robust across genomics datasets and are generally not affected by the size or type of the data.
Are literature models accurate?
Here, the authors found that accurate prediction is a balance between underfitting and overfitting of the model, and that small changes in the architecture of the model can have unpredictable outcomes.
Can literature models be re-used?
The third question asked by the authors was if it was possible to take a neural network model from a publication and then apply a similar model to a new dataset.
The results showed that most of the time, the good performance of existing models does not translate to alternative datasets. However, the authors did find some model characteristics that were generally robust across datasets, and could serve as a potential baseline model – albeit one with modest performance.
Overall though, the authors advise biological or medical researchers against re-using published models for their own datasets.
A problem with reproducibility
During the study, the authors found multiple cases where they could not replicate the performance of a published model. This was because the data or code was not available, or the code was corrupted, incomplete or not well-documented.
They conclude that to move the field forward, researchers must provide reproducible scripts that are relatively easy to follow so that machine learning can have an impact across different scientific disciplines.
Image Credit: Canva