As genomic sequencing becomes more routine, handling the sheer volume of data being produced is causing major issues. We must expand out abilities to analyse and store big data. Most bioinformatic algorithms used to analyse genomics data (e.g. to identify patterns and make predictions on disease progression or treatment options) are based on machine learning methods and, more recently, deep learning
In this review written by Lefteris Koumakis, published earlier this week in ScienceDirect, Koumakis outlines the most prominent models, possible pitfalls and discusses future directions. A summary of which can be found below.
Machine and deep learning
The influence of machine learning has developed the success of bioinformatics. However, current advances in the “-omics” space post new challenges for the machine learning (ML) community. Deep learning (DL) is considered a more efficient and effective way to deal with big data, as it’s more efficient than ML at handling natural data in its raw form. DL has also proven to provide models with higher accuracy that are efficient at discovering patterns in high-dimensional data, but training data is more demanding for DL than ML, and can drastically affect the predicting power of the model.
Deep learning was first theorised in the 1980s, but only over the last decade has it been considered a state of the art method for predictive models for big data sets. The term “deep” refers to the number of layers through which the data is transformed, with some deep learning networks having as many as two hundred layers! However, DL requires specialist hardware and massive parallelism to be effective and so, to overcome the demand for resources and hardware limitations, DL models use pipeline parallelism that can scale up the training phase.
Genomics data analysis
Many machine learning approaches have been evaluated to identify important data from genomics, such as for patient stratification.
Literature has shown that precision medicine approaches are taking advantage of genomics and clinical data using DL for prognostic prediction. The -omics research area, in general, produces large volumes of data, supported by high-throughput platforms that measure the expression of thousands of genes or non-coding transcripts, quantitative gene expression profiles, or other genome alterations.
Despite being considered the state of the art methodology for big data, DL models are still far from being able to be used for precision medicine as the proposed methodologies have not been validated in clinical practice.
Limitations of deep learning in genomics
Deep learning is still in its infancy for use in genomics. Five of the major limitations of deep learning models in the genomics area include:
- Model interpretation
One of the major limitations of DL is the interpretation of the model. It is difficult to understand the rationale and the learned patterns if you would like to extract the causality relationship between the data and the outcome, due to the structure of DL models.
- The curse of dimensionality
The “curse of dimensionality” is one of the most pronounced limitations of artificial intelligence in genomics. Genomic data sets usually represent a large number of variables and a small number of samples, which is a well-recognised problem in genomics for DL, but also for ML algorithms. Accessing public data for genomics can allow you to combine data sets from multiple sources but to collect a representative cohort for the training of DL algorithms, a lot of pre-processing and harmonisation is required.
- Imbalanced classes
Most ML and DL models for genomics have problems with classification, e.g. the discrimination between disease and healthy samples. It is recognised that genomics trials and data gathered from various sources are usually inherently class imbalanced and ML/DL models cannot be effective until a sufficient number of instances per class has been fitted. Transfer learning can provide a solution to tackle the class imbalanced problem. Transfer learning is a method that focuses on storing knowledge gained while solving one problem and applying it to a different, but related problem.
- Heterogeneity of data
The data in most genomic applications are heterogeneous since we deal with subgroups of the population. Even on an individual level, genomic data includes the sequencing of genes or non-coding transcripts, quantitative gene expression profiles, gene variants, genome alterations and gene interactions. It is a challenge to integrate different data as it covariates between the underlying interdependencies among the heterogeneous data.
The bioinformatics community is taking advantage of the plethora of data source to provide many analysis tools but in most cases, this combination is troubling researchers.
- Parameters and hyper-parameters tuning
One of the most difficult steps for DL is tuning the model. Careful analysis of the initial results may help during tuning as the tuning is correlated with the dataset and the research question.
The main hyper-parameters for every DL architecture are the learning rate, batch size, the momentum, and the weight decay.
The learning rate is a tuning parameter that determines the step size at each iteration while moving towards a minimum of a loss of function.
Batch size is the number of training samples used in each iteration.
Momentum tries to find the optimal training path.
Weight decay is a process where after each update, the weights are multiplied by a factor.
These hyper-parameters can be tweaked during the training of the model and a wrong setting in any of these can result in under- or over-fitting.
Deep learning models have an advantage over other genomics algorithms in the pre-processing steps that are usually manually curated, error-prone and time-consuming. However, in many cases, genomics data do not conform to the requirements posed by most DL architectures.
Regardless, the process of translating the knowledge acquired in genomics research into clinically useful tools has been slow. This is partly due to the requirements for validation and standardisation, which can be slow to fulfil because of the fragmentation of genomics research. The FDA is considering a regulatory framework for computational technologies that will allow modifications to be made from real-word learning and adaptation, while still ensuring the safety and efficacy of the software.
Likewise, while most models in DL require a lot of data to be able to generalise finding and make predictions, most data sets for precision medicine are not representative of the overall population. Genomics data often comprises of heterogeneous data types and models that are capable of the integration of those data sets are not easily interpreted.
Overall, deep learning can make more accurate predictions than other technologies and handle multimodal data effectively. However, more efforts should be made to analyse and combine datasets to enhance the role of DL genomics in prediction and prognosis.
Image credit: https://www.freepik.com/free-photos-vectors/background – Background vector created by GarryKillian – www.freepik.com