Data integration is the process of combining different omics datasets, allowing researchers to stack the multiple layers of biological insight together to get the whole picture. Integration is at the core of the multi-omics approach. However, this stage is often cited as the most challenging.
Tools for multi-omics data integration
- TrackViewer, (published in Nature Methods) is a tool for the integration and visualisation of multi-omics data. Several genome browsers have been developed, but the majority of these tools do not have an easy programming interface that can be plugged into a pipeline. As well as this, unlike other genome browsers, trackViewer can perform specialised plots such as lollipop plots (also known as needle plots) for methylation, mutation, and SNP data.
- MotifStack (published in Nature methods) can be considered as a tool for downstream analysis of all multi-omics data. For example, with DNA-seq data, RNA-seq data, ATAC-seq data, ChIP-seq data, it is possible to identify some motifs. To do so, a motif enrichment analysis is performed, and once some motifs are identified, the next step is to see how the motifs are all connected.
- ChIPpeakAnno is for integrated analysis of ChIP-seq and any experimental data resulted in genomic ranges. ChIPpeakAnno was the first batch annotation tool for ChIP-seq data. It is one of the top downloaded bioconductor packages and is highly cited and been used extensively. Despite having been released about 12 years ago, it has stood the test of time. With ChIPpeakAnno, it is possible to annotate peaks to genes and enhancers, perform pathway enrichment analysis, overlap analysis (e.g., replicates, of different transcription factors, or different omic profiles), and more.
- scATACpipe (single cell ATAC pipeline) is a workflow for analysing and visualising scATAC-seq data. While there are other tools and pipelines for analysis of scATAC-seq data, scATACpipe is unique in that it is an end-to-end analysis of scATACseq data, and is easy-to-use, scalable, reproducible, and comprehensive. scATACpipe can perform extensive quality assessment, pre-processing, dimension reduction, clustering, peak calling, differential accessibility inference, integration with scRNA-seq data, transcription factor activity and footprinting analysis, co-accessibility inference, and cell trajectory inference.
Machine learning and artificial intelligence
Machine learning (ML) and artificial intelligence (AI) approaches are becoming increasingly popular. However, ML or AI should not be considered a magic bullet – as with any technique, each has their own limitations and challenges.
Moreover, a lot of these approaches are not even that novel – in fact the buzz-wordy nature of these terms means that they are often used for relatively basic and old models like Random Forest, which was developed back in 1995. That being said, there’s a lot of innovation and development in the ML/AI space – and having some background knowledge may help you identify what is truly fresh and ground-breaking.
ML and AI considerations
Data shift occurs when there is a mismatch between the data an AI or ML model was trained and tested on and the data it encounters in the “real world”. Essentially, training fails to produce a good ML model because the training and testing data does not match other datasets and is not generalisable.
Lack of standardisation between datasets, data collected under very specific experimental conditions, data collected at different times, by different people, under different environments – all these factors can mean that our ML model may have data shift issues. Making sure the data you trained your ML approach on fits other data you may later test it on, is vital to ensure that your results are reliable, accurate, robust and reproducible.
Even if a training process can produce a model that performs well on the test data, that model can still be flawed. This is because with ML models, the training process can produce many different models that all work on your test data, but these models differ in small, seemingly unimportant ways.
These differences can be attributed to many things, such as the way training data is selected or represented, the number of training runs, and so on. These small, sometimes random differences appear arbitrary, especially because they don’t affect how a model performs on the testing data. But when applied to other datasets, it can end up causing unanticipated problems.
To avoid under-specification issues, one option is to introduce an additional stage to the training and testing process, where you produce many ML models instead of just one. These different models can then be tested again on another set of test data, made to compete against each other, and whichever performs best can be selected.
Overfitting vs underfitting
Overfitting is when a statistical or ML model fits too exactly against its training data – and as a result, when the model is tested against unseen data, it cannot perform accurately. To combat overfitting, researchers can look at the training data and the test data – if the training data has a low error rate, and the test data has a high error rate, overfitting is likely an issue.
Underfitting is the opposite of overfitting. To avoid overfitting, less time should be spent training the model. This is known as “early stopping,” – reducing the complexity of the model. However, pausing too early may cause the model to miss or exclude important features, leading to underfitting. This means the model, like with overfitting, is unable to generalize to new “real world” data.
Data leakage is a major problem in ML when developing predictive models. The goal of a predictive ML model is to make accurate predictions on new unseen data. When information from the data a model is trained on includes data that it is later tested on, the model has effectively already seen the answers, and its predictions seem much better than they really are.
Data leakage is more of a problem with complex datasets –most multi-omics data fits that description. One subtle form of data leakage to look out for is temporal leakage, which is when training data includes points from later in time than the test data. This is a problem because the model has essentially seen the future, and improper metadata annotation or the batch effect can be an issue here. Ways to prevent data leakage include preparing the data well within cross-validation folds and reserving a validation dataset for final checks of any developed models.
Black box models
Some ML and AI models are referred to as “black box models,” where users and researchers know the inputs and the outputs, but do not know how the model actually works. However, if we can’t interpret the model, how can we falsify, test, and reproduce the results?
Interpretable models, or explainable models, instead make clear how the model works. Often these models are also open-source, and all the code is made easily accessible and freely available.
Moreover, black box models are created directly from data by an algorithm. In the context of multi-omics, this can actually limit the utility of these models, as they do not incorporate domain knowledge. In many scientific disciplines, such as systems biology, biological information is present in the form of graphs and networks – and this information can be incorporated in network-based algorithms to make them more versatile and applicable in many research areas.