The advent of high-throughput technology and the increasing availability of data from large sets of multi-omics samples has led to the development of several tools and methods for data integration and interpretation.
Why integrate omics data?
Integrated approaches combine different omics data to better understand the relationship between molecules and assess the flow of information from one omics level to another. In turn, this bridges the gap between genotype and phenotype, offering improved predictive accuracy of disease traits and help in the development of future treatments or prevention strategies.
Furthermore, the analysis of multi-omics data with clinical information is crucial. For example, the integration of proteomic, genomic and transcriptomic data has been used to prioritise driver genes in colon and rectal cancer. Moreover, the combination of metabolomics and transcriptomics data has helped yield molecular perturbations underlying prostate cancer. Therefore, the importance of integrating multi-omics data in real-life disease research is evident, and it is clear that combining omics data sets can aid the discovery of underlying causes of serious health conditions.
Multi-omics data integration tools and methods
Multi-omics data integration tools can be categorised based on the biological question that they address, such as:
Identifying disease subtypes is crucial for understanding the underlying molecular patterns driving a condition. There are several tools that can be used to identify subtypes of a disease based on their omics profiles.
Prediction of biomarkers
Biomarkers are often strongly linked to biological pathways that provide a flow of information and reveal the underlying mechanisms of diseases. There are a number of tools that allow the interpretations of molecular features by combining multi-omics datasets.
Deriving insights into disease biology
Understanding complex disease biology is key to diagnosing and developing effective therapeutics for a disease. Tools can be used to harness multi-omics data to gain insights into the mechanistic details of disease biology.
A schematic diagram to show the integrative tools and methods for multi-omics data analysis, grouped according to their approach and colour coded relating to their application. Image credit: I. Subramanian et al., 2020
Then, the methods used by these tools can be further classified, based on the approaches that they employ, for example:
Bayesian multi-omics approaches
- Pathway recognition algorithm using data integration on genomic models (PARADIGM): Infers the activities of individual patients’ biological pathways from multi-omics data. It combines several omics measurements from a single patient sample to infer the activities of genes and their products, as well as pathway interactions. As PARADIGM uses Bayesian factor graphs, it can also fall in the ‘Network’ approach category.
- iCluster: Assigns a single cluster for samples based on multiple data types. This integrative clustering enables flexible modelling of the associations between different data types, whilst reducing the dimensionality of datasets.
- iClusterPlus: iCluster is enhanced to integrate genomic, epigenomic and transcriptomic profiling. It is capable of predicting the key genomic variables and capture biological variation.
- LRAcluster: Uses a probabilistic model to classify omics data, which is size-matched and clustered. Essentially, the variance of the data and the number of clusters is used to better cluster disease subtypes.
- Patient-specific data fusion (PSDF): Copy number variations and gene expression data is integrated to categorise samples into groups, based on their similarity to two datasets. Only the samples that show concordance to the datasets are fused.
- Bayesian consensus clustering (BCC): Source-specific features are modelled using data-driven consensus clustering to integrate multi-omics datasets. Although the clusters of individual data are separate, they are loosely connected to all of the data sources.
- Multiple dataset integration (MDI): Each data source is clustered, and the pairwise dependencies between the clusters is modelled. This enables the identification of a group of genes clustered together across multi-omics datasets.
- Multi-omics factor analysis (MOFA): Integrates multi-omics data types on the same or partially overlapped samples through inferring a data representation as hidden factors within several layers of the omics data.
Network multi-omics approaches
- Similarity network fusion (SNF): Integrates multi-omics datasets using a network fusion method. It creates an individual network for each data type, and then uses network fusion to combine them into a single similarity network based on message-passing theory. SNF can also be grouped under ‘Fusion’ and ‘Similarity-based’ approaches.
- Network-based integration of multi-omics data (NetICS): Predicts the effect of genetic aberrations, epigenetic changes, and gene expression in an interaction network using a network-diffusion model for each sample. It provides a framework for integration of multi-omics data for cancer gene prioritisation.
Fusion multi-omics approach
- Pattern fusion analysis (PFA): Obtains the local sample patterns using principal component analysis (PCA) and aligns them to a common feature to create a global sample pattern across several data types. The contributions of each data type are measured to decrease the effects of bias, resulting in a comprehensive characterisation of samples across multi-omics datasets.
Similarity-based multi-omics approach
- PINSPlus: An algorithm helps to identify how often patients are grouped together in a single cluster when using different types of omics data. The method is used for data integration and clustering strongly connected patients into a disease subtype.
- Neighbourhood-based multi-omics clustering (NEMO): Initially assesses the similarity of patients for each of the omics datasets to form a similarity matrix. This is then integrated into a single matrix, which is subsequently clustered.
Correlation-based multi-omics approach
- CNAmet: Uses weight calculation to link expression values to copy number and methylation, followed by score calculations to combine the weights to make one score per gene. Then, the statistical significance of this score is determined to help identify changes in gene expression.
Other multivariate multi-omics methods
- mixOmics: Allows integration of multi-omics datasets to cluster samples using supervised or unsupervised multivariate methods, with a focus on variable selection. Examples of some of the methods are PCA, partial least squares regression and canonical correlation analysis.
- moCluster: Uses multi-table multivariate analysis to identify patterns across multi-omics datasets. Variables are identified using PCA and are clustered.
- Multiple co-inertia analysis (MCIA): Captures the relationships between high-dimensional datasets, such as gene or protein expression. Covariance optimisation is used to transform the diverse features onto the same scale, and using graphical representations, similarities between the datasets or relevant features can be identified.
- Joint and individual variation explained (JIVE): Segregates multi-omics datasets into joint and individual effects to explain the variation across multiple datasets and within datasets. A permutation test is used to quantify the joint and individual patterns.
- Multiple factor analysis (MFA): Projects multi-omics datasets in a low-dimensional space to integrate numerical variables and categorical variables. It provides a balanced representation of an individual as well as common structures. PCA identifies individual patterns in each omics data, whilst global analysis reveals common structures.
- Regularised multiple kernel learning (rMKL): Projects samples in a low-dimensional space to cluster them. Higher weights are automatically assigned to higher information content, and heterogenous multiple data is integrated to identify subtypes.
- Integrative nonnegative matrix factorisation (iNMF): Analyses high-dimensional multi-omics datasets by combining the homogenous and heterogenous pattern through a partitioned factorisation structure. The method is repeated several times to obtain the optimal minimum objective function.
- Feature selection multiple kernel learning (FSMKL): Captures the similarity between datasets as each is encoded into a base kernel per data type. A feature selection algorithm then finds the most relevant kernel and its importance within the multiple datasets.
- Sparse multi-block partial least squares (sMBPLS): Decomposes multi-omics data sets into multi-dimensional regulatory modules to help identify driving parameters responsible for the relationship between input variables and response variables.
- Thresholding singular value decomposition (T-SVD): Identifies the regulatory mechanisms between two omics datasets, particularly when these features are larger than the measured samples. Sparsity constraint is used with the assumption that each program only regulates a small amount of response variables.
It is clear that there is an extremely wide spectrum of multi-omics integration tools that use a variety of mathematical approaches. Additionally, there are several platforms to help visualise and interpret the multi-omics data after it has been integrated. Some examples include cBioPortal, Firebrowse and OASIS, all of which help to explore the interactions between multi-omics layers in physiology and disease.
To determine which tools and approaches are right for a multi-omics dataset, researchers should benchmark and carry out a detailed comparison of numerous methods using the same data. This should enable conclusions to be drawn about which approaches cluster the best and show the optimum ability to identify structures between the datasets.
For more information about the challenges facing data integration in multi-omics analysis, check out Multi-omics: An Integrative Approach to Biomedical Research. The report includes exclusive contributions from experts in the multi-omics field, including Tamás Korcsmáros, a data integration expert from the Earlham Institute and Quadram Institute. Download it for free here:
This guide was heavily reliant on the publication: Multi-omics Dara Integration, Interpretation, and Its Applications by I. Subramanian et al., 2020.
Image credit: BioIQ