Mobile Menu

A Guide to Multi-omics Integration Strategies

This feature was put together with content from Chapter 2 of the Multi-omics Playbook, released in February, 2024. For an in depth look at a number of the popular integration methods, along with much more, please download the playbook.

Figure 1 Integration of multi-omics across the major omics layers. Image Credit: Hossain, et al. 1

Intro to integration

While each omic provides valuable data alone, in concert, new and valuable insights can be gained. Integrating multi-omics data can reveal new cell subtypes, cell interactions and interactions between the different omic layers leading to gene regulatory and phenotypic outcomes. Since each omic layer is causally tied to the next, multi-omics integration serves to disentangle this relationship to properly capture cell phenotype (Figure 1).

We are now in the era of acquiring data from millions of cells. The integration of this large, complex, multimodal data has the potential to reveal much about biological mechanisms and pathways, but this represents a considerable challenge to researchers2. Ultimately, sophisticated computational tools and methodologies are at hand to solve this issue.

But which ones should you use? The principal focus of this feature will be to outline the selection of methodologies that are currently available. It will cover the different types of integration and the options available for bulk, single-cell and spatial datasets.

Why is integration still a challenge?

Ultimately, integration of multi-omics data is a moving target for which a one-size-fits-all approach will not work. Drawing insights from two specific omics requires unique strategies, since each omic has a unique data scale, noise ratio and, hence, its own preprocessing steps. How these omics correlate within the same sample and the same cell is not yet understood.

For example, convention tells us that actively transcribed genes should have greater open chromatin accessibility, which is a correlation we can model but one that may not be true. However, for other modalities, such as RNA-seq and protein data, the most abundant protein may not correlate with high gene expression. This disconnect makes integration very difficult. Moreover, sensitivity remains an issue. A gene detected at the RNA level may simply be missing in the protein dataset.

Furthermore, these omics are not captured with the same breadth, meaning there is inevitably missing data. scRNA-seq can profile thousands of genes, whilst current proteomic methods have a more limited spectrum, with perhaps only 100 proteins. Even if you can predict the level of each protein relatively accurately, the limited number of features in the protein dataset means the cross-modality cell-cell similarity is more difficult to measure. Specific tools are required for specific challenges, hence the variety of tools available.

Types of Integration

When trying to divide up the computational tools for integration meaningfully, one principal distinction between strategies is whether the tool is designed for multi-omics data that is matched (recorded from the same cell) or unmatched (recorded from different cells). While modern methods are frequently able to create the more desirable, matched multi-omics data, much of the previous work in this area involves integrating unmatched data. Furthermore, not only can unmatched data refer to different cells from the same sample, but it can also involve integrating different cells from different samples of the same tissue, taken at different times in different experiments (see Figure 2A).

Integration can also be seen as operating at the horizontal, vertical and diagonal level3. Horizontal integration is the merging of the same omic across multiple datasets. Several tools exist for this purpose, and while it is technically a form of integration, it is not true multi-omics integration and so won’t be considered further.

Figure 2. Computational strategies for single-cell multi-omics integration (A) Levels of input data – paired, partially paired and unpaired. (B) Techniques deployed in integration. Image Credit: Flynn, et al. 4

Vertical integration merges data from different omics within the same set of samples -essentially equivalent to matched integration. The cell is leveraged as the anchor to bring these omics together.

Diagonal integration is the final, and most technically challenging, form of integration. Here, different omics from different cells/different studies are brought together. The anchor can no longer be the cell and instead must be some co-embedded space in which commonality between cells is found. This is essentially ‘unmatched’ integration.

Below you will find a table detailing many of the available multi-omics integration tools and whether they are for matched or unmatched integration. The following sections of this feature will address each of these in turn, before concluding with a look at transfer-learning and spatial integration strategies.

Table 1. Multi-omics integration tools separated by matched vs. unmatched integration capacity. This table was adapted from Baysoy, et al. 5

YearNameMethodologyIntegration capacityRef.
  MATCHED INTEGRATION TOOLS (From same single cell)
2019SCHEMAMetric learning-based methodChromatin accessibility, mRNA, proteins, immunoprofiling, spatial coordinates6
2020Seurat v4Weighted nearest-neighbourmRNA, spatial coordinates, protein, accessible chromatin, microRNA7
2021DCCAVariational autoencodersmRNA, chromatin accessibility8
2021DeepMAPSAutoencoder-like neural networksmRNA, chromatin accessibility, protein9
2019citeFUSENetwork-based methodmRNA, protein10
2020MOFA+Factor analysismRNA, DNA methylation, chromatin accessibility11
2020scMVAEVariational autoencodermRNA, chromatin accessibility12
2020totalVIDeep generativemRNA, protein13
2020BREM-SCBayesian mixture modelmRNA, protein14
2022SCENIC+Unsupervised identification modelmRNA, chromatin accessibility15
2022FigRConstrained optimal cell mappingmRNA, chromatin accessibility16
2021MIRAProbabilistic topic modellingmRNA, chromatin accessibility17
2023CellOracleModelling gene regulatory networksmRNA, CRISPR screening, chromatin accessibility18
2022MultiVeloProbabilistic latent variable modelmRNA, chromatin accessibility19
  UNMATCHED INTEGRATION TOOLS (From different single cells)
2019SpectrumWeighted nearest-neighbourmicroRNA, mRNA, protein20
2020BindSCCanonical correlationmRNA, chromatin accessibility21
2019MMD-MAManifold alignmentmRNA, chromatin accessibility, DNA methylation, imaging22
2019MuSiCUnsupervised topic modellingmRNA, CRISPR screening23
2019Seurat v3Canonical correlation analysismRNA, chromatin accessibility, protein, spatial24
2020UnionComManifold alignmentmRNA, DNA methylation, chromatin accessibility25
2019CloneAlignStatistical methodmRNA, DNA26
2021PamonaManifold alignmentmRNA, chromatin accessibility27
2022GLUEVariational autoencodersChromatin accessibility, DNA methylation, mRNA28
2019LIGERIntegrative non-negative matrix factorizationmRNA, DNA methylation29
2022StabMapMosaic data integrationmRNA, chromatin accessibility30
2021CoboltMultimodal variational autoencodermRNA, chromatin accessibility31
2021MultiVIProbabilistic modellingmRNA, chromatin accessibility32
2022Seurat v5Bridge integrationmRNA, chromatin accessibility, DNA methylation, protein33

Matched (Vertical) Integration

Vertical integration methods rely on technologies that profile omics data from two or more distinct modalities from within a single cell. From this position, the cell itself can be used as an anchor by which to integrate varying modalities.

Given that the majority of multi-omics tools measure either RNA and protein concurrently or RNA and epigenomic information (mainly via ATAC-seq) concurrently, the majority of tools for vertical integration focus on these pairs of modalities, as can be seen in Table 2.1.

Of the various approaches, there are matrix factorization methods (e.g., MOFA+11), neural network-based (e.g., scMVAE12, DCCA8, DeepMAPS9) and network-based methods (e.g. cite-Fuse, Seurat v4)34. See Figures 2B and 3 for an overview of the techniques.

Figure 3. Single-cell multi-omics data integration methods. The easiest way to integrate multi-omics is to concatenate the original feature matrix of various omics data, but the noise and distinct meaning of values confuse the results of the integration. Machine learning methods extract features from the original matrix and then combine the features across multi-modalities. Deep learning algorithms have also been applied based on various types of networks; e.g., linear, convolution and self-attention. Image Credit: Wang, et al. 34

Unmatched (Diagonal) Integration

Unmatched integration highlights a different and more substantial challenge. The experiments are technically easier to perform because each cell can be treated optimally for the omic analysed. Yet, because the omics data from different modalities are drawn from distinct populations/cells, the cell or tissue cannot be used as an anchor. An anchor has to be derived by some other means.

A general solution to this problem is to project cells into a co-embedded space/non-linear manifold to find commonality between cells in the omics space. Due to the learning-based nature of this task, this space is popularised by a number of machine learning and statistical methods designed to find the most appropriate anchor to align cells.

A popular tool for unmatched integration recently introduced is Graph-Linked Unified Embedding (GLUE28), which can achieve triple-omic integration. Using graph variational autoencoder, GLUE can learn how to anchor features using prior biological knowledge, which it uses to link omic data.

Mosaic Integration

Mosaic integration is an alternative to diagonal integration. This can be used when you have an experimental design where each experiment has various combinations of omics that create sufficient overlap.

For example, if one sample was assessed for transcriptomics and proteomics, another for transcriptomics and epigenomics and a third for proteomics and epigenomics, there is enough commonality between these samples to integrate the data.

Tools such as COBOLT31 and MultiVI32 present modern methods to integrate mRNA and chromatin accessibility in a mosaic fashion. They create a single representation of the cells across datasets to be used in downstream analysis.

Another tool here is MultiMAP35. It is a graph-based method that assumes a uniform distribution of cells across a latent manifold structure to integrate datasets with unique and shared features.

Finally, for mosaic integration, two very recent tools, StabMap30 and bridge integration33 were highlighted in Nature36.

Spatial Integration

With the increasing development of spatial multi-omics methods, new integration strategies are needed for this data. Principally, we are looking at vertical spatial integration as these spatial modalities naturally capture the omics within the confines of a cell or ‘spot’, which works as the anchor.

Existing spatial methods, such as ArchR37, have been successfully deployed for spatial integration. The example here used the RNA modality to indirectly spatially map other modalities, specifically spatial transcriptome and epigenome integration38. Another example is Cell2location39, which was successfully used to integrate spatial RNA and ATAC data in the human heart using a shared nearest neighbours (SNN) strategy40.

Given the popularity of GLUE for diagonal integration of single-cell data, Dr. Jinmiao Chen’s group has recently released SpatialGlue41, a spatial version that allows the integration of omics on spatial sections.

Existing tools are also being modified for spatial analysis. For example, the developers of MOFA+ have recently released MEFISTO42, which uses the same factor analysis approach with a new capability to handle both temporal and spatial components within the model

Ultimately, the development of paired and unpaired spatial integration methods is a space to watch for future developments, as more paired multi-omics methods are released.

References

1.            Hossain, M.S., Joshi, T. & Stacey, G. System approaches to study root hairs as a single cell plant model: current status and future perspectives. Front Plant Sci 6, 363 (2015).

2.            Miao, Z., Humphreys, B.D., McMahon, A.P. & Kim, J. Multi-omics integration in the age of million single-cell data. Nature Reviews Nephrology 17, 710-724 (2021).

3.            Argelaguet, R., Cuomo, A.S.E., Stegle, O. & Marioni, J.C. Computational principles and challenges in single-cell data integration. Nature Biotechnology 39, 1202-1215 (2021).

4.            Flynn, E., Almonte-Loya, A. & Fragiadakis, G.K. Single-Cell Multiomics. Annual Review of Biomedical Data Science 6, 313-337 (2023).

5.            Baysoy, A., Bai, Z., Satija, R. & Fan, R. The technological landscape and applications of single-cell multi-omics. Nature Reviews Molecular Cell Biology, 1-19 (2023).

6.            Singh, R., Hie, B.L., Narayan, A. & Berger, B. Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities. Genome biology 22, 1-24 (2021).

7.            Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573-3587. e29 (2021).

8.            Zuo, C., Dai, H. & Chen, L. Deep cross-omics cycle attention model for joint analysis of single-cell multi-omics data. Bioinformatics 37, 4091-4099 (2021).

9.            Ma, A. et al. Single-cell biological network inference using a heterogeneous graph transformer. Nature Communications 14, 964 (2023).

10.          Kim, H.J., Lin, Y., Geddes, T.A., Yang, J.Y.H. & Yang, P. CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics 36, 4137-4143 (2020).

11.          Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome biology 21, 1-17 (2020).

12.          Zuo, C. & Chen, L. Deep-joint-learning analysis model of single cell transcriptome and open chromatin accessibility data. Briefings in Bioinformatics 22, bbaa287 (2021).

13.          Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods 18, 272-282 (2021).

14.          Wang, X. et al. BREM-SC: a bayesian random effects mixture model for joint clustering single cell multi-omics data. Nucleic acids research 48, 5814-5824 (2020).

15.          Bravo González-Blas, C. et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nature Methods 20, 1355-1367 (2023).

16.          Kartha, V.K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell genomics 2(2022).

17.          Lynch, A.W. et al. MIRA: joint regulatory modeling of multimodal expression and chromatin accessibility in single cells. Nature Methods 19, 1097-1108 (2022).

18.          Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 614, 742-751 (2023).

19.          Li, C., Virgilio, M.C., Collins, K.L. & Welch, J.D. Multi-omic single-cell velocity models epigenome–transcriptome interactions and improves cell fate prediction. Nature Biotechnology 41, 387-398 (2023).

20.          John, C.R., Watson, D., Barnes, M.R., Pitzalis, C. & Lewis, M.J. Spectrum: fast density-aware spectral clustering for single and multi-omic data. Bioinformatics 36, 1159-1166 (2020).

21.          Dou, J. et al. Unbiased integration of single cell multi-omics data. biorxiv, 2020.12. 11.422014 (2020).

22.          Liu, J., Huang, Y., Singh, R., Vert, J.P. & Noble, W.S. Jointly Embedding Multiple Single-Cell Omics Measurements. Algorithms Bioinform 143(2019).

23.          Duan, B. et al. Model-based understanding of single-cell CRISPR screening. Nature communications 10, 2233 (2019).

24.          Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888-1902. e21 (2019).

25.          Cao, K., Bai, X., Hong, Y. & Wan, L. Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 36, i48-i56 (2020).

26.          Campbell, K.R. et al. clonealign: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers. Genome biology 20, 1-12 (2019).

27.          Cao, K., Hong, Y. & Wan, L. Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics 38, 211-219 (2021).

28.          Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nature Biotechnology 40, 1458-1466 (2022).

29.          Welch, J.D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873-1887. e17 (2019).

30.          Ghazanfar, S., Guibentif, C. & Marioni, J.C. Stabilized mosaic single-cell data integration using unshared features. Nature Biotechnology (2023).

31.          Gong, B., Zhou, Y. & Purdom, E. Cobolt: integrative analysis of multimodal single-cell sequencing data. Genome Biology 22, 351 (2021).

32.          Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nature Methods 20, 1222-1231 (2023).

33.          Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nature Biotechnology (2023).

34.          Wang, X., Wu, X., Hong, N. & Jin, W. Progress in single-cell multimodal sequencing and multi-omics data integration. Biophysical Reviews (2023).

35.          Jain, M.S. et al. MultiMAP: dimensionality reduction and integration of multimodal data. Genome biology 22, 1-26 (2021).

36.          Lee, M.Y.Y. & Li, M. Integration of multi-modal single-cell data. Nature Biotechnology (2023).

37.          Granja, J.M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nature genetics 53, 403-411 (2021).

38.          Foster, D.S. et al. Integrated spatial multiomics reveals fibroblast fate during tissue repair. Proceedings of the National Academy of Sciences 118, e2110025118 (2021).

39.          Kleshchevnikov, V. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nature biotechnology 40, 661-671 (2022).

40.          Kuppe, C. et al. Spatial multi-omic map of human myocardial infarction. Nature 608, 766-777 (2022).

41.          Long, Y. et al. Integrated analysis of spatial multi-omics with SpatialGlue. bioRxiv, 2023.04.26.538404 (2023).

42.          Velten, B. et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nature Methods 19, 179-186 (2022).