Multi-omics is an emerging field. ‘Omics’ refers to the comprehensive collection of a set of biological molecules. Essentially, the objective of omics studies is to identify, characterize and quantify all these biological molecules that are involved in the structure, function and dynamics of a cell, tissue or organism.
Genomics was the first omics discipline to appear, focusing on the exploration of entire genomes, as opposed to investigating single genes or individual variants. Initially, genomics studies provided a useful framework for mapping out the underlying genetic basis of complex diseases. However, as more technological advances are being made, leading to cost-efficient and high-throughput analysis of biological molecules, other omics fields are rapidly accelerating. Therefore, the reality is that studying only a single layer of information from each cell could give a skewed picture.
In turn, researchers have started to combine multiple layers of information through ‘multi-omics’ techniques, untangling the heterogeneity of many biological mechanisms. Multi-omics is the simultaneous study of multiple molecular ‘omes’. These integrative investigations can provide detailed insights into the complex molecular mechanisms that underpin disease. They are progressing rapidly to become more and more accessible to researchers. This is due to the development of improved sequencing technologies, the collection of large publicly available datasets and the advent of novel analysis or visualization tools.
For exclusive expert perspectives and several multi-omics case studies, check out Multi-omics: An Integrative Approach to Biomedical Research. This report demonstrates the rationale for integrating multiple omics layers into investigations and highlights the success of recent research in the area. The most up-to-date applications of spatial transcriptomics and single-cell multi-omics are also described in-depth. Download it for free here:
|What are the types of omics?||The different types of omics are described and the step-by-step process of a typical multi-omics experiment is explained.|
|Approaches to multi-omics studies||The genome first, phenotype first and environment first approaches to multi-omics analysis are defined.|
|Challenges in multi-omics data||The challenges that face multi-omics studies are described and the future of the field is briefly discussed.|
|Resources for multi-omics analysis||Useful multi-omics software and databases are listed and relevant webinars are recommended.|
The genome is the complete sequence of DNA in a cell or organism. This genetic material remains relatively constant over time, except for mutations and chromosomal arrangements. Genome wide association studies (GWAS) have been used to identify huge numbers of genetic variants linked to complex diseases in multiple human populations. GWAS studies involve thousands of individuals being genotyped for over 1 million genetic markers. Genomic analysis can be conducted on a variety of experimental platforms, including single nucleotide polymorphism chips and DNA sequencing technologies, which determine the sequence of nucleotides, detect insertions and deletions (indels) and copy number variation.
The proteome is the complete set of proteins expressed by a cell, tissue or organism. It is inherently complex because proteins often undergo post-translational modifications, such as glycosylation, phosphorylation, acetylation and alterations to the amino acids. These processes play key roles in intracellular signaling, control of enzyme activity, protein transport and overall cell structure maintenance. Moreover, proteins have different spatial configurations and often interact with other proteins or molecules. Overall, this can make proteomic studies challenging, but detection can still be done using mass spectrometry and protein microarrays. These high-throughput technologies reveal interactions between thousands of proteins in cells or body fluids by methods such as phage display and yeast two-hybrid assays. Approaches can also investigate global proteome interactions and quantify post-translational modifications. AlphaFold is a new AI system developed by DeepMind. It can predict a protein’s 3D structure from its amino acid sequence with high accuracy, allowing the unprecedented and in-depth exploration of the proteome. Recently, AlphaFold algorithms have been used to cover 98.5% of human proteins!
The transcriptome is the complete set of RNA transcripts in a cell or tissue, including ribosomal RNA (rRNA), messenger RNA (mRNA), transfer RNA (tRNA), micro RNA (miRNA) and other non-coding RNA (ncRNA). Most measuring of the transcriptome is done by microarrays, based on oligonucleotide probes, and RNA sequencing, a more recent approach that doesn’t require probes. Nevertheless, these both examine which transcripts are present and how much they are expressed, along with identifying novel splice and RNA editing sites.
The epigenome is made up of reversible chemical modifications to the DNA, or to the histones that bind DNA, which can persist across generations and vary substantially among different cell types within an organism. These alterations produce changes in gene expression, without altering their base sequence, and can occur in a tissue-specific manner or in response to environmental factors and diseased states. Measuring epigenomic alterations with high-throughput technologies can either be done by DNA methylation or histone acetylation. Many epigenome-wide studies have reported that these modifications are highly important in many biological processes and the development of several diseases, such as cancer and cardiovascular conditions.
The metabolome is the complete set of small molecule metabolites found within a biological sample, such as carbohydrates, lipids, amino acids and nucleic acid pathways. Metabolites can also be found in signaling molecules, such as hormones, and in other exogenous substances. The metabolome can vary within a single organism and differ drastically between individuals of the same species. This is due to several varying factors, including changes in diet, stress, physical activity and disease. Metabolomic studies can be conducted using mass spectrometry or nuclear magnetic resonance spectrometry. Recently, quantitative measures of metabolite levels have made it possible to discover novel genetic loci that regulate small molecules and their relative ratios.
The microbiome consists of all the microorganisms in a given community, such as bacteria, viruses and fungi, all of which colonize skin, mucosal surfaces and the gut. The human microbiome is hugely complex – the gut contains around 1 trillion bacteria from 1,000 different species. Microbiota composition between individuals varies substantially, due to developmental differences, diet, age and environmental factors. It is being increasingly recognized that alterations in gut bacteria can contribute to a variety of disorders, including diabetes, cancer, cardiac conditions and autism. Microbiomics studies involve amplifying and then sequencing certain hypervariable regions of bacterial genes, followed by clustering them into operational taxonomic units.
How to do multi-omics analysis
Compared to single omics studies, multi-omics research offers the opportunity to understand the complete flow of information underlying a disease. Integrating data across several omics layers can be done using a variety of approaches, depending on the study design. Essentially, if two omics elements share a common driver, or one perturbs the other, they will exhibit a correlation or association. These analyses can be done using specialized statistical approaches, whereby each element is investigated independently to assess whether it contributes to the disease.
Steps of multi-omics discovery
- Step #1: Data quality control
- Step #2: Computational model development
- Step #3: Confirmation of computational model
- Step #4: Release of computational procedures
Step #1: Data quality control
Omics data is composed of up to millions of measurements. Therefore, it makes sense that data quality control, which takes place computationally, is crucial. This involves background levels of expression being removed and the reproducibility of measurements from run to run being assessed. Factors such as run date or machine operator can also have a large effect on omics measurements, and so are examined carefully.
Step #2: Computational model development
A potential omics-based test, associated with the phenotype of interest, is developed using the high-quality data. Examples of a phenotype of interest could be preclinical responsiveness to a novel therapy or a clinical outcome of a drug. Many different statistical tools can be used to perform computational model development, which is essentially a process to find the most effective computational model. Typically, a researcher will develop a model using two distinct datasets, referred to as a ‘training set’ and a ‘test set’, each composed of independent samples that have been collected by different investigators. Each computational model is tested on its ability to normalize the data using various tuning parameters and normalization techniques. First, each model is fit on the training set. Then, its performance is evaluated on the test set.
Step #3: Confirmation of computational model
Once the best performing tuning parameters and normalization techniques have been identified, they are ‘locked down’. Then, an error estimation approach, such as cross-validation, is applied to the computational model using an independent set of samples that were not used in computational model development. The new clinical dataset must still be relevant to the intended use of the omics-based model being created. This is to avoid overfitting of the data.
Step #4: Release of computational procedures
After the locked-down computational methods have been shown to perform well on an independent dataset, it is ready to proceed to the test validation phase. This is where the clinical usefulness is assessed. Data and metadata that were used to develop the computational model should be made available in an independently managed database, along with the relevant procedures. Essentially, all aspects of the analysis need to be transparently reported. This is important for the independent verification of results. Releasing information publicly in omics studies is particularly important due to the complex nature of the data, making replicating results extremely difficult. Rigorous verification of the results by the scientific community is necessary to ensure that omics-based tests are statistically valid.
Approaches to multi-omics studies
Multi-omics can be modelled as networks, whereby the information that flows between different layers is used to gain insight about the interactions between different omics. Graphs can be used to depict these interconnected networks. For studying diseases, co-expression networks can be constructed based on the differences in gene expression variation that occur separately in the control and affected individuals. Comparing the network architecture between control and diseased groups can enable the identification of gene expression nodes most correlated with disease.
Furthermore, genetic information can then be integrated into the network modelling, highlighting key pathways that contribute to disease and identify core drivers in biological processes. Multi-omics approaches can be categorized based on the initial focus of the investigation.
The genome first approach
Genetic variants provide a powerful insight into complex diseases and modelling interactions between other omics layers. The ‘genome first’ approach determines the mechanisms behind GWAS loci that contribute to disease. GWAS results alone may be useful for risk prediction, but they cannot directly implicate a particular gene or pathway to infer a therapeutic target. This is because, essentially, the identification of variants affecting gene expression is complicated as a variety of contributing elements are responsible, which are often located at different places within or outside of the gene.
Nevertheless, using omics applications can help to locate specific GWAS loci. Once the causal variants or genes have been established, other omics layers can be used to identify the downstream interactions. Transcriptomics has become increasingly useful for pinpointing causal genes at GWAS loci, by employing either expression arrays or RNA sequencing. Moreover, statistical methods based on expression quantitative loci (eQTL) at GWAS loci have been developed and have helped to make large datasets available for a number of human and animal tissue models. Proteomic techniques can be used to identify interacting pathways contributing to disease, and metabolomics can help bridge genotype to phenotype.
The phenotype first approach
The ‘phenotype first’ approach investigates the pathways that contribute to a disease without focusing the investigation on a specific locus. This means that correlations between the disease and omics data are tested. Subsequently, the associations are fitted into a logical framework to provide insight into the role that the factors play in disease development.
For example, transcriptomic and epigenomic data have been used to show that genomic and environmental contributions of Alzheimer’s act through different cell types. This phenotype first multi-omics study provided evidence for the fact that the genetic predisposition to Alzheimer’s acts mostly through the dysregulation of immune functions, and that epigenetic changes in the neuronal cells are mostly environmentally driven.
The environment first approach
The ‘environment first’ approach examines the environment in terms of how it alters pathways or interacts with genetic variation. Multi-omics analysis is used to investigate the links to disease using an environmental factor as the variable, such as diet or lifestyle changes. This can be extremely difficult to do in humans, so animal models are often necessary.
These types of study designs can be used to understand the interactions between genetics and the environment. For example, the effects of a high fat and high sucrose diet was studied in around 100 different inbred strains of mice. Usually, the gut microbiome introduces additional complexity, but multi-omics studies can be used to reveal the impact of gut microbiota on host responses to dietary changes. In this case, the pathways and genes contributing to diet-induced obesity or diabetes could be identified.
A diagram depicting the different layers and types of omics data. Omics data are an entire pool of molecules and are represented by the circles. Thin black arrows represent the potential interactions and correlations detected between different omes. Thicker arrows indicate the different potential starting points to begin building conceptual frameworks that combine multiple omics data. The genome first approach begins from a disease-associated locus, whereas the phenotype first approach implies any of the omics layers as the starting point. The environmental first approach, which is not explicitly shown, examines environmental perturbations first. Image credit: Y. Hasin et al, 2017
Challenges in multi-omics data
Multi-omics studies rely on extremely large numbers of comparisons and tailored statistical analyses, requiring a considerable amount of time, skilled manpower and money investments. Therefore, careful planning and strategic execution is crucial when conducting an omics study.
Nature of the disease
A simple disease could be caused by single gene mutations, although the severity is often affected by modifier genes and environmental factors. Therefore, concentrated omics investigations at specific time points, focusing on the immediate molecular changes, can be used to help understand the mechanisms behind these diseases. However, if diseases are more complex, they are usually not centered to one specific factor and are far more intricate. This means that combinations of a variety of factors could lead to highly similar diseased states.
Results from a single layer of omics data are usually associative and so may not be able to be interpreted as causative effects. Complex diseases that develop over time are also very likely to involve both environmental and genetic factors, meaning that several omics data at multiple time points, collected from many disease-relevant tissues, will need to be investigated to determine a clear factor that induces the disease. This can be challenging, and sometimes even impossible.
Insights gained about diseases from omics approaches are mostly comparative. This means that the alterations found between healthy and disease groups is often assumed to be directly related to the condition. However, in reality, complex phenotypes are a result of many confounding factors, such as population structure, cell type composition and other unknown attributes. The ‘reductionist’ approach can be used to eliminate the differences associated within the human population by matching groups of patients as closely as possible. However, this method is limited because there are always unknown variable factors that cannot be included, such as environment or certain lifestyle choices. Also, this contrasts with the typical multi-omics study design, which consists of gathering huge amounts of data from large and varied samples, so that many sources of variability can be incorporated into statistical models.
Approaches that separate causal changes from correlative changes can identify the variation associated with a disease trait. As genomic changes are assumed to cause disease, once a correlation has been found, GWAS loci can be used to identify the precise variant. Apart from in the genomics layer, separating causality from correlation remains a tricky task.
Omics datasets are often large and complex. Insufficient attention to data analysis requirements, before and during data collection, is a potential weakness of multi-omics approaches. This is particularly true when datasets need specific tailoring of statistical methods. Therefore, it is important to set out data requirements, which incorporate the main goal of the study design, before collecting the data. Once large omics datasets have been collected, it is possible to reanalyze them with multiple approaches, repeatedly. This makes the development of statistical methods even more crucial for the omics field, because more information can be extracted from existing data types.
Furthermore, the use of publicly available data requires a standardized and easily communicable terminology regarding data collection and analysis. This enables effective integration across studies and facilitates new discoveries by shortening the time taken from data generation to publication, and subsequent translation into clinics. Unsurprisingly, multi-omics approaches are particularly vulnerable to technical problems, such as changes in data ID numbers and the lack of standard protocols.
Tracking health with multi-omics approaches has the potential to highlight disease prior to its development and indicate that lifestyle changes could be beneficial for prevention. Moreover, applications of omics technologies in clinical settings can be used in personalized medicine. However, approaches require a huge quantity of data and technical expertise – more often than not – requires coordinated efforts from multiple research groups. For example, large-scale multi-omics data generation, analysis of methodology development and functional follow ups are all necessary, especially if the procedures need to be adapted for specific diseases and repeated before being integrated. Realistically, such undertakings require coordinated efforts from multiple groups, each providing their own expertise and resources.
As omics approaches provide biological insights based on statistical inference, the power to determine associations or causative effects strongly depends on large datasets. This means that sample size is often the controlling variable for many multi-omics investigations, particularly in human studies. This is because humans are affected by several uncontrollable confounding factors, such as diet and lifestyle choices. This narrows down the available sample size and results in the limited ability of omics approaches to produce meaningful results when studying human disease.
Animal models of disease
Animal models are studied to provide important omics-based insight into disease. Although human studies usually have greater translational potential, they suffer from several limitations. But animal studies can address these obstacles and have many advantages, for example reproducibility, control of environmental factors, accessibility of relevant tissues and the ability to follow up on hypotheses. However, it must be considered that animal models do not always replicate exactly the human biology of complex disease. Therefore, it may be beneficial to compare the omics data between human and animal models to validate the biological relevance of the findings.
Methods for omics data acquisition develop at a high pace, so associated technology can become easily outdated. Although up-to-date approaches would likely provide advantages, such as better coverage, it may also be beneficial to stick with historical technologies for consistency reasons. Moreover, commercial platforms are sometimes abruptly discontinued, making replicating experiments using alternative instruments often impossible. Therefore, deciding which technologies to use for multi-omics investigations can prove challenging.
For more in-depth information about the challenges facing multi-omics studies, check out Multi-omics: An Integrative Approach to Biomedical Research. The report demonstrates the rationale for integrating multiple omics layers into investigations and highlights the success of recent research in the area. Download it for free here:
Prospects of multi-omics data
Many emerging omics technologies are likely to influence the development of future research, such as the continued advances in bioinformatics and computational approaches. As the ways in which data is collected and manipulated continue to improve, the enhanced analysis of multi-omics data will persist and allow for the integration of a greater variety of data types in the coming years. In fact, the advent of omni-omics, the study of all molecular layers, could occur in the not-too-distant future.
Nevertheless, new technologies are likely to add to the challenge of extremely high data complexity and will make fitting appropriate computational models even more difficult. Following proper data guidelines and conducting large meta-analyses of sequencing datasets collected at multiple sites will certainly prove valuable when developing clinically useful omics-based computational model.
Resources for multi-omics analysis
- IMAS: Multi-omics data for evaluating alternative splicing.
- bioCancer: Visualization of multi-omics cancer data.
- omicade4: Analysis of multi-omics data sets.
- mixOmics: Multi-variate methods for data integration.
- PaintOmics: Web resource for visualizing multi-omics data sets.
- iOmicsPASS: Multi-omics-based phenotype prediction.
- Multi-Omics Profiling Expression Database (MOPED): Integrates diverse animal models.
- LinkedOmics: Connects data from cancer datasets.
- Ecomics: Multi-omics database for E. coli data.
- ProteomicsDB: Multi-omics resource for life science research.
Don’t forget to follow Front Line Genomics for more information about how genomics is being used to benefit patients.
Image credit: Genetic Engineering and Biotechnology News