Mobile Menu

How-to: Infer Gene Regulatory Networks from Omics Data

This feature was put together using content from Chapter 5 of the Multi-omics Playbook, released in February 2024. For an in depth look at gene regulatory networks and epigenomic/transcriptomic multi-omics, along with much more, please download the playbook

Gene regulation can be seen as the control machinery inside cells. If DNA is the recipe book, then gene regulation is the chef. Which genes are active, which proteins are expressed and, ultimately, the phenotype of the individual, comes down to gene regulatory processes. This feature will introduce you to methods and talk you through how to model gene regulatory networks (GRNs) from bulk and single-cell omics data using the popular SCENIC1 tools

Intro to GRNs

Gene expression is tightly regulated by a complex interplay of interactions with other genes and signalling molecules. Specific proteins known as transcription factors (TFs) can regulate the expression of genes in these networks by binding to DNA regions, having repressive or positive effects on transcription rates.

GRNs are mathematical representations of how the gene regulators interact. These representations are typically in  graphical format, in which each gene is a node, and each node is connected to other genes via edges. If known transcription factors are connected to other genes, they will likely influence their expression.

These models can be used to understand cell fate by mapping the regulatory programmes that trigger cells to shift to another cell type or cell state. This can be used in a developmental research setting or in a disease setting, such as understanding how to shift a ‘diseased’ cell back to a healthy gene regulatory state.

GRNs from transcriptomics

The field of gene regulatory inference is around two decades old, and the technique has been performed in the micro-array, NGS, bulk and single-cell eras and, most recently, in the multi-omics era (Figure 1). Bulk and single-cell RNA sequencing data alone allows for the inference of gene regulation in principle, since the RNA expression of TFs can inform you of their functionality1.

Figure 1. The transition of GRN inference from bulk transcriptomics to single-cell matched multi-omics measurement. Image Credit: Kim, et al. 2

Some of the earliest computational tools to perform GRN modelling from transcriptomic data alone include GENIE33 and SCENIC1, produced by the Stein Aerts lab. The latter is largely regarded as the go-to tool for this purpose.

For an excellent ‘how-to’ guide for using SCENIC, please refer to this single-cell best practices guide or the ‘cheat-sheet’ style command line below, which is available (along with detailed instructions) from the Stein Aerts lab GitHub.

### Load data

loomPath <- system.file(package="SCENIC", "examples/mouseBrain_toy.loom")

library(SCopeLoomR)

loom <- open_loom(loomPath)

exprMat <- get_dgem(loom)

cellInfo <- get_cell_annotation(loom)

close_loom(loom)

### Initialize settings

library(SCENIC)

scenicOptions <- initializeScenic(org="mgi", dbDir="cisTarget_databases", nCores=10)

# scenicOptions@inputDatasetInfo$cellInfo <- "int/cellInfo.Rds"

saveRDS(scenicOptions, file="int/scenicOptions.Rds")

### Co-expression network

genesKept <- geneFiltering(exprMat, scenicOptions)

exprMat_filtered <- exprMat[genesKept, ]

runCorrelation(exprMat_filtered, scenicOptions)

exprMat_filtered_log <- log2(exprMat_filtered+1)

runGenie3(exprMat_filtered_log, scenicOptions)

### Build and score the GRN

exprMat_log <- log2(exprMat+1)

scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # Toy run settings

scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)

scenicOptions <- runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget")) # Toy run settings

scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log)

# Optional: Binarize activity

# aucellApp <- plotTsne_AUCellApp(scenicOptions, exprMat_log)

# savedSelections <- shiny::runApp(aucellApp)

# newThresholds <- savedSelections$thresholds

# scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"

# saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))

scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)

tsneAUC(scenicOptions, aucType="AUC") # choose settings

# Export:

# saveRDS(cellInfo, file=getDatasetInfo(scenicOptions, "cellInfo")) # Temporary, to add to loom

export2loom(scenicOptions, exprMat)

# To save the current status, or any changes in settings, save the object again:

saveRDS(scenicOptions, file="int/scenicOptions.Rds")

### Exploring output

# Check files in folder 'output'

# Browse the output .loom file @ http://scope.aertslab.org

# output/Step2_MotifEnrichment_preview.html in detail/subset:

motifEnrichment_selfMotifs_wGenes <- loadInt(scenicOptions, "motifEnrichment_selfMotifs_wGenes")

tableSubset <- motifEnrichment_selfMotifs_wGenes[highlightedTFs=="Sox8"]

viewMotifs(tableSubset)

# output/Step2_regulonTargetsInfo.tsv in detail:

regulonTargetsInfo <- loadInt(scenicOptions, "regulonTargetsInfo")

tableSubset <- regulonTargetsInfo[TF=="Stat6" & highConfAnnot==TRUE]

viewMotifs(tableSubset)

# Cell-type specific regulators (RSS):

regulonAUC <- loadInt(scenicOptions, "aucell_regulonAUC")

rss <- calcRSS(AUC=getAUC(regulonAUC), cellAnnotation=cellInfo[colnames(regulonAUC), "CellType"], )

rssPlot <- plotRSS(rss)

plotly::ggplotly(rssPlot$plot)

GRNs from multi-omics

Generally, regulatory processes are too complex to reliably model with transcriptomic data alone and multi-omics data is preferred.

Transcriptomic and epigenomic data, specifically chromatin accessibility measurements through ATAC-seq4, ChIP-seq5 and CUT&Tag6, can provide more robust information about the accessibility of TF binding sites and add important information to the networks drawn from transcriptomics data. While ChIP-seq and CUT&Tag would be the preferred methods, profiling TF binding in this way is costly and limited to TFs with available antibodies. Instead, it is ATAC-seq (a measure of chromatic accessibility) that allows one to infer TF binding site availability and is most commonly used in GRN inference.

The essential process for modelling GRNs is shown in Figure 2. RNA-seq data alone is pre-processed and normalised before gene expression is predicted as a consequence of TF expression, which can be modelled using a variety of methods (e.g. regression, correlation, Boolean).

Multi-omics data needs to be integrated first (see our feature on this topic). From here, chromatin-accessibility data allows one to map the accessibility of specific genomic regions – cis-regulatory elements (CREs) – which TFs can bind to modulate gene expression. This combined information paints a much more accurate picture of how TFs can regulate genes and build a GRN.

Figure 2. Gene Regulatory Network Inference Methodology. Through single-cell RNA, DNA and epigenomic information, gene regulatory networks can be predicted by integrating the data and deploying various models to create the network. Image Credit: Hu, et al. 7

There are unique challenges that come from working with ATAC-seq and RNA-seq data simultaneously, and modelling this sparsity in the data is key to successful computational inferences. We will now take a look at some of the current popular computational tools to leverage transcriptomic and epigenomic data.

GRN multi-omics tools

There is a wide variety of GRN inference tools that utilise single-cell and bulk multi-omics data. Below, you will find a table of computational methods for GRN analysis using multi-omics data.

Table 1. List of multi-omics GRN inference computational methods. Methods are organised by the various differences in the methods process.Table is adapted from: Badia-i-Mompel, et al. 8
ToolsPossible inputsType of multimodal dataType of modellingType of interactionsStatistical frameworkRefs.
ANANSEGroups, contrastsUnpairedLinearWeightedFrequentist9
CellOracleGroups, trajectoriesUnpairedLinearSigned, weightedFrequentist or Bayesian10
DC3GroupsUnpairedLinearBinaryFrequentist11
DeepMAPSGroupsPaired or integratedLinearWeightedFrequentist12
DictysGroups, trajectoriesUnpaired/paired or integratedLinearSigned, weightedFrequentist13
DIRECT-NETGroupsPaired or integratedNon-linearBinaryFrequentist14
FigRGroupsPaired or integratedLinearSigned, weightedFrequentist15
GLUEGroupsPaired or integratedNon-linearWeightedFrequentist16
GRaNIEGroupsPaired or integratedLinearWeightedFrequentist17
Inferelator 3.0GroupsUnpairedLinear or non-linearWeightedFrequentist or Bayesian18
IReNATrajectoriesUnpairedLinearSigned, weightedFrequentist19
MAGICALGroups, contrastsUnpairedNon-linearWeightedBayesian20
MICAGroupsUnpairedNon-linearSigned, weightedFrequentist21
PandoGroupsPaired or integratedLinear or non-linearSigned, weightedFrequentist or Bayesian22
PECAGroupsPaired or integratedLinearWeightedBayesian23
Regulatory MotifsGroupsPaired or integratedLinearSignedFrequentist24
RENINGroupsPaired or integratedLinearSigned, weightedFrequentist25
scAIGroupsPaired or integratedLinearWeightedFrequentist26
sc-compRegGroups, contrastsUnpairedLinearBinaryFrequentist27
SCENIC+Groups, contrasts, trajectoriesPaired or integratedLinearSigned, weightedFrequentist28
scMEGATrajectoriesPaired or integratedLinearWeightedFrequentist29
scMTNIGroups, trajectoriesUnpairedLinear or non-linearWeightedBayesian30
SOMaticGroupsUnpairedLinearBinaryFrequentist31
SymphonyGroupsUnpairedLinearSigned, weightedBayesian32
TimeRegGroups, trajectoriesPaired or integratedLinearBinaryFrequentist33
TRIPODGroupsPaired or integratedNon-linearSigned, weightedFrequentist or Bayesian34

As you can see from Table 1, there are many tools for gene network inference. Next, we will cover three interesting and widely-used computational tools from this list, starting with GRaNIE17.

GRaNIE & GRaNPA & enhancer based networks

Enhancers are genomic locations the play an important role in cell-type-specific gene regulation. These enhancers are regulated by TFs and epigenetic mechanisms. GRaNIE17 is a tool to specifically build enhancer-based GRNs using multi-omics data. Accompanied by GRaNPA, which can assess the biological relevance of the generated GRNs, this is a unique tool suite for GRN inference.

Both tools were built by the lab of Judith Zaugg. For in depth vignettes and workflows of GRaNIE and GRaNPA, please visit the Zaugg lab website.

CellOracle

Our second tool introduces a computational perturbation algorithm to predict cellular changes or cell fate upon TF KO. CellOracle10 from the lab of Professor Sam Morris is less concerned with the gene networks than it is about predicting the shifts in GRNs and cell identity that occur when specific developmental regulators are altered. It is all about predicting changes using an in-silico perturbation approach.

Again, full tutorials for CellOracle can be found on the Morris Lab Github.

SCENIC+

Finally, following on from SCENIC, we have the new tool from the Aerts lab, SCENIC+28, which expands the original SCENIC methodology into the multi-omics space. It works with paired and unpaired single-cell multi-omics data and has the most comprehensive meta-database of TF binding DNA motifs for humans, mice and flies. Benchmarking within the SCENIC+ study showed its superior performance compared to other tools for predicting TF binding sites and target genes. Figure 3 highlights the SCENIC+ workflow.

Figure 3. SCENIC+ workflow. SCENIC+ infers GRNs using pycisTopic preprocessing followed by using the Motif collection database comprised of ~35,000 unique motifs. Image Credit: Bravo González-Blas, et al. 28

Below we present a basic SCENIC+ workflow following preprocessing of the scRNA-seq and scATAC-seq data generated via 10x Multiome. For the full pipeline that this is adapted from, alongside more vignettes and further in-depth support on using SCENIC+, please go to the specific SCENIC+ webpage.

SCENIC+ Workflow

First you will need to load:

  1. the AnnData object containing the scRNA-seq side of the analysis.
  2. the cisTopic object containing the scATAC-seq side of the analysis.
  3. the motif enrichment dictionary containing the motif enrichment results.
import dill

import scanpy as sc

import os

import warnings

warnings.filterwarnings("ignore")

import pandas

import pyranges

# Set stderr to null to avoid strange messages from ray

import sys

_stderr = sys.stderr

null = open(os.devnull,'wb')

work_dir = 'pbmc_tutorial'

tmp_dir = '/scratch/leuven/330/vsc33053/'

adata = sc.read_h5ad(os.path.join(work_dir, 'scRNA/adata.h5ad'))

cistopic_obj = dill.load(open(os.path.join(work_dir, 'scATAC/cistopic_obj.pkl'), 'rb'))

menr = dill.load(open(os.path.join(work_dir, 'motifs/menr.pkl'), 'rb'))

Next, create the SCENIC+ object. It will store both the gene expression and chromatin accessibility along with motif enrichment results and cell/region/gene metadata.

from scenicplus.scenicplus_class import create_SCENICPLUS_object

import numpy as np

scplus_obj = create_SCENICPLUS_object(

    GEX_anndata = adata.raw.to_adata(),

    cisTopic_obj = cistopic_obj,

    menr = menr,

    bc_transform_func = lambda x: f'{x}-10x_pbmc' #function to convert scATAC-seq barcodes to scRNA-seq ones

)

scplus_obj.X_EXP = np.array(scplus_obj.X_EXP.todense())

scplus_obj

Before running SCENIC+ it is important to check which BioMart host the gene names used in your analysis match best. BioMart will be used to find transcription starting sites of each gene. The names of genes (symbols) change quite often, so it is important to select the BioMart host with the largest overlap, otherwise a lot of genes can potentially be lost.

Below we show an example on how to select the optimal host.

ensembl_version_dict = {'105': 'http://www.ensembl.org',

                        '104': 'http://may2021.archive.ensembl.org/',

                        '103': 'http://feb2021.archive.ensembl.org/',

                        '102': 'http://nov2020.archive.ensembl.org/',

                        '101': 'http://aug2020.archive.ensembl.org/',

                        '100': 'http://apr2020.archive.ensembl.org/',

                        '99': 'http://jan2020.archive.ensembl.org/',

                        '98': 'http://sep2019.archive.ensembl.org/',

                        '97': 'http://jul2019.archive.ensembl.org/',

                        '96': 'http://apr2019.archive.ensembl.org/',

                        '95': 'http://jan2019.archive.ensembl.org/',

                        '94': 'http://oct2018.archive.ensembl.org/',

                        '93': 'http://jul2018.archive.ensembl.org/',

                        '92': 'http://apr2018.archive.ensembl.org/',

                        '91': 'http://dec2017.archive.ensembl.org/',

                        '90': 'http://aug2017.archive.ensembl.org/',

                        '89': 'http://may2017.archive.ensembl.org/',

                        '88': 'http://mar2017.archive.ensembl.org/',

                        '87': 'http://dec2016.archive.ensembl.org/',

                        '86': 'http://oct2016.archive.ensembl.org/',

                        '80': 'http://may2015.archive.ensembl.org/',

                        '77': 'http://oct2014.archive.ensembl.org/',

                        '75': 'http://feb2014.archive.ensembl.org/',

                        '54': 'http://may2009.archive.ensembl.org/'}

import pybiomart as pbm

def test_ensembl_host(scplus_obj, host, species):

    dataset = pbm.Dataset(name=species+'_gene_ensembl',  host=host)

    annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])

    annot.columns = ['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']

    annot['Chromosome'] = annot['Chromosome'].astype('str')

    filter = annot['Chromosome'].str.contains('CHR|GL|JH|MT')

    annot = annot[~filter]

    annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']

    gene_names_release = set(annot['Gene'].tolist())

    ov=len([x for x in scplus_obj.gene_names if x in gene_names_release])

    print('Genes recovered: ' + str(ov) + ' out of ' + str(len(scplus_obj.gene_names)))

    return ov

n_overlap = {}

for version in ensembl_version_dict.keys():

    print(f'host: {version}')

    try:

        n_overlap[version] =  test_ensembl_host(scplus_obj, ensembl_version_dict[version], 'hsapiens')

    except:

        print('Host not reachable')

v = sorted(n_overlap.items(), key=lambda item: item[1], reverse=True)[0][0]

print(f"version: {v} has the largest overlap, use {ensembl_version_dict[v]} as biomart host"

biomart_host = "http://sep2019.archive.ensembl.org/"

Before running, it is important to download a list of known human TFs from the human transcription factors database.

!wget -O pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt  http://humantfs.ccbr.utoronto.ca/download/v_1.01/TF_names_v_1.01.txt

We will also download the program bedToBigBed. This will be used to generate files that can be uploaded to the UCSC genome browser

!wget -O pbmc_tutorial/bedToBigBed http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed

!chmod +x pbmc_tutorial/bedToBigBed

Now you are ready to run the analysis.

from scenicplus.wrappers.run_scenicplus import run_scenicplus

try:

    run_scenicplus(

        scplus_obj = scplus_obj,

        variable = ['GEX_celltype'],

        species = 'hsapiens',

        assembly = 'hg38',

        tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',

        save_path = os.path.join(work_dir, 'scenicplus'),

        biomart_host = biomart_host,

        upstream = [1000, 150000],

        downstream = [1000, 150000],

        calculate_TF_eGRN_correlation = True,

        calculate_DEGs_DARs = True,

        export_to_loom_file = True,

        export_to_UCSC_file = True,

        path_bedToBigBed = 'pbmc_tutorial',

        n_cpu = 12,

        _temp_dir = os.path.join(tmp_dir, 'ray_spill'))

except Exception as e:

    #in case of failure, still save the object

    dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)

    raise(e)

After running the SCENIC+ analysis, the scplus_obj will be populated with the results of several analyses. Please follow this link to find out how to perform the downstream analysis.


References

1.           Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nature Methods 14, 1083-1086 (2017).

2.           Kim, D. et al. Gene regulatory network reconstruction: harnessing the power of single-cell multi-omic data. npj Systems Biology and Applications 9, 51 (2023).

3.           Huynh-Thu, V.A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using tree-based methods. PloS one 5, e12776 (2010).

4.           Buenrostro, J.D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486-490 (2015).

5.           Rotem, A. et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nature biotechnology 33, 1165-1172 (2015).

6.           Bartosovic, M., Kabbe, M. & Castelo-Branco, G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nature biotechnology 39, 825-835 (2021).

7.           Hu, X., Hu, Y., Wu, F., Leung, R.W.T. & Qin, J. Integration of single-cell multi-omics for gene regulatory network inference. Computational and Structural Biotechnology Journal 18, 1925-1938 (2020).

8.           Badia-i-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nature Reviews Genetics 24, 739-754 (2023).

9.           Xu, Q. et al. ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination. Nucleic acids research 49, 7966-7985 (2021).

10.         Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 614, 742-751 (2023).

11.          Zeng, W. et al. DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data. Nature communications 10, 4613 (2019).

12.         Ma, A. et al. Single-cell biological network inference using a heterogeneous graph transformer. Nature Communications 14, 964 (2023).

13.         Wang, L. et al. Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multiomics. Nature Methods 20, 1368-1378 (2023).

14.         Zhang, L., Zhang, J. & Nie, Q. DIRECT-NET: An efficient method to discover cis-regulatory elements and construct regulatory networks from single-cell multiomics data. Science Advances 8, eabl7393 (2022).

15.         Kartha, V.K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genom 2(2022).

16.         Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nature Biotechnology 40, 1458-1466 (2022).

17.         Kamal, A. et al. GRaNIE and GRaNPA: inference and evaluation of enhancer-mediated gene regulatory networks. Molecular systems biology 19, e11627 (2023).

18.         Skok Gibbs, C. et al. High-performance single-cell gene regulatory network inference at scale: the Inferelator 3.0. Bioinformatics 38, 2519-2528 (2022).

19.         Jiang, J. et al. IReNA: integrated regulatory network analysis of single-cell transcriptomes and chromatin accessibility profiles. Iscience 25(2022).

20.         Chen, X. et al. Mapping disease regulatory circuits at cell-type resolution from single-cell multiomics data. Nature Computational Science 3, 644-657 (2023).

21.         Alanis-Lobato, G. et al. MICA: a multi-omics method to predict gene regulatory networks in early human embryos. Life Science Alliance 7(2024).

22.         Fleck, J.S. et al. Inferring and perturbing cell fate regulomes in human brain organoids. Nature, 1-8 (2022).

23.         Duren, Z., Chen, X., Jiang, R., Wang, Y. & Wong, W.H. Modeling gene regulation from paired expression and chromatin accessibility data. Proceedings of the National Academy of Sciences 114, E4914-E4923 (2017).

24.         Zenere, A., Rundquist, O., Gustafsson, M. & Altafini, C. Using high-throughput multi-omics data to investigate structural balance in elementary gene regulatory network motifs. Bioinformatics 38, 173-178 (2022).

25.         Ledru, N. et al. Predicting regulators of epithelial cell state through regularized regression analysis of single cell multiomic sequencing. bioRxiv, 2022.12.29.522232 (2022).

26.         Jin, S., Zhang, L. & Nie, Q. scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome biology 21, 1-19 (2020).

27.         Duren, Z. et al. Sc-compReg enables the comparison of gene regulatory networks between conditions using single-cell data. Nature Communications 12, 4763 (2021).

28.         Bravo González-Blas, C. et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nature Methods 20, 1355-1367 (2023).

29.         Li, Z., Nagai, J.S., Kuppe, C., Kramann, R. & Costa, I.G. scMEGA: single-cell multi-omic enhancer-based gene regulatory network inference. Bioinformatics Advances 3, vbad003 (2023).

30.         Zhang, S. et al. Inference of cell type-specific gene regulatory networks on cell lineages from single cell omic datasets. Nature Communications 14, 3064 (2023).

31.         Jansen, C. et al. Building gene regulatory networks from scATAC-seq and scRNA-seq using linked self organizing maps. PLoS computational biology 15, e1006555 (2019).

32.         Bachireddy, P. et al. Mapping the evolution of T cell states during response and resistance to adoptive cellular therapy. Cell reports 37(2021).

33.         Duren, Z., Chen, X., Xin, J., Wang, Y. & Wong, W.H. Time course regulatory analysis based on paired expression and chromatin accessibility data. Genome research 30, 622-634 (2020).

34.         Jiang, Y. et al. Nonparametric single-cell multiomic characterization of trio relationships between transcription factors, target genes, and cis-regulatory regions. Cell Systems 13, 737-751. e4 (2022).