Mobile Menu

Biologically-validated AI is how scientists are realising the full potential of single-cell RNA sequencing

The promise of single-cell gene expression data

Genomic data are an excellent source of novel disease biomarkers and targets. In fact, genetically validated targets are twice as likely to achieve FDA approval (King et al. 2019). The next-generation sequencing (NGS) technology that underlies these discoveries is now commonplace in many research labs. Validation of target expression by RNA sequencing (bulk-RNA-seq) is also common once a gene of interest is identified. However, in a mixed population of cells, biomarkers and targets of interest are expressed at varying levels – bulk-RNA-seq is only capable of reporting on the average gene expression. As a result, important differences in gene expression as a function of cell type or location may be missed by researchers.

One increasingly promising approach to understanding target expression in individual cells is single cell RNA sequencing (scRNA-seq).  This technology provides increased resolution into how alterations in individual cell types contribute to disease pathology.  As opposed to bulk gene expression methods, scRNA-seq identifies the molecular signatures of many individual cells – revealing tissue heterogeneity, enabling identification of subpopulations that were not previously detectable, and generating novel insights into complex disease biology.

With the launch of 10x Genomics applications coupled with widely available NGS technology, scRNA-seq has grown in popularity over the last five years.  It’s being used to not only characterize the individual expression profiles of the milieu of cells that make up normal tissue (Han et al., 2020), but also to identify how that balance and mixture of cells, when altered, can lead to disease states.

In the field of oncology, scRNA-seq has enabled better characterization of cell-based therapies, such as chimeric antigen receptor (CAR) T-cells. In a recent publication, Sheih and colleagues used scRNA-seq of peripheral blood mononuclear cells (PBMCs) to gain insights into clonal kinetics of CAR-T cells in patients treated with cell immunotherapy (2020).  These analyses also resolved what types of CAR-T cells have the best therapeutic potential in solid tumors by dissecting the cells of the tumor microenvironment.  

In the field of immunology, scRNA-seq has provided insight into immune cell responses to acute and chronic viral infections (Yao et al., 2019); as well as enabling investigation into how viruses like HSV-1 impact some host cells differently than others (Drayman et al., 2019).  Most recently, in cardiometabolic diseases, scRNA-seq analysis of aortic tissue has identified altered pathways that can lead to disease states including atherosclerosis and aneurysms (Chen et al., 2019; Li et al., 2020; Chen et al., 2020).  Through scRNA-seq experiments, atypical cell populations are identified and some of these populations are being implicated and subsequently validated as drivers of disease initiation and progression. By better elucidating the biological mechanisms underlying disease at the individual cell level, new targets for therapeutic development can be discovered.

Single-cell RNA-seq experiments produce vast quantities of data; however, unlocking accurate and valuable biological insights is difficult without the appropriate tools for analyzing these complex datasets.

Analytical approaches for single cell gene expression

Single cell RNA-seq datasets undergo initial processing steps before analysis.  Since samples are always a heterogeneous mixture of individual cells, barcodes are used to associate many transcripts with a given cell. Individual transcripts within a given cell are labeled with a unique molecular identifier (UMI) to enable expression analysis.  In order to properly call and identify cells, the barcode is counted followed by gene and cell filtration based on UMI counts. Any cell barcode without enough UMI counts is considered to be background and removed from further analysis.

Unlike typical bulk RNA-seq experiments, scRNA-seq is performed on limited quantities of input RNA, assesses thousands of cells per sample, and involves unsupervised cell clustering analysis to identify populations present.  This approach has the advantage that a smaller number of samples can be used because of the thousands of cells from a single sample confer statistical robustness. But limited input RNA means these datasets commonly have high levels of noise and technical dropouts, which occur when a zero value is obtained for a transcript.  This poses computational challenges in correctly identifying distinct cell clusters and ensuring the results accurately reflect true biological states rather than technical artifacts.  These data processing and analysis concerns led to the development of specific statistical and computational methods for effective analysis.  In addition, since the number of expected cell populations is generally unknown, generative AI models, such as variational autoencoders (VAEs) are now used for advanced unsupervised cell clustering.

Solving for high dimensionality and noise in scRNA-seq data

A major challenge in single cell sequencing data analysis is the separation of thousands of cells into meaningful clusters. Given the high dimensionality of these data, traditional machine learning approaches cannot robustly distinguish cell clusters without a priori estimation of cell cluster number designations. This presents the need for more unbiased methods that will learn from the data how similar or different the individual cells are.

This was the challenge faced by researchers at Genuity Science (formerly known as WuXi NextCODE) and Yale University when investigating the molecular signatures of cardiovascular disease driven by aberrant signaling in endothelial and smooth muscle cells (SMCs) that line and make up blood vessels.  Two experimental models of disease were investigated: atherosclerosis and thoracic aortic aneurysm and dissection (TAAD) (Chen et al., 2019; Li et al., 2020).  While bulk RNA-seq experiments implicated several signaling pathways involved in disease progression, the researchers sought to identify specific cell populations and molecular signatures contributing to disease.  Following scRNA-seq of aortic endothelial cells and SMCs, they employed a VAE strategy to reduce dimensionality and control for technical signal loss of scRNA-seq data to identify putative cell clusters in an unbiased manner.  (Risso et al., 2018; Chen et al., 2019; Li et al., 2020)

This generative AI strategy was then followed by clustering analysis to identify the number of distinct cell clusters present.   The results of these scRNA-seq experiments revealed novel subpopulations of endothelial and smooth muscle cells with unique molecular signatures.  In the atherosclerosis model, analysis revealed a specific endothelial cell population displaying loss of endothelial markers and gain of mesenchymal and inflammatory markers, providing insight into how atherosclerotic plaques are formed and maintained.  In the model of TAAD, a population of phenotypically altered SMCs with degradative features was identified, highlighting how the aortic wall may become susceptible to aneurysm.  These aberrant cell populations were then experimentally determined to be critical drivers of atherosclerosis and TAAD disease progression, thereby revealing new targets for therapeutic development and potential points for intervention. 

Generative deep learning pipeline for scRNA-seq data analysis performed by researchers at Genuity Science and Yale University.  Alignment and pre-processing, zero-inflated variational autoencoder framework, and consensus cell clustering were used in their AI approach.  Modified from Figure S15, Chen et al., Cell Stem Cell 2020.

Gaining valuable insights with increasing experimental complexity

ScRNA-seq is also useful to examine aberrant cell differentiation over time – a powerful approach that can describe how one cell type differentiates into another. However, the additional experimental time points needed for this analysis compound the difficulties in cell cluster identification which are already a concern for scRNA-seq. Researchers at Genuity Science and Yale University tackled this problem in a model of cardiovascular disease by tracking aortic SMCs during aneurysm development in a longitudinal scRNA-seq experiment (Chen et al., 2020).  By applying their novel unsupervised AI methods, they identified cell clusters present at each time point during aneurysm development.

Next, they needed to compare between timepoints to understand disease progression. Using probabilistic methods to model differentiation across time, they identified similarities in the single cell gene expression data to build directed putative cell differentiation trajectory networks.  This analysis projected the cell clusters over the experimental time course, demarcating which cell clusters gave rise to new populations at one, two, and four months.  Through this analysis, the authors discovered a specific SMC population expressing KLF4 that underwent reprogramming to numerous mesenchymal cell types driven by transcriptional changes.  They were also able to demonstrate this reprogramming occurred in a clonal manner – showing that one KLF4-expressing SMC cluster present at baseline gave rise to the reprogrammed cell clusters expressing mesenchymal lineage markers at later time points.  After further in vivo experiments, the authors proposed this parent SMC population is disease-prone and responsible for aneurysm development.  The discovery of this cell population and its plasticity could have significant therapeutic implications and was possible because of the cutting-edge analysis methods employed to longitudinal single cell expression data.

According to the Genuity Science press release:

“This is single-cell science at its revolutionary best. Building on the Yale team’s precision model biology, our deep learning has pinpointed and then observed month by month how a group of cells drive this disease,” said Dr Tom Chittenden, Genuity Science Chief Data Science Officer and co-senior author on the paper. “This is an in vivo-validated starting point for drug discovery in a condition that kills tens of thousands of people every year. Importantly, it also represents a replicable approach we can apply to uncover the causal biology of just about any phenotype.”

On the horizon:  Single-cell sequencing and advanced analysis for target discovery in complex disease

Many other diseases are good candidates for the approach described above.  Most complex diseases occur in tissues with multiple distinct cell populations present.  As such, an approach whereby whole genome sequencing is used to identify and validate novel targets, complemented by scRNA-seq analysis to provide additional mechanistic information is likely to be of value. Specifically, diseases of the liver such as non-alcoholic steatohepatitis, infectious diseases, and many neurological disorders all involve a complex mixture of cell types.  Deep learning approaches to analysis of scRNA-seq data will enable researchers in the future to reveal novel disease mechanisms, biomarkers of disease progression, and therapeutic targets.


  1. King, E. et al. (2019). Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet 15(12): e1008489.
  2. Han, X., Zhou, Z., Fei, L. et al. (2020). Construction of a human cell landscape at single-cell level. Nature.
  3. Sheih, A. et al. (2020). Clonal kinetics and single-cell transcriptional profiling of CAR-T cells in patients undergoing CD19 CAR-T immunotherapy. Nature Comm 11: 219.
  4. Yao, C. et al. (2019). Single-cell RNA-seq reveals TOX as a key regulator of CD8+ T cell persistence in chronic infection. Nat Immunol 20(7): 890-901.
  5. Drayman, N. et al. (2019). HSV-1 single-cell analysis reveals the activation of anti-viral and developmental programs in distinct sub-populations. eLife 8:e46339.
  6. Chen, P.-Y., Qin, L., Li, G., Wang, Z., Dahlman, J.E., Malagon-Lopez J, Gujja, S., Cilfone, N.A., Kauffman, K.J., Sun, L., et al. (2019). Endothelial TGF-beta signaling drives vascular inflammation and atherosclerosis. Nature Metabolism 1, 912–926.
  7. Li, G., Wang, M., Caulk, A.W., Cilfone, N.A., Gujja, S., Qin, L., Chen, P.-Y., Chen, Z., Yousef, S., Jiao, Y., et al. (2020). Chronic mTOR activation induces a degradative smooth muscle cell phenotype. The Journal of Clinical Investigation 130(3):1233-1251.
  8. Chen, P.-Y., Qin, L., Li, G., Malagon-Lopez, J., Wang, Z., Bergaya, S., Gujja, S., Caulk, A.W., Murtada, S.-I., Zhang, X., et al. (2020). Smooth muscle cell reprogramming in aortic aneurysms. Cell Stem Cell Apr 2;26(4):542-557.e11. doi: 10.1016/j.stem.2020.02.013..
  9. Van den Berge, K., Perraudeau, F., Soneson, C. et al. (2018) Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol 19, 24.
  10. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

More on these topics

AI / RNA sequencing / Single-Cell Analysis