Single-cell sequencing technology has been advanced by the falling sequencing costs, a boom in microfluidics and combinatorial indexing strategies. A single experiment can now analyse thousands, or even millions, of cells resulting in unique data science problems.
Sequencing individual cells allows researchers to investigate cell-to-cell heterogeneity and evaluate hypotheses about differences between pre-defined sample groups at the single-cell level. Single-cell measurements of both DNA and RNA, but also more recently epigenetic markers and protein levels, can help researchers understand cells at the finest resolution possible.
As single cell data science exacerbates many the data issues that arise in bulk sequencing, it comes with new challenges such as limited amounts of material per cell leading to uncertainty about observations, amplification can result in technical noise being added to the data, or an increase in resolution results in another dimension of data matrices.
This post, published on Genome Biology, proposes the data science challenges the authors believe to be the most relevant to advancing single-cell data science. A summary can be found below:
- Handling sparsity in single-cell RNA sequencing
Single-cell RNA sequencing (scRNA-seq) can help researchers gain full insight into the interplay of transcripts within single cells. ScRNA-seq measurements suffer from large gractions of observed zeros, where a given gene in a cell has no unique molecular identifiers or reads mapping to it. To denote observed zero values in methodological noise, the term “dropout” is often used, but this usually combines two different types of zero values: those associated with the methodological noise, and those with a true biological absence of expression. Beyond the variation in the number of unexpressed genes, the degree of sparsity (observed zeros) can be attributed to technical limitations, the sequencing depth and underlying expression levels.
Sparsity can hinder downstream analysis and is still a challenge to handle appropriately.
- Defining flexible statistical frameworks for discovering complex differential patterns in gene expression
Beyond changes in average gene expression between cell types or bulk-collected libraries, scRNA-seq enables a high granularity of changes in expression to be unraveled. Currently, the vast majority of differential expression detection methods assume that the group of cells to b compared are known in advance, however, most pipelines rely on clustering or cell type assignment before downstream analysis, without promoting the uncertainty in these assignments or accounting for the double use of data.
With the expanding capacity of experimental techniques to generate multi-sample scRNA-seq datasets, further statistical frameworks will be required to identify more differential patterns across samples. This will be particularly necessary in clinical applications, where cells are collected from multiple patients.
- Mapping single cells to a reference atlas
A lot of secondary analysis requires the cells to be classified into cell types. The lack of appropriate, available references has implied that only reference-free approaches are conceivable, where unsupervised clustering were a predominant option which require manual cluster annotation. This is a time-consuming process, and puts limits on the reproducibility of the results
- Generalising trajectory inference
Several biological processes can be described as continuous dynamic changes in cell type, e.g. immune response or cancer growth. The path a cell can undergo in this continuous space is called a trajectory, and several models hav been proposed to describe cell state dynamics from transcriptomic data. Modelling of other omics-measurements or integrating multiple types of data is still in its infancy. The study of complex trajectories integrating different data types could lead to a more systematic understanding of cell fate processes.
- Finding patterns in spatially resolved measurements
Single-cell spatial transcriptomics technologies retain the spatial co-ordinates of the cells and transcripts within a tissue. This data brings about the question of how spatial information can be leveraged to find patterns, infer cell types or functions and classify the cells in a given tissue.
Single cell genomics
Within every organism, the genome can be altered. In cancer, these genetic alterations can occur during disease progression resulting in tumour cell populations that are highly heterogenous. Tumour heterogeneity can help doctors to predict patient response to therapy or survival, and understanding the dynamics of this could play a key role in improving diagnosis and therapeutic choices.
Single-cell DNA sequencing requires whole genome analysis of the DNA from single cells, and this amplification process introduces errors and biases challenge variant calling.
- Dealing with errors and missing data in the identification of variation from scDNA sequencing data
The major disturbing factor in sc-DNA-seq data is the WGA process, which introduce amplification errors (false-positive alternative alleles) and amplification bias: the insufficient or complete failure of amplification that leads to imbalanced proportions or lack of variant alleles.
Single cell phylogenomics
Single-cell variation profiles can be used in computational models of somatic evolution, such as in cancer. Phylogenetics methods are generally used to reconstruct the evolutionary history of a species, but can be used in cellular maps such as cancer cells.
- Scaling phylogenetic models to many cells and many sites
Phylogenetic models of tumour evolution face challenges of computational tractability mainly induced by the increasing number of cells sequenced in cancer studies and the increasing number of sites that can be queried per genome.
- Integrating multiple types of variation into phylogenetic models
Downstream analyses like characterising the intertumoral heterogeneity and inferring its evolutionary history has unreliable variant detection in single cells. However, the better the quality of the variant calls become, the more important it is to model the types of available signal in mathematical models of tumour evolution which should increase the resolution and reliability of the resulting trees.
While adding data from more cells will improve the phylogeny, it exacerbates the possible growing taxa, and it is not feasible for MSAs containing more than approx. 20 single cells.
- Inferring population genetic parameters of tumour heterogeneity by model integration
Tumour heterogeneity is the result of the evolutionary journey of tumour cell populations and microenvironmental factors. This imposes different selective pressures on the tumour cells which drive the formation of subclones and influencing disease progression, patient outcome, and treatment response. But this leaves a lot of unanswered questions on whether metastatic seeding occurs early and multiple times in parallel, or whether seeding occurs late and from a far developed subclone in the primary tumour. It is also unknown whether a single cell can seed a metastasis, or whether the migration of multiple cells is required, hence sc-seq can provide an invaluable resolution.
Technology allows us to measure the arrangement and relationships of tumour cells in space, with cell location amounting to a second measurement requiring data integration within a cell. In vivo imaging techniques and automated analysis of whole slide immunohistochemistry images gives promise to provide information to help develop mutational profiles from scDNA-seq.
However, classical mathematical models assume well mixed populations, ignoring spatial structure and evolutionary microenvironments. Likewise, it is already common to sequence tumour material from different time points, which lend themselves to temporal analyses of subclonal genotypes, but eventually population genetic models should be integrated with approaches from phylogenetics to leverage the relationships between cells. For comprehensive integration, key parameters will need to be quantified with higher resolution.
With the increased resolution of scDNA-seq, it will be a challenge to integrate this with the spatial location of single cells obtained from other measurements. It will also be a challenge to detect positive or diversifying selection with greater resolution, building on approaches from the bulk context. Another issue is with the detection of positive or diversifying selection is the high false positive rates, and although there are computationally intese solutions for decreasing these, they might not scale to single-cell cancer data sets.
- Integration of single-cell data across samples, experiments and types of measurement
To comprehensively analyse complex biological processes, different types of measurements from multiple experiments need to be obtained and integrated. For the integration, flexible but rigorous statistical and computational frameworks. Experimental technologies that allow multiple measurement types in the same cell are on the rise, but this type of data comes with the challenge to account for dependencies among the measurement types.
- Validating and benchmarking analysis tolls for single-cell measurements
The need for datasets and methods to support systematic benchmarking and evaluation of analysis tools is growing. To be useful, the algorithms and pipelines should be able to pass quality control tests including the ability to produce the expected results and be robust to high levels of sequencing noise. Current simulation tolls mostly concentrate on differential expression analysis, but comprehensive methods for other aspects are not yet developed.
Likewise, the development of tools for validation of simulated sc-seq datasets by comparison with real data is needed. It would be beneficial to bring together a community supported benchmarking platform to allow ongoing comparison methods as new approaches are proposed.