Mobile Menu

Single-cell foundation models: The next big thing?

Single-cell sequencing has already become a staple methodology for medical research. Much of this advancement has happened within, and is directly relevant to, cancer biology. In our recent webinar series, nine expert speakers showcased examples of best-in-class single-cell sequencing data to improve the understanding of cancer characteristics such as cell heterogeneity, cellular networks and tumour microenvironment dynamics. In this feature, we summarise the exciting talk given by Jordan Krull (Post-Doctoral Research Scholar, Pelotonia Institute for Immuno-Oncology, Ohio State University) on ‘Unlocking the Potential of Foundation Models in Single-Cell RNA-Seq to Enhance Insights into Tumor Immunity’. To hear Jordan’s talk, along with the rest of the series, click here.

The challenges and benefits of scRNA-seq

Single-cell RNA-seq (scRNA-seq) has become a staple tool in the study of tumour microenvironments and immune responses. When combined with other advanced tools, more critical questions regarding tumour dynamics can be answered, including understanding which cells are present, what they’re doing, and how they are interacting with one another. Online data repositories for scRNA-seq data have been integral to the methodology and have contributed to our understanding of cancer.

However, there are challenges associated with using these advanced tools, specifically in areas like gene regulatory network analysis. A requirement for large datasets means that many labs do not have the necessary resources, and integration requires exceptional computational power and infrastructure, leading to a need for better solutions.

What is a foundation model?

A potential modern solution to the above problem is the use of foundation models. Jordan defines a foundation model as a ‘large-scale machine learning model, which is pre-trained on a large enough dataset that it is generalizable to a small-scale fine-tuning for various specific tasks.’

There are debates around what can be classified as a foundation model, rather than just a deep-learning model. Two of the most important distinctions, according to Jordan, are that the model must have a massive training corpus on a diverse dataset, and be generalizable to various sources of data, even those it wasn’t trained on. This ensures that the model has all possible context for gene expression.

There are four single-cell foundational models (scFMs) that have appeared in peer reviewed journals. In scFMs, every cell is treated as a sentence prompt, with a start sequence and two numerical values denoting the identity and position of the cell. Identity in this context means ‘where is this gene in the compendium of those being evaluated by the model’, in other words, where is it in the ‘dictionary’, meaning identity can be connected across cells. Most models also use rank value encoding for positional encoding, which is almost like asking ‘where does this word occur in a sentence?’ When applied to gene expression values, it can rank them from highest expressed to lowest expressed, allowing us to see how expression varies across cells. Variations to this concept in scFMs are typically only in how normalisation is carried out before encoding.

The below screenshot details the typical structure of a scFM.

Figure 1: The structure of a single-cell foundation model. Screenshot taken directly from Jordan Krull, Single-Cell for Cancer Biology ONLINE Webinar, [18.04.24].

A realistic solution?

Objective outputs from these scFMs are boundless, according to Jordan. They can be used to understand cell and gene clustering, text annotation, batch integration and more. In addition, they can be used for single-cell perturbation.

However, there are major limitations related to the use of scFMs. Notably, they are relatively inaccessible for most users and, as they are often built by non-biologists, they skip biologically intuitive construction and results. They are often hosted on unfamiliar repositories and are written in languages that many biologists do not learn. In addition, it is currently rare to find novel predictions that have been properly validated.

 Jordan: “Short story – there are currently no use cases in tumor-immune scRNA-seq using FMs.”

Possible expansions

Although there are no current use cases, there is the potential in the future to use these models for supervised and unsupervised cell clustering, deriving accurate gene regulatory networks from small datasets, identifying master regulators and in silico treatment and deletion response predictions.

Jordan describes the example of scGPT, a model published just this year. The researchers involved in the project used both zero-shot and fine-tuning on human immune data, and found cell-specific HLA and cluster differentiation gene sub-networks, with very little training. To extend this case study, the model could be used to define gene regulatory networks specific to tumour-infiltrating lymphocytes in immune checkpoint blockade resistant tumours. Jordan’s lab is currently trying to work on this.

Ultimately, tumour-immune single-cell datasets exist in high enough abundance to start considering the use of foundation models, but more work is required to ensure the success of this method. The models are already capable of answering many important questions, but parameter optimizing and model tweaking will be crucial in addressing remaining challenges.

Jordan: “A user interface is definitely needed. User interfaces are how we tap into the rest of the biology network that wants to use these models, because right now they are very difficult to use.”

Q&A Highlights

Q: Are there any single-cell/spatial applications of foundational models that are really exciting you right now?

A: Yes. Like I mentioned, I think the single-cell perturbance is a really exciting area. Mostly for the sheer fact that we can reduce a large amount of the search down to much more high potential candidates. And the fact that two studies have shown they can validate some of their top candidate is promising for these models.

Q: What are the next steps to determine if the current models are “foundation” enough to be a FM?

A: We need to push the models to do predictions that make sense. I’m an immunologist by training, so I’m usually evaluating whether these models are giving you an answer that makes sense. So, please contact your local biologist to make sure that the results are not just an overly intuitive response! But from there, ChatGPT already suggests that if you want legal advice then you have to feed it a lot of legal information – the model is already trained.

So, I think there will probably be a base model and you’ll do some retraining based on your disease and we’ll have sub-models for different disease sates. Whether that’s just for tumour immunology, or there’ll be a brain tumour or colorectal specific model, we’ll see. But the next steps are to put these models through their paces.