Mobile Menu

A New Deep Learning Tool to Translate Mass Spectra into Peptide Sequences

High-throughput proteomics techniques such as mass spectrometry have the capacity to dramatically change the field as we know it. With the amount of data increasing, new computational approaches are being developed to analyse data in new ways and produce novel discoveries.

Wout Bittremieux, Assistant Research Professor at the University of Antwerp, is developing machine learning algorithms to translate mass spectra plots into peptide sequences. Wout shared the details of his new machine learning tool, Casanovo1,2, at a recent Front Line Genomics webinar.

In this article, we will give an overview into peptide mapping in immunopeptidome analysis, explain how Casanovo sequences novel peptides, and see how the tool performs against other de novo approaches. To hear Wout’s presentation in full, as well as other talks on computational proteomics and big data analysis in protein discovery and characterisation, please check out the on-demand webinar here.

Deciphering the immunopeptidome

It has been well established that cellular division is an imperfect process. When cells divide, genomic mutations can be introduced that may eventually result in cancer development. These cellular changes are recognised by immune cells, which interact with small pieces of proteins (immunopeptides) presented on the surface of cancerous cells. Immune cell recognition can result in the destruction of the cancerous cell, thus preventing disease development.

However, cancer employs several tactics to escape this fate. Cancerous cells may replicate faster than they can be detected, or otherwise evade the immune system. This leads to cancer development. One possible treatment is to identify the unique immunopeptides present on cancer cells (called neo-epitopes) and use this information in immunotherapy treatment. By modifying the patient’s immune cells to target the neo-epitopes, this can increase the recognition and destruction of cancer cells (see Figure 1).

Figure 1: Basic overview of immunotherapy. This process relies on the recognition of neo-epitopes on cancer cells by immune cells. This figure is a screenshot from Wout’s presentation at the Front Line Genomics webinar.

The Journey from mass spectra to peptide

So, how do we find out what immunopeptides are present on cancerous cells? In immumopeptidomics, HLA molecules that bind to immunopeptides are extracted from the sample and the peptides are analysed using mass-spectrometry. Notably, the proteins are processed inside the cell (instead of being cleaved by a protease) so there is a large variety of immunopeptides to analyse. Each experiment can produce thousands of mass spectra.

The next step is to determine which peptide each spectrum corresponds to. To work out the amino acid sequence, several key pieces of data are needed. When a peptide is analysed by mass spectrometry, its backbone bonds are cleaved to create fragment ions. Typically, it is known how peptides fragment into b- and y- ions in the mass spectrometer, and at which mass-over-charge values (m/z) these fragment ions would occur. Using this data, the amino acid sequence of the peptide can be determined (see Figure 2).

Figure 2: Peptide sequence prediction from a mass spectrum. Peaks corresponding to b-(blue) and y-ions (red) of the associated peptide are shown whereas black peaks correspond to unexpected fragmentation events or noise. This figure is a screenshot from Wout’s presentation at the Front Line Genomics webinar.

The most common form of spectrum annotation is sequence database searching. Using a sequence database of theoretical spectra simulated from candidate peptides, you can find a match between your experimental spectrum and a theoretical spectrum. This information can then be used to assign peptide labels to the spectra. However, depending on the experiment, this is not always possible.

“Sometimes we don’t have a sequence database because we’re analysing data from non-model organisms, or there’s unexpected modifications, etc. In this case, sequence database searching is not an option, because this can only identify peptides that we know to look for”, Wout explains.

Luckily, there’s an alternative; de novo peptide sequencing. Here, the peptide-spectrum-match is determined by deriving the peptides from the spectrum directly, for example, by checking whether the distances between the peaks correspond to specific amino acids. As we don’t know what peptides we’re looking for, this is a significantly more challenging task than sequence database searching.

As a result, de novo is generally a less popular strategy than sequence database searching. However, there have been several notable de novo approaches developed over the last decade, with newer tools using machine learning approaches that make this process significantly easier.

Introducing Casanovo

Wout’s team have used the latest advancements in deep learning – including large language models – to carry out de novo peptide sequencing. By using models that are very similar to those used in ChatGTP, Wout aimed to approach sequencing like a translation task, essentially translating from the “language” of mass spectra into the “language” of amino acids.

Therefore, Wout and colleagues developed Casanovo2, a deep learning tool that uses a transformer neural network architecture (see Figure 3). This approach requires a transformer encoder, which receives mass spectra peaks as input and learns a contextualised latent representation of the spectra. This representation is sent to a transformer decoder, which then iteratively predicts the next amino acid in the sequence. Finally, this output is delivered to a beam search decoding strategy that finds the peptide with the highest score.

Figure 3: Architecture of the Casanovo algorithm. This figure is a screenshot from Wout’s presentation at the Front Line Genomics webinar.

Benchmarking Casanovo

The predictive capabilities of Casanovo were then compared to other popular de novo tools, including Novor3 and DeepNovo4. When Casanovo was trained on a similar dataset to these tools (containing approximately 1 million spectra), Casanovo had a higher average peptide precision than both DeepNovo and Novor tools (average precision = 0.84 compared to 0.76 and 0.64 for DeepNovo and Novor respectively). In addition, when Casanovo was trained on a larger dataset (30 million spectra), it significantly outperformed the other tools (average precision = 0.95).

Next, Casavovo’s peptide prediction ability was tested on a library of immunopeptides originating from breast cancer cells. To evaluate Casanovo sequence predictions, the results were checked to determine if they were present in both database search results (Tide) and the human proteome. Overall, Casanovo detected 65% more unique peptides that match the human proteome when compared to Tide (see figure 4).

Figure 4: Performance comparison of Casonovo against database searching (Tide). This figure is a screenshot from Wout’s presentation at the Front Line Genomics webinar.

Lastly, Casanovo was tested by an independent, external study analysing the performance of several de novo tools5. In this study, de novo tools were applied to the identification and assembly of antibodies. Impressively, Casanovo performed better than every other de novo algorithm tested, again illustrating the power of transformer-based deep learning tools for de novo peptide sequencing.

“Immunopeptide analysis is a pretty challenging task, but it’s also very relevant for a lot of clinical applications”, Wout concludes. “Our Casanovo tool uses a transformer neural network that was trained on millions of peptides to achieve state-of-the-art performance in de novo peptide sequencing.”

Casanovo is open-source and freely available on GitHub, so give it a try! More information about Casanovo and Wout’s other publications can be found on his website.

Webinar Q&A (Direct quotes)

Q: “Do specific amino acid modifications affect peptide mapping? If so, how do you overcome that in your technique?”

A: “Yes, this is a very relevant question. As Casanovo is a machine learning tool, it can only identify post-translational modifications (PTMs) that it has seen during training. At the moment, there’s a limited set of PTMs that we support – including oxidation, deamidation, acetylation, carbonylation and loss of ammonia. Those are the PTMs that Casanovo can recognise at the moment. We are in the process of compiling a larger dataset for training, with a larger variety of PTMs, to expand that set in the near future.”


1.      Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. 162, 25514–25522 (17–23 Jul 2022).

2.      Yilmaz, M. et al. Sequence-to-sequence translation from mass spectra to peptides with a transformer model. bioRxiv 2023.01.03.522621 (2023) doi:10.1101/2023.01.03.522621.

3.      Ma, B. Novor: real-time peptide de novo sequencing software. J. Am. Soc. Mass Spectrom. 26, 1885–1894 (2015).

4.      Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl. Acad. Sci. U. S. A. 114, 8247–8252 (2017).

5.      Beslic, D., Tscheuschner, G., Renard, B. Y., Weller, M. G. & Muth, T. Comprehensive evaluation of peptide de novo sequencing tools for monoclonal antibody assembly. Brief. Bioinform. 24, (2023).