10/02/2023 - Journal Club

Casanovo, a transformer model to identify De novo mass spectrometry peptide sequencing

by Zahra Elhamraoui and Mostafa Kalhor

In the last Journal club, we present a paper by Yilmaz et al. called «De novo mass spectrometry peptide sequencing with a transformer model» [1] introducing a deep learning model for de novo peptide sequencing.

What? You do not know exactly what is de novo peptide sequencing? Let me explain it. Imagine that you do not have enough prior knowledge about your sample. How can you use database search methodology? Under this condition, we try to identify peptide sequences directly from experimental spectra. The principle of it is to find the specific fragmentation pattern based on the regular breaks in the mass spectrometric detection of the peptide molecules after protease cleavage and calculate the corresponding amino acid information according to the mass difference between the mass spectrum peaks as well as the post-translational modifications of the amino acid.

Figure 1: Casanovo performs de novo peptide sequencing. Source: Yilmaz et al., 2022 [1]

Early de novo methods used the heuristic search or dynamic programming to score peptide sequences. Recently, some efforts have been made to develop Deep learning models to predict peptides sequence from MS2 including DeepNovo, SMS, and PointNovo. However, these models include complex post-processing steps. In addition, their structures are based on recurrent neural networks, which are slow to train and suffer from long dependency issues.
To resolve mentioned drawbacks, they proposed a transformer-based model called Casanovo for de novo peptide sequencing. Casanovo uses the self-attention mechanism to translate from a variable-length sequence of observed spectrum peaks to a variable-length sequence of amino acids, analogous to the neural machine translation model in the natural language processing setting. o consists of a transformer encoder and decoder, where the encoder takes d-dimensional spectrum peak embeddings as input and outputs d-dimensional latent representation vectors.
Casanovo was trained by 30 million labeled spectra which contain Seven different types of variable modifications (methionine oxidation, asparagine deamidation, glutamine deamidation, N-terminal acetylation, N-terminal carbamylation, N-terminal NH3 loss, and the combination of N-terminal carbamylation and NH3 loss).

Figure 2: Casanovo architecture. Source: Yilmaz et al., 2022 [1]

To evaluate the performance of Casanovo and compare it with other de novo peptide sequencing models, they used the nine-species benchmark data set. This data set combines a total of about 1.5 million mass spectra from nine different experiments, each using the same instrument to analyze peptides from a different species.

Casanovo, leverages the transformer architecture to produce a unified solution to translate mass spectra directly into peptide sequences, without resorting to the discretization of the spectrum m/z axis and without complex post-processing.

We had an interview with Melih Yilmaz and asked him the two following questions about the paper:

Do you think what is the most challenging issue to have a better deep model for de novo sequencing?

Thanks for reaching out and for your interest in Casanovo! A challenge that we tried to overcome with our new preprint was that the original version of Casanovo was trained on MS data from peptides digested with trypsin enzyme which didn’t perform as well for samples that were digested using a different enzyme. To mitigate this, we fine-tuned the existing Casanovo model on a non-enzymatic data set which significantly improves performance on non-tryptic data.

Do you have any plan to improve your model by considering more modifications in your training data?

We don’t have short-term plans to increase the number of post-translational modifications in the current version model. However, it would be straightforward to fine-tune the current model with an extended training set containing the new modifications.


[1] Yilmaz et al., bioRxiv (2022), doi.org/10.1101/2022.02.07.479481

Latest posts

Cross-Border Collaboration: Enhancing Peptide Identification with MS2Rescore and MS Amanda

08/05/2024 - Journal Club

Cross-Border Collaboration: Enhancing Peptide Identification with MS2Rescore and MS Amanda

Exploring Cellular Complexity: Unveiling Single-Cell Proteomics

08/09/2023 - Journal Club

Exploring Cellular Complexity: Unveiling Single-Cell Proteomics

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

23/08/2023 - Journal Club

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics