10/01/2023 - Journal Club

Ad Hoc Learning of Fragmentation

by Zoltan Udvardy and Arslan Siraj

In our last Journal Club blog post, we presented ProteomicsML a web platform with tutorials for machine learning in the field of proteomics. Today, we stay on the spot with machine learning, however, on this occasion, we are presenting a new approach and model from Tom Altenburg et al. based on their article “Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides“.

But what is ad hoc learning, what does this name stand for? “Ad Hoc” means ‘for this specific purpose‘ with which the authors would like to deliver the message that they developed a model that is learning from fragmentation for a specific purpose, in this case, phosphorylation detection. 

Current deep learning applications in the proteomics field are using peptide sequence information, and often masses of amino acids, ion types, losses, or combinations of the aforementioned parameters. Contrary, the model presented in this article abstracts fragmentation patterns from spectra that are important to recognize phosphorylated peptides based on their fragmentation spectra. The key is that the model is recognizing these essential patterns without being explicitly told about them.

To understand what the model is learning we have to introduce the concept of interpretability. Interpretability is the degree to which a human can understand the cause of a decision that the model made or the degree to which a human can consistently predict the model’s result. The authors used two methods to interpret the knowledge of their model. SHAP values (SHapley Additive exPlanations) were used to prove that the model’s decisions are based on peaks that belong to actual fragment ions rather than noise peaks. Also, PathExplain was used to compute pairwise interactions per spectrum to match those with relevant delta masses.

They proposed a two-vector representation, holding intensity and mass-over-charge (m/z) remainder information to encode spectra directly in the deep learning base model AHLF. The story behind such encoding is to promote the learning of associations between any peaks while respecting their location within a spectrum. Therefore, convolutional layers were used, because they preserve the location of a feature as the outputs from a convolution are equivariant concerning their inputs. Subsequently, the higher layers of deep neural networks can make use of the presence and the location of peaks. To be exact, they use convolutions with gaps, commonly called dilated convolutions. Ultimately, due to the parameter-sharing properties of convolutions, the total number of trainable weights is low compared to a fully connected network. Overall, the model architecture allows AHLF to use the entire two-vector spectrum as-is.

Figure 1. Illustration of how long-range associations can be learned by AHLF via dilated convolutions.

To demonstrate the ability of AHLFp in detecting spectra of phosphorylated peptides, they evaluate the performance of AHLFp on 19.2 million labeled spectra from 112 individual PRIDE repositories. In addition, They demonstrate the broad scope of their approach by applying AHLF to a distinct task, namely the detection of cross-linked peptides (AHLFx). Also to put this detection capability into practice, the model was utilized to rescore peptide matches using Percolator and the results were compared to PhoStar, a random forest model with carefully generated phospho detecting features.

Curious about the results? Check out the original article here!

As you might have seen in the blog post about ProteomicsML, we are now contacting the authors of publications covered in our Journal Club to make short interviews and gain further insights about these scientists’ work. Here we would like to thank Tom Altenburg and Bernhard Y. Renard for taking up the challenge and answering our questions! Below you can read our interview with the authors:

Blog post team: How does this research fit with your research interests and your institute’s research aims?

Tom Altenburg and Bernhard Y. Renard: In our group, we develop statistical and computational methods for high throughput techniques, including next-generation sequencing and MS-based proteomics.

Blog post team: What sparked the idea of using Ad hoc learning from peptide fragmentation?

Tom Altenburg and Bernhard Y. Renard: The data situation in MS-based proteomics is highly favorable for approaches like AHLF. There are two reasons for that: i) the availability of public proteomics data now is enormous and constantly growing and ii) there are methods that label data in an automatized fashion (i.e. peptide and protein identification) in MS-based proteomics. Specifically, proteomics search engines can identify and annotate peptides from MS data automatically. On the one hand, this makes training deep learning methods like AHLF feasible. On the other hand, current predominantly algorithmic approaches can further be improved by the integration of ML-based methods like AHLF.

Blog post team: One main novelty of your work is that the model is learning from fragment spectra without any sequence information and expert knowledge. What other fields of LC-MS could use a similar approach?

Tom Altenburg and Bernhard Y. Renard: Related approaches are emerging in lipidomics and metabolomics. For example, a group at TUM is currently working on detecting lipid species using deep learning in a way similar to AHLF. Furthermore, there are learning-based scoring functions that extend or outperform algorithmic scoring schemes in the case of metabolomics. However, from our perspective having a large pool of training data is key but may be not feasible in some other fields.

Blog post team: You show that pairwise interactions of respective delta masses coincide with expert knowledge about phosphopeptide fragmentation in a significant number of cases. Do you try to analyze cases you see in your pairwise analysis that are not explained with expert knowledge? Do you hope to draw/gain new knowledge from what AHLFp learned?

Tom Altenburg and Bernhard Y. Renard: To be honest, it was a bit of a surprise that for a large fraction of the most prominent pairwise interactions, the respective delta masses are relevant in the context of phosphoproteomics and we could explain them. However, we only considered losses and combinations thereof. One could take this a step further and match specific elemental compositions (i.e. compositions of C, H, N, O, and P). A comparison of compositions with or without phosphor may give good additional insights. This, in turn, can be extended to other elements or signatures of other modifications to gain new knowledge and thus is an interesting future perspective.

Blog post team: One of us (Arslan) works with protein-nucleic acid crosslink data analysis, and has some questions regarding the applicability of AHLF with this type of data:

As mentioned in the manuscript for protein-protein crosslink, AHLFx improves the results. For protein-nucleic acid crosslink, we can find spectra of a peptide with nucleic acid, where the crosslinker binds to nucleic acid (mass adducts). Upon fragmentation, the MS/MS spectra are more challenging e.g. we can find a,b,y ions, precursor ions, marker ions, and sometimes not a very nice shifted ion series. Do you have any suggestions on how we could adapt AHLF in protein-nucleic acid crosslinking protocols?

Tom Altenburg and Bernhard Y. Renard: The identification rate can be improved by using AHLF as an additional feature for rescoring or to pre-filter spectra to run a dedicated downstream analysis. Specifically, spectra that contain fragments from a cross-linker and a peptide may be treated differently from spectra that contain fragments from all three: cross-linker, a peptide, and a nucleic acid strand. For example, if a spectrum does not contain fragments from a nucleic acid strand they may be searched by a classical search engine or existing cross-linking search engines, such as xisearch. This at least gives an idea about which peptides may be cross-linked and only for those peptides and filtered (e.g. predicted by AHLFx) protein-nucleic acid cross-linked containing spectra a dedicated search needs to be performed.

Blog post team: The AHLF model is trained without crosslinking spectra, and as your results show, for protein-protein crosslinking it is fine to use transfer learning. Do you think the transfer learning approach could work for protein-nucleic acid crosslinking?

Tom Altenburg and Bernhard Y. Renard: The elemental composition of nucleotides differs from amino acids. Therefore, it might be a good entry point. At least, I would expect patterns (shifts or groups of peaks) that belong to the peptide, others that point to the DNA, and others that belong to the cross-linker. However, the data is probably still very limited in this area. The prediction performance (e.g., AHLFx) may vary with the instrument type and thus might be an additional constraint. In any case, transfer learning (initializing with a pre-trained AHLFp or AHLFx) is probably a good starting point for protein-nucleic acid crosslinking data.

Blog post team: How does the deep learning model behave if we add traditional information as predefined features (as did in Prosit)? It might affect the sensitivity of the identification of peptides.

Tom Altenburg and Bernhard Y. Renard: If a certain expected feature can be learned by the model, i.e. the architecture has no inherent bottleneck regarding that type of feature and if there is enough data for training – then it should not be necessary to include predefined features. However, if any of these two requirements is not met, it could help add features, e.g. to compensate for the lack of data. Otherwise, transfer learning goes in a similar direction. The model is trained on a domain with lots of data (learning ubiquitous and general features such as losses and delta masses etc). At that point does it make a difference if these features were pre-defined or rather pre-learned? Imagine we define a specific neutral loss, fixed conceptually and fixed numerically. The model must accept it and make it work, for better or worse. In contrast, if the model learned some neutral loss but its value is slightly off (w.r.t. the new domain), then it can adjust the neutral loss by adjusting the respective weight during transfer learning. Another intriguing example is the combinatorial complexity of internal fragments. Internal fragments occur when a fragment undergoes a second (or more) fragmentation event. Inevitably, the number of combinations of possible fragments (fragments outside the typical fragment ladder assumption) explodes. The basic building blocks (i.e. masses of amino acids and neutral losses) may be relatively easy to predefine but precalculating all possible internal fragments would be rather expensive. Luckily, this is where deep learning comes in very handy because if the complexity (i.e. combinatorics of internal fragments) follows some kind of hierarchy (i.e. fragments are subsets of each other) the hierarchical structure of a deep learning model might have a chance to pick this up and help us in this situation – provided that there was enough training data.

Blog post team: What are your plans with the model, if you have any?

Tom Altenburg and Bernhard Y. Renard: There are many interesting future directions and this is a fast pacing field. For example, we had the idea to extend AHLF to further improve phosphosite localization and therefore integrate the SHAP values in a way that helps us to pinpoint localization. In our paper, we could show that the FLR is not inflated by using AHLFx. However, one could take this a step further and integrate the SHAP values and build a dedicated localization tool based on this idea.

Again, we would like to thank Tom Altenburg and Bernhard Y. Renard for elaborating on their paper! Finally, thanks for reading our post, and keep tuned for further content!

Latest posts

Exploring Cellular Complexity: Unveiling Single-Cell Proteomics

08/09/2023 - Journal Club

Exploring Cellular Complexity: Unveiling Single-Cell Proteomics

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

23/08/2023 - Journal Club

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

Peptide De Novo Sequencing  What are the ingredients of that delicious pizza?

08/08/2023 - Journal Club

Peptide De Novo Sequencing What are the ingredients of that delicious pizza?