Mass spectrometry-based proteomics imputation using self-supervised deep learning

Hello and welcome back to another issue of the PROTrEIN Journal Club! This occasion we will cover important topics in proteomics, missing values and imputation. We will try to shed light to some of the challenges regarding these matters with the aid of the article titled: “Mass spectrometry-based proteomics imputation using self supervised deep learning” from Henry Webel et al.¹ Also after a few weeks of hiatus, we are back with some machine learning too.

The search for biomarkers and the identification of new drug targets are important use cases of mass spectrometry (MS) based label-free proteomics. However, the downstream analysis of acquired data is largely impacted by missing values. There can be many roots of missing values, but the main contributing factors can be divided into two categories: biological factors such as proteins not existing in the sample or their abundance is below instrument detection limit and analytical factors, for instance poor ionisation efficiency, bad peptides-spectrum matches, stochasticity of precursor selection for fragmentation.²

To address the problem of missing values, different imputation methods were developed. These methods can impute quantification values on various levels such as precursor, aggregated peptides and protein group levels. One common approach is median imputation per feature across samples and another is interpolation of missing features by close replicates. A more sophisticated approach exists that imputes data at protein group level using random draws from down-shifted normal (RSN) distribution. The assumption here is that the values are missing due to absence or lower abundance in the sample than the detection limit. However, that can create biases and skew the downstream analysis. The authors of the paper are presenting three machine learning models to predict missing quantification values.

These alternative deep learning (DL) models use different strategies —collaborative filtering (CF), denoising autoencoder (DAE), and variational autoencoder (VAE)— to impute missing values in proteomics data sets. The training objectives, complexity, and therefore capabilities of the models are different which led authors to evaluate their performance in comparison to each other. The CF and autoencoder objective only focuses on reconstruction, whereas the VAE adds a constraint on the latent representation. Furthermore, the first two modeling approaches use a mean-squared error (MSE) reconstruction loss, whereas the VAE uses a probabilistic loss to assess the reconstruction error.

These models were applied to large (N≈450) and small (N≈50) MS-based proteomics data sets of HeLa cell line tryptic lysates acquired over two years during continuous quality control in two different labs at Novo Nordisk Foundation Center for Protein Research (NNF CPR) and Max Planck Institute of Biochemistry. The effectiveness of the models was assessed in comparison to two heuristic-based methods: median imputation and interpolation of missing features. The results show that the self-supervised models, i.e. CF, DAE, and VAE, outperform the heuristic-based approaches, with half of the median imputation mean absolute error (MAE). By identifying (+23.6%) more significantly differentially abundant protein groups, the VAE model in particular is demonstrated to be useful in illness prediction.

The two autoencoder architectures represented a sample in a low-dimensional space using all of the data. The CF model, in contrast, needed to learn a latent embedding space for both the samples and the features. Overall, While the DL methods and median imputation can impute all missing values, interpolation does not replace missing values in case a value is missing in all replicates. The study also discovers that the models’ performance changes based on how frequently a protein group is observed, with better performance for groups observed in more than 80% of the samples. For protein-level data, the models’ overall performance is shown to be the poorest, for aggregated peptides, it is better, and for precursors, it is best.

In the development datasets, while the DL techniques outperform interpolation and have around half the median imputation MAE, The three DL approaches perform about the same. Consequently, when compared to the self-supervised models, the median imputation and interpolation models performed about 1.8–2.4 times worse.

Performance of imputation methods at the level of protein groups, aggregated peptides, and precursors for MaxQuant outputs

The authors were testing the impact of their developed imputation techniques on a real-world dataset of 455 blood plasma proteomics samples from a cohort of alcohol-related liver disease (ALD) and healthy controls. The study³ where the real-world data originated from, was looking for biomarkers of ALD in the proteomics samples that could enable MS-based liver disease testing. One of the key pathological features of alcohol-related liver disease is fibrosis, therefore proteins related to fibrosis were monitored. In addition then they were training machine learning models to predict fibrosis and inflammation from the MS plasma protein groups. In the referenced article³, they used the RSN imputation approach, hence the authors compared their methods to the results of that publication. From the PIMMS methods the authors selected the variational encoder model for imputation and found 23.6% more differentially expressed proteins. They then investigated whether the differentially regulated proteins can be associated with disease using the DISEASE database, to find that 20 of these proteins had an association entry to fibrosis. With the newly found proteins the authors retrained the predictive model from the original study for liver condition development. They found that the retrained model performed as good or slightly better than the original model, concluding that these proteins can have predictive power.

Just like in the previous editions of PROTrEIN Journal Club we have sent our questions to the authors to conduct a short interview with them and gain more insight in their work. Please read our short interview below:

Blog post team: How self-supervised deep learning models that you chose contribute to the imputation of missing values in label-free quantification (LFQ) proteomics data? What advantages do they offer compared to other models?

Henry Webel: The models are in the category of machine learning models. In comparison to let’s say a random forest, you will additionally get embeddings of the features and samples, either separately (collaborative filtering) or joined (Autoencoder based architectures). Clustering in the embedding – also called latent – space could be used to compare it to e.g. hierarchical clustering of the original data.

Blog post team: The paper suggests assessing if machine learning models can be trained on lower-level data. Could you elaborate on this suggestion and discuss the potential benefits and challenges associated with training machine learning models using lower-level data in the context of MS-based proteomics?

Henry Webel: In mass spectrometry- based bottom-up proteomics the unit of measurement are ions of peptides. The aggregation to protein groups therefore normally implies an implicit imputation using a neutral element as described by Lazar et al.⁴ Therefore, the elements of interest should be rather measured peptides, but currently the dominant approach is to use aggregated protein groups.

Blog post team: It is mentioned that performance of the self-supervised models was better than heuristic approaches, which included median, interpolation or shifted normal distribution imputation, what was the main reason for that and could it be affected somehow by features of the data or size of the data? Do you think it is applicable to use supervised models instead of self-supervised models?

Henry Webel: The performance comparison always depends on the design of the comparison. We therefore extended the down-stream comparison in a revised version of the article. Additionally, we added supervised models such as random forests published as R packages. The main idea is to allow users to compare how well the imputation approaches perform on missing completely at random data (MCAR).

Blog post team: What are the future outlooks for PIMMS? Are you planning further developments?

Henry Webel: We added now many R methods to the comparison. Otherwise it would be great to extend the comparison by other methods of creating the validation and test data splits to get an overview of the methods used in the field.

References:

1. Webel, H. et al. Mass spectrometry-based proteomics imputation using self supervised deep learning. bioRxiv 2023.01.12.523792.

2. Jin, L. et al. A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci. Rep. 11, 1760 (2021).

3. Niu, L. et al. Noninvasive proteomic biomarkers for alcohol-related liver disease. Nat. Med. 28, 1277–1287 (2022).

4. Lazar, C., Gatto, L., Ferro, M., Bruley, C. & Burger, T. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J. Proteome Res. 15, 1116–1125 (2016).

About the author

Zoltan Udvardy and Alireza Nameni

Latest posts

08/05/2024 - Journal Club

Cross-Border Collaboration: Enhancing Peptide Identification with MS2Rescore and MS Amanda

08/09/2023 - Journal Club

Exploring Cellular Complexity: Unveiling Single-Cell Proteomics

23/08/2023 - Journal Club

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

Blog

Mass spectrometry-based proteomics imputation using self-supervised deep learning

About the author

Categories

Latest posts