Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

In our previous Journal Club blog post, we discussed «nanopore profiling,» a method for identifying proteins within the field of proteomics. Today we proceed with «how to validate the identification statistically other than the traditional target-decoy method with an improved version of the decoy-free approach» based on Dominik et al.’s article «Modeling lower-order statistics to enable decoy-free FDR estimation in proteomics«, we present in this blog post a new method for validating identification in proteomics.

There are two methodologies for calculating the false discovery rate (FDR) in proteomics: decoy-based and decoy-free. In decoy-based approaches, spectra are compared to the sequences of naturally occurring (target) proteins and decoys that are in-silico generated, based on peptide sequences of the target database. Using decoy peptide spectrum matches (PSMs), the conventional target-decoy FDR calculation takes into consideration the characteristics of incorrect target PSMs. Depending on the decoy generation method, this increases the cost of computation and decreases the likelihood that accurate PSMs will be considered. In addition, decoy-based methods may under- or overestimate the FDR if the scoring function used in the step of searching the target-decoy database has some bias or if there are insufficient decoys in the region where the models of the correct and incorrect PSMs overlap. Decoy-free statistical validation tools that employ only target PSMs constitute another category of peptide identification validation methods. While the majority of decoy-free statistical validation tools model the score distribution of the highest-scoring PSMs, some also exploit the PSMs with lower scores. The current decoy-free methods typically lack a solid theoretical foundation and rely significantly on the empirical characteristics of the data, or they rely on theoretical assumptions that are not always well-justified.

In this article, the new idea of decoy-free FDR estimation is propose with a semiempirical framework based on lower-scoring target PSMs where as a relationship between the parameters of the distributions of low-order statistics of the log transformed e-value (TEV) score and a necessary empirical optimization to fit a single parameter to real data. The theoretical sharing of parameters μ and β across different orders of Target E-Value (TEV) distributions in proteomics. However, empirical estimation of these parameters reveals slight deviations from the theoretical values. To address this, the article proposes two semiempirical optimization methods for estimating adjusted μ and β values for the Top Null Model (TNM). The methods include a linear regression-based approach and a mean β-based approach. These approaches are applied to data sets, and the best TNM variant is selected based on the Bayesian information criterion (BIC).

The performance of different Top Null Models (TNMs) estimated using lower-order models generated by the Tide and Comet search engines was evaluated. The evaluation focused on FDR estimation and compared the TNM approaches against Couté’s method, the Gumbel TEV model, and the common decoy distribution (CDD) method. The comparison considered metrics such as false discovery proportion (FDP) and the number of correctly identified spectra at different FDR thresholds. FDR control was performed using the Benjamini-Hochberg (BH) procedure. A ground truth data set was created by searching files against a target-decoy database and generating incorrect PSMs. The TNMs were generated based on the top-scoring PSMs, and the optimal μ and β estimates were determined using the proposed semiempirical estimation methods. FDR control was then applied using the BH procedure, and the results were evaluated using the ground truth labels. The process was repeated on bootstrapped samples, and mean FDP and correct identification values with confidence intervals were calculated.

To test the quality of lower-order models and top null models, data sets of natural peptides from five different species (H. sapiens, M. musculus, A. thaliana, S. cerevisiae, and E. coli) were taken from project repositories in the PRIDE archive. Synthetic human peptide data sets from the ProteomeTools project were used in the validation study. Only MS2 spectra with charge states of 2+, 3+, and 4+ were taken into consideration for the evaluation. The lower-order models were found to fit the empirical data well, with some discrepancies for lower order indices. The maximum likelihood estimation (MLE) method performed better than the method of moments (MM) for parameter estimation in the lower-order models. The proposed models accurately fit the empirical distributions and were not significantly affected by the size difference between the analyzed data sets. The study also compared the performance of the lower-order models with decoy-based models and Couté’s method, and found that the lower-order models estimated FDRs better, particularly for Tide results. However, for Comet results, Couté’s method and decoy-based models overestimated FDRs, while the common decoy distribution method provided second-best estimates. The differences in performance between Tide and Comet can be attributed to the less rigorous e-values produced by Comet. The proposed approach, combining theoretical foundations with empirical optimization, showed resistance to issues associated with statistical scoring in shotgun proteomics.

The proposed approach eliminates the need for decoy sequences and offers improved accuracy compared to alternative methods. While further evaluation and tuning may be necessary for different identification tools, this work highlights the untapped potential of lower-scoring PSMs in enhancing statistical validation methods in proteomics research.

Finally, thanks for reading our post and keep tuned for further content!

About the author

Louise Buur and Arslan Siraj

Latest posts

08/05/2024 - Journal Club

Cross-Border Collaboration: Enhancing Peptide Identification with MS2Rescore and MS Amanda

08/09/2023 - Journal Club

Exploring Cellular Complexity: Unveiling Single-Cell Proteomics

23/08/2023 - Journal Club

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

Blog

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics

About the author

Categories

Latest posts