Global ETD Search

Return to search

Computational methods to estimate error rates forpeptide identifications in mass spectrometry-based proteomics / Beräkningsmetoder för att uppskatta felfrekvensen hos peptididentifikationer inom masspektrometri-baserad proteomik

In the field of proteomics, tandem mass spectrometry is the core technology which promises to identify peptide components within complex mixtures on a large scale. Currently the bottleneck is to reduce the error rates and assign accurate statistical estimates of peptide identifications. In this work, we introduce the techniques of identifying chimeric spectra, where two or more precursor ions with similar mass and retention time are co-fragmented and sequenced by the MS/MS instrument. Based on this, we try to analyze the factor which leads to the high error rate of identifications. We show that chimeric spectra have high correlations with the ranking scores and can reduce the number of positive identifications. Additionally, we address the problem of assigning a posterior error probability (PEP) to the individual peptide-spectrum matches (PSMs) that are obtained via search engines. This problem is computationally more difficult than estimating the error rate associated with a large collection of PSMs, such as false discovery rate (FDR). Existing methods rely on parametric or semiparametric models of the underlying score distribution as preassumption.We provide a so-called kernel logistic regression procedure without any explicit assumptions about the score distribution. Based on an appropriate positive definite Gaussian kernel, the resulting PEP estimate is proven to be robust by achieving a close correspondence between the PEP-derived q-values and FDR-derived q-values. Furthermore, we also accept at least 200 more significant PSMs with setting a threshold based on PEP-derived q-values compared to FDR-derived q-values. Finally, we show that this kernel logistic regression method is well established in the statistics literature and it can produce accurate PEP estimates for different types of PSM score functions and data. / Tandemmasspektrometri (MS/MS) är kärnan i proteomikstudier som försöker att identifiera peptider inom komplexa proteinlösningar i stor skala. För närvarande är flaskhalsen att minska felprocenten av peptideidentifikationerna, samt att tilldela noggranna statistiska skattningar av dessa. I detta arbete presenterar vi metoder för att identifiera chimära spektra, där två eller flera produktjoner med liknande massa och retentionstid är samfragmenterade och sekvenserade i ett MS/MS-instrument. Hypotesen är att dessa sam-fragmenterade joner är en anledning till den höga felfrekvensen hos peptideidentifikationer. Vi visar att chimära spektra har korrelerar med identifikationskvalitéten och kan minska antalet positiva identifikationer. Dessutom undersöker vi problemet med att tilldela en posteriori felsannolikhet (posterior error probability, PEP) till individuella peptid-spektrum matcher (PSM) som erhålls genom sökmotorer. Detta problem är beräkningsmässigt svårare än att uppskatta felfrekvensen med en stor samling av PSM, såsom false discover rate (FDR). Befintliga metoder förlitar sig på parametriska eller delvis-parametriska modeller av den underliggande fördelningen av poäng till identifikationer. Vi tillhandahåller en kernel-logistisk regressionsmodell utan några explicita antaganden av fördelningen. Baserat på en lämpligt positiv definit Gausskärna, har den resulterande PEP-uppskattningen visat sig vara robust genom att uppnå ett nära samband mellan PEP-härledda q-värden och FDR-härledda q-värden. Slutligen visar vi att denna icke-parametrisk kernel-logistisk regression metod är väl etablerad i den statistiska litteraturen och kan producera noggranna PEP uppskattningar för olika typer av PSM värderingar

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-124027

Computational Mathematics

Beräkningsmatematik

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-124027
Date	January 2013
Creators	Liang, Xiao
Publisher	KTH, Numerisk analys, NA
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	Swedish
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-MAT-E ; 2013:32

Page generated in 0.0305 seconds

Computational methods to estimate error rates forpeptide identifications in mass spectrometry-based proteomics / Beräkningsmetoder för att uppskatta felfrekvensen hos peptididentifikationer inom masspektrometri-baserad proteomik

Description

Links & Downloads

Tags

Additional Fields