Global ETD Search

Return to search

Statistical and machine learning methods to analyze large-scale mass spectrometry data

As in many other fields, biology is faced with enormous amounts ofdata that contains valuable information that is yet to be extracted. The field of proteomics, the study of proteins, has the luxury of having large repositories containing data from tandem mass-spectrometry experiments, readily accessible for everyone who is interested. At the same time, there is still a lot to discover about proteins as the main actors in cell processes and cell signaling. In this thesis, we explore several methods to extract more information from the available data using methods from statistics and machine learning. In particular, we introduce MaRaCluster, a new method for clustering mass spectra on large-scale datasets. This method uses statistical methods to assess similarity between mass spectra, followed by the conservative complete-linkage clustering algorithm.The combination of these two resulted in up to 40% more peptide identifications on its consensus spectra compared to the state of the art method. Second, we attempt to clarify and promote protein-level false discovery rates (FDRs). Frequently, studies fail to report protein-level FDRs even though the proteins are actually the entities of interest. We provided a framework in which to discuss protein-level FDRs in a systematic manner to open up the discussion and take away potential hesitance. We also benchmarked some scalable protein inference methods and included the best one in the Percolator package. Furthermore, we added functionality to the Percolator package to accommodate the analysis of studies in which many runs are aggregated. This reduced the run time for a recent study regarding a draft human proteome from almost a full day to just 10 minutes on a commodity computer, resulting in a list of proteins together with their corresponding protein-level FDRs. / <p>QC 20160412</p>

mass spectrometry - LC-MS/MS

statistical analysis

data processing and analysis

protein inference

large-scale studies

simulation

Bioinformatics and Systems Biology

Bioinformatik och systembiologi

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-185149
Date	January 2016
Creators	The, Matthew
Publisher	KTH, Genteknologi, Stockholm
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Licentiate thesis, comprehensive summary, info:eu-repo/semantics/masterThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-BIO-Report, 1654-2312 ; 2016:3

Page generated in 0.0834 seconds

Statistical and machine learning methods to analyze large-scale mass spectrometry data

Description

Links & Downloads

Tags

Additional Fields