Global ETD Search

Return to search

The labyrinth of protein classification: a pipeline forselection and classification of biological data

Recent progress in fundamental biological sciences and medicine has considerably increased the quantity ofdata that can be studied and processed. The main limitation now is not retrieving data, but rather extractinguseful biological insights from the large datasets accumulated. More recent advances have provided detailedhigh-density data regarding metabolism (metabolomics) and protein expression (proteomics). Clearly, no single analytic methods, can provide a comprehensive understanding. Rather, the ability to link available datatogether in a coherent manner is required to obtain a complete view. The improving application of MachineLearning (ML) techniques provides the means to make continuous progress in processing complex data sets.A brief discussion is offered on the advantages of ML, the state-of-the-art in Deep Learning (DL) for proteinpredictions and the importance of ML in biological data processing. Noise stemming from incorrect classification or arbitrary/ambiguous labelling of data may arise when ML techniques are applied to large datasets. Furthermore, the stochasticity of biological systems needs to be considered for correctly evaluating theoutputs. Here we show the potential of a workflow to respond biological questions taking into consideration aperturbation of the biological data. For controlling the applicability of models and maximizing the predictivity, in silico filtering schemescan usefully be applied as an “Ockham’s razor” before using any ML technique. After reviewing differentDL approaches for protein prediction purposes, this work shows that a computational approach in filteringsteps is a valuable tool for proteins classification when biological features are not fully annotated or reviewed.The in silico approach has identified putative proline transporters in fungi and plants as well as carotenoidbiosynthetic gene products in the plant family Brassicaceae. The proposed method is suitable for extractingfeatures of classification and then maximizing the use of a DL approach.

http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-201239

computational biology

bioinformatics

mitochondrial metabolism

carotenoid biosynthesis

Biologiska vetenskaper

Bioinformatics (Computational Biology)

Bioinformatik (beräkningsbiologi)

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:su-201239
Date	January 2022
Creators	Pelosi, Benedetta
Publisher	Stockholms universitet, Institutionen för molekylär biovetenskap, Wenner-Grens institut
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Licentiate thesis, monograph, info:eu-repo/semantics/masterThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.002 seconds

The labyrinth of protein classification: a pipeline forselection and classification of biological data

Description

Links & Downloads

Tags

Additional Fields