Global ETD Search

Return to search

Data Science techniques for predicting plant genes involved in secondary metabolites production

Masters of Science / Plant genome analysis is currently experiencing a boost due to reduced costs associated with the
development of next generation sequencing technologies. Knowledge on genetic background can be
applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and
biological engineering. In medicinal plants, secondary metabolites are of particular interest because they
often represent the main active ingredients associated with health-promoting qualities.
Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial
agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed
to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis
pathways. Little significant research has been conducted to study key enzyme factors that can predict a
class of secondary metabolite genes from polyketide synthases.
The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of
secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or
chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data,
particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of
secondary metabolite (SM).
Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved
in polyphenol biosynthesis from data science techniques and convey these techniques in computational
analysis through machine learning algorithms and mathematical and statistical approaches.
Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class
imbalance, which refers to lack of proportionality among protein sequence classes; 2) high
dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics
datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein
sequences have different lengths.
Considering these inherent issues, developing precise classification models and statistical models proves
a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science
techniques that can collect, prepare and analyse SM genes.

http://hdl.handle.net/11394/7039

Identifer	oai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:uwc/oai:etd.uwc.ac.za:11394/7039
Date	January 2018
Creators	Muteba, Ben Ilunga
Contributors	Christoffels, Alan
Publisher	University of the Western Cape
Source Sets	South African National ETD Portal
Language	English
Detected Language	English
Rights	University of the Western Cape

Page generated in 0.0025 seconds

Data Science techniques for predicting plant genes involved in secondary metabolites production

Description

Links & Downloads

Tags

Additional Fields