Global ETD Search

31	Data Science techniques for predicting plant genes involved in secondary metabolites production Muteba, Ben Ilunga January 2018 (has links) Masters of Science / Plant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes. Medicinal plants Polyphenols Feature selection Data visualisation Feature engineering
32	Integrated feature, neighbourhood, and model optimization for personalised modelling and knowledge discovery Liang, Wen January 2009 (has links) “Machine learning is the process of discovering and interpreting meaningful information, such as new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques” (Larose, 2005). From my understanding, machine learning is a process of using different analysis techniques to observe previously unknown, potentially meaningful information, and discover strong patterns and relationships from a large dataset. Professor Kasabov (2007b) classified computational models into three categories (e.g. global, local, and personalised) which have been widespread and used in the areas of data analysis and decision support in general, and in the areas of medicine and bioinformatics in particular. Most recently, the concept of personalised modelling has been widely applied to various disciplines such as personalised medicine, personalised drug design for known diseases (e.g. cancer, diabetes, brain disease, etc.) as well as for other modelling problems in ecology, business, finance, crime prevention, and so on. The philosophy behind the personalised modelling approach is that every person is different from others, thus he/she will benefit from having a personalised model and treatment. However, personalised modelling is not without issues, such as defining the correct number of neighbours or defining an appropriate number of features. As a result, the principal goal of this research is to study and address these issues and to create a novel framework and system for personalised modelling. The framework would allow users to select and optimise the most important features and nearest neighbours for a new input sample in relation to a certain problem based on a weighted variable distance measure in order to obtain more precise prognostic accuracy and personalised knowledge, when compared with global modelling and local modelling approaches. Personalised modelling Feature selection Nearest neighbour Model optimization Optimisation
33	Identification of Driving Styles in Buses Karginova, Nadezda January 2010 (has links) <p>It is important to detect faults in bus details at an early stage. Because the driving style affects the breakdown of different details in the bus, identification of the driving style is important to minimize the number of failures in buses.</p><p>The identification of the driving style of the driver was based on the input data which contained examples of the driving runs of each class. K-nearest neighbor and neural networks algorithms were used. Different models were tested.</p><p>It was shown that the results depend on the selected driving runs. A hypothesis was suggested that the examples from different driving runs have different parameters which affect the results of the classification.</p><p>The best results were achieved by using a subset of variables chosen with help of the forward feature selection procedure. The percent of correct classifications is about 89-90 % for the k-nearest neighbor algorithm and 88-93 % for the neural networks.</p><p>Feature selection allowed a significant improvement in the results of the k-nearest neighbor algorithm and in the results of the neural networks algorithm received for the case when the training and testing data sets were selected from the different driving runs. On the other hand, feature selection did not affect the results received with the neural networks for the case when the training and testing data sets were selected from the same driving runs.</p><p>Another way to improve the results is to use smoothing. Computing the average class among a number of consequent examples allowed achieving a decrease in the error.</p> Driving style k-nearest neighbor algorithm neural networks feature selection
34	Sharing visual features for multiclass and multiview object detection Torralba, Antonio, Murphy, Kevin P., Freeman, William T. 14 April 2004 (has links) We consider the problem of detecting a large number of different classes of objects in cluttered scenes. Traditional approaches require applying a battery of different classifiers to the image, at multiple locations and scales. This can be slow and can require a lot of training data, since each classifier requires the computation of many different image features. In particular, for independently trained detectors, the (run-time) computational complexity, and the (training-time) sample complexity, scales linearly with the number of classes to be detected. It seems unlikely that such an approach will scale up to allow recognition of hundreds or thousands of objects. We present a multi-class boosting procedure (joint boosting) that reduces the computational and sample complexity, by finding common features that can be shared across the classes (and/or views). The detectors for each class are trained jointly, rather than independently. For a given performance level, the total number of features required, and therefore the computational cost, is observed to scale approximately logarithmically with the number of classes. The features selected jointly are closer to edges and generic features typical of many natural structures instead of finding specific object parts. Those generic features generalize better and reduce considerably the computational cost of an algorithm for multi-class object detection. AI Object detection sharing features feature selection multiclass Boosting
35	Identification of Driving Styles in Buses Karginova, Nadezda January 2010 (has links) It is important to detect faults in bus details at an early stage. Because the driving style affects the breakdown of different details in the bus, identification of the driving style is important to minimize the number of failures in buses. The identification of the driving style of the driver was based on the input data which contained examples of the driving runs of each class. K-nearest neighbor and neural networks algorithms were used. Different models were tested. It was shown that the results depend on the selected driving runs. A hypothesis was suggested that the examples from different driving runs have different parameters which affect the results of the classification. The best results were achieved by using a subset of variables chosen with help of the forward feature selection procedure. The percent of correct classifications is about 89-90 % for the k-nearest neighbor algorithm and 88-93 % for the neural networks. Feature selection allowed a significant improvement in the results of the k-nearest neighbor algorithm and in the results of the neural networks algorithm received for the case when the training and testing data sets were selected from the different driving runs. On the other hand, feature selection did not affect the results received with the neural networks for the case when the training and testing data sets were selected from the same driving runs. Another way to improve the results is to use smoothing. Computing the average class among a number of consequent examples allowed achieving a decrease in the error. Driving style k-nearest neighbor algorithm neural networks feature selection
36	Semidefinite Embedding for the Dimensionality Reduction of DNA Microarray Data Kharal, Rosina January 2006 (has links) Harnessing the power of DNA microarray technology requires the existence of analysis methods that accurately interpret microarray data. Current literature abounds with algorithms meant for the investigation of microarray data. However, there is need for an efficient approach that combines different techniques of microarray data analysis and provides a viable solution to dimensionality reduction of microarray data. Reducing the high dimensionality of microarray data is one approach in striving to better understand the information contained within the data. We propose a novel approach for dimensionality reduction of microarray data that effectively combines different techniques in the study of DNA microarrays. Our method, <strong><em>KAS</em></strong> (<em>kernel alignment with semidefinite embedding</em>), aids the visualization of microarray data in two dimensions and shows improvement over existing dimensionality reduction methods such as PCA, LLE and Isomap. Computer Science semidefinite embedding dimensionality reduction feature selection kernel alignment
37	Semidefinite Embedding for the Dimensionality Reduction of DNA Microarray Data Kharal, Rosina January 2006 (has links) Harnessing the power of DNA microarray technology requires the existence of analysis methods that accurately interpret microarray data. Current literature abounds with algorithms meant for the investigation of microarray data. However, there is need for an efficient approach that combines different techniques of microarray data analysis and provides a viable solution to dimensionality reduction of microarray data. Reducing the high dimensionality of microarray data is one approach in striving to better understand the information contained within the data. We propose a novel approach for dimensionality reduction of microarray data that effectively combines different techniques in the study of DNA microarrays. Our method, <strong><em>KAS</em></strong> (<em>kernel alignment with semidefinite embedding</em>), aids the visualization of microarray data in two dimensions and shows improvement over existing dimensionality reduction methods such as PCA, LLE and Isomap. Computer Science semidefinite embedding dimensionality reduction feature selection kernel alignment
38	Feature Selection for Gene Expression Data Based on Hilbert-Schmidt Independence Criterion Zarkoob, Hadi 21 May 2010 (has links) DNA microarrays are capable of measuring expression levels of thousands of genes, even the whole genome, in a single experiment. Based on this, they have been widely used to extend the studies of cancerous tissues to a genomic level. One of the main goals in DNA microarray experiments is to identify a set of relevant genes such that the desired outputs of the experiment mostly depend on this set, to the exclusion of the rest of the genes. This is motivated by the fact that the biological process in cell typically involves only a subset of genes, and not the whole genome. The task of selecting a subset of relevant genes is called feature (gene) selection. Herein, we propose a feature selection algorithm for gene expression data. It is based on the Hilbert-Schmidt independence criterion, and partly motivated by Rank-One Downdate (R1D) and the Singular Value Decomposition (SVD). The algorithm is computationally very fast and scalable to large data sets, and can be applied to response variables of arbitrary type (categorical and continuous). Experimental results of the proposed technique are presented on some synthetic and well-known microarray data sets. Later, we discuss the capability of HSIC in providing a general framework which encapsulates many widely used techniques for dimensionality reduction, clustering and metric learning. We will use this framework to explain two metric learning algorithms, namely the Fisher discriminant analysis (FDA) and closed form metric learning (CFML). As a result of this framework, we are able to propose a new metric learning method. The proposed technique uses the concepts from normalized cut spectral clustering and is associated with an underlying convex optimization problem. Feature selection Hilbert-Schmidt Independence Criterion Gene expression data Statistics
39	GAGS : A Novel Microarray Gene Selection Algorithm for Gene Expression Classification Wu, Kuo-yi 30 July 2010 (has links) In this thesis, we have proposed a novel microarray gene selection algorithm consisting of five processes for solving gene expression classification problem. A normalization process is first used to remove the differences among different scales of genes. Second, an efficient gene ranking process is proposed to filter out the unrelated genes. Then, the genetic algorithm is adopted to find the informative gene subsets for each class. For each class, these informative gene subsets are adopted to classify the testing dataset separately. Finally, the separated classification results are fused to one final classification result. In the first experiment, 4 microarray datasets are used to verify the performance of the proposed algorithm. The experiment is conducted using the leave-one-out-cross-validation (LOOCV) resampling method. We compared the proposed algorithm with twenty one existing methods. The proposed algorithm obtains three wins in four datasets, and the accuracies of three datasets all reach 100%. In the second experiment, 9 microarray datasets are used to verify the proposed algorithm. The experiment is conducted using 50% VS 50% resampling method. Our proposed algorithm obtains eight wins among nine datasets for all competing methods. Feature selection Gene expression data analysis Genetic algorithm
40	Automatic Attribute Clustering and Feature Selection Based on Genetic Algorithms Wang, Po-Cheng 21 August 2009 (has links) Feature selection is an important pre-processing step in mining and learning. A good set of features can not only improve the accuracy of classification, but also reduce the time to derive rules. It is executed especially when the amount of attributes in a given training data is very large. This thesis thus proposes three GA-based clustering methods for attribute clustering and feature selection. In the first method, each feasible clustering result is encoded into a chromosome with positive integers and a gene in the chromosome is for an attribute. The value of a gene represents the cluster to which the attribute belongs. The fitness of each individual is evaluated using both the average accuracy of attribute substitutions in clusters and the cluster balance. The second method further extends the first method to improve the time performance. A new fitness function based on both the accuracy and the attribute dependency is proposed. It can reduce the time of scanning the data base. The third approach uses another encoding method for representing chromosomes. It can achieve a faster convergence and a better result than the second one. At last, the experimental comparison with the k-means clustering approach and with all combinations of attributes also shows the proposed approach can get a good trade-off between accuracy and time complexity. Besides, after feature selection, the rules derived from only the selected features may usually be hard to use if some values of the selected features cannot be obtained in current environments. This problem can be easily solved in our proposed approaches. The attributes with missing values can be replaced by other attributes in the same clusters. The proposed approaches thus provide flexible alternatives for feature selection. k-means reduct genetic algorithms feature clustering feature selection

Search results