Global ETD Search

101	Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein Structure Altun, Gulsah 22 April 2008 (has links) Recently, many methods have been proposed for the classification and prediction problems in bioinformatics. One of these problems is the protein structure prediction. Machine learning approaches and new algorithms have been proposed to solve this problem. Among the machine learning approaches, Support Vector Machines (SVM) have attracted a lot of attention due to their high prediction accuracy. Since protein data consists of sequence and structural information, another most widely used approach for modeling this structured data is to use graphs. In computer science, graph theory has been widely studied; however it has only been recently applied to bioinformatics. In this work, we introduced new algorithms based on statistical methods, graph theory concepts and machine learning for the protein structure prediction problem. A new statistical method based on z-scores has been introduced for seed selection in proteins. A new method based on finding common cliques in protein data for feature selection is also introduced, which reduces noise in the data. We also introduced new binary classifiers for the prediction of structural transitions in proteins. These new binary classifiers achieve much higher accuracy results than the current traditional binary classifiers. protein structure prediction feature selection support vector machines graph theory machine learning algorithm Computer Sciences
102	Application of Bayesian Hierarchical Models in Genetic Data Analysis Zhang, Lin 14 March 2013 (has links) Genetic data analysis has been capturing a lot of attentions for understanding the mechanism of the development and progressing of diseases like cancers, and is crucial in discovering genetic markers and treatment targets in medical research. This dissertation focuses on several important issues in genetic data analysis, graphical network modeling, feature selection, and covariance estimation. First, we develop a gene network modeling method for discrete gene expression data, produced by technologies such as serial analysis of gene expression and RNA sequencing experiment, which generate counts of mRNA transcripts in cell samples. We propose a generalized linear model to fit the discrete gene expression data and assume that the log ratios of the mean expression levels follow a Gaussian distribution. We derive the gene network structures by selecting covariance matrices of the Gaussian distribution with a hyper-inverse Wishart prior. We incorporate prior network models based on Gene Ontology information, which avails existing biological information on the genes of interest. Next, we consider a variable selection problem, where the variables have natural grouping structures, with application to analysis of chromosomal copy number data. The chromosomal copy number data are produced by molecular inversion probes experiments which measure probe-specific copy number changes. We propose a novel Bayesian variable selection method, the hierarchical structured variable se- lection (HSVS) method, which accounts for the natural gene and probe-within-gene architecture to identify important genes and probes associated with clinically relevant outcomes. We propose the HSVS model for grouped variable selection, where simultaneous selection of both groups and within-group variables is of interest. The HSVS model utilizes a discrete mixture prior distribution for group selection and group-specific Bayesian lasso hierarchies for variable selection within groups. We further provide methods for accounting for serial correlations within groups that incorporate Bayesian fused lasso methods for within-group selection. Finally, we propose a Bayesian method of estimating high-dimensional covariance matrices that can be decomposed into a low rank and sparse component. This covariance structure has a wide range of applications including factor analytical model and random effects model. We model the covariance matrices with the decomposition structure by representing the covariance model in the form of a factor analytic model where the number of latent factors is unknown. We introduce binary indicators for estimating the rank of the low rank component combined with a Bayesian graphical lasso method for estimating the sparse component. We further extend our method to a graphical factor analytic model where the graphical model of the residuals is of interest. We achieve sparse estimation of the inverse covariance of the residuals in the graphical factor model by employing a hyper-inverse Wishart prior method for a decomposable graph and a Bayesian graphical lasso method for an unrestricted graph. covariance estimation feature selection graphical network modeling genetic data analysis Bayesian hierarchical model
103	A statistical framework for the analysis of neural control of movement with aging and other clinical applications Johnson, Ashley Nzinga 08 March 2012 (has links) The majority of daily living tasks necessitate the use of bimanual movements or concurrent cognitive processing, which are often more difficult for elderly adults. With the number of Americans age 65 and older expected to double in the next 25 years, in-depth research and sophisticated technologies are necessary to understand the mechanisms involved in normal neuromuscular aging. The objective of the research is to understand the effects of aging on biological signals for motor control and to develop a methodology to classify aging and stroke populations. The methodological approach investigated the influence on correlated activity (coherence) between electroencephalogram (EEG) and electromyogram (EMG) signals into senior age. In support of classifying aging and stroke populations, the methodology selected optimal features from the time, frequency, and information theory domains. Additionally, the use of cepstral analysis was modified toward this application to analyze EEG and EMG signals. The inclusion and optimization of cepstral features significantly improved classification accuracy. Additionally, classification of young and elderly adults using Gaussian Mixture Models with Minimum Classification Error improved overall accuracy values. Contributions from the dissertation include demonstration of the change in correlated activity between EMG and EEG with fine motor simple and complex dual tasks; a quantitative feature library for characterizing the neural control of movement with aging under three task conditions; and a methodology for the selection and classification of features to characterize the neural control of movement. Additionally, the dissertation provides functional insight for the association of features with tasks, aging, and clinical conditions. The results of the work are significant because classification of the neural control of movement with aging is not well established. From these contributions, future potential contributions are: a methodology for physiologists to analyze and interpret data; and a computational tool to provide early detection of neuromuscular disorders. Electroencephalogram Electromyogram Classification Coherence Feature selection Aging Motor neurons Efferent pathways Neuromuscular transmission Cerebrovascular disease—Patients
104	Feature Ranking for Text Classifiers Makrehchi, Masoud January 2007 (has links) Feature selection based on feature ranking has received much attention by researchers in the field of text classification. The major reasons are their scalability, ease of use, and fast computation. %, However, compared to the search-based feature selection methods such as wrappers and filters, they suffer from poor performance. This is linked to their major deficiencies, including: (i) feature ranking is problem-dependent; (ii) they ignore term dependencies, including redundancies and correlation; and (iii) they usually fail in unbalanced data. While using feature ranking methods for dimensionality reduction, we should be aware of these drawbacks, which arise from the function of feature ranking methods. In this thesis, a set of solutions is proposed to handle the drawbacks of feature ranking and boost their performance. First, an evaluation framework called feature meta-ranking is proposed to evaluate ranking measures. The framework is based on a newly proposed Differential Filter Level Performance (DFLP) measure. It was proved that, in ideal cases, the performance of text classifier is a monotonic, non-decreasing function of the number of features. Then we theoretically and empirically validate the effectiveness of DFLP as a meta-ranking measure to evaluate and compare feature ranking methods. The meta-ranking framework is also examined by a stopword extraction problem. We use the framework to select appropriate feature ranking measure for building domain-specific stoplists. The proposed framework is evaluated by SVM and Rocchio text classifiers on six benchmark data. The meta-ranking method suggests that in searching for a proper feature ranking measure, the backward feature ranking is as important as the forward one. Second, we show that the destructive effect of term redundancy gets worse as we decrease the feature ranking threshold. It implies that for aggressive feature selection, an effective redundancy reduction should be performed as well as feature ranking. An algorithm based on extracting term dependency links using an information theoretic inclusion index is proposed to detect and handle term dependencies. The dependency links are visualized by a tree structure called a term dependency tree. By grouping the nodes of the tree into two categories, including hub and link nodes, a heuristic algorithm is proposed to handle the term dependencies by merging or removing the link nodes. The proposed method of redundancy reduction is evaluated by SVM and Rocchio classifiers for four benchmark data sets. According to the results, redundancy reduction is more effective on weak classifiers since they are more sensitive to term redundancies. It also suggests that in those feature ranking methods which compact the information in a small number of features, aggressive feature selection is not recommended. Finally, to deal with class imbalance in feature level using ranking methods, a local feature ranking scheme called reverse discrimination approach is proposed. The proposed method is applied to a highly unbalanced social network discovery problem. In this case study, the problem of learning a social network is translated into a text classification problem using newly proposed actor and relationship modeling. Since social networks are usually sparse structures, the corresponding text classifiers become highly unbalanced. Experimental assessment of the reverse discrimination approach validates the effectiveness of the local feature ranking method to improve the classifier performance when dealing with unbalanced data. The application itself suggests a new approach to learn social structures from textual data. Feature Ranking Feature Selection Information Theory Text Classifier Social Network Link Mining Electrical and Computer Engineering
105	Kernel Methods in Computer-Aided Constructive Drug Design Wong, William Wai Lun 04 May 2009 (has links) A drug is typically a small molecule that interacts with the binding site of some target protein. Drug design involves the optimization of this interaction so that the drug effectively binds with the target protein while not binding with other proteins (an event that could produce dangerous side effects). Computational drug design involves the geometric modeling of drug molecules, with the goal of generating similar molecules that will be more effective drug candidates. It is necessary that algorithms incorporate strategies to measure molecular similarity by comparing molecular descriptors that may involve dozens to hundreds of attributes. We use kernel-based methods to define these measures of similarity. Kernels are general functions that can be used to formulate similarity comparisons. The overall goal of this thesis is to develop effective and efficient computational methods that are reliant on transparent mathematical descriptors of molecules with applications to affinity prediction, detection of multiple binding modes, and generation of new drug leads. While in this thesis we derive computational strategies for the discovery of new drug leads, our approach differs from the traditional ligandbased approach. We have developed novel procedures to calculate inverse mappings and subsequently recover the structure of a potential drug lead. The contributions of this thesis are the following: 1. We propose a vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our experiments have provided convincing comparative empirical evidence that our descriptor formulation in conjunction with kernel based regression algorithms can provide sufficient discrimination to predict various biological activities of a molecule with reasonable accuracy. 2. We present a new component selection algorithm KACS (Kernel Alignment Component Selection) based on kernel alignment for a QSAR study. Kernel alignment has been developed as a measure of similarity between two kernel functions. In our algorithm, we refine kernel alignment as an evaluation tool, using recursive component elimination to eventually select the most important components for classification. We have demonstrated empirically and proven theoretically that our algorithm works well for finding the most important components in different QSAR data sets. 3. We extend the VSMMD in conjunction with a kernel based clustering algorithm to the prediction of multiple binding modes, a challenging area of research that has been previously studied by means of time consuming docking simulations. The results reported in this study provide strong empirical evidence that our strategy has enough resolving power to distinguish multiple binding modes through the use of a standard k-means algorithm. 4. We develop a set of reverse engineering strategies for QSAR modeling based on our VSMMD. These strategies include: (a) The use of a kernel feature space algorithm to design or modify descriptor image points in a feature space. (b) The deployment of a pre-image algorithm to map the newly defined descriptor image points in the feature space back to the input space of the descriptors. (c) The design of a probabilistic strategy to convert new descriptors to meaningful chemical graph templates. The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training set is still used to generate a new image point in the feature space, the reverse engineering strategies just described allows us to develop a new drug candidate that is independent of issues related to probability distribution constraints placed on test set molecules. drug design kernel methods QSAR inverse-QSAR feature selection Computer Science
106	Feature Ranking for Text Classifiers Makrehchi, Masoud January 2007 (has links) Feature selection based on feature ranking has received much attention by researchers in the field of text classification. The major reasons are their scalability, ease of use, and fast computation. %, However, compared to the search-based feature selection methods such as wrappers and filters, they suffer from poor performance. This is linked to their major deficiencies, including: (i) feature ranking is problem-dependent; (ii) they ignore term dependencies, including redundancies and correlation; and (iii) they usually fail in unbalanced data. While using feature ranking methods for dimensionality reduction, we should be aware of these drawbacks, which arise from the function of feature ranking methods. In this thesis, a set of solutions is proposed to handle the drawbacks of feature ranking and boost their performance. First, an evaluation framework called feature meta-ranking is proposed to evaluate ranking measures. The framework is based on a newly proposed Differential Filter Level Performance (DFLP) measure. It was proved that, in ideal cases, the performance of text classifier is a monotonic, non-decreasing function of the number of features. Then we theoretically and empirically validate the effectiveness of DFLP as a meta-ranking measure to evaluate and compare feature ranking methods. The meta-ranking framework is also examined by a stopword extraction problem. We use the framework to select appropriate feature ranking measure for building domain-specific stoplists. The proposed framework is evaluated by SVM and Rocchio text classifiers on six benchmark data. The meta-ranking method suggests that in searching for a proper feature ranking measure, the backward feature ranking is as important as the forward one. Second, we show that the destructive effect of term redundancy gets worse as we decrease the feature ranking threshold. It implies that for aggressive feature selection, an effective redundancy reduction should be performed as well as feature ranking. An algorithm based on extracting term dependency links using an information theoretic inclusion index is proposed to detect and handle term dependencies. The dependency links are visualized by a tree structure called a term dependency tree. By grouping the nodes of the tree into two categories, including hub and link nodes, a heuristic algorithm is proposed to handle the term dependencies by merging or removing the link nodes. The proposed method of redundancy reduction is evaluated by SVM and Rocchio classifiers for four benchmark data sets. According to the results, redundancy reduction is more effective on weak classifiers since they are more sensitive to term redundancies. It also suggests that in those feature ranking methods which compact the information in a small number of features, aggressive feature selection is not recommended. Finally, to deal with class imbalance in feature level using ranking methods, a local feature ranking scheme called reverse discrimination approach is proposed. The proposed method is applied to a highly unbalanced social network discovery problem. In this case study, the problem of learning a social network is translated into a text classification problem using newly proposed actor and relationship modeling. Since social networks are usually sparse structures, the corresponding text classifiers become highly unbalanced. Experimental assessment of the reverse discrimination approach validates the effectiveness of the local feature ranking method to improve the classifier performance when dealing with unbalanced data. The application itself suggests a new approach to learn social structures from textual data. Feature Ranking Feature Selection Information Theory Text Classifier Social Network Link Mining Electrical and Computer Engineering
107	Kernel Methods in Computer-Aided Constructive Drug Design Wong, William Wai Lun 04 May 2009 (has links) A drug is typically a small molecule that interacts with the binding site of some target protein. Drug design involves the optimization of this interaction so that the drug effectively binds with the target protein while not binding with other proteins (an event that could produce dangerous side effects). Computational drug design involves the geometric modeling of drug molecules, with the goal of generating similar molecules that will be more effective drug candidates. It is necessary that algorithms incorporate strategies to measure molecular similarity by comparing molecular descriptors that may involve dozens to hundreds of attributes. We use kernel-based methods to define these measures of similarity. Kernels are general functions that can be used to formulate similarity comparisons. The overall goal of this thesis is to develop effective and efficient computational methods that are reliant on transparent mathematical descriptors of molecules with applications to affinity prediction, detection of multiple binding modes, and generation of new drug leads. While in this thesis we derive computational strategies for the discovery of new drug leads, our approach differs from the traditional ligandbased approach. We have developed novel procedures to calculate inverse mappings and subsequently recover the structure of a potential drug lead. The contributions of this thesis are the following: 1. We propose a vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our experiments have provided convincing comparative empirical evidence that our descriptor formulation in conjunction with kernel based regression algorithms can provide sufficient discrimination to predict various biological activities of a molecule with reasonable accuracy. 2. We present a new component selection algorithm KACS (Kernel Alignment Component Selection) based on kernel alignment for a QSAR study. Kernel alignment has been developed as a measure of similarity between two kernel functions. In our algorithm, we refine kernel alignment as an evaluation tool, using recursive component elimination to eventually select the most important components for classification. We have demonstrated empirically and proven theoretically that our algorithm works well for finding the most important components in different QSAR data sets. 3. We extend the VSMMD in conjunction with a kernel based clustering algorithm to the prediction of multiple binding modes, a challenging area of research that has been previously studied by means of time consuming docking simulations. The results reported in this study provide strong empirical evidence that our strategy has enough resolving power to distinguish multiple binding modes through the use of a standard k-means algorithm. 4. We develop a set of reverse engineering strategies for QSAR modeling based on our VSMMD. These strategies include: (a) The use of a kernel feature space algorithm to design or modify descriptor image points in a feature space. (b) The deployment of a pre-image algorithm to map the newly defined descriptor image points in the feature space back to the input space of the descriptors. (c) The design of a probabilistic strategy to convert new descriptors to meaningful chemical graph templates. The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training set is still used to generate a new image point in the feature space, the reverse engineering strategies just described allows us to develop a new drug candidate that is independent of issues related to probability distribution constraints placed on test set molecules. drug design kernel methods QSAR inverse-QSAR feature selection Computer Science
108	Qualitative Performance Analysis for Large-Scale Scientific Workflows Buneci, Emma 30 May 2008 (has links) <p>Today, large-scale scientific applications are both data driven and distributed. To support the scale and inherent distribution of these applications, significant heterogeneous and geographically distributed resources are required over long periods of time to ensure adequate performance. Furthermore, the behavior of these applications depends on a large number of factors related to the application, the system software, the underlying hardware, and other running applications, as well as potential interactions among these factors.</p> <p>Most Grid application users are primarily concerned with obtaining the result of the application as fast as possible, without worrying about the details involved in monitoring and understanding factors affecting application performance. In this work, we aim to provide the application users with a simple and intuitive performance evaluation mechanism during the execution time of their long-running Grid applications or workflows. Our performance evaluation mechanism provides a qualitative and periodic assessment of the application's behavior by informing the user whether the application's performance is expected or unexpected. Furthermore, it can help improve overall application performance by informing and guiding fault-tolerance services when the application exhibits persistent unexpected performance behaviors.</p> <p>This thesis addresses the hypotheses that in order to qualitatively assess application behavioral states in long-running scientific Grid applications: (1) it is necessary to extract temporal information in performance time series data, and that (2) it is sufficient to extract variance and pattern as specific examples of temporal information. Evidence supporting these hypotheses can lead to the ability to qualitatively assess the overall behavior of the application and, if needed, to offer a most likely diagnostic of the underlying problem.</p> <p>To test the stated hypotheses, we develop and evaluate a general <em> qualitative performance analysis</em> framework that incorporates (a) techniques from time series analysis and machine learning to extract and learn from data, structural and temporal features associated with application performance in order to reach a qualitative interpretation of the application's behavior, and (b) mechanisms and policies to reason over time and across the distributed resource space about the behavior of the application. </p> <p>Experiments with two scientific applications from meteorology and astronomy comparing signatures generated from instantaneous values of performance data versus those generated from temporal characteristics support the former hypothesis that temporal information is necessary to extract from performance time series data to be able to accurately interpret the behavior of these applications. Furthermore, temporal signatures incorporating variance and pattern information generated for these applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. This leads to the framework's accurate classification of instances of similar behaviors, which represents supporting evidence for the latter hypothesis. The proposed framework's ability to generate a qualitative assessment of performance behavior for scientific applications using temporal information present in performance time series data represents a step towards simplifying and improving the quality of service for Grid applications.</p> / Dissertation Computer Science Computer Science feature selection time series analysis performance analysis signatures scientific applications
109	Feature Selection for Value Function Approximation Taylor, Gavin January 2011 (has links) <p>The field of reinforcement learning concerns the question of automated action selection given past experiences. As an agent moves through the state space, it must recognize which state choices are best in terms of allowing it to reach its goal. This is quantified with value functions, which evaluate a state and return the sum of rewards the agent can expect to receive from that state. Given a good value function, the agent can choose the actions which maximize this sum of rewards. Value functions are often chosen from a linear space defined by a set of features; this method offers a concise structure, low computational effort, and resistance to overfitting. However, because the number of features is small, this method depends heavily on these few features being expressive and useful, making the selection of these features a core problem. This document discusses this selection.</p><p>Aside from a review of the field, contributions include a new understanding of the role approximate models play in value function approximation, leading to new methods for analyzing feature sets in an intuitive way, both using the linear and the related kernelized approximation architectures. Additionally, we present a new method for automatically choosing features during value function approximation which has a bounded approximation error and produces superior policies, even in extremely noisy domains.</p> / Dissertation Artificial Intelligence Computer Science Feature Selection Reinforcement Learning Value Function Approximation
110	Sparse Modeling in Classification, Compression and Detection Chen, Jihong 12 July 2004 (has links) The principal focus of this thesis is the exploration of sparse structures in a variety of statistical modelling problems. While more comprehensive models can be useful to solve a larger number of problems, its calculation may be ill-posed in most practical instances because of the sparsity of informative features in the data. If this sparse structure can be exploited, the models can often be solved very efficiently. The thesis is composed of four projects. Firstly, feature sparsity is incorporated to improve the performance of support vector machines when there are a lot of noise features present. The second project is about an empirical study on how to construct an optimal cascade structure. The third project involves the design of a progressive, rate-distortionoptimized shape coder by combining zero-tree algorithm with beamlet structure. Finally, the longest run statistics is applied for the detection of a filamentary structure in twodimensional rectangular region. The fundamental ideas of the above projects are common — extract an efficient summary from a large amount of data. The main contributions of this work are to develop and implement novel techniques for the efficient solutions of several dicult problems that arise in statistical signal/image processing. Longest run Detection Coding Beamlet Cascade Feature selection Support vector machine Sparsity

Search results