101 |
Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein StructureAltun, Gulsah 22 April 2008 (has links)
Recently, many methods have been proposed for the classification and prediction problems in bioinformatics. One of these problems is the protein structure prediction. Machine learning approaches and new algorithms have been proposed to solve this problem. Among the machine learning approaches, Support Vector Machines (SVM) have attracted a lot of attention due to their high prediction accuracy. Since protein data consists of sequence and structural information, another most widely used approach for modeling this structured data is to use graphs. In computer science, graph theory has been widely studied; however it has only been recently applied to bioinformatics. In this work, we introduced new algorithms based on statistical methods, graph theory concepts and machine learning for the protein structure prediction problem. A new statistical method based on z-scores has been introduced for seed selection in proteins. A new method based on finding common cliques in protein data for feature selection is also introduced, which reduces noise in the data. We also introduced new binary classifiers for the prediction of structural transitions in proteins. These new binary classifiers achieve much higher accuracy results than the current traditional binary classifiers.
|
102 |
Application of Bayesian Hierarchical Models in Genetic Data AnalysisZhang, Lin 14 March 2013 (has links)
Genetic data analysis has been capturing a lot of attentions for understanding the mechanism of the development and progressing of diseases like cancers, and is crucial in discovering genetic markers and treatment targets in medical research. This dissertation focuses on several important issues in genetic data analysis, graphical network modeling, feature selection, and covariance estimation. First, we develop a gene network modeling method for discrete gene expression data, produced by technologies such as serial analysis of gene expression and RNA sequencing experiment, which generate counts of mRNA transcripts in cell samples. We propose a generalized linear model to fit the discrete gene expression data and assume that the log ratios of the mean expression levels follow a Gaussian distribution. We derive the gene network structures by selecting covariance matrices of the Gaussian distribution with a hyper-inverse Wishart prior. We incorporate prior network models based on Gene Ontology information, which avails existing biological information on the genes of interest. Next, we consider a variable selection problem, where the variables have natural grouping structures, with application to analysis of chromosomal copy number data. The chromosomal copy number data are produced by molecular inversion probes experiments which measure probe-specific copy number changes. We propose a novel Bayesian variable selection method, the hierarchical structured variable se- lection (HSVS) method, which accounts for the natural gene and probe-within-gene architecture to identify important genes and probes associated with clinically relevant outcomes. We propose the HSVS model for grouped variable selection, where simultaneous selection of both groups and within-group variables is of interest. The HSVS model utilizes a discrete mixture prior distribution for group selection and group-specific Bayesian lasso hierarchies for variable selection within groups. We further provide methods for accounting for serial correlations within groups that incorporate Bayesian fused lasso methods for within-group selection. Finally, we propose a Bayesian method of estimating high-dimensional covariance matrices that can be decomposed into a low rank and sparse component. This covariance structure has a wide range of applications including factor analytical model and random effects model. We model the covariance matrices with the decomposition structure by representing the covariance model in the form of a factor analytic model where the number of latent factors is unknown. We introduce binary indicators for estimating the rank of the low rank component combined with a Bayesian graphical lasso method for estimating the sparse component. We further extend our method to a graphical factor analytic model where the graphical model of the residuals is of interest. We achieve sparse estimation of the inverse covariance of the residuals in the graphical factor model by employing a hyper-inverse Wishart prior method for a decomposable graph and a Bayesian graphical lasso method for an unrestricted graph.
|
103 |
A statistical framework for the analysis of neural control of movement with aging and other clinical applicationsJohnson, Ashley Nzinga 08 March 2012 (has links)
The majority of daily living tasks necessitate the use of bimanual movements or concurrent cognitive processing, which are often more difficult for elderly adults. With the number of Americans age 65 and older expected to double in the next 25 years, in-depth research and sophisticated technologies are necessary to understand the mechanisms involved in normal neuromuscular aging. The objective of the research is to understand the effects of aging on biological signals for motor control and to develop a methodology to classify aging and stroke populations. The methodological approach investigated the influence on correlated activity (coherence) between electroencephalogram (EEG) and electromyogram (EMG) signals into senior age. In support of classifying aging and stroke populations, the methodology selected optimal features from the time, frequency, and information theory domains. Additionally, the use of cepstral analysis was modified toward this application to analyze EEG and EMG signals. The inclusion and optimization of cepstral features significantly improved classification accuracy. Additionally, classification of young and elderly adults using Gaussian Mixture Models with Minimum Classification Error improved overall accuracy values.
Contributions from the dissertation include demonstration of the change in correlated activity between EMG and EEG with fine motor simple and complex dual tasks; a quantitative feature library for characterizing the neural control of movement with aging under three task conditions; and a methodology for the selection and classification of features to characterize the neural control of movement. Additionally, the dissertation provides functional insight for the association of features with tasks, aging, and clinical conditions. The results of the work are significant because classification of the neural control of movement with aging is not well established. From these contributions, future potential contributions are: a methodology for physiologists to analyze and interpret data; and a computational tool to provide early detection of neuromuscular disorders.
|
104 |
Feature Ranking for Text ClassifiersMakrehchi, Masoud January 2007 (has links)
Feature selection based on feature ranking has received much
attention by researchers in the field of text classification. The
major reasons are their scalability, ease of use, and fast computation. %,
However, compared to the search-based feature selection methods such
as wrappers and filters, they suffer from poor performance. This is
linked to their major deficiencies, including: (i) feature ranking
is problem-dependent; (ii) they ignore term dependencies, including
redundancies and correlation; and (iii) they usually fail in
unbalanced data.
While using feature ranking methods for dimensionality reduction, we
should be aware of these drawbacks, which arise from the function of
feature ranking methods. In this thesis, a set of solutions is
proposed to handle the drawbacks of feature ranking and boost their
performance. First, an evaluation framework called feature
meta-ranking is proposed to evaluate ranking measures. The framework
is based on a newly proposed Differential Filter Level Performance
(DFLP) measure. It was proved that, in ideal cases, the performance
of text classifier is a monotonic, non-decreasing function of the
number of features. Then we theoretically and empirically validate
the effectiveness of DFLP as a meta-ranking measure to evaluate and
compare feature ranking methods. The meta-ranking framework is also
examined by a stopword extraction problem. We use the framework to
select appropriate feature ranking measure for building
domain-specific stoplists. The proposed framework is evaluated by
SVM and Rocchio text classifiers on six benchmark data. The
meta-ranking method suggests that in searching for a proper feature
ranking measure, the backward feature ranking is as important as the
forward one.
Second, we show that the destructive effect of term redundancy gets
worse as we decrease the feature ranking threshold. It implies that
for aggressive feature selection, an effective redundancy reduction
should be performed as well as feature ranking. An algorithm based
on extracting term dependency links using an information theoretic
inclusion index is proposed to detect and handle term dependencies.
The dependency links are visualized by a tree structure called a
term dependency tree. By grouping the nodes of the tree into two
categories, including hub and link nodes, a heuristic algorithm is
proposed to handle the term dependencies by merging or removing the
link nodes. The proposed method of redundancy reduction is evaluated
by SVM and Rocchio classifiers for four benchmark data sets.
According to the results, redundancy reduction is more effective on
weak classifiers since they are more sensitive to term redundancies.
It also suggests that in those feature ranking methods which compact
the information in a small number of features, aggressive feature
selection is not recommended.
Finally, to deal with class imbalance in feature level using ranking
methods, a local feature ranking scheme called reverse
discrimination approach is proposed. The proposed method is applied
to a highly unbalanced social network discovery problem. In this
case study, the problem of learning a social network is translated
into a text classification problem using newly proposed actor and
relationship modeling. Since social networks are usually sparse
structures, the corresponding text classifiers become highly
unbalanced. Experimental assessment of the reverse discrimination
approach validates the effectiveness of the local feature ranking
method to improve the classifier performance when dealing with
unbalanced data. The application itself suggests a new approach to
learn social structures from textual data.
|
105 |
Kernel Methods in Computer-Aided Constructive Drug DesignWong, William Wai Lun 04 May 2009 (has links)
A drug is typically a small molecule that interacts with the binding site of some
target protein. Drug design involves the optimization of this interaction so that the
drug effectively binds with the target protein while not binding with other proteins
(an event that could produce dangerous side effects). Computational drug design
involves the geometric modeling of drug molecules, with the goal of generating
similar molecules that will be more effective drug candidates. It is necessary that
algorithms incorporate strategies to measure molecular similarity by comparing
molecular descriptors that may involve dozens to hundreds of attributes. We use
kernel-based methods to define these measures of similarity. Kernels are general
functions that can be used to formulate similarity comparisons.
The overall goal of this thesis is to develop effective and efficient computational
methods that are reliant on transparent mathematical descriptors of molecules with
applications to affinity prediction, detection of multiple binding modes, and generation
of new drug leads. While in this thesis we derive computational strategies for
the discovery of new drug leads, our approach differs from the traditional ligandbased
approach. We have developed novel procedures to calculate inverse mappings
and subsequently recover the structure of a potential drug lead. The contributions
of this thesis are the following:
1. We propose a vector space model molecular descriptor (VSMMD) based on
a vector space model that is suitable for kernel studies in QSAR modeling.
Our experiments have provided convincing comparative empirical evidence
that our descriptor formulation in conjunction with kernel based regression
algorithms can provide sufficient discrimination to predict various biological
activities of a molecule with reasonable accuracy.
2. We present a new component selection algorithm KACS (Kernel Alignment
Component Selection) based on kernel alignment for a QSAR study. Kernel
alignment has been developed as a measure of similarity between two kernel
functions. In our algorithm, we refine kernel alignment as an evaluation tool,
using recursive component elimination to eventually select the most important
components for classification. We have demonstrated empirically and proven
theoretically that our algorithm works well for finding the most important
components in different QSAR data sets.
3. We extend the VSMMD in conjunction with a kernel based clustering algorithm
to the prediction of multiple binding modes, a challenging area of
research that has been previously studied by means of time consuming docking
simulations. The results reported in this study provide strong empirical
evidence that our strategy has enough resolving power to distinguish multiple
binding modes through the use of a standard k-means algorithm.
4. We develop a set of reverse engineering strategies for QSAR modeling based
on our VSMMD. These strategies include:
(a) The use of a kernel feature space algorithm to design or modify descriptor
image points in a feature space.
(b) The deployment of a pre-image algorithm to map the newly defined
descriptor image points in the feature space back to the input space of
the descriptors.
(c) The design of a probabilistic strategy to convert new descriptors to meaningful
chemical graph templates.
The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training
set is still used to generate a new image point in the feature space, the reverse engineering
strategies just described allows us to develop a new drug candidate that is
independent of issues related to probability distribution constraints placed on test
set molecules.
|
106 |
Feature Ranking for Text ClassifiersMakrehchi, Masoud January 2007 (has links)
Feature selection based on feature ranking has received much
attention by researchers in the field of text classification. The
major reasons are their scalability, ease of use, and fast computation. %,
However, compared to the search-based feature selection methods such
as wrappers and filters, they suffer from poor performance. This is
linked to their major deficiencies, including: (i) feature ranking
is problem-dependent; (ii) they ignore term dependencies, including
redundancies and correlation; and (iii) they usually fail in
unbalanced data.
While using feature ranking methods for dimensionality reduction, we
should be aware of these drawbacks, which arise from the function of
feature ranking methods. In this thesis, a set of solutions is
proposed to handle the drawbacks of feature ranking and boost their
performance. First, an evaluation framework called feature
meta-ranking is proposed to evaluate ranking measures. The framework
is based on a newly proposed Differential Filter Level Performance
(DFLP) measure. It was proved that, in ideal cases, the performance
of text classifier is a monotonic, non-decreasing function of the
number of features. Then we theoretically and empirically validate
the effectiveness of DFLP as a meta-ranking measure to evaluate and
compare feature ranking methods. The meta-ranking framework is also
examined by a stopword extraction problem. We use the framework to
select appropriate feature ranking measure for building
domain-specific stoplists. The proposed framework is evaluated by
SVM and Rocchio text classifiers on six benchmark data. The
meta-ranking method suggests that in searching for a proper feature
ranking measure, the backward feature ranking is as important as the
forward one.
Second, we show that the destructive effect of term redundancy gets
worse as we decrease the feature ranking threshold. It implies that
for aggressive feature selection, an effective redundancy reduction
should be performed as well as feature ranking. An algorithm based
on extracting term dependency links using an information theoretic
inclusion index is proposed to detect and handle term dependencies.
The dependency links are visualized by a tree structure called a
term dependency tree. By grouping the nodes of the tree into two
categories, including hub and link nodes, a heuristic algorithm is
proposed to handle the term dependencies by merging or removing the
link nodes. The proposed method of redundancy reduction is evaluated
by SVM and Rocchio classifiers for four benchmark data sets.
According to the results, redundancy reduction is more effective on
weak classifiers since they are more sensitive to term redundancies.
It also suggests that in those feature ranking methods which compact
the information in a small number of features, aggressive feature
selection is not recommended.
Finally, to deal with class imbalance in feature level using ranking
methods, a local feature ranking scheme called reverse
discrimination approach is proposed. The proposed method is applied
to a highly unbalanced social network discovery problem. In this
case study, the problem of learning a social network is translated
into a text classification problem using newly proposed actor and
relationship modeling. Since social networks are usually sparse
structures, the corresponding text classifiers become highly
unbalanced. Experimental assessment of the reverse discrimination
approach validates the effectiveness of the local feature ranking
method to improve the classifier performance when dealing with
unbalanced data. The application itself suggests a new approach to
learn social structures from textual data.
|
107 |
Kernel Methods in Computer-Aided Constructive Drug DesignWong, William Wai Lun 04 May 2009 (has links)
A drug is typically a small molecule that interacts with the binding site of some
target protein. Drug design involves the optimization of this interaction so that the
drug effectively binds with the target protein while not binding with other proteins
(an event that could produce dangerous side effects). Computational drug design
involves the geometric modeling of drug molecules, with the goal of generating
similar molecules that will be more effective drug candidates. It is necessary that
algorithms incorporate strategies to measure molecular similarity by comparing
molecular descriptors that may involve dozens to hundreds of attributes. We use
kernel-based methods to define these measures of similarity. Kernels are general
functions that can be used to formulate similarity comparisons.
The overall goal of this thesis is to develop effective and efficient computational
methods that are reliant on transparent mathematical descriptors of molecules with
applications to affinity prediction, detection of multiple binding modes, and generation
of new drug leads. While in this thesis we derive computational strategies for
the discovery of new drug leads, our approach differs from the traditional ligandbased
approach. We have developed novel procedures to calculate inverse mappings
and subsequently recover the structure of a potential drug lead. The contributions
of this thesis are the following:
1. We propose a vector space model molecular descriptor (VSMMD) based on
a vector space model that is suitable for kernel studies in QSAR modeling.
Our experiments have provided convincing comparative empirical evidence
that our descriptor formulation in conjunction with kernel based regression
algorithms can provide sufficient discrimination to predict various biological
activities of a molecule with reasonable accuracy.
2. We present a new component selection algorithm KACS (Kernel Alignment
Component Selection) based on kernel alignment for a QSAR study. Kernel
alignment has been developed as a measure of similarity between two kernel
functions. In our algorithm, we refine kernel alignment as an evaluation tool,
using recursive component elimination to eventually select the most important
components for classification. We have demonstrated empirically and proven
theoretically that our algorithm works well for finding the most important
components in different QSAR data sets.
3. We extend the VSMMD in conjunction with a kernel based clustering algorithm
to the prediction of multiple binding modes, a challenging area of
research that has been previously studied by means of time consuming docking
simulations. The results reported in this study provide strong empirical
evidence that our strategy has enough resolving power to distinguish multiple
binding modes through the use of a standard k-means algorithm.
4. We develop a set of reverse engineering strategies for QSAR modeling based
on our VSMMD. These strategies include:
(a) The use of a kernel feature space algorithm to design or modify descriptor
image points in a feature space.
(b) The deployment of a pre-image algorithm to map the newly defined
descriptor image points in the feature space back to the input space of
the descriptors.
(c) The design of a probabilistic strategy to convert new descriptors to meaningful
chemical graph templates.
The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training
set is still used to generate a new image point in the feature space, the reverse engineering
strategies just described allows us to develop a new drug candidate that is
independent of issues related to probability distribution constraints placed on test
set molecules.
|
108 |
Qualitative Performance Analysis for Large-Scale Scientific WorkflowsBuneci, Emma 30 May 2008 (has links)
<p>Today, large-scale scientific applications are both data driven and distributed. To support the scale and inherent distribution of these applications, significant heterogeneous and geographically distributed resources are required over long periods of time to ensure adequate performance. Furthermore, the behavior of these applications depends on a large number of factors related to the application, the system software, the underlying hardware, and other running applications, as well as potential interactions among these factors.</p>
<p>Most Grid application users are primarily concerned with obtaining the result of the application as fast as possible, without worrying about the details involved in monitoring and understanding factors affecting application performance. In this work, we aim to provide the application users with a simple and intuitive performance evaluation mechanism during the execution time of their long-running Grid applications or workflows. Our performance evaluation mechanism provides a qualitative and periodic assessment of the application's behavior by informing the user whether the application's performance is expected or unexpected. Furthermore, it can help improve overall application performance by informing and guiding fault-tolerance services when the application exhibits persistent unexpected performance behaviors.</p>
<p>This thesis addresses the hypotheses that in order to qualitatively assess application behavioral states in long-running scientific Grid applications: (1) it is necessary to extract temporal information in performance time series data, and that (2) it is sufficient to extract variance and pattern as specific examples of temporal information. Evidence supporting these hypotheses can lead to the ability to qualitatively assess the overall behavior of the application and, if needed, to offer a most likely diagnostic of the underlying problem.</p>
<p>To test the stated hypotheses, we develop and evaluate a general <em> qualitative performance analysis</em> framework that incorporates (a) techniques from time series analysis and machine learning to extract and learn from data, structural and temporal features associated with application performance in order to reach a qualitative interpretation of the application's behavior, and (b) mechanisms and policies to reason over time and across the distributed resource space about the behavior of the application. </p>
<p>Experiments with two scientific applications from meteorology and astronomy comparing signatures generated from instantaneous values of performance data versus those generated from temporal characteristics support the former hypothesis that temporal information is necessary to extract from performance time series data to be able to accurately interpret the behavior of these applications. Furthermore, temporal signatures incorporating variance and pattern information generated for these applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. This leads to the framework's accurate classification of instances of similar behaviors, which represents supporting evidence for the latter hypothesis. The proposed framework's ability to generate a qualitative assessment of performance behavior for scientific applications using temporal information present in performance time series data represents a step towards simplifying and improving the quality of service for Grid applications.</p> / Dissertation
|
109 |
Feature Selection for Value Function ApproximationTaylor, Gavin January 2011 (has links)
<p>The field of reinforcement learning concerns the question of automated action selection given past experiences. As an agent moves through the state space, it must recognize which state choices are best in terms of allowing it to reach its goal. This is quantified with value functions, which evaluate a state and return the sum of rewards the agent can expect to receive from that state. Given a good value function, the agent can choose the actions which maximize this sum of rewards. Value functions are often chosen from a linear space defined by a set of features; this method offers a concise structure, low computational effort, and resistance to overfitting. However, because the number of features is small, this method depends heavily on these few features being expressive and useful, making the selection of these features a core problem. This document discusses this selection.</p><p>Aside from a review of the field, contributions include a new understanding of the role approximate models play in value function approximation, leading to new methods for analyzing feature sets in an intuitive way, both using the linear and the related kernelized approximation architectures. Additionally, we present a new method for automatically choosing features during value function approximation which has a bounded approximation error and produces superior policies, even in extremely noisy domains.</p> / Dissertation
|
110 |
Sparse Modeling in Classification, Compression and DetectionChen, Jihong 12 July 2004 (has links)
The principal focus of this thesis is the exploration of sparse structures in a variety of statistical modelling problems. While more comprehensive models can be useful to solve a larger number of problems, its calculation may be ill-posed in most practical instances because of the sparsity of informative features in the data. If this sparse structure can be exploited, the models can often be solved very efficiently.
The thesis is composed of four projects. Firstly, feature sparsity is incorporated to improve the performance of support vector machines when there are a lot of noise features present. The second project is about an empirical study on how to construct an optimal cascade structure. The third project involves the design of a progressive, rate-distortionoptimized shape coder by combining zero-tree algorithm with beamlet structure. Finally,
the longest run statistics is applied for the detection of a filamentary structure in twodimensional rectangular region.
The fundamental ideas of the above projects are common — extract an efficient summary from a large amount of data. The main contributions of this work are to develop and implement novel techniques for the efficient solutions of several dicult problems that arise in statistical signal/image processing.
|
Page generated in 0.1477 seconds