Global ETD Search

11	Pairwise rational kernels applied to metabolic network predictions Roche Lima, Abiel 06 April 2015 (has links) Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of chemical reactions, in which the product from one reaction serves as the input to another reaction. Many pathways remain incompletely characterized, and in some of them not all enzyme components have been identified. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise kernel frameworks have been used in supervised learning approaches, e.g., Pairwise Support Vector Machines (SVMs), to predict relationships among two pairs of entities. Rational kernels are based on transducers to manipulate sequence data, computing similarity measures between sequences or automata. Rational kernels take advantage of the smaller and faster representation and algorithms of weighted finite-state transducers. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and machine translations. We propose a new framework, Pairwise Rational Kernels (PRKs), to manipulate pairs of sequence data, as pairwise combinations of rational kernels. We develop experiments using SVM with PRKs applied to metabolic pathway predictions in order to validate our methods. As a result, we obtain faster execution times with PRKs than other kernels, while maintaining accurate predictions. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We also obtain a new type of Pairwise Rational Kernels based on automaton and transducer operations. In this case, we define new operations over two pairs of automata to obtain new rational kernels. We also develop experiments to validate these new PRKs to predict metabolic networks. As a result, we obtain the best execution times when we compare them with other kernels and the previous PRKs. Machine Learning kernel methods Bioinformatics Metabolic network predictions
12	Scalable Embeddings for Kernel Clustering on MapReduce Elgohary, Ahmed 14 February 2014 (has links) There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications. The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically. Data Clustering Kernel Methods Scalable Data Analytics MapReduce Big Data
13	Kernel Methods for Regression Rossmann, Tom Lennart January 2023 (has links) Kernel methods are a well-studied approach for addressing regression problems by implicitly mapping input variables into possibly infinite-dimensional feature spaces, particularly in cases where standard linear regression fails to capture non-linear relationships in data. Therefore, the choice between standard linear regression and kernel regression can be seen as a tradeoff between constraints on the number of features and the number of training samples. Our results show that the Gaussian kernel consistently achieves the lowest mean squared error for the largest considered training size. At the same time, the standard ridge regression exhibits a higher mean squared error and lower fit time. We have proven algebraically that the solutions of standard ridge regression and kernel ridge regression are mathematically equivalent. Kernel Methods Regression Ridge Regression Probability Theory and Statistics Sannolikhetsteori och statistik
14	Convergence of Kernel Methods for Modeling and Estimation of Dynamical Systems Guo, Jia 14 January 2021 (has links) As data-driven modeling becomes more prevalent for representing the uncertain dynamical systems, concerns also arise regarding the reliability of these methods. Recent developments in approximation theory provide a new perspective to studying these problems. This dissertation analyzes the convergence of two kernel-based, data-driven modeling methods, the reproducing kernel Hilbert space (RKHS) embedding method and the empirical-analytical Lagrangian (EAL) model. RKHS embedding is a non-parametric extension of the classical adaptive estimation method that embeds the uncertain function in an RKHS, an infinite-dimensional function space. As a result the original uncertain system of ordinary differential equations are understood as components of a distributed parameter system. Similarly to the classical approach for adaptive estimation, a novel definition of persistent excitation (PE) is introduced, which is proven to guarantee the pointwise convergence of the estimate of function over the PE domain. The finite-dimensional approximation of the RKHS embedding method is based on approximant spaces that consist of kernel basis functions centered at samples in the state space. This dissertation shows that explicit rate of convergence of the RKHS embedding method can be derived by choosing specific types of native spaces. In particular, when the RKHS is continuously embedded in a Sobolev space, the approximation error is proven to decrease at a rate determined by the fill distance of the samples in the PE domain. This dissertation initially studies scalar-valued RKHS, and subsequently the RKHS embedding method is extended for the estimation of vector-valued uncertain functions. Like the scalar-valued case, the formulation of vector-valued RKHS embedding is proven to be well-posed. The notion of partial PE is also generalized, and it is shown that the rate of convergence derived for the scalar-valued approximation still holds true for certain separable operator-valued kernels. The second part of this dissertation studies the EAL modeling method, which is a hybrid mechanical model for Lagrangian systems with uncertain holonomic constraints. For the singular perturbed form of the system, the kernel method is applied to approximate a penalty potential that is introduced to approximately enforce constraints. In this dissertation, the accuracy confidence function is introduced to characterize the constraint violation of an approximate trajectory. We prove that the confidence function can be decomposed into a term representing the bias and another term representing the variation. Numerical simulations are conducted to examine the factors that affect the error, including the spectral filtering, the number of samples, and the accumulation of integration error. / Doctor of Philosophy / As data-driven modeling is becoming more prevalent for representing uncertain dynamical systems, concerns also arise regarding the reliability of these methods. This dissertation employs recent developments in approximation theory to provide rigorous error analysis for two certain kernel-based approaches for modeling dynamical systems. The reproducing kernel Hilbert space (RKHS) embedding method is a non-parametric extension of the classical adaptive estimation for identifying uncertain functions in nonlinear systems. By embedding the uncertain function in a properly selected RKHS, the nonlinear state equation in Euclidean space is transformed into a linear evolution in an infinite-dimensional RKHS, where the function estimation error can be characterized directly and precisely. Pointwise convergence of the function estimate is proven over the domain that is persistently excited (PE). And a finite-dimensional approximation can be constructed within an arbitrarily small error bound. The empirical-analytical Lagrangian (EAL) model is developed to approximate the trajectory of Lagrangian systems with uncertain configuration manifold. Employing the kernel method, a penalty potential is constructed from the observation data to ``push'' the trajectory towards the actual configuration manifold. A probabilistic error bound is derived for the distance of the approximated trajectory away from the actual manifold. The error bound is proven to contain a bias term and a variance term, both of which are determined by the parameters of the kernel method. Kernel Methods Adaptive Estimation Error Analysis Lagrangian Systems
15	First passage times dynamics in Markov Models with applications to HMM : induction, sequence classification and graph mining Callut, Jérôme 12 October 2007 (has links) Sequential data are encountered in many contexts of everyday life and in numerous scientific applications. They can for instance be SMS typeset on mobile phones, web pages reached while crossing hyperlinks, system logs or DNA samples, to name a few. Generating such data defines a sequential process. This thesis is concerned with the modeling of sequential processes from observed data. Sequential processes are here modeled using probabilistic models, namely discrete time Markov chains (MC), Hidden Markov Models (HMMs) and Partially Observable Markov Models (POMMs). Such models can answer questions such as (i) Which event will occur a couple of steps later? (ii) How many times will a particular event occur? and (iii) When does an event occur for the first time given the current situation? The last question is of particular interest in this thesis and is mathematically formalized under the notion of First Passage Times (FPT) dynamics of a process. The FPT dynamics is used here to solve the three following problems related to machine learning and data mining: (i) HMM/POMM induction, (ii) supervised sequence classification and (iii) relevant subgraph mining. Firstly, we propose a novel algorithm, called POMMStruct, for learning the structure and the parameters of POMMs to best fit the empirical FPT dynamics observed in the samples. Experimental results illustrate the benefit of POMMStruct in the modeling of sequential processes with a complex temporal dynamics while compared to classical induction approaches. Our second contribution is concerned with the classification of sequences. We propose to model the FPT in sequences with discrete phase-type (PH) distributions using a novel algorithm called PHit. These distributions are used to devise a new string kernel and a probabilistic classifier. Experimental results on biological data shows that our methods provides state-of-the-art classification results. Finally, we address the problem of mining subgraphs, which are relevant to model the relationships between selected nodes of interest, in large graphs such as biological networks. A new relevance criterion based on particular random walks called K-walks is proposed as well as efficient algorithms to compute this criterion. Experiments on the KEGG metabolic network and on randomly generated graphs are presented. Phase-type distributions Machine learning Data mining Stochastic models Kernel methods
16	Gaussian Process Kernels for Cross-Spectrum Analysis in Electrophysiological Time Series Ulrich, Kyle Richard January 2016 (has links) <p>Multi-output Gaussian processes provide a convenient framework for multi-task problems. An illustrative and motivating example of a multi-task problem is multi-region electrophysiological time-series data, where experimentalists are interested in both power and phase coherence between channels. Recently, the spectral mixture (SM) kernel was proposed to model the spectral density of a single task in a Gaussian process framework. This work develops a novel covariance kernel for multiple outputs, called the cross-spectral mixture (CSM) kernel. This new, flexible kernel represents both the power and phase relationship between multiple observation channels. The expressive capabilities of the CSM kernel are demonstrated through implementation of 1) a Bayesian hidden Markov model, where the emission distribution is a multi-output Gaussian process with a CSM covariance kernel, and 2) a Gaussian process factor analysis model, where factor scores represent the utilization of cross-spectral neural circuits. Results are presented for measured multi-region electrophysiological data.</p> / Dissertation Electrical engineering Cross-spectra Gaussian processes Kernel methods Local field potentials State space models Time series
17	Statistical Learning and Model Criticism for Networks and Point Processes Jiasen Yang (7027331) 16 August 2019 (has links) <div>Networks and point processes provide flexible tools for representing and modeling complex dependencies in data arising from various social and physical domains. Graphs, or networks, encode relational dependencies between entities, while point processes characterize temporal or spatial interactions among events.</div><div><br></div><div>In the first part of this dissertation, we consider dynamic network data (such as communication networks) in which links connecting pairs of nodes appear continuously over time. We propose latent space point process models to capture two different aspects of the data: (i) communication occurs at a higher rate between individuals with similar latent attributes (i.e., homophily); and (ii) individuals tend to reciprocate communications from others, but in a varied manner. Our framework marries ideas from point process models, including Poisson and Hawkes processes, with ideas from latent space models of static networks. We evaluate our models on several real-world datasets and show that a dual latent space model, which accounts for heterogeneity in both homophily and reciprocity, significantly improves performance in various link prediction and network embedding tasks.</div><div><br></div><div>In the second part of this dissertation, we develop nonparametric goodness-of-fit tests for discrete distributions and point processes that contain intractable normalization constants, providing the first generally applicable and computationally feasible approaches under those circumstances. Specifically, we propose and characterize Stein operators for discrete distributions, and construct a general Stein operator for point processes using the Papangelou conditional intensity function. Based on the proposed Stein operators, we establish kernelized Stein discrepancy measures for discrete distributions and point processes, which enable us to develop nonparametric goodness-of-fit tests for un-normalized density/intensity functions. We apply the kernelized Stein discrepancy tests to discrete distributions (including network models) as well as temporal and spatial point processes. Our experiments demonstrate that the proposed tests typically outperform two-sample tests based on the maximum mean discrepancy, which, unlike our goodness-of-fit tests, assume the availability of exact samples from the null model.</div><div><br></div> Statistics Network data Point process Model criticism Stein's method Kernel methods Goodness-of-fit testing
18	Structure in machine learning : graphical models and Monte Carlo methods Rowland, Mark January 2018 (has links) This thesis is concerned with two main areas: approximate inference in discrete graphical models, and random embeddings for dimensionality reduction and approximate inference in kernel methods. Approximate inference is a fundamental problem in machine learning and statistics, with strong connections to other domains such as theoretical computer science. At the same time, there has often been a gap between the success of many algorithms in this area in practice, and what can be explained by theory; thus, an important research effort is to bridge this gap. Random embeddings for dimensionality reduction and approximate inference have led to great improvements in scalability of a wide variety of methods in machine learning. In recent years, there has been much work on how the stochasticity introduced by these approaches can be better controlled, and what further computational improvements can be made. In the first part of this thesis, we study approximate inference algorithms for discrete graphical models. Firstly, we consider linear programming methods for approximate MAP inference, and develop our understanding of conditions for exactness of these approximations. Such guarantees of exactness are typically based on either structural restrictions on the underlying graph corresponding to the model (such as low treewidth), or restrictions on the types of potential functions that may be present in the model (such as log-supermodularity). We contribute two new classes of exactness guarantees: the first of these takes the form of particular hybrid restrictions on a combination of graph structure and potential types, whilst the second is given by excluding particular substructures from the underlying graph, via graph minor theory. We also study a particular family of transformation methods of graphical models, uprooting and rerooting, and their effect on approximate MAP and marginal inference methods. We prove new theoretical results on the behaviour of particular approximate inference methods under these transformations, in particular showing that the triplet relaxation of the marginal polytope is unique in being universally rooted. We also introduce a heuristic which quickly picks a rerooting, and demonstrate benefits empirically on models over several graph topologies. In the second part of this thesis, we study Monte Carlo methods for both linear dimensionality reduction and approximate inference in kernel machines. We prove the statistical benefit of coupling Monte Carlo samples to be almost-surely orthogonal in a variety of contexts, and study fast approximate methods of inducing this coupling. A surprising result is that these approximate methods can simultaneously offer improved statistical benefits, time complexity, and space complexity over i.i.d. Monte Carlo samples. We evaluate our methods on a variety of datasets, directly studying their effects on approximate kernel evaluation, as well as on downstream tasks such as Gaussian process regression.
19	Kernel Methods in Computer-Aided Constructive Drug Design Wong, William Wai Lun 04 May 2009 (has links) A drug is typically a small molecule that interacts with the binding site of some target protein. Drug design involves the optimization of this interaction so that the drug effectively binds with the target protein while not binding with other proteins (an event that could produce dangerous side effects). Computational drug design involves the geometric modeling of drug molecules, with the goal of generating similar molecules that will be more effective drug candidates. It is necessary that algorithms incorporate strategies to measure molecular similarity by comparing molecular descriptors that may involve dozens to hundreds of attributes. We use kernel-based methods to define these measures of similarity. Kernels are general functions that can be used to formulate similarity comparisons. The overall goal of this thesis is to develop effective and efficient computational methods that are reliant on transparent mathematical descriptors of molecules with applications to affinity prediction, detection of multiple binding modes, and generation of new drug leads. While in this thesis we derive computational strategies for the discovery of new drug leads, our approach differs from the traditional ligandbased approach. We have developed novel procedures to calculate inverse mappings and subsequently recover the structure of a potential drug lead. The contributions of this thesis are the following: 1. We propose a vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our experiments have provided convincing comparative empirical evidence that our descriptor formulation in conjunction with kernel based regression algorithms can provide sufficient discrimination to predict various biological activities of a molecule with reasonable accuracy. 2. We present a new component selection algorithm KACS (Kernel Alignment Component Selection) based on kernel alignment for a QSAR study. Kernel alignment has been developed as a measure of similarity between two kernel functions. In our algorithm, we refine kernel alignment as an evaluation tool, using recursive component elimination to eventually select the most important components for classification. We have demonstrated empirically and proven theoretically that our algorithm works well for finding the most important components in different QSAR data sets. 3. We extend the VSMMD in conjunction with a kernel based clustering algorithm to the prediction of multiple binding modes, a challenging area of research that has been previously studied by means of time consuming docking simulations. The results reported in this study provide strong empirical evidence that our strategy has enough resolving power to distinguish multiple binding modes through the use of a standard k-means algorithm. 4. We develop a set of reverse engineering strategies for QSAR modeling based on our VSMMD. These strategies include: (a) The use of a kernel feature space algorithm to design or modify descriptor image points in a feature space. (b) The deployment of a pre-image algorithm to map the newly defined descriptor image points in the feature space back to the input space of the descriptors. (c) The design of a probabilistic strategy to convert new descriptors to meaningful chemical graph templates. The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training set is still used to generate a new image point in the feature space, the reverse engineering strategies just described allows us to develop a new drug candidate that is independent of issues related to probability distribution constraints placed on test set molecules. drug design kernel methods QSAR inverse-QSAR feature selection Computer Science
20	Kernel Methods in Computer-Aided Constructive Drug Design Wong, William Wai Lun 04 May 2009 (has links) A drug is typically a small molecule that interacts with the binding site of some target protein. Drug design involves the optimization of this interaction so that the drug effectively binds with the target protein while not binding with other proteins (an event that could produce dangerous side effects). Computational drug design involves the geometric modeling of drug molecules, with the goal of generating similar molecules that will be more effective drug candidates. It is necessary that algorithms incorporate strategies to measure molecular similarity by comparing molecular descriptors that may involve dozens to hundreds of attributes. We use kernel-based methods to define these measures of similarity. Kernels are general functions that can be used to formulate similarity comparisons. The overall goal of this thesis is to develop effective and efficient computational methods that are reliant on transparent mathematical descriptors of molecules with applications to affinity prediction, detection of multiple binding modes, and generation of new drug leads. While in this thesis we derive computational strategies for the discovery of new drug leads, our approach differs from the traditional ligandbased approach. We have developed novel procedures to calculate inverse mappings and subsequently recover the structure of a potential drug lead. The contributions of this thesis are the following: 1. We propose a vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our experiments have provided convincing comparative empirical evidence that our descriptor formulation in conjunction with kernel based regression algorithms can provide sufficient discrimination to predict various biological activities of a molecule with reasonable accuracy. 2. We present a new component selection algorithm KACS (Kernel Alignment Component Selection) based on kernel alignment for a QSAR study. Kernel alignment has been developed as a measure of similarity between two kernel functions. In our algorithm, we refine kernel alignment as an evaluation tool, using recursive component elimination to eventually select the most important components for classification. We have demonstrated empirically and proven theoretically that our algorithm works well for finding the most important components in different QSAR data sets. 3. We extend the VSMMD in conjunction with a kernel based clustering algorithm to the prediction of multiple binding modes, a challenging area of research that has been previously studied by means of time consuming docking simulations. The results reported in this study provide strong empirical evidence that our strategy has enough resolving power to distinguish multiple binding modes through the use of a standard k-means algorithm. 4. We develop a set of reverse engineering strategies for QSAR modeling based on our VSMMD. These strategies include: (a) The use of a kernel feature space algorithm to design or modify descriptor image points in a feature space. (b) The deployment of a pre-image algorithm to map the newly defined descriptor image points in the feature space back to the input space of the descriptors. (c) The design of a probabilistic strategy to convert new descriptors to meaningful chemical graph templates. The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training set is still used to generate a new image point in the feature space, the reverse engineering strategies just described allows us to develop a new drug candidate that is independent of issues related to probability distribution constraints placed on test set molecules. drug design kernel methods QSAR inverse-QSAR feature selection Computer Science

Search results