Global ETD Search

1	Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery Kilpatrick, Alastair Morris January 2015 (has links) The discovery of transcription factor binding site (TFBS) motifs remains an important and challenging problem in computational biology. This thesis presents MITSU, a novel algorithm for TFBS motif discovery which exploits stochastic methods as a means of both overcoming optimality limitations in current algorithms and as a framework for incorporating relevant prior knowledge in order to improve results. The current state of the TFBS motif discovery field is surveyed, with a focus on probabilistic algorithms that typically take the promoter regions of coregulated genes as input. A case is made for an approach based on the stochastic Expectation-Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms for motif discovery is shown. The algorithm developed in this thesis is unique amongst existing motif discovery algorithms in that it combines the sEM algorithm with a derived data set which leads to an improved approximation to the likelihood function. This likelihood function is unconstrained with regard to the distribution of motif occurrences within the input dataset. MITSU also incorporates a novel heuristic to automatically determine TFBS motif width. This heuristic, known as MCOIN, is shown to outperform current methods for determining motif width. MITSU is implemented in Java and an executable is available for download. MITSU is evaluated quantitatively using realistic synthetic data and several collections of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates that MITSU improves on a deterministic EM-based motif discovery algorithm and an alternative sEM-based algorithm, in terms of previously established metrics. The ability of the sEM algorithm to escape stable fixed points of the EM algorithm, which trap deterministic motif discovery algorithms and the ability of MITSU to discover multiple motif occurrences within a single input sequence are also demonstrated. MITSU is validated using previously characterised Alphaproteobacterial motifs, before being applied to motif discovery in uncharacterised Alphaproteobacterial data. A number of novel results from this analysis are presented and motivate two extensions of MITSU: a strategy for the discovery of multiple different motifs within a single dataset and a higher order Markov background model. The effects of incorporating these extensions within MITSU are evaluated quantitatively using previously characterised prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs. Finally, an information-theoretic measure of motif palindromicity is presented and its advantages over existing approaches for discovering palindromic motifs discussed. 572.8
2	Network Exceptions Modelling Using Hidden Markov Model : A Case Study of Ericsson’s DroppedCall Data Li, Shikun January 2014 (has links) In telecommunication, the series of mobile network exceptions is a processwhich exhibits surges and bursts. The bursty part is usually caused by systemmalfunction. Additionally, the mobile network exceptions are often timedependent. A model that successfully captures these aspects will make troubleshootingmuch easier for system engineers. The Hidden Markov Model(HMM) is a good candidate as it provides a mechanism to capture both thetime dependency and the random occurrence of bursts. This thesis focuses onan application of the HMM to mobile network exceptions, with a case study ofEricsson’s Dropped Call data. For estimation purposes, two methods of maximumlikelihood estimation for HMM, namely, EM algorithm and stochasticEM algorithm, are used. Hidden Markov Model Mobile Network Stochastic EM algorithm
3	Estimation of wood fibre length distributions from censored mixture data Svensson, Ingrid January 2007 (has links) <p>The motivating forestry background for this thesis is the need for fast, non-destructive, and cost-efficient methods to estimate fibre length distributions in standing trees in order to evaluate the effect of silvicultural methods and breeding programs on fibre length. The usage of increment cores is a commonly used non-destructive sampling method in forestry. An increment core is a cylindrical wood sample taken with a special borer, and the methods proposed in this thesis are especially developed for data from increment cores. Nevertheless the methods can be used for data from other sampling frames as well, for example for sticks with the shape of an elongated rectangular box.</p><p>This thesis proposes methods to estimate fibre length distributions based on censored mixture data from wood samples. Due to sampling procedures, wood samples contain cut (censored) and uncut observations. Moreover the samples consist not only of the fibres of interest but of other cells (fines) as well. When the cell lengths are determined by an automatic optical fibre-analyser, there is no practical possibility to distinguish between cut and uncut cells or between fines and fibres. Thus the resulting data come from a censored version of a mixture of the fine and fibre length distributions in the tree. The methods proposed in this thesis can handle this lack of information.</p><p>Two parametric methods are proposed to estimate the fine and fibre length distributions in a tree. The first method is based on grouped data. The probabilities that the length of a cell from the sample falls into different length classes are derived, the censoring caused by the sampling frame taken into account. These probabilities are functions of the unknown parameters, and ML estimates are found from the corresponding multinomial model.</p><p>The second method is a stochastic version of the EM algorithm based on the individual length measurements. The method is developed for the case where the distributions of the true lengths of the cells at least partially appearing in the sample belong to exponential families. The cell length distribution in the sample and the conditional distribution of the true length of a cell at least partially appearing in the sample given the length in the sample are derived. Both these distributions are necessary in order to use the stochastic EM algorithm. Consistency and asymptotic normality of the stochastic EM estimates is proved.</p><p>The methods are applied to real data from increment cores taken from Scots pine trees (Pinus sylvestris L.) in Northern Sweden and further evaluated through simulation studies. Both methods work well for sample sizes commonly obtained in practice.</p> censoring fibre length distribution identifiability increment core length bias mixture stochastic EM algorithm Mathematical statistics Matematisk statistik
4	Estimation of wood fibre length distributions from censored mixture data Svensson, Ingrid January 2007 (has links) The motivating forestry background for this thesis is the need for fast, non-destructive, and cost-efficient methods to estimate fibre length distributions in standing trees in order to evaluate the effect of silvicultural methods and breeding programs on fibre length. The usage of increment cores is a commonly used non-destructive sampling method in forestry. An increment core is a cylindrical wood sample taken with a special borer, and the methods proposed in this thesis are especially developed for data from increment cores. Nevertheless the methods can be used for data from other sampling frames as well, for example for sticks with the shape of an elongated rectangular box. This thesis proposes methods to estimate fibre length distributions based on censored mixture data from wood samples. Due to sampling procedures, wood samples contain cut (censored) and uncut observations. Moreover the samples consist not only of the fibres of interest but of other cells (fines) as well. When the cell lengths are determined by an automatic optical fibre-analyser, there is no practical possibility to distinguish between cut and uncut cells or between fines and fibres. Thus the resulting data come from a censored version of a mixture of the fine and fibre length distributions in the tree. The methods proposed in this thesis can handle this lack of information. Two parametric methods are proposed to estimate the fine and fibre length distributions in a tree. The first method is based on grouped data. The probabilities that the length of a cell from the sample falls into different length classes are derived, the censoring caused by the sampling frame taken into account. These probabilities are functions of the unknown parameters, and ML estimates are found from the corresponding multinomial model. The second method is a stochastic version of the EM algorithm based on the individual length measurements. The method is developed for the case where the distributions of the true lengths of the cells at least partially appearing in the sample belong to exponential families. The cell length distribution in the sample and the conditional distribution of the true length of a cell at least partially appearing in the sample given the length in the sample are derived. Both these distributions are necessary in order to use the stochastic EM algorithm. Consistency and asymptotic normality of the stochastic EM estimates is proved. The methods are applied to real data from increment cores taken from Scots pine trees (Pinus sylvestris L.) in Northern Sweden and further evaluated through simulation studies. Both methods work well for sample sizes commonly obtained in practice. censoring fibre length distribution identifiability increment core length bias mixture stochastic EM algorithm Mathematical statistics Matematisk statistik
5	Design and analysis of response selective samples in observational studies Grünewald, Maria January 2011 (has links) Outcome dependent sampling may increase efficiency in observational studies. It is however not always obvious how to sample efficiently, and how to analyze the resulting data without introducing bias. This thesis describes a general framework for efficiency calculations in multistage sampling, with focus on what is sometimes referred to as ascertainment sampling. A method for correcting for the sampling scheme in analysis of ascertainment samples is also presented. Simulation based methods are used to overcome computational issues in both efficiency calculations and analysis of data. / At the time of doctoral defense, the following paper was unpublished and had a status as follows: Paper 1: Submitted. ascertainment missing data outcome dependent sampling response selective samples sequential design stochastic EM algorithm Mathematical statistics Matematisk statistik
6	Methods and algorithms to learn spatio-temporal changes from longitudinal manifold-valued observations / Méthodes et algorithmes pour l’apprentissage de modèles d'évolution spatio-temporels à partir de données longitudinales sur une variété Schiratti, Jean-Baptiste 23 January 2017 (has links) Dans ce manuscrit, nous présentons un modèle à effets mixtes, présenté dans un cadre Bayésien, permettant d'estimer la progression temporelle d'un phénomène biologique à partir d'observations répétées, à valeurs dans une variété Riemannienne, et obtenues pour un individu ou groupe d'individus. La progression est modélisée par des trajectoires continues dans l'espace des observations, que l'on suppose être une variété Riemannienne. La trajectoire moyenne est définie par les effets mixtes du modèle. Pour définir les trajectoires de progression individuelles, nous avons introduit la notion de variation parallèle d'une courbe sur une variété Riemannienne. Pour chaque individu, une trajectoire individuelle est construite en considérant une variation parallèle de la trajectoire moyenne et en reparamétrisant en temps cette parallèle. Les transformations spatio-temporelles sujet-spécifiques, que sont la variation parallèle et la reparamétrisation temporelle sont définnies par les effets aléatoires du modèle et permettent de quantifier les changements de direction et vitesse à laquelle les trajectoires sont parcourues. Le cadre de la géométrie Riemannienne permet d'utiliser ce modèle générique avec n'importe quel type de données définies par des contraintes lisses. Une version stochastique de l'algorithme EM, le Monte Carlo Markov Chains Stochastic Approximation EM (MCMC-SAEM), est utilisé pour estimer les paramètres du modèle au sens du maximum a posteriori. L'utilisation du MCMC-SAEM avec un schéma numérique permettant de calculer le transport parallèle est discutée dans ce manuscrit. De plus, le modèle et le MCMC-SAEM sont validés sur des données synthétiques, ainsi qu'en grande dimension. Enfin, nous des résultats obtenus sur différents jeux de données liés à la santé. / We propose a generic Bayesian mixed-effects model to estimate the temporal progression of a biological phenomenon from manifold-valued observations obtained at multiple time points for an individual or group of individuals. The progression is modeled by continuous trajectories in the space of measurements, which is assumed to be a Riemannian manifold. The group-average trajectory is defined by the fixed effects of the model. To define the individual trajectories, we introduced the notion of « parallel variations » of a curve on a Riemannian manifold. For each individual, the individual trajectory is constructed by considering a parallel variation of the average trajectory and reparametrizing this parallel in time. The subject specific spatiotemporal transformations, namely parallel variation and time reparametrization, are defined by the individual random effects and allow to quantify the changes in direction and pace at which the trajectories are followed. The framework of Riemannian geometry allows the model to be used with any kind of measurements with smooth constraints. A stochastic version of the Expectation-Maximization algorithm, the Monte Carlo Markov Chains Stochastic Approximation EM algorithm (MCMC-SAEM), is used to produce produce maximum a posteriori estimates of the parameters. The use of the MCMC-SAEM together with a numerical scheme for the approximation of parallel transport is discussed. In addition to this, the method is validated on synthetic data and in high-dimensional settings. We also provide experimental results obtained on health data. Géométrie Riemannienne Algorithme EM stochastique Données longitudinales Modélisation statistique Riemannian geometry Stochastic EM algorithm Longitudinal data Statistical modeling

1

Page generated in 0.0723 seconds