Global ETD Search

11	A new normalized EM algorithm for clustering gene expression data Nguyen, Phuong Minh, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2008 (has links) Microarray data clustering represents a basic exploratory tool to find groups of genes exhibiting similar expression patterns or to detect relevant classes of molecular subtypes. Among a wide range of clustering approaches proposed and applied in the gene expression community to analyze microarray data, mixture model-based clustering has received much attention to its sound statistical framework and its flexibility in data modeling. However, clustering algorithms following the model-based framework suffer from two serious drawbacks. The first drawback is that the performance of these algorithms critically depends on the starting values for their iterative clustering procedures. Additionally, they are not capable of working directly with very high dimensional data sets in the sample clustering problem where the dimension of the data is up to hundreds or thousands. The thesis focuses on the two challenges and includes the following contributions: First, the thesis introduces the statistical model of our proposed normalized Expectation Maximization (EM) algorithm followed by its clustering performance analysis on a number of real microarray data sets. The normalized EM is stable even with random initializations for its EM iterative procedure. The stability of the normalized EM is demonstrated through its performance comparison with other related clustering algorithms. Furthermore, the normalized EM is the first mixture model-based clustering approach to be capable of working directly with very high dimensional microarray data sets in the sample clustering problem, where the number of genes is much larger than the number of samples. This advantage of the normalized EM is illustrated through the comparison with the unnormalized EM (The conventional EM algorithm for Gaussian mixture model-based clustering). Besides, for experimental microarray data sets with the availability of class labels of data points, an interesting property of the convergence speed of the normalized EM with respect to the radius of the hypersphere in its corresponding statistical model is uncovered. Second, to support the performance comparison of different clusterings a new internal index is derived using fundamental concepts from information theory. This index allows the comparison of clustering approaches in which the closeness between data points is evaluated by their cosine similarity. The method for deriving this internal index can be utilized to design other new indexes for comparing clustering approaches which employ a common similarity measure. Clustering. Expectation Maximization (EM) algorithm. Microarray data.
12	Evolutionary Algorithms for Model-Based Clustering Kampo, Regina S. January 2021 (has links) Cluster analysis is used to detect underlying group structure in data. Model-based clustering is the process of performing cluster analysis which involves the fitting of finite mixture models. However, parameter estimation in mixture model-based approaches to clustering is notoriously difficult. To this end, this thesis focuses on the development of evolutionary computation as an alternative technique for parameter estimation in mixture models. An evolutionary algorithm is proposed and illustrated on the well-established Gaussian mixture model with missing values. Next, the family of Gaussian parsimonious clustering models is considered, and an evolutionary algorithm is developed to estimate the parameters. Next, an evolutionary algorithm is developed for latent Gaussian mixture models and to facilitate the flexible clustering of high-dimensional data. For all models and families of models considered in this thesis, the proposed algorithms used for model-fitting and parameter estimation are presented and the performance illustrated using real and simulated data sets to assess the clustering ability of all models. This thesis concludes with a discussion and suggestions for future work. / Dissertation / Doctor of Philosophy (PhD) Evolutionary Algorithm Model-based Clustering EM Algorithm
13	Computation of Weights for Probabilistic Record Linkage Using the EM Algorithm Bauman, G. John 29 June 2006 (has links) (PDF) Record linkage is the process of combining information about a single individual from two or more records. Probabilistic record linkage gives weights to each field that is compared. The decision of whether the records should be linked is then determined by the sum of the weights, or “Score”, over all fields compared. Using methods similar to the simple versus simple most powerful test, an optimal record linkage decision rule can be established to minimize the number of unlinked records when the probability of false positive and false negative errors are specified. The weights needed for probabilistic record linkage necessitate linking a “training” subset of records for the computations. This is not practical in many settings, as hand matching requires a considerable time investment. In 1989, Matthew A. Jaro demonstrated how the Expectation-Maximization, or EM, algorithm could be used to compute the needed weights when fields have Binomial matching possibilities. This project applies this method of using the EM algorithm to calculate weights for head-of-household records from the 1910 and 1920 Censuses for Ascension Parish of Louisiana and Church and County Records from Perquimans County, North Carolina. This project also expands the Jaro's EM algorithm to a Multinomial framework. The performance of the EM algorithm for calculating weights will be assessed by comparing the computed weights to weights computed by clerical matching. Simulations will also be conducted to investigate the sensitivity of the algorithm to the total number of record pairs, the number of fields with missing entries, the starting values of estimated probabilities, and the convergence epsilon value. record linkage EM algorithm Statistics and Probability
14	Estimating the Proportion of True Null Hypotheses in Multiple Testing Problems Oyeniran, Oluyemi 18 July 2016 (has links) No description available. Statistics Multiple Comparisons Mixture Model EM Algorithm
15	Parameter estimation of queueing system using mixture model and the EM algorithm Li, Hang 02 December 2016 (has links) Parameter estimation is a long-lasting topic in queueing systems and has attracted considerable attention from both academia and industry. In this thesis, we design a parameter estimation framework for a tandem queueing system that collects end-to-end measurement data and utilizes the finite mixture model for the maximum likelihood (ML) estimation. The likelihood equations produced by ML are then solved by the iterative expectation-maximization (EM) algorithm, a powerful algorithm for parameter estimation in scenarios involving complicated distributions. We carry out a set of experiments with different parameter settings to test the performance of the proposed framework. Experimental results show that our method performs well for tandem queueing systems, in which the constituent nodes' service time follow distributions governed by exponential family. Under this framework, both the Newton-Raphson (NR) algorithm and the EM algorithm could be applied. The EM algorithm, however, is recommended due to its ease of implementation and lower computational overhead. / Graduate / hangli@uvic.ca EM algorithm Queueing Theory Mixture Model Tandem Queueing System
16	Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery Kilpatrick, Alastair Morris January 2015 (has links) The discovery of transcription factor binding site (TFBS) motifs remains an important and challenging problem in computational biology. This thesis presents MITSU, a novel algorithm for TFBS motif discovery which exploits stochastic methods as a means of both overcoming optimality limitations in current algorithms and as a framework for incorporating relevant prior knowledge in order to improve results. The current state of the TFBS motif discovery field is surveyed, with a focus on probabilistic algorithms that typically take the promoter regions of coregulated genes as input. A case is made for an approach based on the stochastic Expectation-Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms for motif discovery is shown. The algorithm developed in this thesis is unique amongst existing motif discovery algorithms in that it combines the sEM algorithm with a derived data set which leads to an improved approximation to the likelihood function. This likelihood function is unconstrained with regard to the distribution of motif occurrences within the input dataset. MITSU also incorporates a novel heuristic to automatically determine TFBS motif width. This heuristic, known as MCOIN, is shown to outperform current methods for determining motif width. MITSU is implemented in Java and an executable is available for download. MITSU is evaluated quantitatively using realistic synthetic data and several collections of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates that MITSU improves on a deterministic EM-based motif discovery algorithm and an alternative sEM-based algorithm, in terms of previously established metrics. The ability of the sEM algorithm to escape stable fixed points of the EM algorithm, which trap deterministic motif discovery algorithms and the ability of MITSU to discover multiple motif occurrences within a single input sequence are also demonstrated. MITSU is validated using previously characterised Alphaproteobacterial motifs, before being applied to motif discovery in uncharacterised Alphaproteobacterial data. A number of novel results from this analysis are presented and motivate two extensions of MITSU: a strategy for the discovery of multiple different motifs within a single dataset and a higher order Markov background model. The effects of incorporating these extensions within MITSU are evaluated quantitatively using previously characterised prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs. Finally, an information-theoretic measure of motif palindromicity is presented and its advantages over existing approaches for discovering palindromic motifs discussed. 572.8
17	Semiparametric mixture models Xiang, Sijia January 1900 (has links) Doctor of Philosophy / Department of Statistics / Weixin Yao / This dissertation consists of three parts that are related to semiparametric mixture models. In Part I, we construct the minimum profile Hellinger distance (MPHD) estimator for a class of semiparametric mixture models where one component has known distribution with possibly unknown parameters while the other component density and the mixing proportion are unknown. Such semiparametric mixture models have been often used in biology and the sequential clustering algorithm. In Part II, we propose a new class of semiparametric mixture of regression models, where the mixing proportions and variances are constants, but the component regression functions are smooth functions of a covariate. A one-step backfitting estimate and two EM-type algorithms have been proposed to achieve the optimal convergence rate for both the global parameters and nonparametric regression functions. We derive the asymptotic property of the proposed estimates and show that both proposed EM-type algorithms preserve the asymptotic ascent property. In Part III, we apply the idea of single-index model to the mixture of regression models and propose three new classes of models: the mixture of single-index models (MSIM), the mixture of regression models with varying single-index proportions (MRSIP), and the mixture of regression models with varying single-index proportions and variances (MRSIPV). Backfitting estimates and the corresponding algorithms have been proposed for the new models to achieve the optimal convergence rate for both the parameters and the nonparametric functions. We show that the nonparametric functions can be estimated as if the parameters were known and the parameters can be estimated with the same rate of convergence, n[subscript](-1/2), that is achieved in a parametric model. Semiparametric mixture models Kernel regression EM algorithm Statistics (0463)
18	On Convergence Properties of the EM Algorithm for Gaussian Mixtures Jordan, Michael, Xu, Lei 21 April 1995 (has links) "Expectation-Maximization'' (EM) algorithm and gradient-based approaches for maximum likelihood learning of finite Gaussian mixtures. We show that the EM step in parameter space is obtained from the gradient via a projection matrix $P$, and we provide an explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties of $P$ and provide new results analyzing the effect that $P$ has on the likelihood surface. Based on these mathematical results, we present a comparative discussion of the advantages and disadvantages of EM and other algorithms for the learning of Gaussian mixture models. learning neural networks EM algorithm clustering mixture models statistics
19	Essays in Dynamic Macroeconometrics Bañbura, Marta 26 June 2009 (has links) The thesis contains four essays covering topics in the field of macroeconomic forecasting. The first two chapters consider factor models in the context of real-time forecasting with many indicators. Using a large number of predictors offers an opportunity to exploit a rich information set and is also considered to be a more robust approach in the presence of instabilities. On the other hand, it poses a challenge of how to extract the relevant information in a parsimonious way. Recent research shows that factor models provide an answer to this problem. The fundamental assumption underlying those models is that most of the co-movement of the variables in a given dataset can be summarized by only few latent variables, the factors. This assumption seems to be warranted in the case of macroeconomic and financial data. Important theoretical foundations for large factor models were laid by Forni, Hallin, Lippi and Reichlin (2000) and Stock and Watson (2002). Since then, different versions of factor models have been applied for forecasting, structural analysis or construction of economic activity indicators. Recently, Giannone, Reichlin and Small (2008) have used a factor model to produce projections of the U.S GDP in the presence of a real-time data flow. They propose a framework that can cope with large datasets characterised by staggered and nonsynchronous data releases (sometimes referred to as “ragged edge”). This is relevant as, in practice, important indicators like GDP are released with a substantial delay and, in the meantime, more timely variables can be used to assess the current state of the economy. The first chapter of the thesis entitled “A look into the factor model black box: publication lags and the role of hard and soft data in forecasting GDP” is based on joint work with Gerhard Rünstler and applies the framework of Giannone, Reichlin and Small (2008) to the case of euro area. In particular, we are interested in the role of “soft” and “hard” data in the GDP forecast and how it is related to their timeliness. The soft data include surveys and financial indicators and reflect market expectations. They are usually promptly available. In contrast, the hard indicators on real activity measure directly certain components of GDP (e.g. industrial production) and are published with a significant delay. We propose several measures in order to assess the role of individual or groups of series in the forecast while taking into account their respective publication lags. We find that surveys and financial data contain important information beyond the monthly real activity measures for the GDP forecasts, once their timeliness is properly accounted for. The second chapter entitled “Maximum likelihood estimation of large factor model on datasets with arbitrary pattern of missing data” is based on joint work with Michele Modugno. It proposes a methodology for the estimation of factor models on large cross-sections with a general pattern of missing data. In contrast to Giannone, Reichlin and Small (2008), we can handle datasets that are not only characterised by a “ragged edge”, but can include e.g. mixed frequency or short history indicators. The latter is particularly relevant for the euro area or other young economies, for which many series have been compiled only since recently. We adopt the maximum likelihood approach which, apart from the flexibility with regard to the pattern of missing data, is also more efficient and allows imposing restrictions on the parameters. Applied for small factor models by e.g. Geweke (1977), Sargent and Sims (1977) or Watson and Engle (1983), it has been shown by Doz, Giannone and Reichlin (2006) to be consistent, robust and computationally feasible also in the case of large cross-sections. To circumvent the computational complexity of a direct likelihood maximisation in the case of large cross-section, Doz, Giannone and Reichlin (2006) propose to use the iterative Expectation-Maximisation (EM) algorithm (used for the small model by Watson and Engle, 1983). Our contribution is to modify the EM steps to the case of missing data and to show how to augment the model, in order to account for the serial correlation of the idiosyncratic component. In addition, we derive the link between the unexpected part of a data release and the forecast revision and illustrate how this can be used to understand the sources of the latter in the case of simultaneous releases. We use this methodology for short-term forecasting and backdating of the euro area GDP on the basis of a large panel of monthly and quarterly data. In particular, we are able to examine the effect of quarterly variables and short history monthly series like the Purchasing Managers' surveys on the forecast. The third chapter is entitled “Large Bayesian VARs” and is based on joint work with Domenico Giannone and Lucrezia Reichlin. It proposes an alternative approach to factor models for dealing with the curse of dimensionality, namely Bayesian shrinkage. We study Vector Autoregressions (VARs) which have the advantage over factor models in that they allow structural analysis in a natural way. We consider systems including more than 100 variables. This is the first application in the literature to estimate a VAR of this size. Apart from the forecast considerations, as argued above, the size of the information set can be also relevant for the structural analysis, see e.g. Bernanke, Boivin and Eliasz (2005), Giannone and Reichlin (2006) or Christiano, Eichenbaum and Evans (1999) for a discussion. In addition, many problems may require the study of the dynamics of many variables: many countries, sectors or regions. While we use standard priors as proposed by Litterman (1986), an important novelty of the work is that we set the overall tightness of the prior in relation to the model size. In this we follow the recommendation by De Mol, Giannone and Reichlin (2008) who study the case of Bayesian regressions. They show that with increasing size of the model one should shrink more to avoid overfitting, but when data are collinear one is still able to extract the relevant sample information. We apply this principle in the case of VARs. We compare the large model with smaller systems in terms of forecasting performance and structural analysis of the effect of monetary policy shock. The results show that a standard Bayesian VAR model is an appropriate tool for large panels of data once the degree of shrinkage is set in relation to the model size. The fourth chapter entitled “Forecasting euro area inflation with wavelets: extracting information from real activity and money at different scales” proposes a framework for exploiting relationships between variables at different frequency bands in the context of forecasting. This work is motivated by the on-going debate whether money provides a reliable signal for the future price developments. The empirical evidence on the leading role of money for inflation in an out-of-sample forecast framework is not very strong, see e.g. Lenza (2006) or Fisher, Lenza, Pill and Reichlin (2008). At the same time, e.g. Gerlach (2003) or Assenmacher-Wesche and Gerlach (2007, 2008) argue that money and output could affect prices at different frequencies, however their analysis is performed in-sample. In this Chapter, it is investigated empirically which frequency bands and for which variables are the most relevant for the out-of-sample forecast of inflation when the information from prices, money and real activity is considered. To extract different frequency components from a series a wavelet transform is applied. It provides a simple and intuitive framework for band-pass filtering and allows a decomposition of series into different frequency bands. Its application in the multivariate out-of-sample forecast is novel in the literature. The results indicate that, indeed, different scales of money, prices and GDP can be relevant for the inflation forecast. Wavelets Large cross-section Factor model EM algorithm Bayesian VAR
20	A comparably robust approach to estimate the left-censored data of trace elements in Swedish groundwater Li, Cong January 2012 (has links) Groundwater data in this thesis, which is taken from the database of Sveriges Geologiska Undersökning, characterizes chemical and quantitative status of groundwater in Sweden. The data usually is recorded with only quantification limits when it is below certain values. Accordingly, this thesis is aiming at handling such kind of data. The thesis considers this topic by using the EM algorithm to get the results from maximum likelihood estimation. Consequently, estimations of distributions on censored data of trace elements are expounded on. Related simulations show that the estimation is acceptable. groundwater left-censored data the EM algorithm maximum likelihood estimation

Search results