Spelling suggestions: "subject:"then EM algorithm"" "subject:"them EM algorithm""
81 |
Extending Growth Mixture Models and Handling Missing Values via Mixtures of Non-Elliptical DistributionsWei, Yuhong January 2017 (has links)
Growth mixture models (GMMs) are used to model intra-individual change and inter-individual differences in change and to detect underlying group structure in longitudinal studies. Regularly, these models are fitted under the assumption of normality, an assumption that is frequently invalid. To this end, this thesis focuses on the development of novel non-elliptical growth mixture models to better fit real data. Two non-elliptical growth mixture models, via the multivariate skew-t distribution and the generalized hyperbolic distribution, are developed and applied to simulated and real data. Furthermore, these two non-elliptical growth mixture models are extended to accommodate missing values, which are near-ubiquitous in real data.
Recently, finite mixtures of non-elliptical distributions have flourished and facilitated the flexible clustering of the data featuring longer tails and asymmetry. However, in practice, real data often have missing values, and so work in this direction is also pursued. A novel approach, via mixtures of the generalized hyperbolic distribution and mixtures of the multivariate skew-t distributions, is presented to handle missing values in mixture model-based clustering context. To increase parsimony, families of mixture models have been developed by imposing constraints on the component scale matrices whenever missing data occur. Next, a mixture of generalized hyperbolic factor analyzers model is also proposed to cluster high-dimensional data with different patterns of missing values. Two missingness indicator matrices are also introduced to ease the computational burden. The algorithms used for parameter estimation are presented, and the performance of the methods is illustrated on simulated and real data. / Thesis / Doctor of Philosophy (PhD)
|
82 |
Inference for Generalized Multivariate Analysis of Variance (GMANOVA) Models and High-dimensional ExtensionsJana, Sayantee 11 1900 (has links)
A Growth Curve Model (GCM) is a multivariate linear model used for analyzing longitudinal data with short to moderate time series. It is a special case of Generalized Multivariate Analysis of Variance (GMANOVA) models. Analysis using the GCM involves comparison of mean growths among different groups. The classical GCM, however, possesses some limitations including distributional assumptions, assumption of identical degree of polynomials for all groups and it requires larger sample size than the number of time points. In this thesis, we relax some of the assumptions of the traditional GCM and develop appropriate inferential tools for its analysis, with the aim of reducing bias, improving precision and to gain increased power as well as overcome limitations of high-dimensionality.
Existing methods for estimating the parameters of the GCM assume that the underlying distribution for the error terms is multivariate normal. In practical problems, however, we often come across skewed data and hence estimation techniques developed under the normality assumption may not be optimal. Simulation studies conducted in this thesis, in fact, show that existing methods are sensitive to the presence of skewness in the data, where estimators are associated with increased bias and mean square error (MSE), when the normality assumption is violated. Methods appropriate for skewed distributions are, therefore, required. In this thesis, we relax the distributional assumption of the GCM and provide estimators for the mean and covariance matrices of the GCM under multivariate skew normal (MSN) distribution. An estimator for the additional skewness parameter of the MSN distribution is also provided. The estimators are derived using the expectation maximization (EM) algorithm and extensive simulations are performed to examine the performance of the estimators. Comparisons with existing estimators show that our estimators perform better than existing estimators, when the underlying distribution is multivariate skew normal. Illustration using real data set is also provided, wherein Triglyceride levels from the Framingham Heart Study is modelled over time.
The GCM assumes equal degree of polynomial for each group. Therefore, when groups means follow different shapes of polynomials, the GCM fails to accommodate this difference in one model. We consider an extension of the GCM, wherein mean responses from different groups can have different shapes, represented by polynomials of different degree. Such a model is referred to as Extended Growth Curve Model (EGCM). We extend our work on GCM to EGCM, and develop estimators for the mean and covariance matrices under MSN errors. We adopted the Restricted Expectation Maximization (REM) algorithm, which is based on the multivariate Newton-Raphson (NR) method and Lagrangian optimization. However, the multivariate NR method and hence, the existing REM algorithm are applicable to vector parameters and the parameters of interest in this study are matrices. We, therefore, extended the NR approach to matrix parameters, which consequently allowed us to extend the REM algorithm to matrix parameters. The performance of the proposed estimators were examined using extensive simulations and a motivating real data example was provided to illustrate the application of the proposed estimators.
Finally, this thesis deals with high-dimensional application of GCM. Existing methods for a GCM are developed under the assumption of ‘small p large n’ (n >> p) and are not appropriate for analyzing high-dimensional longitudinal data, due to singularity of the sample covariance matrix. In a previous work, we used Moore-Penrose generalized inverse to overcome this challenge. However, the method has some limitations around near singularity, when p~n. In this thesis, a Bayesian framework was used to derive a test for testing the linear hypothesis on the mean parameter of the GCM, which is applicable in high-dimensional situations. Extensive simulations are performed to investigate the performance of the test statistic and establish optimality characteristics. Results show that this test performs well, under different conditions, including the near singularity zone. Sensitivity of the test to mis-specification of the parameters of the prior distribution are also examined empirically. A numerical example is provided to illustrate the usefulness of the proposed method in practical situations. / Thesis / Doctor of Philosophy (PhD)
|
83 |
Normal Mixture Models for Gene Cluster Identification in Two Dimensional Microarray DataHarvey, Eric Scott 01 January 2003 (has links)
This dissertation focuses on methodology specific to microarray data analyses that organize the data in preliminary steps and proposes a cluster analysis method which improves the interpretability of the cluster results. Cluster analysis of microarray data allows samples with similar gene expression values to be discovered and may serve as a useful diagnostic tool. Since microarray data is inherently noisy, data preprocessing steps including smoothing and filtering are discussed. Comparing the results of different clustering methods is complicated by the arbitrariness of the cluster labels. Methods for re-labeling clusters to assess the agreement between the results of different clustering techniques are proposed. Microarray data involve large numbers of observations and generally present as arrays of light intensity values reflecting the degree of activity of the genes. These measurements are often two dimensional in nature since each is associated with an individual sample (cell line) and gene. The usual hierarchical clustering techniques do not easily adapt to this type of problem. These techniques allow only one dimension of the data to be clustered at a time and lose information due to the collapsing of the data in the opposite dimension. A novel clustering technique based on normal mixture distribution models is developed. This method clusters observations that arise from the same normal distribution and allows the data to be simultaneously clustered in two dimensions. The model is fitted using the Expectation/Maximization (EM) algorithm. For every cluster, the posterior probability that an observation belongs to that cluster is calculated. These probabilities allow the analyst to control the cluster assignments, including the use of overlapping clusters. A user friendly program, 2-DCluster, was written to support these methods. This program was written for Microsoft Windows 2000 and XP systems and supports one and two dimensional clustering. The program and sample applications are available at http://etd.vcu.edu. An electronic copy of this dissertation is available at the same address.
|
84 |
Statistical inference for rankings in the presence of panel segmentationXie, Lin January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Paul Nelson / Panels of judges are often used to estimate consumer preferences for m items such as food products. Judges can either evaluate each item on several ordinal scales and indirectly produce an overall ranking, or directly report a ranking of the items. A complete ranking orders all the items from best to worst. A partial ranking, as we use the term, only reports rankings of the best q out of m items. Direct ranking, the subject of this report, does not require the widespread but questionable practice of treating ordinal measurement as though they were on ratio or interval scales. Here, we develop and study segmentation models in which the panel may consist of relatively homogeneous subgroups, the segments. Judges within a subgroup will tend to agree among themselves and differ from judges in the other subgroups. We develop and study the statistical analysis of mixture models where it is not known to which segment a judge belongs or in some cases how many segments there are. Viewing segment membership indicator variables as latent data, an E-M algorithm was used to find the maximum likelihood estimators of the parameters specifying a mixture of Mallow’s (1957) distance models for complete and partial rankings. A simulation study was conducted to evaluate the behavior of the E-M algorithm in terms of such issues as the fraction of data sets for which the algorithm fails to converge and the sensitivity of initial values to the convergence rate and the performance of the maximum likelihood estimators in terms of bias and mean square error, where applicable.
A Bayesian approach was developed and credible set estimators was constructed. Simulation was used to evaluate the performance of these credible sets as
confidence sets.
A method for predicting segment membership from covariates measured on a judge was derived using a logistic model applied to a mixture of Mallows probability distance models. The effects of covariates on segment membership were assessed.
Likelihood sets for parameters specifying mixtures of Mallows distance models were constructed and explored.
|
85 |
Software for Estimation of Human Transcriptome Isoform Expression Using RNA-Seq DataJohnson, Kristen 18 May 2012 (has links)
The goal of this thesis research was to develop software to be used with RNA-Seq data for transcriptome quantification that was capable of handling multireads and quantifying isoforms on a more global level. Current software available for these purposes uses various forms of parameter alteration in order to work with multireads. Many still analyze isoforms per gene or per researcher determined clusters as well. By doing so, the effects of multireads are diminished or possibly wrongly represented. To address this issue, two programs, GWIE and ChromIE, were developed based on a simple iterative EM-like algorithm with no parameter manipulation. These programs are used to produce accurate isoform expression levels.
|
86 |
Estimation of Regression Coefficients under a Truncated Covariate with Missing ValuesReinhammar, Ragna January 2019 (has links)
By means of a Monte Carlo study, this paper investigates the relative performance of Listwise Deletion, the EM-algorithm and the default algorithm in the MICE-package for R (PMM) in estimating regression coefficients under a left truncated covariate with missing values. The intention is to investigate whether the three frequently used missing data techniques are robust against left truncation when missing values are MCAR or MAR. The results suggest that no technique is superior overall in all combinations of factors studied. The EM-algorithm is unaffected by left truncation under MCAR but negatively affected by strong left truncation under MAR. Compared to the default MICE-algorithm, the performance of EM is more stable across distributions and combinations of sample size and missing rate. The default MICE-algorithm is improved by left truncation but is sensitive to missingness pattern and missing rate. Compared to Listwise Deletion, the EM-algorithm is less robust against left truncation when missing values are MAR. However, the decline in performance of the EM-algorithm is not large enough for the algorithm to be completely outperformed by Listwise Deletion, especially not when the missing rate is moderate. Listwise Deletion might be robust against left truncation but is inefficient.
|
87 |
Estimação de modelos afins por partes em espaço de estadosRui, Rafael January 2016 (has links)
Esta tese foca no problema de estimação de estado e de identificação de parâametros para modelos afins por partes. Modelos afins por partes são obtidos quando o domínio do estado ou da entrada do sistema e particionado em regiões e, para cada região, um submodelo linear ou afim e utilizado para descrever a dinâmica do sistema. Propomos um algoritmo para estimação recursiva de estados e um algoritmo de identificação de parâmetros para uma classe de modelos afins por partes. Propomos um estimador de estados Bayesiano que utiliza o filtro de Kalman em cada um dos submodelos. Neste estimador, a função distribuição cumulativa e utilizada para calcular a distribuição a posteriori do estado assim como a probabilidade de cada submodelo. Já o método de identificação proposto utiliza o algoritmo EM (Expectation Maximization algorithm) para identificar os parâmetros do modelo. A função distribuição cumulativa e utilizada para calcular a probabilidade de cada submodelo a partir da medida do sistema. Em seguida, utilizamos o filtro de Kalman suavizado para estimar o estado e calcular uma função substituta da função likelihood. Tal função e então utilizada para identificar os parâmetros do modelo. O estimador proposto foi utilizado para estimar o estado do modelo não linear para vibrações causadas por folgas. Foram realizadas simulações, onde comparamos o método proposto ao filtro de Kalman estendido e o filtro de partículas. O algoritmo de identificação foi utilizado para identificar os parâmetros do modelo do jato JAS 39 Gripen, assim como, o modelos não linear de vibrações causadas por folgas. / This thesis focuses on the state estimation and parameter identi cation problems of piecewise a ne models. Piecewise a ne models are obtained when the state domain or the input domain are partitioned into regions and, for each region, a linear or a ne submodel is used to describe the system dynamics. We propose a recursive state estimation algorithm and a parameter identi cation algorithm to a class of piecewise a ne models. We propose a Bayesian state estimate which uses the Kalman lter in each submodel. In the this estimator, the cumulative distribution is used to compute the posterior distribution of the state as well as the probability of each submodel. On the other hand, the proposed identi cation method uses the Expectation Maximization (EM) algorithm to identify the model parameters. We use the cumulative distribution to compute the probability of each submodel based on the system measurements. Subsequently, we use the Kalman smoother to estimate the state and compute a surrogate function for the likelihood function. This function is used to estimate the model parameters. The proposed estimator was used to estimate the state of the nonlinear model for vibrations caused by clearances. Numerical simulations were performed, where we have compared the proposed method to the extended Kalman lter and the particle lter. The identi cation algorithm was used to identify the model parameters of the JAS 39 Gripen aircraft as well as the nonlinear model for vibrations caused by clearances.
|
88 |
Estimation of wood fibre length distributions from censored mixture dataSvensson, Ingrid January 2007 (has links)
<p>The motivating forestry background for this thesis is the need for fast, non-destructive, and cost-efficient methods to estimate fibre length distributions in standing trees in order to evaluate the effect of silvicultural methods and breeding programs on fibre length. The usage of increment cores is a commonly used non-destructive sampling method in forestry. An increment core is a cylindrical wood sample taken with a special borer, and the methods proposed in this thesis are especially developed for data from increment cores. Nevertheless the methods can be used for data from other sampling frames as well, for example for sticks with the shape of an elongated rectangular box.</p><p>This thesis proposes methods to estimate fibre length distributions based on censored mixture data from wood samples. Due to sampling procedures, wood samples contain cut (censored) and uncut observations. Moreover the samples consist not only of the fibres of interest but of other cells (fines) as well. When the cell lengths are determined by an automatic optical fibre-analyser, there is no practical possibility to distinguish between cut and uncut cells or between fines and fibres. Thus the resulting data come from a censored version of a mixture of the fine and fibre length distributions in the tree. The methods proposed in this thesis can handle this lack of information.</p><p>Two parametric methods are proposed to estimate the fine and fibre length distributions in a tree. The first method is based on grouped data. The probabilities that the length of a cell from the sample falls into different length classes are derived, the censoring caused by the sampling frame taken into account. These probabilities are functions of the unknown parameters, and ML estimates are found from the corresponding multinomial model.</p><p>The second method is a stochastic version of the EM algorithm based on the individual length measurements. The method is developed for the case where the distributions of the true lengths of the cells at least partially appearing in the sample belong to exponential families. The cell length distribution in the sample and the conditional distribution of the true length of a cell at least partially appearing in the sample given the length in the sample are derived. Both these distributions are necessary in order to use the stochastic EM algorithm. Consistency and asymptotic normality of the stochastic EM estimates is proved.</p><p>The methods are applied to real data from increment cores taken from Scots pine trees (Pinus sylvestris L.) in Northern Sweden and further evaluated through simulation studies. Both methods work well for sample sizes commonly obtained in practice.</p>
|
89 |
Code-aided synchronization for digital burst communicationsHerzet, Cédric 21 April 2006 (has links)
This thesis deals with the synchronization of digital communication systems. Synchronization (from the Greek syn (together) and chronos (time)) denotes the task of making two systems running at the same time. In communication systems, the synchronization of the transmitter and the receiver requires to accurately estimate a number of parameters such as the carrier frequency and phase offsets, the timing epoch...
In the early days of digital communications, synchronizers used to operate in either data-aided (DA) or non-data-aided (NDA) modes. However, with the recent advent of powerful coding techniques, these conventional synchronization modes have been shown to be unable to properly synchronize state-of-the-art receivers.
In this context, we investigate in this thesis a new family of synchronizers referred to as code-aided (CA) synchronizers. The idea behind CA synchronization is to take benefit from the structure of the code used to protect the data to improve the estimation quality achieved by the synchronizers. In a first part of the thesis, we address the issue of turbo synchronization, i.e., the iterative synchronization of continuous parameters. In particular, we derive several mathematical frameworks enabling a systematic derivation of turbo synchronizers and a deeper understanding of their behavior. In a second part, we focus on the so-called CA hypothesis testing problem. More particularly, we derive optimal solutions to deal with this problem and propose efficient implementations of the proposed algorithms. Finally, in a last part of this thesis, we derive theoretical lower bounds on the performance of turbo synchronizers.
|
90 |
Estimation of wood fibre length distributions from censored mixture dataSvensson, Ingrid January 2007 (has links)
The motivating forestry background for this thesis is the need for fast, non-destructive, and cost-efficient methods to estimate fibre length distributions in standing trees in order to evaluate the effect of silvicultural methods and breeding programs on fibre length. The usage of increment cores is a commonly used non-destructive sampling method in forestry. An increment core is a cylindrical wood sample taken with a special borer, and the methods proposed in this thesis are especially developed for data from increment cores. Nevertheless the methods can be used for data from other sampling frames as well, for example for sticks with the shape of an elongated rectangular box. This thesis proposes methods to estimate fibre length distributions based on censored mixture data from wood samples. Due to sampling procedures, wood samples contain cut (censored) and uncut observations. Moreover the samples consist not only of the fibres of interest but of other cells (fines) as well. When the cell lengths are determined by an automatic optical fibre-analyser, there is no practical possibility to distinguish between cut and uncut cells or between fines and fibres. Thus the resulting data come from a censored version of a mixture of the fine and fibre length distributions in the tree. The methods proposed in this thesis can handle this lack of information. Two parametric methods are proposed to estimate the fine and fibre length distributions in a tree. The first method is based on grouped data. The probabilities that the length of a cell from the sample falls into different length classes are derived, the censoring caused by the sampling frame taken into account. These probabilities are functions of the unknown parameters, and ML estimates are found from the corresponding multinomial model. The second method is a stochastic version of the EM algorithm based on the individual length measurements. The method is developed for the case where the distributions of the true lengths of the cells at least partially appearing in the sample belong to exponential families. The cell length distribution in the sample and the conditional distribution of the true length of a cell at least partially appearing in the sample given the length in the sample are derived. Both these distributions are necessary in order to use the stochastic EM algorithm. Consistency and asymptotic normality of the stochastic EM estimates is proved. The methods are applied to real data from increment cores taken from Scots pine trees (Pinus sylvestris L.) in Northern Sweden and further evaluated through simulation studies. Both methods work well for sample sizes commonly obtained in practice.
|
Page generated in 0.0477 seconds