Global ETD Search

11	Investigation of Multiple Imputation Methods for Categorical Variables Miranda, Samantha 01 May 2020 (has links) We compare different multiple imputation methods for categorical variables using the MICE package in R. We take a complete data set and remove different levels of missingness and evaluate the imputation methods for each level of missingness. Logistic regression imputation and linear discriminant analysis (LDA) are used for binary variables. Multinomial logit imputation and LDA are used for nominal variables while ordered logit imputation and LDA are used for ordinal variables. After imputation, the regression coefficients, percent deviation index (PDI) values, and relative frequency tables were found for each imputed data set for each level of missingness and compared to the complete corresponding data set. It was found that logistic regression outperformed LDA for binary variables, and LDA outperformed both multinomial logit imputation and ordered logit imputation for nominal and ordered variables. Simulations were ran to confirm the validity of the results. Missing data multiple imputation methods categorical data Physical Sciences and Mathematics
12	A Study on How Data Quality Influences Machine Learning Predictability and Interpretability for Tabular Data Ahsan, Humra 05 May 2022 (has links) No description available. Computer Science Machine Learning Categorical Data Random Forest Imputation
13	Evaluating Person-Oriented Methods for Mediation January 2019 (has links) abstract: Statistical inference from mediation analysis applies to populations, however, researchers and clinicians may be interested in making inference to individual clients or small, localized groups of people. Person-oriented approaches focus on the differences between people, or latent groups of people, to ask how individuals differ across variables, and can help researchers avoid ecological fallacies when making inferences about individuals. Traditional variable-oriented mediation assumes the population undergoes a homogenous reaction to the mediating process. However, mediation is also described as an intra-individual process where each person passes from a predictor, through a mediator, to an outcome (Collins, Graham, & Flaherty, 1998). Configural frequency mediation is a person-oriented analysis of contingency tables that has not been well-studied or implemented since its introduction in the literature (von Eye, Mair, & Mun, 2010; von Eye, Mun, & Mair, 2009). The purpose of this study is to describe CFM and investigate its statistical properties while comparing it to traditional and casual inference mediation methods. The results of this study show that joint significance mediation tests results in better Type I error rates but limit the person-oriented interpretations of CFM. Although the estimator for logistic regression and causal mediation are different, they both perform well in terms of Type I error and power, although the causal estimator had higher bias than expected, which is discussed in the limitations section. / Dissertation/Thesis / Masters Thesis Psychology 2019 Quantitative psychology Categorical Data Causal Inference Mediation Person-Oriented
14	Dimensionality Reduction with Non-Gaussian Mixtures Tang, Yang 11 1900 (has links) Broadly speaking, cluster analysis is the organization of a data set into meaningful groups and mixture model-based clustering is recently receiving a wide interest in statistics. Historically, the Gaussian mixture model has dominated the model-based clustering literature. When model-based clustering is performed on a large number of observed variables, it is well known that Gaussian mixture models can represent an over-parameterized solution. To this end, this thesis focuses on the development of novel non-Gaussian mixture models for high-dimensional continuous and categorical data. We developed a mixture of joint generalized hyperbolic models (JGHM), which exhibits different marginal amounts of tail-weight. Moreover, it takes into account the cluster specific subspace and, therefore, limits the number of parameters to estimate. This is a novel approach, which is applicable to high, and potentially very- high, dimensional spaces and with arbitrary correlation between dimensions. Three different mixture models are developed using forms of the mixture of latent trait models to realize model-based clustering of high-dimensional binary data. A family of mixture of latent trait models with common slope parameters are developed to reduce the number of parameters to be estimated. This approach facilitates a low-dimensional visual representation of the clusters. We further developed the penalized latent trait models to facilitate ultra high dimensional binary data which performs automatic variable selection as well. For all models and families of models developed in this thesis, the algorithms used for model-fitting and parameter estimation are presented. Real and simulated data sets are used to assess the clustering ability of the models. / Thesis / Doctor of Philosophy (PhD) clustering non-Gaussian latent variables mixture Models categorical data variational method
15	A stochastic process model for transient trace data Mathur, Anup 05 October 2007 (has links) Creation of sufficiently accurate workload models of computer systems is a key step in evaluating and tuning these systems. Workload models for an observable system can be built from traces collected by observing the system. This dissertation presents a novel technique to construct non-executable, artificial workload models fitting transient trace data. The trace can be a categorical or numerical time-series. The trace is considered a sample realization of a non-stationary stochastic process, {X<sub>t</sub>}, such that random variables X<sub>t</sub> follow different probability distributions. To estimate the parameters for the model a Rate Evolution Graph (REG) is built from the trace data. The REG is a two-dimensional Cartesian graph which plots the number of occurrences of each unique state in the trace on the ordinate and time on the abscissa. The REG contains one path for all instances of each unique state in the trace. The derivative of a REG path at time t is used as an estimate of the probability of occurrence of the corresponding state at t. We use piecewise linear regression to fit straight line segments to each REG path. The slopes of the line segments that fit a REG path estimate the time dependent probability of occurrence of the corresponding state. The estimates of occurrence probabilities of all unique states in the trace are used to construct a time-dependent joint probability mass function. The joint probability mass function is the representation of the Pzrecewise Independent Stochastic Process model for the trace. Two methods that assist to compact the model, while retaining accuracy, are also discussed. / Ph. D. workload models program behavior categorical data LD5655.V856 1996.M3798
16	Implementing a Class of Permutation Tests: The coin Package Zeileis, Achim, Wiel, Mark A. van de, Hornik, Kurt, Hothorn, Torsten 11 1900 (has links) (PDF) The R package coin implements a unified approach to permutation tests providing a huge class of independence tests for nominal, ordered, numeric, and censored data as well as multivariate data at mixed scales. Based on a rich and exible conceptual framework that embeds different permutation test procedures into a common theory, a computational framework is established in coin that likewise embeds the corresponding R functionality in a common S4 class structure with associated generic functions. As a consequence, the computational tools in coin inherit the exibility of the underlying theory and conditional inference functions for important special cases can be set up easily. Conditional versions of classical tests\|such as tests for location and scale problems in two or more samples, independence in two- or three-way contingency tables, or association problems for censored, ordered categorical or multivariate data\|can easily be implemented as special cases using this computational toolbox by choosing appropriate transformations of the observations. The paper gives a detailed exposition of both the internal structure of the package and the provided user interfaces along with examples on how to extend the implemented functionality. (authors' abstract)
17	Implementing a Class of Permutation Tests: The coin Package Hothorn, Torsten, Hornik, Kurt, van de Wiel, Mark A., Zeileis, Achim January 2007 (has links) (PDF) The R package coin implements a unified approach to permutation tests providing a huge class of independence tests for nominal, ordered, numeric, and censored data as well as multivariate data at mixed scales. Based on a rich and flexible conceptual framework that embeds different permutation test procedures into a common theory, a computational framework is established in coin that likewise embeds the corresponding R functionality in a common S4 class structure with associated generic functions. As a consequence, the computational tools in coin inherit the flexibility of the underlying theory and conditional inference functions for important special cases can be set up easily. Conditional versions of classical tests - such as tests for location and scale problems in two or more samples, independence in two- or three-way contingency tables, or association problems for censored, ordered categorical or multivariate data - can be easily be implemented as special cases using this computational toolbox by choosing appropriate transformations of the observations. The paper gives a detailed exposition of both the internal structure of the package and the provided user interfaces. / Series: Research Report Series / Department of Statistics and Mathematics
18	Statistické srovnání výsledků perkutánních, ureteroskopických a robotických operací pro obstrukci ureteropelvické junkce. / Statistical evaluation of percutan, ureteroscopic a robotic surgeries of ureteropelvic obstruction Masarovičová, Martina January 2008 (has links) The aim of this diploma thesis is statistical processing of a sample of patients that have been hospitalized and treated for ureteropelvic junction obstruction at the urological department of ÚNV Prague in last 20 years and to determine the optimal treatment method. Evaluation of surgical techniques from the surgical and economical point of creates a comprehensive image of advantages and disadvantages connected with application of a particular method and enables all participating subjects to decide in case of doubt. In this case the statistical analysis is a proper instrument, leading to find answers, however, it also gives an opportunity for discussion.
19	Likelihood-based inference for antedependence (Markov) models for categorical longitudinal data Xie, Yunlong 01 July 2011 (has links) Antedependence (AD) of order p, also known as the Markov property of order p, is a property of index-ordered random variables in which each variable, given at least p immediately preceding variables, is independent of all further preceding variables. Zimmerman and Nunez-Anton (2010) present statistical methodology for fitting and performing inference for AD models for continuous (primarily normal) longitudinal data. But analogous AD-model methodology for categorical longitudinal data has not yet been well developed. In this thesis, we derive maximum likelihood estimators of transition probabilities under antedependence of any order, and we use these estimators to develop likelihood-based methods for determining the order of antedependence of categorical longitudinal data. Specifically, we develop a penalized likelihood method for determining variable-order antedependence structure, and we derive the likelihood ratio test, score test, Wald test and an adaptation of Fisher's exact test for pth-order antedependence against the unstructured (saturated) multinomial model. Simulation studies show that the score (Pearson's Chi-square) test performs better than all the other methods for complete and monotone missing data, while the likelihood ratio test is applicable for data with arbitrary missing pattern. But since the likelihood ratio test is oversensitive under the null hypothesis, we modify it by equating the expectation of the test statistic to its degrees of freedom so that it has actual size closer to nominal size. Additionally, we modify the likelihood ratio tests for use in testing for pth-order antedependence against qth-order antedependence, where q > p, and for testing nested variable-order antedependence models. We extend the methods to deal with data having a monotone or arbitrary missing pattern. For antedependence models of constant order p, we develop methods for testing transition probability stationarity and strict stationarity and for maximum likelihood estimation of parametric generalized linear models that are transition probability stationary AD(p) models. The methods are illustrated using three data sets. ANTEDEPENDENCE MODELS CATEGORICAL DATA LIKELIHOOD-BASED INFERENCE LONGITUDINAL DATA Statistics and Probability
20	On the Measurement of Model Fit for Sparse Categorical Data Kraus, Katrin January 2012 (has links) This thesis consists of four papers that deal with several aspects of the measurement of model fit for categorical data. In all papers, special attention is paid to situations with sparse data. The first paper concerns the computational burden of calculating Pearson's goodness-of-fit statistic for situations where many response patterns have observed frequencies that equal zero. A simple solution is presented that allows for the computation of the total value of Pearson's goodness-of-fit statistic when the expected frequencies of response patterns with observed frequencies of zero are unknown. In the second paper, a new fit statistic is presented that is a modification of Pearson's statistic but that is not adversely affected by response patterns with very small expected frequencies. It is shown that the new statistic is asymptotically equivalent to Pearson's goodness-of-fit statistic and hence, asymptotically chi-square distributed. In the third paper, comprehensive simulation studies are conducted that compare seven asymptotically equivalent fit statistics, including the new statistic. Situations that are considered concern both multinomial sampling and factor analysis. Tests for the goodness-of-fit are conducted by means of the asymptotic and the bootstrap approach both under the null hypothesis and when there is a certain degree of misfit in the data. Results indicate that recommendations on the use of a fit statistic can be dependent on the investigated situation and on the purpose of the model test. Power varies substantially between the fit statistics and the cause of the misfit of the model. Findings indicate further that the new statistic proposed in this thesis shows rather stable results and compared to the other fit statistics, no disadvantageous characteristics of the fit statistic are found. Finally, in the fourth paper, the potential necessity of determining the goodness-of-fit by two sided model testing is adverted. A simulation study is conducted that investigates differences between the one sided and the two sided approach of model testing. Situations are identified for which two sided model testing has advantages over the one sided approach. goodness-of-fit sparseness model fit categorical data fit statistic sparse contingency table

Search results