Global ETD Search

11	A-Optimal Subsampling For Big Data General Estimating Equations Cheung, Chung Ching 08 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method. Subsampling Big Data A-optimality General Estimating Equations High Dimensional Statistics
12	A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONS Chung Ching Cheung (7027808) 13 August 2019 (has links) <p>A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.</p> Statistics subsampling general estimating equations a-optimality big data High Dimensional Data
13	Learning Curves in Emergency Ultrasonography Brady, Kaitlyn 29 December 2012 (has links) "This project utilized generalized estimating equations and general linear modeling to model learning curves for sonographer performance in emergency ultrasonography. Performance was measured in two ways: image quality (interpretable vs. possible hindrance in interpretation) and agreement of findings between the sonographer and an expert reviewing sonographer. Records from 109 sonographers were split into two data sets-- training (n=50) and testing (n=59)--to conduct exploratory analysis and fit the final models for analysis, respectively. We determined that the number of scans of a particular exam type required for a sonographer to obtain quality images on that exam type with a predicted probability of 0.9 is highly dependent upon the person conducting the review, the indication of the scan (educational or medical), and the outcome of the scan (whether there is a pathology positive finding). Constructing family-wise 95% confidence intervals for each exam type demonstrated a large amount of variation for the number of scans required both between exam types and within exam types. It was determined that a sonographer's experience with a particular exam type is not a significant predictor of future agreement on that exam type and thus no estimates were made based on the agreement learning curves. In addition, we concluded based on a type III analysis that when already considering exam type related experience, the consideration of experience on other exam types does not significantly impact the learning curve for quality. However, the learning curve for agreement is significantly impacted by the additional consideration of experience on other exam types." inverse interval estimation logistic regression general linear models generalized estimating equations
14	Modelos Birnbaum-Saunders usando equações de estimação / Birnbaum-Saunders models using estimating equations Tsuyuguchi, Aline Barbosa 12 May 2017 (has links) Este trabalho de tese tem como objetivo principal propor uma abordagem alternativa para analisar dados Birnbaum-Saunders (BS) correlacionados com base em equações de estimação. Da classe ótima de funções de estimação proposta por Crowder (1987), derivamos uma classe ótima para a análise de dados correlacionados em que as distribuições marginais são assumidas log-BS e log-BS-t, respectivamente. Derivamos um processo iterativo para estimação dos parâmetros, métodos de diagnóstico, tais como análise de resíduos, distância de Cook e influência local sob três diferentes esquemas de perturbação: ponderação de casos, perturbação da variável resposta e perturbação individual de covariáveis. Estudos de simulação são desenvolvidos para cada modelo para avaliar as propriedades empíricas dos estimadores dos parâmetros de localização, forma e correlação. A abordagem apresentada é discutida em duas aplicações: o primeiro exemplo referente a um banco de dados sobre a produtividade de capital público nos 48 estados norte-americanos contíguos de 1970 a 1986 e o segundo exemplo referente a um estudo realizado na Escola de Educação Física e Esporte da Universidade de São Paulo (USP) durante 2016 em que 70 corredores foram avaliados em corridas em esteiras em três períodos distintos. / The aim of this thesis is to propose an alternative approach to analyze correlated Birnbaum-Saunders (BS) data based on estimating equations. From the optimal estimating functions class proposed by Crowder (1987), we derive an optimal class for the analysis of correlated data in which the marginal distributions are assumed either log-BS or log-BS-t. It is derived an iterative process, diagnostic procedures such as residual analysis, Cooks distance and local influence under three different perturbation schemes: case-weights, response variable perturbation and single-covariate perturbation. Simulation studies to assess the empirical properties of the parameters estimates are performed for each proposed model. The proposed methodology is discussed in two applications: the first one on a data set of public capital productivity of the contiguous 48 USA states, from 1970 to 1986, and the second data set refers to a study conducted in the School of Physical Education and Sport of the University of São Paulo (USP), during 2016, in which 70 runners were evaluated in running machines races in three periods. Birnbaum-Saunders distribution Correlated data Dados correlacionados Distribuição Birnbaum-Saunders Equações de estimação Estimating equations
15	Equações de estimação generalizadas com resposta binomial negativa: modelando dados correlacionados de contagem com sobredispersão / Generalized estimating equations with negative binomial responses: modeling correlated count data with overdispersion Oesselmann, Clarissa Cardoso 12 December 2016 (has links) Uma suposição muito comum na análise de modelos de regressão é a de respostas independentes. No entanto, quando trabalhamos com dados longitudinais ou agrupados essa suposição pode não fazer sentido. Para resolver esse problema existem diversas metodologias, e talvez a mais conhecida, no contexto não Gaussiano, é a metodologia de Equações de Estimação Generalizadas (EEGs), que possui similaridades com os Modelos Lineares Generalizados (MLGs). Essas similaridades envolvem a classificação do modelo em torno de distribuições da família exponencial e da especificação de uma função de variância. A única diferença é que nessa função também é inserida uma matriz trabalho que inclui a parametrização da estrutura de correlação dentro das unidades experimentais. O principal objetivo desta dissertação é estudar como esses modelos se comportam em uma situação específica, de dados de contagem com sobredispersão. Quando trabalhamos com MLGs esse problema é resolvido através do ajuste de um modelo com resposta binomial negativa (BN), e a ideia é a mesma para os modelos envolvendo EEGs. Essa dissertação visa rever as teorias existentes em EEGs no geral e para o caso específico quando a resposta marginal é BN, e além disso mostrar como essa metodologia se aplica na prática, com três exemplos diferentes de dados correlacionados com respostas de contagem. / An assumption that is common in the analysis of regression models is that of independent responses. However, when working with longitudinal or grouped data this assumption may not have sense. To solve this problem there are several methods, but perhaps the best known, in the non Gaussian context, is the one based on Generalized Estimating Equations (GEE), which has similarities with Generalized Linear Models (GLM). Such similarities involve the classification of the model around the exponential family and the specification of a variance function. The only diference is that in this function is also inserted a working correlation matrix concerning the correlations within the experimental units. The main objective of this dissertation is to study how these models behave in a specific situation, which is the one on count data with overdispersion. When we work with GLM this kind of problem is solved by setting a model with a negative binomial response (NB), and the idea is the same for the GEE methodology. This dissertation aims to review in general the GEE methodology and for the specific case when the responses follow marginal negative binomial distributions. In addition, we show how this methodology is applied in practice, with three examples of correlated data with count responses. Binomial negativa Count Data Dados de contagem Equações de estimação generalizadas Generalized Estimating Equations Negative Binomial Overdispersion Sobredispersão
16	Modelos Birnbaum-Saunders usando equações de estimação / Birnbaum-Saunders models using estimating equations Aline Barbosa Tsuyuguchi 12 May 2017 (has links) Este trabalho de tese tem como objetivo principal propor uma abordagem alternativa para analisar dados Birnbaum-Saunders (BS) correlacionados com base em equações de estimação. Da classe ótima de funções de estimação proposta por Crowder (1987), derivamos uma classe ótima para a análise de dados correlacionados em que as distribuições marginais são assumidas log-BS e log-BS-t, respectivamente. Derivamos um processo iterativo para estimação dos parâmetros, métodos de diagnóstico, tais como análise de resíduos, distância de Cook e influência local sob três diferentes esquemas de perturbação: ponderação de casos, perturbação da variável resposta e perturbação individual de covariáveis. Estudos de simulação são desenvolvidos para cada modelo para avaliar as propriedades empíricas dos estimadores dos parâmetros de localização, forma e correlação. A abordagem apresentada é discutida em duas aplicações: o primeiro exemplo referente a um banco de dados sobre a produtividade de capital público nos 48 estados norte-americanos contíguos de 1970 a 1986 e o segundo exemplo referente a um estudo realizado na Escola de Educação Física e Esporte da Universidade de São Paulo (USP) durante 2016 em que 70 corredores foram avaliados em corridas em esteiras em três períodos distintos. / The aim of this thesis is to propose an alternative approach to analyze correlated Birnbaum-Saunders (BS) data based on estimating equations. From the optimal estimating functions class proposed by Crowder (1987), we derive an optimal class for the analysis of correlated data in which the marginal distributions are assumed either log-BS or log-BS-t. It is derived an iterative process, diagnostic procedures such as residual analysis, Cooks distance and local influence under three different perturbation schemes: case-weights, response variable perturbation and single-covariate perturbation. Simulation studies to assess the empirical properties of the parameters estimates are performed for each proposed model. The proposed methodology is discussed in two applications: the first one on a data set of public capital productivity of the contiguous 48 USA states, from 1970 to 1986, and the second data set refers to a study conducted in the School of Physical Education and Sport of the University of São Paulo (USP), during 2016, in which 70 runners were evaluated in running machines races in three periods. Dados correlacionados Distribuição Birnbaum-Saunders Equações de estimação Birnbaum-Saunders distribution Correlated data Estimating equations
17	An examination of individual and social network factors that influence needle sharing behaviour among Winnipeg injection drug users Sulaiman, Patricia C. 14 December 2005 (has links) The sharing of needles among injection drug users (IDUs) is a common route of Human Immunodeficiency Virus and Hepatitis C Virus transmission. Through the increased utilization of social network analysis, researchers have been able to examine how the interpersonal relationships of IDUs affect injection risk behaviour. This study involves a secondary analysis of data from a cross-sectional study of 156 IDUs from Winnipeg, Manitoba titled “Social Network Analysis of Injection Drug Users”. Multiple logistic regression analysis was used to assess the individual and the social network characteristics associated with needle sharing among the IDUs. Generalized Estimating Equations analysis was used to determine the injecting dyad characteristics which influence needle sharing behaviour between the IDUs and their injection drug using network members. The results revealed five key thematic findings that were significantly associated with needle sharing: (1) types of drug use, (2) socio-demographic status, (3) injecting in semi-public locations, (4) intimacy, and (5) social influence. The findings from this study suggest that comprehensive prevention approaches that target individuals and their network relationships may be necessary for sustainable reductions in needle sharing among IDUs. / February 2006 injection drug use social network analysis needle sharing Winnipeg, Manitoba generalized estimating equations secondary data analysis
18	Statistical Evaluation of Continuous-Scale Diagnostic Tests with Missing Data Wang, Binhuan 12 June 2012 (has links) The receiver operating characteristic (ROC) curve methodology is the statistical methodology for assessment of the accuracy of diagnostics tests or bio-markers. Currently most widely used statistical methods for the inferences of ROC curves are complete-data based parametric, semi-parametric or nonparametric methods. However, these methods cannot be used in diagnostic applications with missing data. In practical situations, missing diagnostic data occur more commonly due to various reasons such as medical tests being too expensive, too time consuming or too invasive. This dissertation aims to develop new nonparametric statistical methods for evaluating the accuracy of diagnostic tests or biomarkers in the presence of missing data. Specifically, novel nonparametric statistical methods will be developed with different types of missing data for (i) the inference of the area under the ROC curve (AUC, which is a summary index for the diagnostic accuracy of the test) and (ii) the joint inference of the sensitivity and the specificity of a continuous-scale diagnostic test. In this dissertation, we will provide a general framework that combines the empirical likelihood and general estimation equations with nuisance parameters for the joint inferences of sensitivity and specificity with missing diagnostic data. The proposed methods will have sound theoretical properties. The theoretical development is challenging because the proposed profile log-empirical likelihood ratio statistics are not the standard sum of independent random variables. The new methods have the power of likelihood based approaches and jackknife method in ROC studies. Therefore, they are expected to be more robust, more accurate and less computationally intensive than existing methods in the evaluation of competing diagnostic tests. AUC Bootstrap Diagnostic tests Empirical likelihood Estimating equations Imputation Jackknife Missing data ROC curve Verification bias
19	Analysis of Correlated Data with Measurement Error in Responses or Covariates Chen, Zhijian January 2010 (has links) Correlated data frequently arise from epidemiological studies, especially familial and longitudinal studies. Longitudinal design has been used by researchers to investigate the changes of certain characteristics over time at the individual level as well as how potential factors influence the changes. Familial studies are often designed to investigate the dependence of health conditions among family members. Various models have been developed for this type of multivariate data, and a wide variety of estimation techniques have been proposed. However, data collected from observational studies are often far from perfect, as measurement error may arise from different sources such as defective measuring systems, diagnostic tests without gold references, and self-reports. Under such scenarios only rough surrogate variables are measured. Measurement error in covariates in various regression models has been discussed extensively in the literature. It is well known that naive approaches ignoring covariate error often lead to inconsistent estimators for model parameters. In this thesis, we develop inferential procedures for analyzing correlated data with response measurement error. We consider three scenarios: (i) likelihood-based inferences for generalized linear mixed models when the continuous response is subject to nonlinear measurement errors; (ii) estimating equations methods for binary responses with misclassifications; and (iii) estimating equations methods for ordinal responses when the response variable and categorical/ordinal covariates are subject to misclassifications. The first problem arises when the continuous response variable is difficult to measure. When the true response is defined as the long-term average of measurements, a single measurement is considered as an error-contaminated surrogate. We focus on generalized linear mixed models with nonlinear response error and study the induced bias in naive estimates. We propose likelihood-based methods that can yield consistent and efficient estimators for both fixed-effects and variance parameters. Results of simulation studies and analysis of a data set from the Framingham Heart Study are presented. Marginal models have been widely used for correlated binary, categorical, and ordinal data. The regression parameters characterize the marginal mean of a single outcome, without conditioning on other outcomes or unobserved random effects. The generalized estimating equations (GEE) approach, introduced by Liang and Zeger (1986), only models the first two moments of the responses with associations being treated as nuisance characteristics. For some clustered studies especially familial studies, however, the association structure may be of scientific interest. With binary data Prentice (1988) proposed additional estimating equations that allow one to model pairwise correlations. We consider marginal models for correlated binary data with misclassified responses. We develop “corrected” estimating equations approaches that can yield consistent estimators for both mean and association parameters. The idea is related to Nakamura (1990) that is originally developed for correcting bias induced by additive covariate measurement error under generalized linear models. Our approaches can also handle correlated misclassifications rather than a simple misclassification process as considered by Neuhaus (2002) for clustered binary data under generalized linear mixed models. We extend our methods and further develop marginal approaches for analysis of longitudinal ordinal data with misclassification in both responses and categorical covariates. Simulation studies show that our proposed methods perform very well under a variety of scenarios. Results from application of the proposed methods to real data are presented. Measurement error can be coupled with many other features in the data, e.g., complex survey designs, that can complicate inferential procedures. We explore combining survey weights and misclassification in ordinal covariates in logistic regression analyses. We propose an approach that incorporates survey weights into estimating equations to yield design-based unbiased estimators. In the final part of the thesis we outline some directions for future work, such as transition models and semiparametric models for longitudinal data with both incomplete observations and measurement error. Missing data is another common feature in applications. Developing novel statistical techniques for dealing with both missing data and measurement error can be beneficial. Estimating equations Generalized mixed models Longitudinal data Measurement error Odds ratio Statistics (Biostatistics)
20	Analysis of Correlated Data with Measurement Error in Responses or Covariates Chen, Zhijian January 2010 (has links) Correlated data frequently arise from epidemiological studies, especially familial and longitudinal studies. Longitudinal design has been used by researchers to investigate the changes of certain characteristics over time at the individual level as well as how potential factors influence the changes. Familial studies are often designed to investigate the dependence of health conditions among family members. Various models have been developed for this type of multivariate data, and a wide variety of estimation techniques have been proposed. However, data collected from observational studies are often far from perfect, as measurement error may arise from different sources such as defective measuring systems, diagnostic tests without gold references, and self-reports. Under such scenarios only rough surrogate variables are measured. Measurement error in covariates in various regression models has been discussed extensively in the literature. It is well known that naive approaches ignoring covariate error often lead to inconsistent estimators for model parameters. In this thesis, we develop inferential procedures for analyzing correlated data with response measurement error. We consider three scenarios: (i) likelihood-based inferences for generalized linear mixed models when the continuous response is subject to nonlinear measurement errors; (ii) estimating equations methods for binary responses with misclassifications; and (iii) estimating equations methods for ordinal responses when the response variable and categorical/ordinal covariates are subject to misclassifications. The first problem arises when the continuous response variable is difficult to measure. When the true response is defined as the long-term average of measurements, a single measurement is considered as an error-contaminated surrogate. We focus on generalized linear mixed models with nonlinear response error and study the induced bias in naive estimates. We propose likelihood-based methods that can yield consistent and efficient estimators for both fixed-effects and variance parameters. Results of simulation studies and analysis of a data set from the Framingham Heart Study are presented. Marginal models have been widely used for correlated binary, categorical, and ordinal data. The regression parameters characterize the marginal mean of a single outcome, without conditioning on other outcomes or unobserved random effects. The generalized estimating equations (GEE) approach, introduced by Liang and Zeger (1986), only models the first two moments of the responses with associations being treated as nuisance characteristics. For some clustered studies especially familial studies, however, the association structure may be of scientific interest. With binary data Prentice (1988) proposed additional estimating equations that allow one to model pairwise correlations. We consider marginal models for correlated binary data with misclassified responses. We develop “corrected” estimating equations approaches that can yield consistent estimators for both mean and association parameters. The idea is related to Nakamura (1990) that is originally developed for correcting bias induced by additive covariate measurement error under generalized linear models. Our approaches can also handle correlated misclassifications rather than a simple misclassification process as considered by Neuhaus (2002) for clustered binary data under generalized linear mixed models. We extend our methods and further develop marginal approaches for analysis of longitudinal ordinal data with misclassification in both responses and categorical covariates. Simulation studies show that our proposed methods perform very well under a variety of scenarios. Results from application of the proposed methods to real data are presented. Measurement error can be coupled with many other features in the data, e.g., complex survey designs, that can complicate inferential procedures. We explore combining survey weights and misclassification in ordinal covariates in logistic regression analyses. We propose an approach that incorporates survey weights into estimating equations to yield design-based unbiased estimators. In the final part of the thesis we outline some directions for future work, such as transition models and semiparametric models for longitudinal data with both incomplete observations and measurement error. Missing data is another common feature in applications. Developing novel statistical techniques for dealing with both missing data and measurement error can be beneficial. Estimating equations Generalized mixed models Longitudinal data Measurement error Odds ratio Statistics (Biostatistics)

Search results