121 |
Some Recent Advances in Non- and Semiparametric Bayesian Modeling with Copulas, Mixtures, and Latent VariablesMurray, Jared January 2013 (has links)
<p>This thesis develops flexible non- and semiparametric Bayesian models for mixed continuous, ordered and unordered categorical data. These methods have a range of possible applications; the applications considered in this thesis are drawn primarily from the social sciences, where multivariate, heterogeneous datasets with complex dependence and missing observations are the norm. </p><p>The first contribution is an extension of the Gaussian factor model to Gaussian copula factor models, which accommodate continuous and ordinal data with unspecified marginal distributions. I describe how this model is the most natural extension of the Gaussian factor model, preserving its essential dependence structure and the interpretability of factor loadings and the latent variables. I adopt an approximate likelihood for posterior inference and prove that, if the Gaussian copula model is true, the approximate posterior distribution of the copula correlation matrix asymptotically converges to the correct parameter under nearly any marginal distributions. I demonstrate with simulations that this method is both robust and efficient, and illustrate its use in an application from political science.</p><p>The second contribution is a novel nonparametric hierarchical mixture model for continuous, ordered and unordered categorical data. The model includes a hierarchical prior used to couple component indices of two separate models, which are also linked by local multivariate regressions. This structure effectively overcomes the limitations of existing mixture models for mixed data, namely the overly strong local independence assumptions. In the proposed model local independence is replaced by local conditional independence, so that the induced model is able to more readily adapt to structure in the data. I demonstrate the utility of this model as a default engine for multiple imputation of mixed data in a large repeated-sampling study using data from the Survey of Income and Participation. I show that it improves substantially on its most popular competitor, multiple imputation by chained equations (MICE), while enjoying certain theoretical properties that MICE lacks. </p><p>The third contribution is a latent variable model for density regression. Most existing density regression models are quite flexible but somewhat cumbersome to specify and fit, particularly when the regressors are a combination of continuous and categorical variables. The majority of these methods rely on extensions of infinite discrete mixture models to incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally different approach, introducing a continuous latent variable which depends on covariates through a parametric regression. In turn, the observed response depends on the latent variable through an unknown function. I demonstrate that a spline prior for the unknown function is quite effective relative to Dirichlet Process mixture models in density estimation settings (i.e., without covariates) even though these Dirichlet process mixtures have better theoretical properties asymptotically. The spline formulation enjoys a number of computational advantages over more flexible priors on functions. Finally, I demonstrate the utility of this model in regression applications using a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling as a smooth function of the quantile index.</p> / Dissertation
|
122 |
Comparative approaches to handling missing data, with particular focus on multiple imputation for both cross-sectional and longitudinal models.Hassan, Ali Satty Ali. January 2012 (has links)
Much data-based research are characterized by the unavoidable problem of incompleteness
as a result of missing or erroneous values. This thesis discusses some of the
various strategies and basic issues in statistical data analysis to address the missing
data problem, and deals with both the problem of missing covariates and missing outcomes.
We restrict our attention to consider methodologies which address a specific
missing data pattern, namely monotone missingness.
The thesis is divided into two parts. The first part placed a particular emphasis on
the so called missing at random (MAR) assumption, but focuses the bulk of attention
on multiple imputation techniques. The main aim of this part is to investigate various
modelling techniques using application studies, and to specify the most appropriate
techniques as well as gain insight into the appropriateness of these techniques for handling
incomplete data analysis. This thesis first deals with the problem of missing
covariate values to estimate regression parameters under a monotone missing covariate
pattern. The study is devoted to a comparison of different imputation techniques,
namely markov chain monte carlo (MCMC), regression, propensity score (PS) and last
observation carried forward (LOCF). The results from the application study revealed
that we have universally best methods to deal with missing covariates when the missing
data pattern is monotone. Of the methods explored, the MCMC and regression methods
of imputation to estimate regression parameters with monotone missingness were
preferable to the PS and LOCF methods. This study is also concerned with comparative
analysis of the techniques applied to incomplete Gaussian longitudinal outcome
or response data due to random dropout. Three different methods are assessed and
investigated, namely multiple imputation (MI), inverse probability weighting (IPW)
and direct likelihood analysis. The findings in general favoured MI over IPW in the
case of continuous outcomes, even when the MAR mechanism holds. The findings further suggest that the use of MI and direct likelihood techniques lead to accurate and
equivalent results as both techniques arrive at the same substantive conclusions. The
study also compares and contrasts several statistical methods for analyzing incomplete
non-Gaussian longitudinal outcomes when the underlying study is subject to ignorable
dropout. The methods considered include weighted generalized estimating equations
(WGEE), multiple imputation after generalized estimating equations (MI-GEE) and
generalized linear mixed model (GLMM). The current study found that the MI-GEE
method was considerably robust, doing better than all the other methods in terms of
small and large sample sizes, regardless of the dropout rates.
The primary interest of the second part of the thesis falls under the non-ignorable
dropout (MNAR) modelling frameworks that rely on sensitivity analysis in modelling
incomplete Gaussian longitudinal data. The aim of this part is to deal with non-random
dropout by explicitly modelling the assumptions that caused the dropout and
incorporated this additional sub-model into the model for the measurement data, and
to assess the sensitivity of the modelling assumptions. The study pays attention to
the analysis of repeated Gaussian measures subject to potentially non-random dropout
in order to study the influence on inference that might be caused in the data by the
dropout process. We consider the construction of a particular type of selection model,
namely the Diggle-Kenward model as a tool for assessing the sensitivity of a selection
model in terms of the modelling assumptions. The major conclusions drawn were that
there was evidence in favour of the MAR process rather than an MCAR process in
the context of the assumed model. In addition, there was the need to obtain further
insight into the data by comparing various sensitivity analysis frameworks. Lastly,
two families of models were also compared and contrasted to investigate the potential
influence on inference that dropout might have or exert on the dependent measurement
data considered, and to deal with incomplete sequences. The models were based on
selection and pattern mixture frameworks used for sensitivity analysis to jointly model
the distribution of the dropout process and longitudinal measurement process. The
results of the sensitivity analysis were in agreement and hence led to similar parameter
estimates. Additional confidence in the findings was gained as both models led to
similar results for significant effects such as marginal treatment effects. / Thesis (M.Sc.)-University of KwaZulu-Natal, Pietermaritzburg, 2012.
|
123 |
Assessing the Impact of Genotype Imputation on Meta-analysis of Genetic Association StudiesOmondi, Emmanuel 28 July 2014 (has links)
In this thesis,we study how a meta-analysis of genetic association studies is influenced by the degree of genotype imputation uncertainty in the studies combined and the size of meta-analysis. We consider the fixed effect meta-analysis model to evaluate the accuracy and efficiency of imputation-based meta-analysis results under different levels of imputation accuracy. We also examine the impact of genotype imputation on the between-study heterogeneity and type 1 error in the random effects meta-analysis model. Simulation results reaffirm that meta-analysis boosts the power of detecting genetic associations compared to individual study results. However, the power deteriorates with increasing uncertainty in imputed genotypes. Genotype imputation affects a random effects meta-analysis in a non-obvious way as estimation of between-study heterogeneity and interpretation of association results depend heavily on the number of studies combined. We propose an adjusted fixed effect meta-analysis approach for adding imputation-based studies to a meta-analysis of existing typed studies in a controlled way to improve precision and reliability. The proposed method should help in designing an effective meta-analysis study.
|
124 |
Survival analysis for breast cancerLiu, Yongcai 21 September 2010 (has links)
This research carries out a survival analysis for patients with breast cancer. The influence of clinical and pathologic features, as well as molecular markers on survival time are investigated. Special
attention focuses on whether the molecular markers can provide additional information in helping predict clinical outcome and guide therapies for breast cancer patients. Three outcomes, breast cancer specific survival (BCSS), local relapse survival (LRS) and distant relapse survival (DRS), are
examined using two datasets, the large dataset with missing values in markers (n=1575) and the small (complete) dataset consisting of patient records without any missing values (n=910). Results show
that some molecular markers, such as YB1, could join ER, PR and HER2 to be integrated
into cancer clinical practices. Further clinical research work is needed to identify the importance of CK56.
The 10 year survival probability at the mean of all the covariates (clinical variables and markers) for BCSS, LRS, and DRS is 77%, 91%, and 72% respectively. Due to the presence of a large portion of missing values in the dataset, a sophisticated multiple imputation method is needed to estimate the missing values so that an unbiased and more reliable analysis can be achieved. In this study, three multiple imputation (MI) methods, data augmentation
(DA), multivariate imputations by chained equations (MICE) and AREG, are employed and compared.
Results shows that AREG is the preferred MI approach. The reliability of MI results are demonstrated using various techniques. This work will hopefully shed light on the determination of appropriate MI
methods for other similar research situations.
|
125 |
Multiple imputation for marginal and mixed models in longitudinal data with informative missingnessDeng, Wei, January 2005 (has links)
Thesis (Ph. D.)--Ohio State University, 2005. / Title from first page of PDF file. Document formatted into pages; contains xiii, 108 p.; also includes graphics. Includes bibliographical references (p. 104-108). Available online via OhioLINK's ETD Center
|
126 |
A Monte Carlo study the impact of missing data in cross-classification random effects models /Alemdar, Meltem. January 2008 (has links)
Thesis (Ph. D.)--Georgia State University, 2008. / Title from title page (Digital Archive@GSU, viewed July 20, 2010) Carolyn F. Furlow, committee chair; Philo A. Hutcheson, Phillip E. Gagne, Sheryl A. Gowen, committee members. Includes bibliographical references (p. 96-100).
|
127 |
Bayesian estimation of factor analysis models with incomplete dataMerkle, Edgar C., January 2005 (has links)
Thesis (Ph. D.)--Ohio State University, 2005. / Title from first page of PDF file. Document formatted into pages; contains xi, 106 p.; also includes graphics. Includes bibliographical references (p. 103-106). Available online via OhioLINK's ETD Center
|
128 |
Effects of Missing Values on Neural Network Survival Time PredictionRaoufi-Danner, Torrin January 2018 (has links)
Data sets with missing values are a pervasive problem within medical research. Building lifetime prediction models based solely upon complete-case data can bias the results, so imputation is preferred over listwise deletion. In this thesis, artificial neural networks (ANNs) are used as a prediction model on simulated data with which to compare various imputation approaches. The construction and optimization of ANNs is discussed in detail, and some guidelines are presented for activation functions, number of hidden layers and other tunable parameters. For the simulated data, binary lifetime prediction at five years was examined. The ANNs here performed best with tanh activation, binary cross-entropy loss with softmax output and three hidden layers of between 15 and 25 nodes. The imputation methods examined are random, mean, missing forest, multivariate imputation by chained equations (MICE), pooled MICE with imputed target and pooled MICE with non-imputed target. Random and mean imputation performed poorly compared to the others and were used as a baseline comparison case. The other algorithms all performed well up to 50% missingness. There were no statistical differences between these methods below 30% missingness, however missing forest had the best performance above this amount. It is therefore the recommendation of this thesis that the missing forest algorithm is used to impute missing data when constructing ANNs to predict breast cancer patient survival at the five-year mark.
|
129 |
Comparação das águas dos rios Jaguari e Atibaia na região de lançamento de efluente de indústria petroquímica / Comparision of the water from rivers Jaguari and Atibaia at the region of wastewater release by a petrochemical industryOliveira, Eduardo Schneider Bueno de [UNESP] 03 February 2016 (has links)
Submitted by EDUARDO SCHNEIDER BUENO DE OLIVEIRA null (eduardosbdeoliveira@hotmail.com) on 2016-04-14T17:34:57Z
No. of bitstreams: 1
Dissertação Final - Eduardo Schneider.pdf: 4265629 bytes, checksum: 4e5da4135aad7da51adb68c347b376b1 (MD5) / Approved for entry into archive by Felipe Augusto Arakaki (arakaki@reitoria.unesp.br) on 2016-04-18T13:08:57Z (GMT) No. of bitstreams: 1
oliveira_esb_me_bot.pdf: 4265629 bytes, checksum: 4e5da4135aad7da51adb68c347b376b1 (MD5) / Made available in DSpace on 2016-04-18T13:08:57Z (GMT). No. of bitstreams: 1
oliveira_esb_me_bot.pdf: 4265629 bytes, checksum: 4e5da4135aad7da51adb68c347b376b1 (MD5)
Previous issue date: 2016-02-03 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / A ação antrópica na natureza é algo muito constante ao longo de toda
a história, mas cada vez mais notam-se os efeitos negativos que por vezes ela pode
trazer. Verificar esses efeitos, suas implicações, e aquilo que pode ser feito para evitar
maiores problemas é de suma importância para a manutenção de nosso planeta
em boas condições e consequentemente para a qualidade de vida do ser humano.
O presente estudo realiza uma an álise da qualidade da água dos Rios Jaguari e
Atibaia, entre os quais há o despejo de resíduos de uma indústria, além da qualidade
da água após o processo de utilização pela indústria, antes de sua devolução ao rio.
Com isso, pode-se verificar a qualidade do tratamento de resíduo de tal indústria e
analisar possíveis efeitos que possa haver na qualidade da água após o despejo dos
resíduos no rio. Para isso, com base em dados sobre características físicas, químicas e
microbiológicas da água, são utilizadas técnicas estatísticas adequadas para realizar
a análise necessária ao intuito anteriormente exposto. Como os dados possuem
dependência entre si, é necessário que sejam utilizados métodos que permitam tal
ocorrência, como o Bootstrap em Blocos não param étrico (Künsch, 1989; Politis
& Romano, 1994). Também há a realização de imputação múltipla de dados,
uma vez que há diversos meses do estudo com dados ausentes, através da técnica
de Imputação de Dados Livre de Distribuição (Bergamo, 2007; Bergamo et al., 2008). / The anthropic action in nature is a constant factor along the history, but each day the negative effects that it brings can be increasingly seen. Check these effects, its implications and what can be done in order to avoid bigger problems has a great importance to the manteinance of our planet in good conditions and, consequently, to the human being life quality. This study performs an analysis of the water quality of the Jaguari and Atibaia rivers, among which happens the dumping of residuals from a petrochemical industry, as well as of the quality of the water after its utilization process by the industry, before its devolution to the river. Thus, it is possible to verify this industry’s residual treatment quality and to analyze possible effects to the water quality after the residual dumping at the river. For this, based on data about fisical, chemical and microbiological characteristics of the water, appropriate statistical techniques are used, aiming to do the necessary analysis to fullfill the exposed intention. Because of the existence of dependency, methods that allow this ocurrence shall be used, such as the non parametric Blocks Bootstrap (K¨unsch, 1989; Politis & Romano, 1994). There is also the realization of multiple imputation, using the technique of the Distribution-free Multiple Imputation (Bergamo, 2007; Bergamo et al., 2008), once for some months there are missing data.
|
130 |
Three-Level Multiple Imputation: A Fully Conditional Specication ApproachJanuary 2015 (has links)
abstract: Currently, there is a clear gap in the missing data literature for three-level models.
To date, the literature has only focused on the theoretical and algorithmic work
required to implement three-level imputation using the joint model (JM) method of
imputation, leaving relatively no work done on fully conditional specication (FCS)
method. Moreover, the literature lacks any methodological evaluation of three-level
imputation. Thus, this thesis serves two purposes: (1) to develop an algorithm in
order to implement FCS in the context of a three-level model and (2) to evaluate
both imputation methods. The simulation investigated a random intercept model
under both 20% and 40% missing data rates. The ndings of this thesis suggest
that the estimates for both JM and FCS were largely unbiased, gave good coverage,
and produced similar results. The sole exception for both methods was the slope for
the level-3 variable, which was modestly biased. The bias exhibited by the methods
could be due to the small number of clusters used. This nding suggests that future
research ought to investigate and establish clear recommendations for the number of
clusters required by these imputation methods. To conclude, this thesis serves as a
preliminary start in tackling a much larger issue and gap in the current missing data
literature. / Dissertation/Thesis / Masters Thesis Psychology 2015
|
Page generated in 0.1089 seconds