31 |
Secondary Analysis of Case-Control Studies in Genomic ContextsWei, Jiawei 2010 August 1900 (has links)
This dissertation consists of five independent projects. In each project, a novel
statistical method was developed to address a practical problem encountered in genomic
contexts. For example, we considered testing for constant nonparametric effects
in a general semiparametric regression model in genetic epidemiology; analyzed the
relationship between covariates in the secondary analysis of case-control data; performed
model selection in joint modeling of paired functional data; and assessed the
prediction ability of genes in gene expression data generated by the CodeLink System
from GE.
In the first project in Chapter II we considered the problem of testing for constant
nonparametric effects in a general semiparametric regression model when there is the
potential for interaction between the parametrically and nonparametrically modeled
variables. We derived a generalized likelihood ratio test for this hypothesis, showed
how to implement it, and gave evidence that it can improve statistical power when
compared to standard partially linear models.
The second project in Chapter III addressed the issue of score testing for the
independence of X and Y in the second analysis of case-control data. The semiparametric
efficient approaches can be used to construct semiparametric score tests, but
they suffer from a lack of robustness to the assumed model for Y given X. We showed
how to adjust the semiparametric score test to make its level/Type I error correct even if the assumed model for Y given X is incorrect, and thus the test is robust.
The third project in Chapter IV took up the issue of estimation of a regression
function when Y given X follows a homoscedastic regression model. We showed how
to estimate the regression parameters in a rare disease case even if the assumed model
for Y given X is incorrect, and thus the estimates are model-robust.
In the fourth project in Chapter V we developed novel AIC and BIC-type methods
for estimating the smoothing parameters in a joint model of paired, hierarchical
sparse functional data, and showed in our numerical work that they are many times
faster than 10-fold crossvalidation while at the same time giving results that are
remarkably close to the crossvalidated estimates.
In the fifth project in Chapter VI we introduced a practical permutation test
that uses cross-validated genetic predictors to determine if the list of genes in question
has “good” prediction ability. It avoids overfitting by using cross-validation to
derive the genetic predictor and determines if the count of genes that give “good”
prediction could have been obtained by chance. This test was then used to explore
gene expression of colonic tissue and exfoliated colonocytes in the fecal stream to
discover similarities between the two.
|
32 |
Predicting the migration of CO₂ plume in saline aquifers using probabilistic history matching approachesBhowmik, Sayantan 20 August 2012 (has links)
During the operation of a geological carbon storage project, verifying that the CO₂ plume remains within the permitted zone is of particular interest both to regulators and to operators. However, the cost of many monitoring technologies, such as time-lapse seismic, limits their application. For adequate predictions of plume migration, proper representation of heterogeneous permeability fields is imperative. Previous work has shown that injection data (pressures, rates) from wells might provide a means of characterizing complex permeability fields in saline aquifers. Thus, given that injection data are readily available and inexpensive, they might provide an inexpensive alternative for monitoring; combined with a flow model like the one developed in this work, these data could even be used for predicting plume migration. These predictions of plume migration pathways can then be compared to field observations like time-lapse seismic or satellite measurements of surface-deformation, to ensure the containment of the injected CO₂ within the storage area. In this work, two novel methods for creating heterogeneous permeability fields constrained by injection data are demonstrated. The first method is an implementation of a probabilistic history matching algorithm to create models of the aquifer for predicting the movement of the CO₂ plume. The geologic property of interest, for example hydraulic conductivity, is updated conditioned to geological information and injection pressures. The resultant aquifer model which is geologically consistent can be used to reliably predict the movement of the CO₂ plume in the subsurface. The second method is a model selection algorithm that refines an initial suite of subsurface models representing the prior uncertainty to create a posterior set of subsurface models that reflect injection performance consistent with that observed. Such posterior models can be used to represent uncertainty in the future migration of the CO₂ plume. The applicability of both methods is demonstrated using a field data set from central Algeria. / text
|
33 |
Particle tracking proxies for prediction of CO₂ plume migration within a model selection frameworkBhowmik, Sayantan 24 June 2014 (has links)
Geologic sequestration of CO₂ in deep saline aquifers has been studied extensively over the past two decades as a viable method of reducing anthropological carbon emissions. The monitoring and prediction of the movement of injected CO₂ is important for assessing containment of the gas within the storage volume, and taking corrective measures if required. Given the uncertainty in geologic architecture of the storage aquifers, it is reasonable to depict our prior knowledge of the project area using a vast suite of aquifer models. Simulating such a large number of models using traditional numerical flow simulators to evaluate uncertainty is computationally expensive. A novel stochastic workflow for characterizing the plume migration, based on a model selection algorithm developed by Mantilla in 2011, has been implemented. The approach includes four main steps: (1) assessing the connectivity/dynamic characteristics of a large prior ensemble of models using proxies; (2) model clustering using the principle component analysis or multidimensional scaling coupled with the k-mean clustering approach; (3) model selection using the Bayes' rule on the reduced model space, and (4) model expansion using an ensemble pattern-based matching scheme. In this dissertation, two proxies have been developed based on particle tracking in order to assess the flow connectivity of models in the initial set. The proxies serve as fast approximations of finite-difference flow simulation models, and are meant to provide rapid estimations of connectivity of the aquifer models. Modifications have also been implemented within the model selection workflow to accommodate the particular problem of application to a carbon sequestration project. The applicability of the proxies is tested both on synthetic models and real field case studies. It is demonstrated that the first proxy captures areal migration to a reasonable extent, while failing to adequately capture vertical buoyancy-driven flow of CO₂. This limitation of the proxy is addressed in the second proxy, and its applicability is demonstrated not only in capturing horizontal migration but also in buoyancy-driven flow. Both proxies are tested both as standalone approximations of numerical simulation and within the larger model selection framework. / text
|
34 |
Mixed-effect modeling of codon usageFeng, Shujuan 22 February 2011 (has links)
Logistic mixed effects models are used to determine whether optimal codons associate with two specific
properties of the expressed protein: solvent accessibility, aggregation propensity, or evolutionary conservation. Both random components and fixed structures in the models are decided by following certain selection procedures. More models are also developed by considering different factor combinations using the same selection procedure. The results show that evolutionary conservation is the most important factor for predicting for the optimal codon usage for most amino acids; aggregation propensity is also an important factor, and solvent accessibility is the least important factor for most amino acids.The results of this analysis are consistent with the previous literature, provide more
straightforward way to study the research question and also more information for the insight
relationships. / text
|
35 |
Selection of Simplified Models and Parameter Estimation Using Limited DataWu, SHAOHUA 23 December 2009 (has links)
Due to difficulties associated with formulating complex models and obtaining reliable estimates of unknown model parameters, modellers often use simplified models (SMs) that are structurally imperfect and that contain a smaller number of parameters. The objectives of this research are: 1) to develop practical and easy-to-use strategies to help modellers select the best SM from a set of candidate models, and 2) to assist modellers in deciding which parameters in complex models should be estimated, and which should be fixed at initial values. The aim is to select models and parameters so that the best possible predictions can be obtained using the available data and the modeller’s engineering and scientific knowledge.
This research summarizes the extensive qualitative and quantitative results in the statistics literature regarding the use of SMs. Mean-squared error (MSE) is used to judge the quality of model predictions obtained from different candidate models, and a confidence-interval approach is developed to assess the uncertainties associated with whether a SM or the corresponding extended model will give better predictions. Nine commonly-applied model-selection criteria (MSC) are reviewed and analyzed for their propensities of preferring SMs. It is shown that there exist preferential orderings for many MSC that are independent of model structure and the particular data set. A new MSE-based MSC is developed using univariate linear statistical models. The effectiveness of this criterion for selecting dynamic nonlinear multivariate models is demonstrated both theoretically and empirically. The proposed criterion is then applied for determining the optimal number of parameters to estimate in complex models, based on ranked parameter lists obtained from estimability analysis. This approach makes use of the modeller’s prior knowledge about precision of initial parameter values and is less computationally expensive than comparable methods in the literature. / Thesis (Ph.D, Chemical Engineering) -- Queen's University, 2009-12-23 09:48:45.423
|
36 |
Model choice and variable selection in mixed & semiparametric modelsSäfken, Benjamin 27 March 2015 (has links)
No description available.
|
37 |
Local Log-Linear Models for Capture-RecaptureKurtz, Zachary Todd 01 January 2014 (has links)
Capture-recapture (CRC) models use two or more samples, or lists, to estimate the size of a population. In the canonical example, a researcher captures, marks, and releases several samples of fish in a lake. When the fish that are captured more than once are few compared to the total number that are captured, one suspects that the lake contains many more uncaptured fish. This basic intuition motivates CRC models in fields as diverse as epidemiology, entomology, and computer science. We use simulations to study the performance of conventional log-linear models for CRC. Specifically we evaluate model selection criteria, model averaging, an asymptotic variance formula, and several small-sample data adjustments. Next, we argue that interpretable models are essential for credible inference, since sets of models that fit the data equally well can imply vastly different estimates of the population size. A secondary analysis of data on survivors of the World Trade Center attacks illustrates this issue. Our main chapter develops local log-linear models. Heterogeneous populations tend to bias conventional log-linear models. Post-stratification can reduce the effects of heterogeneity by using covariates, such as the age or size of each observed unit, to partition the data into relatively homogeneous post-strata. One can fit a model to each post-stratum and aggregate the resulting estimates across post-strata. We extend post-stratification to its logical extreme by selecting a local log-linear model for each observed point in the covariate space, while smoothing to achieve stability. Local log-linear models serve a dual purpose. Besides estimating the population size, they estimate the rate of missingness as a function of covariates. Simulations demonstrate the superiority of local log-linear models for estimating local rates of missingness for special cases in which the generating model varies over the covariate space. We apply the method to estimate bird species richness in continental North America and to estimate the prevalence of multiple sclerosis in a region of France.
|
38 |
Bayesian Analysis of Switching ARCH ModelsKaufmann, Sylvia, Frühwirth-Schnatter, Sylvia January 2000 (has links) (PDF)
We consider a time series model with autoregressive conditional heteroskedasticity that is subject to changes in regime. The regimes evolve according to a multistate latent Markov switching process with unknown transition probabilities, and it is the constant in the variance process of the innovations that is subject to regime shifts. The joint estimation of the latent process and all model parameters is performed within a Bayesian framework using the method of Markov Chain Monte Carlo simulation. We perform model selection with respect to the number of states and the number of autoregressive parameters in the variance process using Bayes factors and model likelihoods. To this aim, the model likelihood is estimated by combining the candidate's formula with importance sampling. The usefulness of the sampler is demonstrated by applying it to the dataset previously used by Hamilton and Susmel who investigated models with switching autoregressive conditional heteroskedasticity using maximum likelihood methods. The paper concludes with some issues related to maximum likelihood methods, to classical model select ion, and to potential straightforward extensions of the model presented here. (author's abstract) / Series: Forschungsberichte / Institut für Statistik
|
39 |
Three Essays on Shrinkage Estimation and Model Selection of Linear and Nonlinear Time Series ModelsJanuary 2018 (has links)
abstract: The primary objective in time series analysis is forecasting. Raw data often exhibits nonstationary behavior: trends, seasonal cycles, and heteroskedasticity. After data is transformed to a weakly stationary process, autoregressive moving average (ARMA) models may capture the remaining temporal dynamics to improve forecasting. Estimation of ARMA can be performed through regressing current values on previous realizations and proxy innovations. The classic paradigm fails when dynamics are nonlinear; in this case, parametric, regime-switching specifications model changes in level, ARMA dynamics, and volatility, using a finite number of latent states. If the states can be identified using past endogenous or exogenous information, a threshold autoregressive (TAR) or logistic smooth transition autoregressive (LSTAR) model may simplify complex nonlinear associations to conditional weakly stationary processes. For ARMA, TAR, and STAR, order parameters quantify the extent past information is associated with the future. Unfortunately, even if model orders are known a priori, the possibility of over-fitting can lead to sub-optimal forecasting performance. By intentionally overestimating these orders, a linear representation of the full model is exploited and Bayesian regularization can be used to achieve sparsity. Global-local shrinkage priors for AR, MA, and exogenous coefficients are adopted to pull posterior means toward 0 without over-shrinking relevant effects. This dissertation introduces, evaluates, and compares Bayesian techniques that automatically perform model selection and coefficient estimation of ARMA, TAR, and STAR models. Multiple Monte Carlo experiments illustrate the accuracy of these methods in finding the "true" data generating process. Practical applications demonstrate their efficacy in forecasting. / Dissertation/Thesis / Doctoral Dissertation Statistics 2018
|
40 |
Probabilistic Forecast of Wind Power Generation by Stochastic Differential Equation ModelsElkantassi, Soumaya 04 1900 (has links)
Reliable forecasting of wind power generation is crucial to optimal control of costs in generation of electricity with respect to the electricity demand. Here, we propose and analyze stochastic wind power forecast models described by parametrized stochastic differential equations, which introduce appropriate fluctuations in numerical forecast outputs. We use an approximate maximum likelihood method to infer the model parameters taking into account the time correlated sets of data. Furthermore, we study the validity and sensitivity of the parameters for each model. We applied our models to Uruguayan wind power production as determined by historical data and corresponding numerical forecasts for the period of March 1 to May 31, 2016.
|
Page generated in 0.1538 seconds