Global ETD Search

1	Some limit behaviors for the LS estimators in errors-in-variables regression model Chen, Shu January 1900 (has links) Master of Science / Department of Statistics / Weixing Song / There has been a continuing interest among statisticians in the problem of regression models wherein the independent variables are measured with error and there is considerable literature on the subject. In the following report, we discuss the errors-in-variables regression model: yi = β0 + β1xi + β2zi + ϵi,Xi = xi + ui,Zi = zi + vi with i.i.d. errors (ϵi, ui, vi), for i = 1, 2, ..., n and find the least square estimators for the parameters of interest. Both weak and strong consistency for the least square estimators βˆ0, βˆ1, and βˆ2 of the unknown parameters β0, β1, and β2 are obtained. Moreover, under regularity conditions, the asymptotic normalities of the estimators are reported. Errors-in-variables regression model Statistics (0463)
2	Essays in Cluster Sampling and Causal Inference Makela, Susanna January 2018 (has links) This thesis consists of three papers in applied statistics, specifically in cluster sampling, causal inference, and measurement error. The first paper studies the problem of estimating the finite population mean from a two-stage sample with unequal selection probabilies in a Bayesian framework. Cluster sampling is common in survey practice, and the corresponding inference has been predominantly design-based. We develop a Bayesian framework for cluster sampling and account for the design effect in the outcome modeling. In a two-stage cluster sampling design, clusters are first selected with probability proportional to cluster size, and units are then randomly sampled within selected clusters. Methodological challenges arise when the sizes of nonsampled cluster are unknown. We propose both nonparametric and parametric Bayesian approaches for predicting the cluster size, and we implement inference for the unknown cluster sizes simultaneously with inference for survey outcome. We implement this method in Stan and use simulation studies to compare the performance of an integrated Bayesian approach to classical methods on their frequentist properties. We then apply our propsed method to the Fragile Families and Child Wellbeing study as an illustration of complex survey inference. The second paper focuses on the problem of weak instrumental variables, motivated by estimating the causal effect of incarceration on recidivism. An instrument is weak when it is only weakly predictive of the treatment of interest. Given the well-known pitfalls of weak instrumental variables, we propose a method for strengthening a weak instrument. We use a matching strategy that pairs observations to be close on observed covariates but far on the instrument. This strategy strengthens the instrument, but with the tradeoff of reduced sample size. To help guide the applied researcher in selecting a match, we propose simulating the power of a sensitivity analysis and design sensitivity and using graphical methods to examine the results. We also demonstrate the use of recently developed methods for identifying effect modification, which is an interaction between a pretreatment covariate and the treatment. Larger and less variable treatment effects are less sensitive to unobserved bias, so identifying when effect modification is present and which covariates may be the source is important. We undertake our study in the context of studying the causal effect of incarceration on recividism via a natural experiment in the state of Pennsylvania, a motivating example that illustrates each component of our analysis. The third paper considers the issue of measurement error in the context of survey sampling and hierarchical models. Researchers are often interested in studying the relationship between community-levels variables and individual outcomes. This approach often requires estimating the neighborhood-level variable of interest from the sampled households, which induces measurement error in the neighborhood-level covariate since not all households are sampled. Other times, neighborhood-level variables are not observed directly, and only a noisy proxy is available. In both cases, the observed variables may contain measurement error. Measurement error is known to attenuate the coefficient of the mismeasured variable, but it can also affect other coefficients in the model, and ignoring measurement error can lead to misleading inference. We propose a Bayesian hierarchical model that integrates an explicit model for the measurement error process along with a model for the outcome of interest for both sampling-induced measurement error and classical measurement error. Advances in Bayesian computation, specifically the development of the Stan probabilistic programming language, make the implementation of such models easy and straightforward. Statistics Mathematical statistics Cluster analysis Errors-in-variables models
3	Bias correction of bounded location errors in binary data Walker, Nelson B. January 1900 (has links) Master of Science / Department of Statistics / Trevor Hefley / Binary regression models for spatial data are commonly used in disciplines such as epidemiology and ecology. Many spatially-referenced binary data sets suffer from location error, which occurs when the recorded location of an observation differs from its true location. When location error occurs, values of the covariates associated with the true spatial locations of the observations cannot be obtained. We show how a change of support (COS) can be applied to regression models for binary data to provide bias-corrected coefficient estimates when the true values of the covariates are unavailable, but the unknown location of the observations are contained within non-overlapping polygons of any geometry. The COS accommodates spatial and non-spatial covariates and preserves the convenient interpretation of methods such as logistic and probit regression. Using a simulation experiment, we compare binary regression models with a COS to naive approaches that ignore location error. We illustrate the flexibility of the COS by modeling individual-level disease risk in a population using a binary data set where the location of the observations are unknown, but contained within administrative units. Our simulation experiment and data illustration corroborate that conventional regression models for binary data which ignore location error are unreliable, but that the COS can be used to eliminate bias while preserving model choice.
4	Goodness-of-fit tests in measurement error models with replications Jia, Weijia January 1900 (has links) Doctor of Philosophy / Department of Statistics / Weixing Song / In this dissertation, goodness-of-fit tests are proposed for checking the adequacy of parametric distributional forms of the regression error density functions and the error-prone predictor density function in measurement error models, when replications of the surrogates of the latent variables are available. In the first project, we propose goodness-of-fit tests on the density function of the regression error in the errors-in-variables model. Instead of assuming that the distribution of the measurement error is known as is done in most relevant literature, we assume that replications of the surrogates of the latent variables are available. The test statistic is based upon a weighted integrated squared distance between a nonparametric estimate and a semi-parametric estimate of the density functions of certain residuals. Under the null hypothesis, the test statistic is shown to be asymptotically normal. Consistency and local power results of the proposed test under fixed alternatives and local alternatives are also established. Finite sample performance of the proposed test is evaluated via simulation studies. A real data example is also included to demonstrate the application of the proposed test. In the second project, we propose a class of goodness-of-fit tests for checking the parametric distributional forms of the error-prone random variables in the classic additive measurement error models. We also assume that replications of the surrogates of the error-prone variables are available. The test statistic is based upon a weighted integrated squared distance between a non-parametric estimator and a semi-parametric estimator of the density functions of the averaged surrogate data. Under the null hypothesis, the minimum distance estimator of the distribution parameters and the test statistics are shown to be asymptotically normal. Consistency and local power of the proposed tests under fixed alternatives and local alternatives are also established. Finite sample performance of the proposed tests is evaluated via simulation studies. Errors-in-variables Goodness-of-fit Replication Consistency Local power
5	Robust mixture linear EIV regression models by t-distribution Liu, Yantong January 1900 (has links) Master of Science / Department of Statistics / Weixing Song / A robust estimation procedure for mixture errors-in-variables linear regression models is proposed in the report by assuming the error terms follow a t-distribution. The estimation procedure is implemented by an EM algorithm based on the fact that the t-distribution is a scale mixture of normal distribution and a Gamma distribution. Finite sample performance of the proposed algorithm is evaluated by some extensive simulation studies. Comparison is also made with the MLE procedure under normality assumption. Robust estimation Linear errors-in-variables model EM algorithm Mixture T-distribution Statistics (0463)
6	Regression calibration and maximum likelihood inference for measurement error models Monleon-Moscardo, Vicente J. 08 December 2005 (has links) Graduation date: 2006 / Regression calibration inference seeks to estimate regression models with measurement error in explanatory variables by replacing the mismeasured variable by its conditional expectation, given a surrogate variable, in an estimation procedure that would have been used if the true variable were available. This study examines the effect of the uncertainty in the estimation of the required conditional expectation on inference about regression parameters, when the true explanatory variable and its surrogate are observed in a calibration dataset and related through a normal linear model. The exact sampling distribution of the regression calibration estimator is derived for normal linear regression when independent calibration data are available. The sampling distribution is skewed and its moments are not defined, but its median is the parameter of interest. It is shown that, when all random variables are normally distributed, the regression calibration estimator is equivalent to maximum likelihood provided a natural estimate of variance is non-negative. A check for this equivalence is useful in practice for judging the suitability of regression calibration. Results about relative efficiency are provided for both external and internal calibration data. In some cases maximum likelihood is substantially more efficient than regression calibration. In general, though, a more important concern when the necessary conditional expectation is uncertain, is that inferences based on approximate normality and estimated standard errors may be misleading. Bootstrap and likelihood-ratio inferences are preferable. Regression calibration Measurement error Errors-in-variables Validation study Regression analysis Inference Error analysis (Mathematics)
7	Modeling and Control of Bilinear Systems : Application to the Activated Sludge Process Ekman, Mats January 2005 (has links) This thesis concerns modeling and control of bilinear systems (BLS). BLS are linear but not jointly linear in state and control. In the first part of the thesis, a background to BLS and their applications to modeling and control is given. The second part, and likewise the principal theme of this thesis, is dedicated to theoretical aspects of identification, modeling and control of mainly BLS, but also linear systems. In the last part of the thesis, applications of bilinear and linear modeling and control to the activated sludge process (ASP) are given. bilinear systems modeling optimal control parameter estimation errors-in-variables activated sludge process Automatic control Reglerteknik
8	A Bayesian approach to energy monitoring optimization Carstens, Herman January 2017 (has links) This thesis develops methods for reducing energy Measurement and Verification (M&V) costs through the use of Bayesian statistics. M&V quantifies the savings of energy efficiency and demand side projects by comparing the energy use in a given period to what that use would have been, had no interventions taken place. The case of a large-scale lighting retrofit study, where incandescent lamps are replaced by Compact Fluorescent Lamps (CFLs), is considered. These projects often need to be monitored over a number of years with a predetermined level of statistical rigour, making M&V very expensive. M&V lighting retrofit projects have two interrelated uncertainty components that need to be addressed, and which form the basis of this thesis. The first is the uncertainty in the annual energy use of the average lamp, and the second the persistence of the savings over multiple years, determined by the number of lamps that are still functioning in a given year. For longitudinal projects, the results from these two aspects need to be obtained for multiple years. This thesis addresses these problems by using the Bayesian statistical paradigm. Bayesian statistics is still relatively unknown in M&V, and presents an opportunity for increasing the efficiency of statistical analyses, especially for such projects. After a thorough literature review, especially of measurement uncertainty in M&V, and an introduction to Bayesian statistics for M&V, three methods are developed. These methods address the three types of uncertainty in M&V: measurement, sampling, and modelling. The first method is a low-cost energy meter calibration technique. The second method is a Dynamic Linear Model (DLM) with Bayesian Forecasting for determining the size of the metering sample that needs to be taken in a given year. The third method is a Dynamic Generalised Linear Model (DGLM) for determining the size of the population survival survey sample. It is often required by law that M&V energy meters be calibrated periodically by accredited laboratories. This can be expensive and inconvenient, especially if the facility needs to be shut down for meter installation or removal. Some jurisdictions also require meters to be calibrated in-situ; in their operating environments. However, it is shown that metering uncertainty makes a relatively small impact to overall M&V uncertainty in the presence of sampling, and therefore the costs of such laboratory calibration may outweigh the benefits. The proposed technique uses another commercial-grade meter (which also measures with error) to achieve this calibration in-situ. This is done by accounting for the mismeasurement effect through a mathematical technique called Simulation Extrapolation (SIMEX). The SIMEX result is refined using Bayesian statistics, and achieves acceptably low error rates and accurate parameter estimates. The second technique uses a DLM with Bayesian forecasting to quantify the uncertainty in metering only a sample of the total population of lighting circuits. A Genetic Algorithm (GA) is then applied to determine an efficient sampling plan. Bayesian statistics is especially useful in this case because it allows the results from previous years to inform the planning of future samples. It also allows for exact uncertainty quantification, where current confidence interval techniques do not always do so. Results show a cost reduction of up to 66%, but this depends on the costing scheme used. The study then explores the robustness of the efficient sampling plans to forecast error, and finds a 50% chance of undersampling for such plans, due to the standard M&V sampling formula which lacks statistical power. The third technique uses a DGLM in the same way as the DLM, except for population survival survey samples and persistence studies, not metering samples. Convolving the binomial survey result distributions inside a GA is problematic, and instead of Monte Carlo simulation, a relatively new technique called Mellin Transform Moment Calculation is applied to the problem. The technique is then expanded to model stratified sampling designs for heterogeneous populations. Results show a cost reduction of 17-40%, although this depends on the costing scheme used. Finally the DLM and DGLM are combined into an efficient overall M&V plan where metering and survey costs are traded off over multiple years, while still adhering to statistical precision constraints. This is done for simple random sampling and stratified designs. Monitoring costs are reduced by 26-40% for the costing scheme assumed. The results demonstrate the power and flexibility of Bayesian statistics for M&V applications, both in terms of exact uncertainty quantification, and by increasing the efficiency of the study and reducing monitoring costs. / Hierdie proefskrif ontwikkel metodes waarmee die koste van energiemonitering en verifieëring (M&V) deur Bayesiese statistiek verlaag kan word. M&V bepaal die hoeveelheid besparings wat deur energiedoeltreffendheid- en vraagkantbestuurprojekte behaal kan word. Dit word gedoen deur die energieverbruik in ’n gegewe tydperk te vergelyk met wat dit sou wees indien geen ingryping plaasgevind het nie. ’n Grootskaalse beligtingsretrofitstudie, waar filamentgloeilampe met fluoresserende spaarlampe vervang word, dien as ’n gevallestudie. Sulke projekte moet gewoonlik oor baie jare met ’n vasgestelde statistiese akkuuraatheid gemonitor word, wat M&V duur kan maak. Twee verwante onsekerheidskomponente moet in M&V beligtingsprojekte aangespreek word, en vorm die grondslag van hierdie proefskrif. Ten eerste is daar die onsekerheid in jaarlikse energieverbruik van die gemiddelde lamp. Ten tweede is daar die volhoubaarheid van die besparings oor veelvoudige jare, wat bepaal word deur die aantal lampe wat tot in ’n gegewe jaar behoue bly. Vir longitudinale projekte moet hierdie twee komponente oor veelvoudige jare bepaal word. Hierdie proefskrif spreek die probleem deur middel van ’n Bayesiese paradigma aan. Bayesiese statistiek is nog relatief onbekend in M&V, en bied ’n geleentheid om die doeltreffendheid van statistiese analises te verhoog, veral vir bogenoemde projekte. Die proefskrif begin met ’n deeglike literatuurstudie, veral met betrekking tot metingsonsekerheid in M&V. Daarna word ’n inleiding tot Bayesiese statistiek vir M&V voorgehou, en drie metodes word ontwikkel. Hierdie metodes spreek die drie hoofbronne van onsekerheid in M&V aan: metings, opnames, en modellering. Die eerste metode is ’n laekoste energiemeterkalibrasietegniek. Die tweede metode is ’n Dinamiese Linieêre Model (DLM) met Bayesiese vooruitskatting, waarmee meter opnamegroottes bepaal kan word. Die derde metode is ’n Dinamiese Veralgemeende Linieêre Model (DVLM), waarmee bevolkingsoorlewing opnamegroottes bepaal kan word. Volgens wet moet M&V energiemeters gereeld deur erkende laboratoria gekalibreer word. Dit kan duur en ongerieflik wees, veral as die aanleg tydens meterverwydering en -installering afgeskakel moet word. Sommige regsgebiede vereis ook dat meters in-situ gekalibreer word; in hul bedryfsomgewings. Tog word dit aangetoon dat metingsonsekerheid ’n klein deel van die algehele M&V onsekerheid beslaan, veral wanneer opnames gedoen word. Dit bevraagteken die kostevoordeel van laboratoriumkalibrering. Die voorgestelde tegniek gebruik ’n ander kommersieële-akkuurraatheidsgraad meter (wat self ’n nie-weglaatbare metingsfout bevat), om die kalibrasie in-situ te behaal. Dit word gedoen deur die metingsfout deur SIMulerings EKStraptolering (SIMEKS) te verminder. Die SIMEKS resultaat word dan deur Bayesiese statistiek verbeter, en behaal aanvaarbare foutbereike en akkuurate parameterafskattings. Die tweede tegniek gebruik ’n DLM met Bayesiese vooruitskatting om die onsekerheid in die meting van die opnamemonster van die algehele bevolking af te skat. ’n Genetiese Algoritme (GA) word dan toegepas om doeltreffende opnamegroottes te vind. Bayesiese statistiek is veral nuttig in hierdie geval aangesien dit vorige jare se uitslae kan gebruik om huidige afskattings te belig Dit laat ook die presiese afskatting van onsekerheid toe, terwyl standaard vertrouensintervaltegnieke dit nie doen nie. Resultate toon ’n kostebesparing van tot 66%. Die studie ondersoek dan die standvastigheid van kostedoeltreffende opnameplanne in die teenwoordigheid van vooruitskattingsfoute. Dit word gevind dat kostedoeltreffende opnamegroottes 50% van die tyd te klein is, vanweë die gebrek aan statistiese krag in die standaard M&V formules. Die derde tegniek gebruik ’n DVLM op dieselfde manier as die DLM, behalwe dat bevolkingsoorlewingopnamegroottes ondersoek word. Die saamrol van binomiale opname-uitslae binne die GA skep ’n probleem, en in plaas van ’n Monte Carlo simulasie word die relatiewe nuwe Mellin Vervorming Moment Berekening op die probleem toegepas. Die tegniek word dan uitgebou om laagsgewyse opname-ontwerpe vir heterogene bevolkings te vind. Die uitslae wys ’n 17-40% kosteverlaging, alhoewel dit van die koste-skema afhang. Laastens word die DLM en DVLM saamgevoeg om ’n doeltreffende algehele M&V plan, waar meting en opnamekostes teen mekaar afgespeel word, te ontwerp. Dit word vir eenvoudige en laagsgewyse opname-ontwerpe gedoen. Moniteringskostes word met 26-40% verlaag, maar hang van die aangenome koste-skema af. Die uitslae bewys die krag en buigsaamheid van Bayesiese statistiek vir M&V toepassings, beide vir presiese onsekerheidskwantifisering, en deur die doeltreffendheid van die dataverbruik te verhoog en sodoende moniteringskostes te verlaag. / Thesis (PhD)--University of Pretoria, 2017. / National Research Foundation / Department of Science and Technology / National Hub for the Postgraduate Programme in Energy Efficiency and Demand Side Management / Electrical, Electronic and Computer Engineering / PhD / Unrestricted Measurement and Verification Errors-in-Variables Longitudinal Sampling Generalised Linear Model Simulation Extrapolation Machine Learning UCTD
9	Optimization under Uncertainty with Applications in Data-driven Stochastic Simulation and Rare-event Estimation Zhang, Xinyu January 2022 (has links) For many real-world problems, optimization could only be formulated with partial information or subject to uncertainty due to reasons such as data measurement error, model misspecification, or that the formulation depends on the non-stationary future. It thus often requires one to make decisions without knowing the problem's full picture. This dissertation considers the robust optimization framework—a worst-case perspective—to characterize uncertainty as feasible regions and optimize over the worst possible scenarios. Two applications in this worst-case perspective are discussed: stochastic estimation and rare-event simulation. Chapters 2 and 3 discuss a min-max framework to enhance existing estimators for simulation problems that involve a bias-variance tradeoff. Biased stochastic estimators, such as finite-differences for noisy gradient estimation, often contain parameters that need to be properly chosen to balance impacts from the bias and the variance. While the optimal order of these parameters in terms of the simulation budget can be readily established, the precise best values depend on model characteristics that are typically unknown in advance. We introduce a framework to construct new classes of estimators, based on judicious combinations of simulation runs on sequences of tuning parameter values, such that the estimators consistently outperform a given tuning parameter choice in the conventional approach, regardless of the unknown model characteristics. We argue the outperformance via what we call the asymptotic minimax risk ratio, obtained by minimizing the worst-case asymptotic ratio between the mean square errors of our estimators and the conventional one, where the worst case is over any possible values of the model unknowns. In particular, when the minimax ratio is less than 1, the calibrated estimator is guaranteed to perform better asymptotically. We identify this minimax ratio for general classes of weighted estimators and the regimes where this ratio is less than 1. Moreover, we show that the best weighting scheme is characterized by a sum of two components with distinct decay rates. We explain how this arises from bias-variance balancing that combats the adversarial selection of the model constants, which can be analyzed via a tractable reformulation of a non-convex optimization problem. Chapters 4 and 5 discuss extreme event estimation using a distributionally robust optimization framework. Conventional methods for extreme event estimation rely on well-chosen parametric models asymptotically justified from extreme value theory (EVT). These methods, while powerful and theoretically grounded, could however encounter difficult bias-variance tradeoffs that exacerbates especially when data size is too small, deteriorating the reliability of the tail estimation. The chapters study a framework based on the recently surging literature of distributionally robust optimization. This approach can be viewed as a nonparametric alternative to conventional EVT, by imposing general shape belief on the tail instead of parametric assumption and using worst-case optimization as a resolution to handle the nonparametric uncertainty. We explain how this approach bypasses the bias-variance tradeoff in EVT. On the other hand, we face a conservativeness-variance tradeoff which we describe how to tackle. We also demonstrate computational tools for the involved optimization problems and compare our performance with conventional EVT across a range of numerical examples. Industrial engineering Stochastic approximation Decision making Errors-in-variables models Extreme value theory
10	Flexible models of time-varying exposures Wang, Chenkun 05 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / With the availability of electronic medical records, medication dispensing data offers an unprecedented opportunity for researchers to explore complex relationships among longterm medication use, disease progression and potential side-effects in large patient populations. However, these data also pose challenges to existing statistical models because both medication exposure status and its intensity vary over time. This dissertation focused on flexible models to investigate the association between time-varying exposures and different types of outcomes. First, a penalized functional regression model was developed to estimate the effect of time-varying exposures on multivariate longitudinal outcomes. Second, for survival outcomes, a regression spline based model was proposed in the Cox proportional hazards (PH) framework to compare disease risk among different types of time-varying exposures. Finally, a penalized spline based Cox PH model with functional interaction terms was developed to estimate interaction effect between multiple medication classes. Data from a primary care patient cohort are used to illustrate the proposed approaches in determining the association between antidepressant use and various outcomes. / NIH grants, R01 AG019181 and P30 AG10133. Medical records -- Data processing Medicine -- Data processing Errors-in-variables models Regression analysis

Search results