451 |
Estimating Structural Models with Bayesian EconometricsSacher, Szymon Konrad January 2023 (has links)
With the ever-increasing availability of large, high-dimensional datasets, there is a growingneed for econometric methods that can handle such data. The last decade has seen the development of many such methods in computer science, but their applications to economic models have been limited. In this thesis, I investigate whether modern tools in (exact and approximate) Bayesian inference can be useful in economics. In the three chapters, my coauthors and I develop and estimate a variety of models applied to problems in organizational economics, health, and labor. In chapter one, joint with Andrew Olenski, we estimate a mortality-based Bayesian model of nursing home quality accounting for selection. We then conduct three exercises. First, we examine the correlates of quality, and find that public report cards have near-zero correlation. Second, we show that higher quality nursing homes fared better during the pandemic: a one standard deviation increase in quality corresponds to 2.5% fewer Covid-19 cases. Finally, we show that a 10% increase in the Medicaid reimbursement rate raises quality, leading to a 1.85 percentage point increase in 90-day survival. Such a reform would be cost-effective under conservative estimates of the quality-adjusted statistical value of life.
In chapter two, joint with Laura Battaglia and Stephen Hansen, we demonstrate the effectiveness of Hamiltonian Monte Carlo (HMC) in analyzing high-dimensional data in a computationally efficient and methodologically sound manner. We propose a new model, called Supervised Topic Model with Covariates, that shows how modeling this type of data carefully can have significant implications on conclusions compared to a simpler yet methodologically problematic two-step approach. By conducting a simulation study and revisiting the study of executive time use by Bandiera, Prat, Hansen, and Sadun (2020), we demonstrate these results. This approach can accommodate thousands of parameters and doesn’t require custom algorithms specific to each model, making it more accessible for applied researchers.
In chapter three, I propose a new way to estimate a two-way fixed effects model such as Abowd, Kramarz, and Margolis (1999) (AKM) that relaxes the stringent assumptions concerning the matching process. Through simulations, I demonstrate that this model performs well and provide an application to matched employer-employee data from Brazil. The results indicate that disregarding selection may result in a significant bias in the estimates of location fixed effects, and thus, can contribute to explaining recent discoveries about the relevance of locations in US labor markets.
The three chapters demonstrate the usefulness of modern Bayesian methods for estimating models that would be otherwise infeasible, while remaining accessible enough for applied researchers. The importance of carefully modeling the data of interest instead of relying on ad-hoc solutions is also highlighted, as it has been shown to significantly impact the conclusions drawn across a variety of problems.
|
452 |
Algorithm Design and Localization Analysis in Sequential and Statistical LearningXu, Yunbei January 2023 (has links)
Learning theory is a dynamic and rapidly evolving field that aims to provide mathematical foundations for designing and understanding the behavior of algorithms and procedures that can learn from data automatically. At the heart of this field lies the interplay between algorithm design and statistical complexity analysis, with sharp statistical complexity characterizations often requiring localization analysis.
This dissertation aims to advance the fields of machine learning and decision making by contributing to two key directions: principled algorithm design and localized statistical complexity. Our research develops novel algorithmic techniques and analytical frameworks to build more effective and robust learning systems. Specifically, we focus on studying uniform convergence and localization in statistical learning theory, developing efficient algorithms using the optimism principle for contextual bandits, and creating Bayesian design principles for bandit and reinforcement learning problems.
|
453 |
Bridging Text Mining and Bayesian NetworksRaghuram, Sandeep Mudabail 09 March 2011 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / After the initial network is constructed using expert’s knowledge of the domain,
Bayesian networks need to be updated as and when new data is observed.
Literature mining is a very important source of this new data. In this work, we
explore what kind of data needs to be extracted with the view to update Bayesian Networks, existing technologies which can be useful in achieving some of the goals and what research is required to accomplish the remaining requirements.
This thesis specifically deals with utilizing causal associations and experimental results which can be obtained from literature mining. However, these associations and numerical results cannot be directly integrated with the
Bayesian network. The source of the literature and the perceived quality of
research needs to be factored into the process of integration, just like a human, reading the literature, would. This thesis presents a general methodology for updating a Bayesian Network with the mined data. This methodology consists of solutions to some of the issues surrounding the task of integrating the causal associations with the Bayesian Network and demonstrates the idea with a semiautomated software system.
|
454 |
An Automated System for Generating Situation-Specific Decision Support in Clinical Order Entry from Local Empirical DataKlann, Jeffrey G. 19 October 2011 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Clinical Decision Support is one of the only aspects of health information technology that has demonstrated decreased costs and increased quality in healthcare delivery, yet it is extremely expensive and time-consuming to create, maintain, and localize. Consequently, a majority of health care systems do not utilize it, and even when it is available it is frequently incorrect. Therefore it is important to look beyond traditional guideline-based decision support to more readily available resources in order to bring this technology into widespread use. This study proposes that the wisdom of physicians within a practice is a rich, untapped knowledge source that can be harnessed for this purpose. I hypothesize and demonstrate that this wisdom is reflected by order entry data well enough to partially reconstruct the knowledge behind treatment decisions. Automated reconstruction of such knowledge is used to produce dynamic, situation-specific treatment suggestions, in a similar vein to Amazon.com shopping recommendations. This approach is appealing because: it is local (so it reflects local standards); it fits into workflow more readily than the traditional local-wisdom approach (viz. the curbside consult); and, it is free (the data are already being captured).
This work develops several new machine-learning algorithms and novel applications of existing algorithms, focusing on an approach called Bayesian network structure learning. I develop: an approach to produce dynamic, rank-ordered situation-specific treatment menus from treatment data; statistical machinery to evaluate their accuracy using retrospective simulation; a novel algorithm which is an order of magnitude faster than existing algorithms; a principled approach to choosing smaller, more optimal, domain-specific subsystems; and a new method to discover temporal relationships in the data. The result is a comprehensive approach for extracting knowledge from order-entry data to produce situation-specific treatment menus, which is applied to order-entry data at Wishard Hospital in Indianapolis. Retrospective simulations find that, in a large variety of clinical situations, a short menu will contain the clinicians' desired next actions. A prospective survey additionally finds that such menus aid physicians in writing order sets (in completeness and speed). This study demonstrates that clinical knowledge can be successfully extracted from treatment data for decision support.
|
455 |
USE OF APRIORI KNOWLEDGE ON DYNAMIC BAYESIAN MODELS IN TIME-COURSE EXPRESSION DATA PREDICTIONKilaru, Gokhul Krishna 20 March 2012 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Bayesian networks, one of the most widely used techniques to understand or predict the future by making use of current or previous data, have gained credence over the last decade for their ability to simulate large gene expression datasets to track and predict the reasons for changes in biological systems. In this work, we present a dynamic Bayesian model with gene annotation scores such as the gene characterization index (GCI) and the GenCards inferred functionality score (GIFtS) to understand and assess the prediction performance of the model by incorporating prior knowledge. Time-course breast cancer data including expression data about the genes in the breast cell-lines when treated with doxorubicin is considered for this study. Bayes server software was used for the simulations in a dynamic Bayesian environment with 8 and 19 genes on 12 different data combinations for each category of gene set to predict and understand the future time- course expression profiles when annotation scores are incorporated into the model. The 8-gene set predicted the next time course with r>0.95, and the 19-gene set yielded a value of r>0.8 in 92% cases of the simulation experiments. These results showed that incorporating prior knowledge into the dynamic Bayesian model for simulating the time- course expression data can improve the prediction performance when sufficient apriori parameters are provided.
|
456 |
Uncertainty Quantification for Micro-Scale Simulations of Flow in Plant CanopiesGiacomini, Beatrice January 2023 (has links)
Recent decades have seen remarkable increase in the fidelity of computational fluid dynamics (CFD) models for the simulation of exchange processes between plant canopies and the atmosphere. However, no matter how accurate the selected CFD solver is, model results are found to be affected by an irreducible level of uncertainty that originates from the inability of exactly measuring vegetation (leaf orientation, foliage density, plant reconfiguration) and flow features (incoming wind direction, solar radiation, stratification effects).
Motivated by this consideration, the present PhD thesis proposes a Bayesian uncertainty quantification (UQ) framework for evaluating uncertainty on model parameters and its impact on model results, in the context of CFD for idealized and realistic plant canopy flow. Two problems are considered. First, for the one-dimensional flow within and above the Duke forest near Durham, NC, a one-dimensional Reynolds-averaged Navier--Stokes model is employed. In-situ measurements of turbulence statistics are used to inform the UQ framework in order to evaluate uncertainty on plant geometry and its impact on turbulence statistics and aerodynamic coefficients.
The second problem is characterized by a more realistic setup, with three-dimensional simulations aiming at replicating the flow over a walnut block in Dixon, CA. Due to the substantial computational cost associated with large-eddy simulation (LES), a surrogate model is used for flow simulations. The surrogate is built on top of an exiguous number of LESs over realistic plant canopy, with plant area density derived from LiDAR measurements. Here, the goal is to investigate uncertainty on incoming wind direction and potential repercussions on turbulence statistics. Synthetic data are used to inform the framework.
In both cases, uncertainty on model parameters is characterized via a Markov chain Monte Carlo procedure (inverse problem) and propagated to model results through Monte Carlo sampling (forward problem). In the validation phase, profiles of turbulence statistics with associated uncertainty are compared with the measurements used to inform the framework. By providing an enriched solution for simulation of flow over idealized and realistic plant canopy, this PhD thesis highlights the potential of UQ to enhance prediction of micro-scale exchange processes between vegetation and atmosphere.
|
457 |
General Bayesian Calibration Framework for Model Contamination and Measurement ErrorWang, Siquan January 2023 (has links)
Many applied statistical applications face the potential problem of model contamination and measurement error. The form and degree of contamination as well as the measurement error are usually unknown and sample-specific, which brings additional challenges for researchers. In this thesis, we have proposed several Bayesian inference models to address these issues, with the application to one type of special data for allergen concentration measurement, which is called serial dilution data and is self-calibrated.
In our first chapter, we address the problem of model contamination by using a multilevel model to simultaneously flag problematic observations and estimate unknown concentrations in serial dilution data, a problem where the current approach can lead to noisy estimates and difficulty in estimating very low or high concentrations.
In our second chapter, we propose the Bayesian joint contamination model for modeling multiple measurement units at the same time while adjusting for differences between experiments using the idea of global calibration, and it could account for uncertainty in both predictors and response variables in Bayesian regression. We are able to get efficacy gain by analyzing multiple experiments together while maintaining robustness with the use of hierarchical models.
In our third chapter, we develop a Bayesian two-step inference model to account for measurement uncertainty propagation in regression analysis when the joint inference model is infeasible. We aim to increase model inference reliability while providing flexibility to users by not restricting the type of inference model used in the first step. For each of the proposed methods, We also demonstrate how to integrate multiple model building blocks through the idea of Bayesian workflow.
In extensive simulation studies, we show that our proposed methods outperform other commonly used approaches. For the data applications, we apply the proposed new methods to the New York City Neighborhood Asthma and Allergy Study (NYC NAAS) data to estimate indoor allergen concentrations more accurately as well as reveal the underlying associations between dust mite allergen concentrations and the exhaled nitric oxide (NO) measurement for asthmatic children. The methods and tools developed here have a wide range of applications and can be used to improve lab analyses, which are crucial for quantifying exposures to assess disease risk and evaluating interventions.
|
458 |
Exploring Confidence Intervals in the Case of Binomial and Hypergeometric DistributionsMojica, Irene 01 January 2011 (has links)
The objective of this thesis is to examine one of the most fundamental and yet important methodologies used in statistical practice, interval estimation of the probability of success in a binomial distribution. The textbook confidence interval for this problem is known as the Wald interval as it comes from the Wald large sample test for the binomial case. It is generally acknowledged that the actual coverage probability of the standard interval is poor for values of p near 0 or 1. Moreover, recently it has been documented that the coverage properties of the standard interval can be inconsistent even if p is not near the boundaries. For this reason, one would like to study the variety of methods for construction of confidence intervals for unknown probability p in the binomial case. The present thesis accomplishes the task by presenting several methods for constructing confidence intervals for unknown binomial probability p. It is well known that the hypergeometric distribution is related to the binomial distribution. In particular, if the size of the population, N, is large and the number of items of interest k is such that k/N tends to p as N grows, then the hypergeometric distribution can be approximated by the binomial distribution. Therefore, in this case, one can use the confidence intervals constructed for p in the case of the binomial distribution as a basis for construction of the confidence intervals for the unknown value k = pN. The goal of this thesis is to study this approximation and to point out several confidence intervals which are designed specifically for the hypergeometric distribution. In particular, this thesis considers several confidence intervals which are based on estimation of a binomial proportion as well as Bayesian credible sets based on various priors.
|
459 |
Computational Psychometrics for Item-based Computerized Adaptive LearningChen, Yi January 2023 (has links)
With advances in computer technology and expanded access to educational data, psychometrics faces new opportunities and challenges for enhancing pattern discovery and decision-making in testing and learning. In this dissertation, I introduced three computational psychometrics studies for solving the technical problems in item-based computerized adaptive learning (CAL) systems related to dynamic measurement, diagnosis, and recommendation based on Bayesian item response theory (IRT).
For the first study, I introduced a new knowledge tracing (KT) model, dynamic IRT (DIRT), which can iteratively update the posterior distribution of latent ability based on moment match approximation and capture the uncertainty of ability change during the learning process. For dynamic measurement, DIRT has advantages in interpretation, flexibility, computation cost, and implementability. For the second study, A new measurement model, named multilevel and multidimensional item response theory with Q matrix (MMIRT-Q), was proposed to provide fine-grained diagnostic feedback. I introduced sequential Monte Carlo (SMC) for online estimation of latent abilities.
For the third study, I proposed the maximum expected ratio of posterior variance reduction criterion (MERPV) for testing purposes and the maximum expected improvement in posterior mean (MEIPM) criterion for learning purposes under the unified framework of IRT. With these computational psychometrics solutions, we can improve the students’ learning and testing experience with accurate psychometrics measurement, timely diagnosis feedback, and efficient item selection.
|
460 |
Correcting for Measurement Error and Misclassification using General Location ModelsKwizera, Muhire Honorine January 2023 (has links)
Measurement error is common in epidemiologic studies and can lead to biased statistical inference. It is well known, for example, that regression analyses involving measurement error in predictors often produce biased model coefficient estimates. The work in this dissertation adds to the existing vast literature on measurement error by proposing a missing data treatment of measurement error through general location models.
The focus is on the case in which information about the measurement error model is not obtained from a subsample of the main study data but from separate, external information, namely the external calibration. Methods for handling measurement error in the setting of external calibration are in need with the increase in the availability of external data sources and the popularity of data integration in epidemiologic studies. General location models are well suited for the joint analysis of continuous and discrete variables. They offer direct relationships with the linear and logistic regression models and can be readily implemented using frequentist and Bayesian approaches. We use the general location models to correct for measurement error and misclassification in the context of three practical problems.
The first problem concerns measurement error in a continuous variable from a dataset containing both continuous and categorical variables. In the second problem, measurement error in the continuous variable is further complicated by the limit of detection (LOD) of the measurement instrument, resulting in some measures of the error-prone continuous variable undetectable if they are below LOD. The third problem deals with misclassification in a binary treatment variable. We implement the proposed methods using Bayesian approaches for the first two problems and using the Expectation-maximization algorithm for the third problem.
For the first problem we propose a Bayesian approach, based on the general location model, to correct measurement error of a continuous variable in a data set with both continuous and categorical variables. We consider the external calibration setting where in addition to the main study data of interest, calibration data are available and provide information on the measurement error but not on the error-free variables.
The proposed method uses observed data from both the calibration and main study samples and incorporates relationships among all variables in measurement error adjustment, unlike existing methods that only use the calibration data for model estimation. We assume by strong nondifferential measurement error (sNDME) that the measurement error is independent of all the error-free variables given the true value of the error-prone variable. The sNDME assumption allows us to identify our model parameters. We show through simulations that the proposed method yields reduced bias, smaller mean squared error, and interval coverage closer to the nominal level compared to existing methods in regression settings. Furthermore, this improvement is pronounced with increased measurement error, higher correlation between covariates, and stronger covariate effects. We apply the new method to the New York City Neighborhood Asthma and Allergy Study to examine the association between indoor allergen concentrations and asthma morbidity among urban asthmatic children.
The simultaneous occurrence of measurement error and LOD is common particularly in environmental exposures such as measurements of the indoor allergen concentrations mentioned in the first problem. Statistical analyses that do not address these two problems simultaneously could lead to wrong scientific conclusions. To address this second problem, we extend the Bayesian general location models for measurement error adjustment to handle both measurement error and values below LOD in a continuous environmental exposure in a regression setting with mixed continuous and discrete variables. We treat values below LOD as censored. Simulations show that our method yields smaller bias and root mean squared error and the posterior credible interval of our method has coverage closer to the nominal level compared to alternative methods, even when the proportion of data below LOD is moderate. We revisit data from the New York City Neighborhood Asthma and Allergy Study and quantify the effect of indoor allergen concentrations on childhood asthma when over 50% of the measured concentrations are below LOD.
We finally look at the third problem of group mean comparison when treatment groups are misclassified. Our motivation comes from the Frequent User Services Engagement (FUSE) study. Researchers wish to compare quantitative health and social outcome measures for frequent jail-and-shelter users who were assigned housing and those who were not housed, and misclassification occurs as a result of noncompliance. The recommended intent-to-treat analysis which is based on initial group assignment is known to underestimate group mean differences. We use the general location model to estimate differences in group means after adjusting for misclassification in the binary grouping variable. Information on the misclassification is available through the sensitivity and specificity. We assume nondifferential misclassification so that misclassification does not depend on the outcome. We use the expectation-maximization algorithm to obtain estimates of the general location model parameters and the group means difference. Simulations show the bias reduction in the estimates of group means difference.
|
Page generated in 0.0567 seconds