41 |
Extremal martingales with applications and a Bayesian approach to model selectionDümbgen, Moritz January 2015 (has links)
No description available.
|
42 |
Bayesian criterion-based model selection in structural equation models. / CUHK electronic theses & dissertations collectionJanuary 2010 (has links)
Structural equation models (SEMs) are commonly used in behavioral, educational, medical, and social sciences. Lots of software, such as EQS, LISREL, MPlus, and WinBUGS, can be used for the analysis of SEMs. Also many methods have been developed to analyze SEMs. One popular method is the Bayesian approach. An important issue in the Bayesian analysis of SEMs is model selection. In the literature, Bayes factor and deviance information criterion (DIC) are commonly used statistics for Bayesian model selection. However, as commented in Chen et al. (2004), Bayes factor relies on posterior model probabilities, in which proper prior distributions are needed. And specifying prior distributions for all models under consideration is usually a challenging task, in particular when the model space is large. In addition, it is well known that Bayes factor and posterior model probability are generally sensitive to the choice of the prior distributions of the parameters. Furthermore the computational burden of Bayes factor is heavy. Alternatively, criterion-based methods are attractive in the sense that they do not require proper prior distributions in general, and the computation is quite simple. One of commonly used criterion-based methods is DIC, which however assumes the posterior mean to be a good estimator. For some models like the mixture SEMs, WinBUGS does not provide the DIC values. Moreover, if the difference in DIC values is small, only reporting the model with the smallest DIC value may be misleading. In this thesis, motivated by the above limitations of the Bayes factor and DIC, a Bayesian model selection criterion called the Lv measure is considered. It is a combination of the posterior predictive variance and bias, and can be viewed as a Bayesian goodness-of-fit statistic. The calibration distribution of the Lv measure, defined as the prior predictive distribution of the difference between the Lv measures of the candidate model and the criterion minimizing model, is discussed to help understanding the Lv measure in detail. The computation of the Lv measure is quite simple, and the performance is satisfactory. Thus, it is an attractive model selection statistic. In this thesis, the application of the Lv measure to various kinds of SEMs will be studied, and some illustrative examples will be conducted to evaluate the performance of the Lv measure for model selection of SEMs. To compare different model selection methods, Bayes factor and DIC will also be computed. Moreover, different prior inputs and sample sizes are considered to check the impact of the prior information and sample size on the performance of the Lv measure. In this thesis, when the performances of two models are similar, the simpler one is selected. / Li, Yunxian. / Adviser: Song Xinyuan. / Source: Dissertation Abstracts International, Volume: 72-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 116-122). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
43 |
Bayesian statistical analysis for nonrecursive nonlinear structural equation models. / CUHK electronic theses & dissertations collectionJanuary 2007 (has links)
Keywords: Bayesian analysis, Finite mixture, Gibbs sampler, Langevin-Hasting sampler, MH sampler, Model comparison, Nonrecursive nonlinear structural equation model, Path sampling. / Structural equation models (SEMs) have been applied extensively to management, marketing, behavioral, and social sciences, etc for studying relationships among manifest and latent variables. Motivated by more complex data structures appeared in various fields, more complicated models have been recently developed. For the developments of SEMs, there is a usual assumption about the regression coefficient of the underlying latent variables. On themselves, more specifically, it is generally assumed that the structural equation modeling is recursive. However, in practice, nonrecursive SEMs are not uncommon. Thus, this fundamental assumption is not always appropriate. / The main objective of this thesis is to relax this assumption by developing some efficient procedures for some complex nonrecursive nonlinear SEMs (NNSEMs). The work in the thesis is based on Bayesian statistical analysis for NNSEMs. The first chapter introduces some background knowledge about NNSEMs. In chapter 2, Bayesian estimates of NNSEMs are given, then some statistical analysis topics such as standard error, model comparison, etc are discussed. In chapter 3, we develop an efficient hybrid MCMC algorithm to obtain Bayesian estimates for NNSEMs with mixed continuous and ordered categorical data. Also, some statistical analysis topics are discussed. In chapter 4, finite mixture NNSEMs are analyzed with the Bayesian approach. The newly developed methodologies are all illustrated with simulation studies and real examples. At last, some conclusion and discussions are included in Chapter 5. / Li, Yong. / "July 2007." / Adviser: Sik-yum Lee. / Source: Dissertation Abstracts International, Volume: 69-01, Section: B, page: 0398. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2007. / Includes bibliographical references (p. 99-111). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
44 |
Toward a Robust and Universal Crowd Labeling FrameworkKhattak, Faiza Khan January 2017 (has links)
The advent of fast and economical computers with large electronic storage has led to a large volume of data, most of which is unlabeled. While computers provide expeditious, accurate and low-cost computation, they still lag behind in many tasks that require human intelligence such as labeling medical images, videos or text. Consequently, current research focuses on a combination of computer accuracy and human intelligence to complete labeling task. In most cases labeling needs to be done by domain experts, however, because of the variability in expertise, experience, and intelligence of human beings, experts can be scarce.
As an alternative to using domain experts, help is sought from non-experts, also known as Crowd, to complete tasks that cannot be readily automated. Since crowd labelers are non-expert, multiple labels per instance are acquired for quality purposes. The final label is obtained by com- bining these multiple labels. It is very common that the ground truth, instance difficulty, and the labeler ability are unknown entities. Therefore, the aggregation task becomes a “chicken and egg” problem to start with.
Despite the fact that much research using machine learning and statistical techniques has been conducted in this area (e.g., [Dekel and Shamir, 2009; Hovy et al., 2013a; Liu et al., 2012; Donmez and Carbonell, 2008]), many questions remain unresolved, these include: (a) What are the best ways to evaluate labelers? (b) It is common to use expert-labeled instances (ground truth) to evaluate la- beler ability (e.g., [Le et al., 2010; Khattak and Salleb-Aouissi, 2011; Khattak and Salleb-Aouissi, 2012; Khattak and Salleb-Aouissi, 2013]). The question is, what should be the cardinality of the set of expert-labeled instances to have an accurate evaluation? (c) Which factors other than labeler expertise (e.g., difficulty of instance, prevalence of class, bias of a labeler toward a particular class) can affect the labeling accuracy? (d) Is there any optimal way to combine multiple labels to get the
best labeling accuracy? (e) Should the labels provided by oppositional/malicious labelers be dis- carded and blocked? Or is there a way to use the “information” provided by oppositional/malicious labelers? (f) How can labelers and instances be evaluated if the ground truth is not known with certitude?
In this thesis, we investigate these questions. We present methods that rely on few expert-labeled instances (usually 0.1% -10% of the dataset) to evaluate various parameters using a frequentist and a Bayesian approach. The estimated parameters are then used for label aggregation to produce one final label per instance.
In the first part of this thesis, we propose a method called Expert Label Injected Crowd Esti- mation (ELICE) and extend it to different versions and variants. ELICE is based on a frequentist approach for estimating the underlying parameters. The first version of ELICE estimates the pa- rameters i.e., labeler expertise and data instance difficulty, using the accuracy of crowd labelers on expert-labeled instances [Khattak and Salleb-Aouissi, 2011; Khattak and Salleb-Aouissi, 2012]. The multiple labels for each instance are combined using weighted majority voting. These weights are the scores of labeler reliability on any given instance, which are obtained by inputting the pa- rameters in the logistic function.
In the second version of ELICE [Khattak and Salleb-Aouissi, 2013], we introduce entropy as a way to estimate the uncertainty of labeling. This provides an advantage of differentiating between good, random and oppositional/malicious labelers. The aggregation of labels for ELICE version 2 flips the label (for binary classification) provided by the oppositional/malicious labeler thus utilizing the information that is generally discarded by other labeling methodologies.
Both versions of ELICE have a cluster-based variant in which rather than making a random choice of instances from the whole dataset, clusters of data are first formed using any clustering approach e.g., K-means. Then an equal number of instances from each cluster are chosen randomly to get expert-labels. This is done to ensure equal representation of each class in the test dataset.
Besides taking advantage of expert-labeled instances, the third version of ELICE [Khattak and Salleb-Aouissi, 2016], incorporates pairwise/circular comparison of labelers to labelers and in- stances to instances. The idea here is to improve accuracy by using the crowd labels, which unlike expert-labels, are available for the whole dataset and may provide a more comprehensive view of the labeler ability and instance difficulty. This is especially helpful for the case when the domain
experts do not agree on one label and ground truth is not known for certain. Therefore, incorporating more information beyond expert labels can provide better results.
We test the performance of ELICE on simulated labels as well as real labels obtained from Amazon Mechanical Turk. Results show that ELICE is effective as compared to state-of-the-art methods. All versions and variants of ELICE are capable of delaying phase transition. The main contribution of ELICE is that it makes the use of all possible information available from crowd and experts. Next, we also present a theoretical framework to estimate the number of expert-labeled instances needed to achieve certain labeling accuracy. Experiments are presented to demonstrate the utility of the theoretical bound.
In the second part of this thesis, we present Crowd Labeling Using Bayesian Statistics (CLUBS) [Khattak and Salleb-Aouissi, 2015; Khattak et al., 2016b; Khattak et al., 2016a], a new approach for crowd labeling to estimate labeler and instance parameters along with label aggregation. Our approach is inspired by Item Response Theory (IRT). We introduce new parameters and refine the existing IRT parameters to fit the crowd labeling scenario. The main challenge is that unlike IRT, in the crowd labeling case, the ground truth is not known and has to be estimated based on the parameters. To overcome this challenge, we acquire expert-labels for a small fraction of instances in the dataset. Our model estimates the parameters based on the expert-labeled instances. The estimated parameters are used for weighted aggregation of crowd labels for the rest of the dataset. Experiments conducted on synthetic data and real datasets with heterogeneous quality crowd-labels show that our methods perform better than many state-of-the-art crowd labeling methods.
We also conduct significance tests between our methods and other state-of-the-art methods to check the significance of the accuracy of these methods. The results show the superiority of our method in most cases. Moreover, we present experiments to demonstrate the impact of the accuracy of final aggregated labels when used as training data. The results essentially emphasize the need for high accuracy of the aggregated labels.
In the last part of the thesis, we present past and contemporary research related to crowd la- beling. We conclude with future of crowd labeling and further research directions. To summarize, in this thesis, we have investigated different methods for estimating crowd labeling parameters and using them for label aggregation. We hope that our contribution will be useful to the crowd labeling community.
|
45 |
Detecting short adjacent repeats in multiple sequences: a Bayesian approach.January 2010 (has links)
Li, Qiwei. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (p. 75-85). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Repetitive DNA Sequence --- p.3 / Chapter 1.1.1 --- Definition and Categorization of Repeti- tive DNA Sequence --- p.3 / Chapter 1.1.2 --- Definition and Categorization of Tandem Repeats --- p.4 / Chapter 1.1.3 --- Definition and Categorization of Interspersed Repeats --- p.6 / Chapter 1.2 --- Research Significance --- p.7 / Chapter 1.3 --- Contributions --- p.9 / Chapter 1.4 --- Thesis Organization --- p.11 / Chapter 2 --- Literature Review and Overview of Our Method --- p.13 / Chapter 2.1 --- Existing Methods --- p.14 / Chapter 2.2 --- Overview of Our Method --- p.17 / Chapter 3 --- Theoretical Background --- p.22 / Chapter 3.1 --- Multinomial Distributions --- p.23 / Chapter 3.2 --- Dirichlet Distribution --- p.23 / Chapter 3.3 --- Metropolis-Hastings Sampling --- p.25 / Chapter 3.4 --- Gibbs Sampling --- p.26 / Chapter 4 --- Problem Description --- p.28 / Chapter 4.1 --- Generative Model --- p.29 / Chapter 4.1.1 --- Input Data R --- p.31 / Chapter 4.1.2 --- Parameters A (Repeat Segment Starting Positions) --- p.32 / Chapter 4.1.3 --- Parameters S (Repeat Segment Structures) --- p.33 / Chapter 4.1.4 --- Parameters θ(Motif Matrix) --- p.35 / Chapter 4.1.5 --- Parameters Φ (Background Distribution) . --- p.36 / Chapter 4.1.6 --- An Example of the Model Schematic Di- agram --- p.37 / Chapter 4.2 --- Parameter Structure --- p.38 / Chapter 4.3 --- Posterior Distribution --- p.40 / Chapter 4.3.1 --- The Full Posterior Distribution --- p.41 / Chapter 4.3.2 --- The Collapsed Posterior Distribution --- p.42 / Chapter 4.4 --- Conclusion --- p.43 / Chapter 5 --- Methodology --- p.45 / Chapter 5.1 --- Schematic Procedure --- p.46 / Chapter 5.1.1 --- The Basic Schematic Procedure --- p.46 / Chapter 5.1.2 --- The Improved Schematic Procedure --- p.47 / Chapter 5.2 --- Initialization --- p.49 / Chapter 5.3 --- Predictive Update Step for θn and Φn --- p.50 / Chapter 5.4 --- Gibbs Sampling Step for an --- p.50 / Chapter 5.5 --- Metropolis-Hastings Sampling Step for sn --- p.51 / Chapter 5.5.1 --- Rear Indel Move --- p.53 / Chapter 5.5.2 --- Partial Shift Move --- p.56 / Chapter 5.5.3 --- Front Indel Move --- p.56 / Chapter 5.6 --- Phase Shifts --- p.57 / Chapter 5.7 --- Conclusion --- p.58 / Chapter 6 --- Results and Discussion --- p.60 / Chapter 6.1 --- Settings --- p.61 / Chapter 6.2 --- Experiment on Synthetic Data --- p.63 / Chapter 6.3 --- Experiment on Real Data --- p.69 / Chapter 7 --- Conclusion and Future Work --- p.72 / Chapter 7.1 --- Conclusion --- p.72 / Chapter 7.2 --- Future Work --- p.74 / Bibliography --- p.75
|
46 |
Bayesian inference of point-source waves based on a set of independent noisy detectors / CUHK electronic theses & dissertations collectionJanuary 2015 (has links)
Waves are everywhere. Biological waves, such as gastric slow waves, and electromagnetic waves, such as TV signals and radio waves, are typical examples that we encounter in everyday life. Many waves are emitted from a point source, whose wavefront can be approximated by a line if the point source is far away. When an experimenter records a propagating wave, the data is subject to noise contamination, posing great diffculty in wave analysis. In this thesis, we consider the situation where at most one wave propagates in a two-dimensional space at any particular time and the detector recordings are noisy. We introduce two parametric generative models for wave propagation and one parametric model for noise generation, and develop a multistage procedure which identifies the number of waves in a given data set, followed by an inference on important variables, including the location of the point source, the velocity of the wave and indicator variables of spikes under the Bayesian paradigm. The procedure is illustrated with two real-life examples. The first one is a study on the effect of potassium ion channels using cultured heart cells. The other is on the propagation characteristics of the Tokohu Tsunami in 2011. / 波是無處不在的。生物波如胃慢波,以及電磁波如電視信號和無線電波,都是我們在日常生活中常遇到的波的典型例子。許多波都是點源,而當波從一個遠的點源發射, 其波陣面會近似一條直線。當實驗者記錄波數據時,數據很大機會受到雜訊污染,增加了分析波數據的難度。本文考慮在一個二維空間內,任何特定的時間中,最多只有一個波在傳播,而波數據受到雜訊污染。我們提出了兩個參數模型模擬波的產生和傳播,以及一個參數模型模擬雜訊的產生。我們並建立了一個多階段程序先識別數據中波的數量,然後根據貝葉斯理論,將尖峰訊號分類成波尖峰訊號或雜訊尖峰訊號,以及對波尖峰訊號的重要參數,包括點源的位置和波的速度進行估算。本文提出的方法將應用於兩組真實數據上。第一組是關於細胞鉀離子通道如何影響心肌培養細胞研究,而另一組則分析2011年日本東北海嘯的傳播特性。 / Lau, Yuk Fai. / Thesis M.Phil. Chinese University of Hong Kong 2015. / Includes bibliographical references (leaves 71-74). / Abstracts also in Chinese. / Title from PDF title page (viewed on 18, October, 2016). / Detailed summary in vernacular field only.
|
47 |
Properties of the maximum likelihood and Bayesian estimators of availabilityKuo, Way January 2011 (has links)
Typescript (photocopy). / Digitized by Kansas Correctional Industries
|
48 |
Misclassification of the dependent variable in binary choice modelsGu, Yuanyuan, Economics, Australian School of Business, UNSW January 2006 (has links)
Survey data are often subject to a number of measurement errors. The measurement error associated with a multinomial variable is called a misclassification error. In this dissertation we study such errors when the outcome is binary. It is known that ignoring such misclassification errors may affect the parameter estimates, see for example Hausman, Abrevaya and Scott-Morton (1998). However, previous studies showed that robust estimation of the parameters is achievable if we take misclassification into account. There are many attempts to do so in the literature and the major problem in implementing them is to avoid poor or fragile identifiability of the misclassification probabilities. Generally we restrict these parameters by imposing prior information on them. Such prior constraints on the parameters are simple to impose within a Bayesian framework. Hence we consider a Bayesian logistic regression model that takes into account the misclassification of the dependent variable. A very convenient way to implement such a Bayesian analysis is to estimate the hierarchical model using the WinBUGS software package developed by the MRC biostatistics group, Institute of Public Health, at Cambridge University. WinGUGS allows us to estimate the posterior distributions of all the parameters using relatively little programming and once the program is written it is trivial to change the link function, for example from logit to probit. If we wish to have more control over the sampling scheme or to deal with more complex models, then we propose a data augmentation approach using the Metropolis-Hastings algorithm within a Gibbs sampling framework. The sampling scheme can be made more efficient by using a one-step Newton-Raphson algorithm to form the Metropolis-Hastings proposal. Results from empirically analyzing real data and from the simulation studies suggest that if suitable priors are specified for the misclassification parameters and the regression parameters, then logistic regression allowing for misclassification results in better estimators than the estimators that do not take misclassification into account.
|
49 |
Bayesian estimation of decomposable Gaussian graphical modelsArmstrong, Helen, School of Mathematics, UNSW January 2005 (has links)
This thesis explains to statisticians what graphical models are and how to use them for statistical inference; in particular, how to use decomposable graphical models for efficient inference in covariance selection and multivariate regression problems. The first aim of the thesis is to show that decomposable graphical models are worth using within a Bayesian framework. The second aim is to make the techniques of graphical models fully accessible to statisticians. To achieve these aims the thesis makes a number of statistical contributions. First, it proposes a new prior for decomposable graphs and a simulation methodology for estimating this prior. Second, it proposes a number of Markov chain Monte Carlo sampling schemes based on graphical techniques. The thesis also presents some new graphical results, and some existing results are reproved to make them more readily understood. Appendix 8.1 contains all the programs written to carry out the inference discussed in the thesis, together with both a summary of the theory on which they are based and a line by line description of how each routine works.
|
50 |
Assessment tool for nuclear material acquisition pathwaysFord, David Grant 15 May 2009 (has links)
An assessment methodology has been developed at Texas A&M University for
predicting weapons useable material acquisition by a terrorist organization or rogue state
based on an acquisition network simulation. The network has been designed to include
all of the materials, facilities, and expertise (each of which are represented by a unique
node) that must be obtained to acquire Special Nuclear Material (SNM). Using various
historical cases and open source expert opinion, the resources required to successfully
obtain the goal of every node within the network was determined. A visual
representation of the network was created within Microsoft Visio and uses Visual Basic
for Applications (VBA) to analyze the network. This tool can be used to predict the most
likely pathway(s) that a predefined organization would take in attempting to acquire
SNM. The methodology uses the resources available to the organization, along with any
of the nodes to which the organization may already have access, to determine which path
the organization is most likely to attempt.
Using this resource based decision model, various sample simulations were run to
exercise the program. The results of these simulations were in accordance with what was
expected for the resources allocated to the organization being modeled. The program was
demonstrated to show that it was capable of taking many complex resources
considerations into account and modeled them accurately.
|
Page generated in 0.0972 seconds