Global ETD Search

31	Toward a Robust and Universal Crowd Labeling Framework Khattak, Faiza Khan January 2017 (has links) The advent of fast and economical computers with large electronic storage has led to a large volume of data, most of which is unlabeled. While computers provide expeditious, accurate and low-cost computation, they still lag behind in many tasks that require human intelligence such as labeling medical images, videos or text. Consequently, current research focuses on a combination of computer accuracy and human intelligence to complete labeling task. In most cases labeling needs to be done by domain experts, however, because of the variability in expertise, experience, and intelligence of human beings, experts can be scarce. As an alternative to using domain experts, help is sought from non-experts, also known as Crowd, to complete tasks that cannot be readily automated. Since crowd labelers are non-expert, multiple labels per instance are acquired for quality purposes. The final label is obtained by com- bining these multiple labels. It is very common that the ground truth, instance difficulty, and the labeler ability are unknown entities. Therefore, the aggregation task becomes a “chicken and egg” problem to start with. Despite the fact that much research using machine learning and statistical techniques has been conducted in this area (e.g., [Dekel and Shamir, 2009; Hovy et al., 2013a; Liu et al., 2012; Donmez and Carbonell, 2008]), many questions remain unresolved, these include: (a) What are the best ways to evaluate labelers? (b) It is common to use expert-labeled instances (ground truth) to evaluate la- beler ability (e.g., [Le et al., 2010; Khattak and Salleb-Aouissi, 2011; Khattak and Salleb-Aouissi, 2012; Khattak and Salleb-Aouissi, 2013]). The question is, what should be the cardinality of the set of expert-labeled instances to have an accurate evaluation? (c) Which factors other than labeler expertise (e.g., difficulty of instance, prevalence of class, bias of a labeler toward a particular class) can affect the labeling accuracy? (d) Is there any optimal way to combine multiple labels to get the best labeling accuracy? (e) Should the labels provided by oppositional/malicious labelers be dis- carded and blocked? Or is there a way to use the “information” provided by oppositional/malicious labelers? (f) How can labelers and instances be evaluated if the ground truth is not known with certitude? In this thesis, we investigate these questions. We present methods that rely on few expert-labeled instances (usually 0.1% -10% of the dataset) to evaluate various parameters using a frequentist and a Bayesian approach. The estimated parameters are then used for label aggregation to produce one final label per instance. In the first part of this thesis, we propose a method called Expert Label Injected Crowd Esti- mation (ELICE) and extend it to different versions and variants. ELICE is based on a frequentist approach for estimating the underlying parameters. The first version of ELICE estimates the pa- rameters i.e., labeler expertise and data instance difficulty, using the accuracy of crowd labelers on expert-labeled instances [Khattak and Salleb-Aouissi, 2011; Khattak and Salleb-Aouissi, 2012]. The multiple labels for each instance are combined using weighted majority voting. These weights are the scores of labeler reliability on any given instance, which are obtained by inputting the pa- rameters in the logistic function. In the second version of ELICE [Khattak and Salleb-Aouissi, 2013], we introduce entropy as a way to estimate the uncertainty of labeling. This provides an advantage of differentiating between good, random and oppositional/malicious labelers. The aggregation of labels for ELICE version 2 flips the label (for binary classification) provided by the oppositional/malicious labeler thus utilizing the information that is generally discarded by other labeling methodologies. Both versions of ELICE have a cluster-based variant in which rather than making a random choice of instances from the whole dataset, clusters of data are first formed using any clustering approach e.g., K-means. Then an equal number of instances from each cluster are chosen randomly to get expert-labels. This is done to ensure equal representation of each class in the test dataset. Besides taking advantage of expert-labeled instances, the third version of ELICE [Khattak and Salleb-Aouissi, 2016], incorporates pairwise/circular comparison of labelers to labelers and in- stances to instances. The idea here is to improve accuracy by using the crowd labels, which unlike expert-labels, are available for the whole dataset and may provide a more comprehensive view of the labeler ability and instance difficulty. This is especially helpful for the case when the domain experts do not agree on one label and ground truth is not known for certain. Therefore, incorporating more information beyond expert labels can provide better results. We test the performance of ELICE on simulated labels as well as real labels obtained from Amazon Mechanical Turk. Results show that ELICE is effective as compared to state-of-the-art methods. All versions and variants of ELICE are capable of delaying phase transition. The main contribution of ELICE is that it makes the use of all possible information available from crowd and experts. Next, we also present a theoretical framework to estimate the number of expert-labeled instances needed to achieve certain labeling accuracy. Experiments are presented to demonstrate the utility of the theoretical bound. In the second part of this thesis, we present Crowd Labeling Using Bayesian Statistics (CLUBS) [Khattak and Salleb-Aouissi, 2015; Khattak et al., 2016b; Khattak et al., 2016a], a new approach for crowd labeling to estimate labeler and instance parameters along with label aggregation. Our approach is inspired by Item Response Theory (IRT). We introduce new parameters and refine the existing IRT parameters to fit the crowd labeling scenario. The main challenge is that unlike IRT, in the crowd labeling case, the ground truth is not known and has to be estimated based on the parameters. To overcome this challenge, we acquire expert-labels for a small fraction of instances in the dataset. Our model estimates the parameters based on the expert-labeled instances. The estimated parameters are used for weighted aggregation of crowd labels for the rest of the dataset. Experiments conducted on synthetic data and real datasets with heterogeneous quality crowd-labels show that our methods perform better than many state-of-the-art crowd labeling methods. We also conduct significance tests between our methods and other state-of-the-art methods to check the significance of the accuracy of these methods. The results show the superiority of our method in most cases. Moreover, we present experiments to demonstrate the impact of the accuracy of final aggregated labels when used as training data. The results essentially emphasize the need for high accuracy of the aggregated labels. In the last part of the thesis, we present past and contemporary research related to crowd la- beling. We conclude with future of crowd labeling and further research directions. To summarize, in this thesis, we have investigated different methods for estimating crowd labeling parameters and using them for label aggregation. We hope that our contribution will be useful to the crowd labeling community. Computer science Labels Bayesian statistical decision theory
32	Detecting short adjacent repeats in multiple sequences: a Bayesian approach. January 2010 (has links) Li, Qiwei. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (p. 75-85). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Repetitive DNA Sequence --- p.3 / Chapter 1.1.1 --- Definition and Categorization of Repeti- tive DNA Sequence --- p.3 / Chapter 1.1.2 --- Definition and Categorization of Tandem Repeats --- p.4 / Chapter 1.1.3 --- Definition and Categorization of Interspersed Repeats --- p.6 / Chapter 1.2 --- Research Significance --- p.7 / Chapter 1.3 --- Contributions --- p.9 / Chapter 1.4 --- Thesis Organization --- p.11 / Chapter 2 --- Literature Review and Overview of Our Method --- p.13 / Chapter 2.1 --- Existing Methods --- p.14 / Chapter 2.2 --- Overview of Our Method --- p.17 / Chapter 3 --- Theoretical Background --- p.22 / Chapter 3.1 --- Multinomial Distributions --- p.23 / Chapter 3.2 --- Dirichlet Distribution --- p.23 / Chapter 3.3 --- Metropolis-Hastings Sampling --- p.25 / Chapter 3.4 --- Gibbs Sampling --- p.26 / Chapter 4 --- Problem Description --- p.28 / Chapter 4.1 --- Generative Model --- p.29 / Chapter 4.1.1 --- Input Data R --- p.31 / Chapter 4.1.2 --- Parameters A (Repeat Segment Starting Positions) --- p.32 / Chapter 4.1.3 --- Parameters S (Repeat Segment Structures) --- p.33 / Chapter 4.1.4 --- Parameters θ(Motif Matrix) --- p.35 / Chapter 4.1.5 --- Parameters Φ (Background Distribution) . --- p.36 / Chapter 4.1.6 --- An Example of the Model Schematic Di- agram --- p.37 / Chapter 4.2 --- Parameter Structure --- p.38 / Chapter 4.3 --- Posterior Distribution --- p.40 / Chapter 4.3.1 --- The Full Posterior Distribution --- p.41 / Chapter 4.3.2 --- The Collapsed Posterior Distribution --- p.42 / Chapter 4.4 --- Conclusion --- p.43 / Chapter 5 --- Methodology --- p.45 / Chapter 5.1 --- Schematic Procedure --- p.46 / Chapter 5.1.1 --- The Basic Schematic Procedure --- p.46 / Chapter 5.1.2 --- The Improved Schematic Procedure --- p.47 / Chapter 5.2 --- Initialization --- p.49 / Chapter 5.3 --- Predictive Update Step for θn and Φn --- p.50 / Chapter 5.4 --- Gibbs Sampling Step for an --- p.50 / Chapter 5.5 --- Metropolis-Hastings Sampling Step for sn --- p.51 / Chapter 5.5.1 --- Rear Indel Move --- p.53 / Chapter 5.5.2 --- Partial Shift Move --- p.56 / Chapter 5.5.3 --- Front Indel Move --- p.56 / Chapter 5.6 --- Phase Shifts --- p.57 / Chapter 5.7 --- Conclusion --- p.58 / Chapter 6 --- Results and Discussion --- p.60 / Chapter 6.1 --- Settings --- p.61 / Chapter 6.2 --- Experiment on Synthetic Data --- p.63 / Chapter 6.3 --- Experiment on Real Data --- p.69 / Chapter 7 --- Conclusion and Future Work --- p.72 / Chapter 7.1 --- Conclusion --- p.72 / Chapter 7.2 --- Future Work --- p.74 / Bibliography --- p.75 Sequences (Mathematics) Bayesian statistical decision theory
33	Bayesian inference of point-source waves based on a set of independent noisy detectors / CUHK electronic theses & dissertations collection January 2015 (has links) Waves are everywhere. Biological waves, such as gastric slow waves, and electromagnetic waves, such as TV signals and radio waves, are typical examples that we encounter in everyday life. Many waves are emitted from a point source, whose wavefront can be approximated by a line if the point source is far away. When an experimenter records a propagating wave, the data is subject to noise contamination, posing great diffculty in wave analysis. In this thesis, we consider the situation where at most one wave propagates in a two-dimensional space at any particular time and the detector recordings are noisy. We introduce two parametric generative models for wave propagation and one parametric model for noise generation, and develop a multistage procedure which identifies the number of waves in a given data set, followed by an inference on important variables, including the location of the point source, the velocity of the wave and indicator variables of spikes under the Bayesian paradigm. The procedure is illustrated with two real-life examples. The first one is a study on the effect of potassium ion channels using cultured heart cells. The other is on the propagation characteristics of the Tokohu Tsunami in 2011. / 波是無處不在的。生物波如胃慢波，以及電磁波如電視信號和無線電波，都是我們在日常生活中常遇到的波的典型例子。許多波都是點源，而當波從一個遠的點源發射，其波陣面會近似一條直線。當實驗者記錄波數據時，數據很大機會受到雜訊污染，增加了分析波數據的難度。本文考慮在一個二維空間內，任何特定的時間中，最多只有一個波在傳播，而波數據受到雜訊污染。我們提出了兩個參數模型模擬波的產生和傳播，以及一個參數模型模擬雜訊的產生。我們並建立了一個多階段程序先識別數據中波的數量，然後根據貝葉斯理論，將尖峰訊號分類成波尖峰訊號或雜訊尖峰訊號，以及對波尖峰訊號的重要參數，包括點源的位置和波的速度進行估算。本文提出的方法將應用於兩組真實數據上。第一組是關於細胞鉀離子通道如何影響心肌培養細胞研究，而另一組則分析2011年日本東北海嘯的傳播特性。 / Lau, Yuk Fai. / Thesis M.Phil. Chinese University of Hong Kong 2015. / Includes bibliographical references (leaves 71-74). / Abstracts also in Chinese. / Title from PDF title page (viewed on 18, October, 2016). / Detailed summary in vernacular field only. Bayesian statistical decision theory QA279.5 .L386 2015
34	Properties of the maximum likelihood and Bayesian estimators of availability Kuo, Way January 2011 (has links) Typescript (photocopy). / Digitized by Kansas Correctional Industries Probabilities Bayesian statistical decision theory Statistics
35	Misclassification of the dependent variable in binary choice models Gu, Yuanyuan, Economics, Australian School of Business, UNSW January 2006 (has links) Survey data are often subject to a number of measurement errors. The measurement error associated with a multinomial variable is called a misclassification error. In this dissertation we study such errors when the outcome is binary. It is known that ignoring such misclassification errors may affect the parameter estimates, see for example Hausman, Abrevaya and Scott-Morton (1998). However, previous studies showed that robust estimation of the parameters is achievable if we take misclassification into account. There are many attempts to do so in the literature and the major problem in implementing them is to avoid poor or fragile identifiability of the misclassification probabilities. Generally we restrict these parameters by imposing prior information on them. Such prior constraints on the parameters are simple to impose within a Bayesian framework. Hence we consider a Bayesian logistic regression model that takes into account the misclassification of the dependent variable. A very convenient way to implement such a Bayesian analysis is to estimate the hierarchical model using the WinBUGS software package developed by the MRC biostatistics group, Institute of Public Health, at Cambridge University. WinGUGS allows us to estimate the posterior distributions of all the parameters using relatively little programming and once the program is written it is trivial to change the link function, for example from logit to probit. If we wish to have more control over the sampling scheme or to deal with more complex models, then we propose a data augmentation approach using the Metropolis-Hastings algorithm within a Gibbs sampling framework. The sampling scheme can be made more efficient by using a one-step Newton-Raphson algorithm to form the Metropolis-Hastings proposal. Results from empirically analyzing real data and from the simulation studies suggest that if suitable priors are specified for the misclassification parameters and the regression parameters, then logistic regression allowing for misclassification results in better estimators than the estimators that do not take misclassification into account. Error analysis (Mathematics) Bayesian statistical decision theory
36	Bayesian estimation of decomposable Gaussian graphical models Armstrong, Helen, School of Mathematics, UNSW January 2005 (has links) This thesis explains to statisticians what graphical models are and how to use them for statistical inference; in particular, how to use decomposable graphical models for efficient inference in covariance selection and multivariate regression problems. The first aim of the thesis is to show that decomposable graphical models are worth using within a Bayesian framework. The second aim is to make the techniques of graphical models fully accessible to statisticians. To achieve these aims the thesis makes a number of statistical contributions. First, it proposes a new prior for decomposable graphs and a simulation methodology for estimating this prior. Second, it proposes a number of Markov chain Monte Carlo sampling schemes based on graphical techniques. The thesis also presents some new graphical results, and some existing results are reproved to make them more readily understood. Appendix 8.1 contains all the programs written to carry out the inference discussed in the thesis, together with both a summary of the theory on which they are based and a line by line description of how each routine works.
37	Bayesian analysis for avian nest survival models / Tra, Yolande Vololonirina, January 2000 (has links) Thesis (Ph. D.)--University of Missouri-Columbia, 2000. / Typescript. Vita. Includes bibliographical references (leaves 85-88). Also available on the Internet.
38	On a subjective modelling of VaR fa Bayesian approach / Siu, Wai-shing. January 2001 (has links) Thesis (M. Phil.)--University of Hong Kong, 2001. / Includes bibliographical references (leaves 74-80).
39	Bayesian analysis for avian nest survival models Tra, Yolande Vololonirina, January 2000 (has links) Thesis (Ph. D.)--University of Missouri-Columbia, 2000. / Typescript. Vita. Includes bibliographical references (leaves 85-88). Also available on the Internet.
40	Bayesian inference for models with monotone densities and hazard rates / Ho, Man Wai. January 2002 (has links) Thesis (Ph. D.)--Hong Kong University of Science and Technology, 2002. / Includes bibliographical references (leaves 110-114). Also available in electronic version. Access restricted to campus users.

Search results