Global ETD Search

151	Hypothesis testing and feature selection in semi-supervised data Sechidis, Konstantinos January 2015 (has links) A characteristic of most real world problems is that collecting unlabelled examples is easier and cheaper than collecting labelled ones. As a result, learning from partially labelled data is a crucial and demanding area of machine learning, and extending techniques from fully to partially supervised scenarios is a challenging problem. Our work focuses on two types of partially labelled data that can occur in binary problems: semi-supervised data, where the labelled set contains both positive and negative examples, and positive-unlabelled data, a more restricted version of partial supervision where the labelled set consists of only positive examples. In both settings, it is very important to explore a large number of features in order to derive useful and interpretable information about our classification task, and select a subset of features that contains most of the useful information. In this thesis, we address three fundamental and tightly coupled questions concerning feature selection in partially labelled data; all three relate to the highly controversial issue of when does additional unlabelled data improve performance in partially labelled learning environments and when does not. The first question is what are the properties of statistical hypothesis testing in such data? Second, given the widespread criticism of significance testing, what can we do in terms of effect size estimation, that is, quantification of how strong the dependency between feature X and the partially observed label Y? Finally, in the context of feature selection, how well can features be ranked by estimated measures, when the population values are unknown? The answers to these questions provide a comprehensive picture of feature selection in partially labelled data. Interesting applications include for estimation of mutual information quantities, structure learning in Bayesian networks, and investigation of how human-provided prior knowledge can overcome the restrictions of partial labelling. One direct contribution of our work is to enable valid statistical hypothesis testing and estimation in positive-unlabelled data. Focusing on a generalised likelihood ratio test and on estimating mutual information, we provide five key contributions. (1) We prove that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities. (2) We suggest a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power by incorporating user’s prior knowledge over the prevalence of positive examples. (3) We show a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. (4) We derive an estimator of the mutual information in positive-unlabelled data, and its asymptotic distribution. (5) Finally, we show how to rank features with and without prior knowledge. Also we derive extensions of these results to semi-supervised data. In another extension, we investigate how we can use our results for Markov blanket discovery in partially labelled data. While there are many different algorithms for deriving the Markov blanket of fully supervised nodes, the partially labelled problem is far more challenging, and there is a lack of principled approaches in the literature. Our work constitutes a generalization of the conditional tests of independence for partially labelled binary target variables, which can handle the two main partially labelled scenarios: positive-unlabelled and semi-supervised. The result is a significantly deeper understanding of how to control false negative errors in Markov Blanket discovery procedures and how unlabelled data can help. Finally, we present how our results can be used for information theoretic feature selection in partially labelled data. Our work extends naturally feature selection criteria suggested for fully-supervised data, to partially labelled scenarios. These criteria can capture both the relevancy and redundancy of the features and can be used for semi-supervised and positive-unlabelled data. 519.5
152	Statistické zpracování dat z reálného výrobního procesu / Statistical analysis of real manufacturing process data Kučerová, Barbora January 2012 (has links) Tématem této diplomové práce je statistická regulace výrobního procesu. Cílem bylo analyzovat data z reálného technologického procesu revolverového vstřikovacího lisu. Analýza byla provedena za užití statistického testování hypotéz, analýzy rozptylu, obecného lineárního modelu a analýzy způsobilosti procesu. Analýza dat byla provedena ve statistickém softwaru Minitab 16.
153	Employing mHealth Applications for the Self-Assessment of Selected Eye Functions and Prediction of Chronic Major Eye Diseases among the Aging Population Abdualiyeva, Gulnara 24 May 2019 (has links) In the epoch of advanced mHealth (mobile health) use in ophthalmology, there is a scientific call for regulating the validity and reliability of eye-related apps. For a positive health outcome that works towards enhancing mobile-application guided diagnosis in joint decision-making between eye specialists and individuals, the aging population should be provided with a reliable and valid tool for assessment of their eye status outside the physician office. This interdisciplinary study aims to determine through hypothesis testing validity and reliability of a limited set of five mHealth apps (mHAs ) and through binary logistic regression the prediction possibilities of investigated apps to exclude the four major eye diseases in the particular demographic population. The study showed that 189 aging adults (45- 86 years old) who did complete the mHAs’ tests were able to produce reliable results of selected eye function tests through four out of five mHAs measuring visual acuity, contrast sensitivity, red desaturation, visual field and Amsler grid in comparison with a “gold standard” - comprehensive eye examination. Also, part of the participants was surveyed for assessing the Quality of Experience on mobile apps. Understanding of current reliability of existing eye-related mHAs will lead to the creation of ideal mobile application’ self-assessment protocol predicting the timely need for clinical assessment and treatment of age-related macular degeneration, diabetic retinopathy, glaucoma and cataract. Detecting the level of eye function impairments by mHAs is cost-effective and can contribute to research methodology in eye diseases’ prediction by expanding the system of clear criteria specially created for mobile applications and provide returning significant value in preventive ophthalmology. mHealth applications Reliability Validity Age-related eye diseases Hypothesis testing Binary logistic regression
154	Sdílení investičních nápadu: Rola štěstí a dovednosti / Sharing investment ideas: Role of luck and skill Turlík, Tomáš January 2021 (has links) i Abstract In the environment of a large group of analysts who are willing to share their investment ideas publicly, it is a challenging task to find the ones who have a great skill and whose recommendations generate abnormal returns. We explore one such famous group, Value Investors Club, consisting of 1223 analysts be- tween the years 2000 and 2019. We separate the analysts into multiple groups, each representing their inherent abilities. The commonly used method of single hypothesis testing cannot be used as we test many analysts at once, and the multiple hypothesis testing methods need to be employed. Using these meth- ods, we are able to detect the subgroup of analysts who have abnormal returns from the Fama-French 4 factor portfolio. However, different methods lead to different groups of analysts deemed to be skilled. An overall portfolio consist- ing of all analysts generates large abnormal returns, which diminish with the increases in the holding period. Furthermore, analyses from analysts estimated to be skilled are used to form portfolios. We find that there are methods that have significantly larger abnormal returns compared to the overall portfolio; however, the methods are not consistent at producing such portfolios. Keywords multiple hypothesis testing, luck and skill, in- vestment ideas Title...
155	On Analysis of Sufficient Dimension Reduction Models An, Panduan 04 June 2019 (has links) No description available. Mathematics Statistics Sufficient dimension reduction central subspace central mean subspace monotonicity variable selection hypothesis testing nonparametric
156	Two-Sample Testing of High-Dimensional Covariance Matrices Sun, Nan, 0000-0003-0278-5254 January 2021 (has links) Testing the equality between two high-dimensional covariance matrices is challenging. As the most efficient way to measure evidential discrepancies in observed data, the likelihood ratio test is expected to be powerful when the null hypothesis is violated. However, when the data dimensionality becomes large and potentially exceeds the sample size by a substantial margin, likelihood ratio based approaches face practical and theoretical challenges. To solve this problem, this study proposes a method by which we first randomly project the original high-dimensional data into lower-dimensional space, and then apply the corrected likelihood ratio tests developed with random matrix theory. We show that testing with a single random projection is consistent under the null hypothesis. Through evaluating the power function, which is challenging in this context, we provide evidence that the test with a single random projection based on a random projection matrix with reasonable column sizes is more powerful when the two covariance matrices are unequal but component-wise discrepancy could be small -- a weak and dense signal setting. To more efficiently utilize this data information, we propose combined tests from multiple random projections from the class of meta-analyses. We establish the foundation of the combined tests from our theoretical analysis that the p-values from multiple random projections are asymptotically independent in the high-dimensional covariance matrices testing problem. Then, we show that combined tests from multiple random projections are consistent under the null hypothesis. In addition, our theory presents the merit of certain meta-analysis approaches over testing with a single random projection. Numerical evaluation of the power function of the combined tests from multiple random projections is also provided based on numerical evaluation of power function of testing with a single random projection. Extensive simulations and two real genetic data analyses confirm the merits and potential applications of our test. / Statistics Statistics Corrected likelihood ratio test Covariance matrix Hypothesis testing Meta analysis Random matrix theory Random projections
157	Spatial Regularization for Analysis of Text and Epidemiological Data MAITI, ANIRUDDHA, 0000-0002-1142-6344 January 2022 (has links) Use of spatial data has become an important aspect of data analysis. Use of location information can provide useful insight into the dataset. Advancement of sensor technologies and improved data connectivity have made it possible to the generation of large amounts of passively generated user location data. Apart from passively generated data from users, explicit effort has been made by commercial vendors to curate large amounts of location related data such as residential histories from a variety of sources such as credit records, litigation data, driving license records etc. Such spatial data, when linked with other datasets can provide useful insights. In this dissertation, we show that spatial information of data enables us to derive useful insights in domains of text analysis and epidemiology. We investigated primarily two types of data having spatial information - text data with location information and disease related data having residential address information. We show that in the case of text data, spatial information helps us find spatially informative topics. In the case of epidemiological data, we show residential information can be used to identify high risk spatial regions. There are instances where a primary analysis is not sufficient to establish a statistically robust conclusion. For instance, in domains such as epidemiology, where a finding is not considered to be relevant unless some statistical significance is established. We proposed techniques for significant tests which can be applied to text analysis, topic modelling, and disease mapping tasks in order to establish significance of the findings. / Computer and Information Science Computer science Hypothesis testing Microblog data Residential history data Spatial epidemiology Spatial text analysis Topic modelling
158	Distributed Inference for Degenerate U-Statistics with Application to One and Two Sample Test Atta-Asiamah, Ernest January 2020 (has links) In many hypothesis testing problems such as one-sample and two-sample test problems, the test statistics are degenerate U-statistics. One of the challenges in practice is the computation of U-statistics for a large sample size. Besides, for degenerate U-statistics, the limiting distribution is a mixture of weighted chi-squares, involving the eigenvalues of the kernel of the U-statistics. As a result, it’s not straightforward to construct the rejection region based on this asymptotic distribution. In this research, we aim to reduce the computation complexity of degenerate U-statistics and propose an easy-to-calibrate test statistic by using the divide-and-conquer method. Specifically, we randomly partition the full n data points into kn even disjoint groups, and compute U-statistics on each group and combine them by averaging to get a statistic Tn. We proved that the statistic Tn has the standard normal distribution as the limiting distribution. In this way, the running time is reduced from O(n^m) to O( n^m/km_n), where m is the order of the one sample U-statistics. Besides, for a given significance level , it’s easy to construct the rejection region. We apply our method to the goodness of fit test and two-sample test. The simulation and real data analysis show that the proposed test can achieve high power and fast running time for both one and two-sample tests. degenerate and non degenerate divide-and-conquer goodness-of-fit test hypothesis testing maximum mean discrepancy U-statistics
159	An efficient framework for hypothesis testing using Topological Data Analysis Pathirana, Hasani Indunil 05 May 2023 (has links) No description available. Statistics Topological data analysis Persistent homology Persistence diagram Betti function Hypothesis testing
160	Sensitivity to Distributional Assumptions in Estimation of the ODP Thresholding Function Bunn, Wendy Jill 06 July 2007 (has links) (PDF) Recent technological advances in fields like medicine and genomics have produced high-dimensional data sets and a challenge to correctly interpret experimental results. The Optimal Discovery Procedure (ODP) (Storey 2005) builds on the framework of Neyman-Pearson hypothesis testing to optimally test thousands of hypotheses simultaneously. The method relies on the assumption of normally distributed data; however, many applications of this method will violate this assumption. This thesis investigates the sensitivity of this method to detection of significant but nonnormal data. Overall, estimation of the ODP with the method described in this thesis is satisfactory, except when the nonnormal alternative distribution has high variance and expectation only one standard deviation away from the null distribution. estimation gene expression multiple hypothesis testing multiple comparisons nonnormal optimal discovery procedure statistics Statistics and Probability

Search results