Return to search

Higher Criticism Testing for Signal Detection in Rare And Weak Models

examples - we need models for selecting a small subset of useful features from high-dimensional data, where the useful features are both rare and weak, this being crucial for e.g. supervised classfication of sparse high- dimensional data. A preceding step is to detect the presence of useful features, signal detection. This problem is related to testing a very large number of hypotheses, where the proportion of false null hypotheses is assumed to be very small. However, reliable signal detection will only be possible in certain areas of the two-dimensional sparsity-strength parameter space, the phase space. In this report, we focus on two families of distributions, N and χ2. In the former case, features are supposed to be independent and normally distributed. In the latter, in search for a more sophisticated model, we suppose that features depend in blocks, whose empirical separation strength asymptotically follows the non-central χ2ν-distribution. Our search for informative features explores Tukey's higher criticism (HC), which is a second-level significance testing procedure, for comparing the fraction of observed signi cances to the expected fraction under the global null. Throughout the phase space we investgate the estimated error rate, Err = (#Falsely rejected H0+ #Falsely rejected H1)/#Simulations, where H0: absence of informative signals, and H1: presence of informative signals, in both the N-case and the χ2ν-case, for ν= 2; 10; 30. In particular, we find, using a feature vector of the approximately same size as in genomic applications, that the analytically derived detection boundary is too optimistic in the sense that close to it, signal detection is still failing, and we need to move far from the boundary into the success region to ensure reliable detection. We demonstrate that Err grows fast and irregularly as we approach the detection boundary from the success region. In the χ2ν-case, ν > 2, no analytical detection boundary has been derived, but we show that the empirical success region there is smaller than in the N-case, especially as ν increases.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-103284
Date January 2012
CreatorsBlomberg, Niclas
PublisherKTH, Matematisk statistik
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess
RelationTrita-MAT, 1401-2286 ; 25

Page generated in 0.002 seconds