Spelling suggestions: "subject:"permutation testing"" "subject:"ermutation testing""
1 |
Permutation Tests for ClassificationMukherjee, Sayan, Golland, Polina, Panchenko, Dmitry 28 August 2003 (has links)
We introduce and explore an approach to estimating statistical significance of classification accuracy, which is particularly useful in scientific applications of machine learning where high dimensionality of the data and the small number of training examples render most standard convergence bounds too loose to yield a meaningful guarantee of the generalization ability of the classifier. Instead, we estimate statistical significance of the observed classification accuracy, or the likelihood of observing such accuracy by chance due to spurious correlations of the high-dimensional data patterns with the class labels in the given training set. We adopt permutation testing, a non-parametric technique previously developed in classical statistics for hypothesis testing in the generative setting (i.e., comparing two probability distributions). We demonstrate the method on real examples from neuroimaging studies and DNA microarray analysis and suggest a theoretical analysis of the procedure that relates the asymptotic behavior of the test to the existing convergence bounds.
|
2 |
Permutation Tests for ClassificationMukherjee, Sayan, Golland, Polina, Panchenko, Dmitry 28 August 2003 (has links)
We introduce and explore an approach to estimating statisticalsignificance of classification accuracy, which is particularly usefulin scientific applications of machine learning where highdimensionality of the data and the small number of training examplesrender most standard convergence bounds too loose to yield ameaningful guarantee of the generalization ability of theclassifier. Instead, we estimate statistical significance of theobserved classification accuracy, or the likelihood of observing suchaccuracy by chance due to spurious correlations of thehigh-dimensional data patterns with the class labels in the giventraining set. We adopt permutation testing, a non-parametric techniquepreviously developed in classical statistics for hypothesis testing inthe generative setting (i.e., comparing two probabilitydistributions). We demonstrate the method on real examples fromneuroimaging studies and DNA microarray analysis and suggest atheoretical analysis of the procedure that relates the asymptoticbehavior of the test to the existing convergence bounds.
|
3 |
Differentiation between "Bomb" and Ordinary U.S. East Coast Cyclogenesis using Principal Component Analysis and K-means Cluster AnalysisThomas, Evan Edward 12 May 2012 (has links)
The purpose of this research is to identify whether synoptic patterns and variables were statistically significantly different between East Coast United States track bomb and ordinary cyclogenesis. The differentiation of East Coast track bomb and ordinary cyclogenesis was completed through the utility of the principal component analysis, a K-means cluster analysis, a subjective composite analysis, and permutation tests. The principal component analysis determined that there were three leading modes of variability within the bomb and ordinary composites. The K-means cluster analysis was used to cluster these leading patterns of variability into three distinct clusters for the bomb and ordinary cyclones. The subjective composite analysis, created by averaging all the variables from each cyclone in each cluster, identified several synoptic variables and patterns to be objectively compared through permutation tests. The permutation tests revealed that synoptic variables and patterns associated with bomb cyclogenesis statistically significantly differ from ordinary cyclogenesis.
|
4 |
Characterizing the Statistical Properties and Global Distribution of Dansgaard-Oeschger EventsThomas, Andrea Michelle 04 March 2009 (has links) (PDF)
Ice core records from Greenland have shown times of rapid warming during the most recent glacial period, called Dansgaard-Oeschger (D-O) events. D-O events are important to our understanding of both past climate systems and modern climate volatility. In this paper, we present new approaches for statistically evaluating the existence of cyclicity in D-O events and the possible lagged correlation between the Greenland and Antarctica temperature records. Specifically, we consider permutation testing and bootstrapping methodologies for assessing the cyclicity of D-O events and the correlation between the Greenland and Antarctica records. We find that there is not enough evidence to conclude that D-O events are cyclical; however, the Antarctica record leads the Greenland record by 545 years with a statistically significant correlation of 0.455.
|
5 |
Finding a Targeted Subgroup with Efficacy for BinaryResponse with Application for Drug DevelopmentKil, Siyoen January 2013 (has links)
No description available.
|
6 |
Novel chemometric proposals for advanced multivariate data analysis, processing and interpretationVitale, Raffaele 03 November 2017 (has links)
The present Ph.D. thesis, primarily conceived to support and reinforce the relation between academic and industrial worlds, was developed in collaboration with Shell Global Solutions (Amsterdam, The Netherlands) in the endeavour of applying and possibly extending well-established latent variable-based approaches (i.e. Principal Component Analysis - PCA - Partial Least Squares regression - PLS - or Partial Least Squares Discriminant Analysis - PLSDA) for complex problem solving not only in the fields of manufacturing troubleshooting and optimisation, but also in the wider environment of multivariate data analysis. To this end, novel efficient algorithmic solutions are proposed throughout all chapters to address very disparate tasks, from calibration transfer in spectroscopy to real-time modelling of streaming flows of data. The manuscript is divided into the following six parts, focused on various topics of interest:
Part I - Preface, where an overview of this research work, its main aims and justification is given together with a brief introduction on PCA, PLS and PLSDA;
Part II - On kernel-based extensions of PCA, PLS and PLSDA, where the potential of kernel techniques, possibly coupled to specific variants of the recently rediscovered pseudo-sample projection, formulated by the English statistician John C. Gower, is explored and their performance compared to that of more classical methodologies in four different applications scenarios: segmentation of Red-Green-Blue (RGB) images, discrimination of on-/off-specification batch runs, monitoring of batch processes and analysis of mixture designs of experiments;
Part III - On the selection of the number of factors in PCA by permutation testing, where an extensive guideline on how to accomplish the selection of PCA components by permutation testing is provided through the comprehensive illustration of an original algorithmic procedure implemented for such a purpose;
Part IV - On modelling common and distinctive sources of variability in multi-set data analysis, where several practical aspects of two-block common and distinctive component analysis (carried out by methods like Simultaneous Component Analysis - SCA - DIStinctive and COmmon Simultaneous Component Analysis - DISCO-SCA - Adapted Generalised Singular Value Decomposition - Adapted GSVD - ECO-POWER, Canonical Correlation Analysis - CCA - and 2-block Orthogonal Projections to Latent Structures - O2PLS) are discussed, a new computational strategy for determining the number of common factors underlying two data matrices sharing the same row- or column-dimension is described, and two innovative approaches for calibration transfer between near-infrared spectrometers are presented;
Part V - On the on-the-fly processing and modelling of continuous high-dimensional data streams, where a novel software system for rational handling of multi-channel measurements recorded in real time, the On-The-Fly Processing (OTFP) tool, is designed;
Part VI - Epilogue, where final conclusions are drawn, future perspectives are delineated, and annexes are included. / La presente tesis doctoral, concebida principalmente para apoyar y reforzar la relación entre la academia y la industria, se desarrolló en colaboración con Shell Global Solutions (Amsterdam, Países Bajos) en el esfuerzo de aplicar y posiblemente extender los enfoques ya consolidados basados en variables latentes (es decir, Análisis de Componentes Principales - PCA - Regresión en Mínimos Cuadrados Parciales - PLS - o PLS discriminante - PLSDA) para la resolución de problemas complejos no sólo en los campos de mejora y optimización de procesos, sino también en el entorno más amplio del análisis de datos multivariados. Con este fin, en todos los capítulos proponemos nuevas soluciones algorítmicas eficientes para abordar tareas dispares, desde la transferencia de calibración en espectroscopia hasta el modelado en tiempo real de flujos de datos.
El manuscrito se divide en las seis partes siguientes, centradas en diversos temas de interés:
Parte I - Prefacio, donde presentamos un resumen de este trabajo de investigación, damos sus principales objetivos y justificaciones junto con una breve introducción sobre PCA, PLS y PLSDA;
Parte II - Sobre las extensiones basadas en kernels de PCA, PLS y PLSDA, donde presentamos el potencial de las técnicas de kernel, eventualmente acopladas a variantes específicas de la recién redescubierta proyección de pseudo-muestras, formulada por el estadista inglés John C. Gower, y comparamos su rendimiento respecto a metodologías más clásicas en cuatro aplicaciones a escenarios diferentes: segmentación de imágenes Rojo-Verde-Azul (RGB), discriminación y monitorización de procesos por lotes y análisis de diseños de experimentos de mezclas;
Parte III - Sobre la selección del número de factores en el PCA por pruebas de permutación, donde aportamos una guía extensa sobre cómo conseguir la selección de componentes de PCA mediante pruebas de permutación y una ilustración completa de un procedimiento algorítmico original implementado para tal fin;
Parte IV - Sobre la modelización de fuentes de variabilidad común y distintiva en el análisis de datos multi-conjunto, donde discutimos varios aspectos prácticos del análisis de componentes comunes y distintivos de dos bloques de datos (realizado por métodos como el Análisis Simultáneo de Componentes - SCA - Análisis Simultáneo de Componentes Distintivos y Comunes - DISCO-SCA - Descomposición Adaptada Generalizada de Valores Singulares - Adapted GSVD - ECO-POWER, Análisis de Correlaciones Canónicas - CCA - y Proyecciones Ortogonales de 2 conjuntos a Estructuras Latentes - O2PLS). Presentamos a su vez una nueva estrategia computacional para determinar el número de factores comunes subyacentes a dos matrices de datos que comparten la misma dimensión de fila o columna y dos planteamientos novedosos para la transferencia de calibración entre espectrómetros de infrarrojo cercano;
Parte V - Sobre el procesamiento y la modelización en tiempo real de flujos de datos de alta dimensión, donde diseñamos la herramienta de Procesamiento en Tiempo Real (OTFP), un nuevo sistema de manejo racional de mediciones multi-canal registradas en tiempo real;
Parte VI - Epílogo, donde presentamos las conclusiones finales, delimitamos las perspectivas futuras, e incluimos los anexos. / La present tesi doctoral, concebuda principalment per a recolzar i reforçar la relació entre l'acadèmia i la indústria, es va desenvolupar en col·laboració amb Shell Global Solutions (Amsterdam, Països Baixos) amb l'esforç d'aplicar i possiblement estendre els enfocaments ja consolidats basats en variables latents (és a dir, Anàlisi de Components Principals - PCA - Regressió en Mínims Quadrats Parcials - PLS - o PLS discriminant - PLSDA) per a la resolució de problemes complexos no solament en els camps de la millora i optimització de processos, sinó també en l'entorn més ampli de l'anàlisi de dades multivariades. A aquest efecte, en tots els capítols proposem noves solucions algorítmiques eficients per a abordar tasques dispars, des de la transferència de calibratge en espectroscopia fins al modelatge en temps real de fluxos de dades.
El manuscrit es divideix en les sis parts següents, centrades en diversos temes d'interès:
Part I - Prefaci, on presentem un resum d'aquest treball de recerca, es donen els seus principals objectius i justificacions juntament amb una breu introducció sobre PCA, PLS i PLSDA;
Part II - Sobre les extensions basades en kernels de PCA, PLS i PLSDA, on presentem el potencial de les tècniques de kernel, eventualment acoblades a variants específiques de la recentment redescoberta projecció de pseudo-mostres, formulada per l'estadista anglés John C. Gower, i comparem el seu rendiment respecte a metodologies més clàssiques en quatre aplicacions a escenaris diferents: segmentació d'imatges Roig-Verd-Blau (RGB), discriminació i monitorització de processos per lots i anàlisi de dissenys d'experiments de mescles;
Part III - Sobre la selecció del nombre de factors en el PCA per proves de permutació, on aportem una guia extensa sobre com aconseguir la selecció de components de PCA a través de proves de permutació i una il·lustració completa d'un procediment algorítmic original implementat per a la finalitat esmentada;
Part IV - Sobre la modelització de fonts de variabilitat comuna i distintiva en l'anàlisi de dades multi-conjunt, on discutim diversos aspectes pràctics de l'anàlisis de components comuns i distintius de dos blocs de dades (realitzat per mètodes com l'Anàlisi Simultània de Components - SCA - Anàlisi Simultània de Components Distintius i Comuns - DISCO-SCA - Descomposició Adaptada Generalitzada en Valors Singulars - Adapted GSVD - ECO-POWER, Anàlisi de Correlacions Canòniques - CCA - i Projeccions Ortogonals de 2 blocs a Estructures Latents - O2PLS). Presentem al mateix temps una nova estratègia computacional per a determinar el nombre de factors comuns subjacents a dues matrius de dades que comparteixen la mateixa dimensió de fila o columna, i dos plantejaments nous per a la transferència de calibratge entre espectròmetres d'infraroig proper;
Part V - Sobre el processament i la modelització en temps real de fluxos de dades d'alta dimensió, on dissenyem l'eina de Processament en Temps Real (OTFP), un nou sistema de tractament racional de mesures multi-canal registrades en temps real;
Part VI - Epíleg, on presentem les conclusions finals, delimitem les perspectives futures, i incloem annexos. / Vitale, R. (2017). Novel chemometric proposals for advanced multivariate data analysis, processing and interpretation [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/90442
|
7 |
Modélisation statistique de tenseurs d'ordre supérieur en imagerie par résonance magnétique de diffusion / Statistical modelling of high order tensors in diffusion weighted magnetic resonance imagingGkamas, Theodosios 29 September 2015 (has links)
L'IRMd est un moyen non invasif permettant d'étudier in vivo la structure des fibres nerveuses du cerveau. Dans cette thèse, nous modélisons des données IRMd à l'aide de tenseurs d'ordre 4 (T4). Les problèmes de comparaison de groupes ou d'individu avec un groupe normal sont abordés, et résolus à l'aide d'analyses statistiques sur les T4s. Les approches utilisent des réductions non linéaires de dimension, et bénéficient des métriques non euclidiennes pour les T4s. Les statistiques sont calculées dans l'espace réduit, et permettent de quantifier la dissimilarité entre le groupe (ou l'individu) d'intérêt et le groupe de référence. Les approches proposées sont appliquées à la neuromyélite optique et aux patients atteints de locked in syndrome. Les conclusions tirées sont cohérentes avec les connaissances médicales actuelles. / DW-MRI is a non-invasive way to study in vivo the structure of nerve fibers in the brain. In this thesis, fourth order tensors (T4) were used to model DW-MRI data. In addition, the problems of group comparison or individual against a normal group were discussed and solved using statistical analysis on T4s. The approaches use nonlinear dimensional reductions, assisted by non-Euclidean metrics for T4s. The statistics are calculated in the reduced space and allow us to quantify the dissimilarity between the group (or the individual) of interest and the reference group. The proposed approaches are applied to neuromyelitis optica and patients with locked in syndrome. The derived conclusions are consistent with the current medical knowledge.
|
Page generated in 0.0967 seconds