Global ETD Search

1	Feed forward neural network entities Hadjiprocopis, Andreas January 2000 (has links) No description available. 006.3
2	Bootstrapping in a high dimensional but very low sample size problem Song, Juhee 16 August 2006 (has links) High Dimension, Low Sample Size (HDLSS) problems have received much attention recently in many areas of science. Analysis of microarray experiments is one such area. Numerous studies are on-going to investigate the behavior of genes by measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression. HDLSS data investigated in this dissertation consist of a large number of data sets each of which has only a few observations. We assume a statistical model in which measurements from the same subject have the same expected value and variance. All subjects have the same distribution up to location and scale. Information from all subjects is shared in estimating this common distribution. Our interest is in testing the hypothesis that the mean of measurements from a given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and traditional bootstrapping, do not necessarily provide reliable results since there are only a few observations for each data set. We motivate a mixture model having C clusters and 3C parameters to overcome the small sample size problem. Standardized data are pooled after assigning each data set to one of the mixture components. To get reasonable initial parameter estimates when density estimation methods are applied, we apply clustering methods including agglomerative and K-means. Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean of within Cluster Variance estimates), are used to choose an optimal number of clusters. Density estimation methods including a maximum likelihood unimodal density estimator and kernel density estimation are used to estimate the unknown density. Once the density is estimated, a bootstrapping algorithm that selects samples from the estimated density is used to approximate the distribution of test statistics. The t-statistic and an empirical likelihood ratio statistic are used, since their distributions are completely determined by the distribution common to all subject. A method to control the false discovery rate is used to perform simultaneous tests on all small data sets. Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid) microarray experiment data are analyzed by the proposed methods. Bootstrap Density Estimation Clustering High dimensional Data
3	A Bidirectional Pipeline for Semantic Interaction in Visual Analytics Binford, Adam Quarles 21 September 2016 (has links) Semantic interaction in visual data analytics allows users to indirectly adjust model parameters by directly manipulating the output of the models. This is accomplished using an underlying bidirectional pipeline that first uses statistical models to visualize the raw data. When a user interacts with the visualization, the interaction is interpreted into updates in the model parameters automatically, giving the users immediate feedback on each interaction. These interpreted interactions eliminate the need for a deep understanding of the underlying statistical models. However, the development of such tools is necessarily complex due to their interactive nature. Furthermore, each tool defines its own unique pipeline to suit its needs, which leads to difficulty experimenting with different types of data, models, interaction techniques, and visual encodings. To address this issue, we present a flexible multi-model bidirectional pipeline for prototyping visual analytics tools that rely on semantic interaction. The pipeline has plug-and-play functionality, enabling quick alterations to the type of data being visualized, how models transform the data, and interaction methods. In so doing, the pipeline enforces a separation between the data pipeline and the visualization, preventing the two from becoming codependent. To show the flexibility of the pipeline, we demonstrate a new visual analytics tool and several distinct variations, each of which were quickly and easily implemented with slight changes to the pipeline or client. / Master of Science Visualization High-dimensional data Interaction design
4	Independence Screening in High-Dimensional Data Wauters, John, Wauters, John January 2016 (has links) High-dimensional data, data in which the number of dimensions exceeds the number of observations, is increasingly common in statistics. The term "ultra-high dimensional" is defined by Fan and Lv (2008) as describing the situation where log(p) is of order O(na) for some a in the interval (0, ½). It arises in many contexts such as gene expression data, proteomic data, imaging data, tomography, and finance, as well as others. High-dimensional data present a challenge to traditional statistical techniques. In traditional statistical settings, models have a small number of features, chosen based on an assumption of what features may be relevant to the response of interest. In the high-dimensional setting, many of the techniques of traditional feature selection become computationally intractable, or does not yield unique solutions. Current research in modeling high-dimensional data is heavily focused on methods that screen the features before modeling; that is, methods that eliminate noise-features as a pre-modeling dimension reduction. Typically noise feature are identified by exploiting properties of independent random variables, thus the term "independence screening." There are methods for modeling high-dimensional data without feature screening first (e.g. LASSO or SCAD), but simulation studies show screen-first methods perform better as dimensionality increases. Many proposals for independence screening exist, but in my literature review certain themes recurred: A) The assumption of sparsity: that all the useful information in the data is actually contained in a small fraction of the features (the "active features"), the rest being essentially random noise (the "inactive" features). B) In many newer methods, initial dimension reduction by feature screening reduces the problem from the high-dimensional case to a classical case; feature selection then proceeds by a classical method. C) In the initial screening, removal of features independent of the response is highly desirable, as such features literally give no information about the response. D) For the initial screening, some statistic is applied pairwise to each feature in combination with the response; the specific statistic chosen so that in the case that the two random variables are independent, a specific known value is expected for the statistic. E) Features are ranked by the absolute difference between the calculated statistic and the expected value of that statistic in the independent case, i.e. features that are most different from the independent case are most preferred. F) Proof is typically offered that, asymptotically, the method retains the true active features with probability approaching one. G) Where possible, an iterative version of the process is explored, as iterative versions do much better at identifying features that are active in their interactions, but not active individually. feature screening high-dimensional data independence screening modeling dimension reduction
5	Randomization test and correlation effects in high dimensional data Wang, Xiaofei January 1900 (has links) Master of Science / Department of Statistics / Gary Gadbury / High-dimensional data (HDD) have been encountered in many fields and are characterized by a “large p, small n” paradigm that arises in genomic, lipidomic, and proteomic studies. This report used a simulation study that employed basic block diagonal covariance matrices to generate correlated HDD. Quantities of interests in such data are, among others, the number of ‘significant’ discoveries. This number can be highly variable when data are correlated. This project compared randomization tests versus usual t-tests for testing of significant effects across two treatment conditions. Of interest was whether the variance of the number of discoveries is better controlled in a randomization setting versus a t-test. The results showed that the randomization tests produced results similar to that of t-tests. Randomization test Correlation effect High dimensional data Statistics (0463)
6	Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learning Wang, Fan January 2019 (has links) In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
7	Statistical Dependence in Imputed High-Dimensional Data for a Colorectal Cancer Study Suyundikov, Anvar 01 May 2015 (has links) The main purpose of this dissertation was to examine the statistical dependence of imputed microRNA (miRNA) data in a colorectal cancer study. The dissertation addressed three related statistical issues that were raised by this study. the first statistical issue was motivated by the fact that miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compared the precision and power performance of several imputation methods, and drew attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. The second statistical issue was raised by the necessity to address the bimodality of distributions of miRNA data along with the imputation-induced dependency among subjects. We proposed and compared the performance of three nonparametric methods to identify the dierentially expressed miRNAs in the paired tumor-normal data while accounting for the imputation-induced dependence. The third statistical issue was related to the development of a normalization method for miRNA data that would reduce not only technical variation but also the variation caused by the characteristics of subjects, while maintaining the true biological dierences between arrays. Statistical Dependence High-Dimensional Data Colorectal Cancer Study Mathematics
8	A clustering scheme for large high-dimensional document datasets Chen, Jing-wen 09 August 2007 (has links) Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method. Dimension reduction high-dimensional data clustering text mining Document clustering
9	Statistical Methods to Enhance Clinical Prediction with High-Dimensional Data and Ordinal Response Leha, Andreas 25 March 2015 (has links) Der technologische Fortschritt ermöglicht es heute, die moleculare Konfiguration einzelner Zellen oder ganzer Gewebeproben zu untersuchen. Solche in großen Mengen produzierten hochdimensionalen Omics-Daten aus der Molekularbiologie lassen sich zu immer niedrigeren Kosten erzeugen und werden so immer häufiger auch in klinischen Fragestellungen eingesetzt. Personalisierte Diagnose oder auch die Vorhersage eines Behandlungserfolges auf der Basis solcher Hochdurchsatzdaten stellen eine moderne Anwendung von Techniken aus dem maschinellen Lernen dar. In der Praxis werden klinische Parameter, wie etwa der Gesundheitszustand oder die Nebenwirkungen einer Therapie, häufig auf einer ordinalen Skala erhoben (beispielsweise gut, normal, schlecht). Es ist verbreitet, Klassifikationsproblme mit ordinal skaliertem Endpunkt wie generelle Mehrklassenproblme zu behandeln und somit die Information, die in der Ordnung zwischen den Klassen enthalten ist, zu ignorieren. Allerdings kann das Vernachlässigen dieser Information zu einer verminderten Klassifikationsgüte führen oder sogar eine ungünstige ungeordnete Klassifikation erzeugen. Klassische Ansätze, einen ordinal skalierten Endpunkt direkt zu modellieren, wie beispielsweise mit einem kumulativen Linkmodell, lassen sich typischerweise nicht auf hochdimensionale Daten anwenden. Wir präsentieren in dieser Arbeit hierarchical twoing (hi2) als einen Algorithmus für die Klassifikation hochdimensionler Daten in ordinal Skalierte Kategorien. hi2 nutzt die Mächtigkeit der sehr gut verstandenen binären Klassifikation, um auch in ordinale Kategorien zu klassifizieren. Eine Opensource-Implementierung von hi2 ist online verfügbar. In einer Vergleichsstudie zur Klassifikation von echten wie von simulierten Daten mit ordinalem Endpunkt produzieren etablierte Methoden, die speziell für geordnete Kategorien entworfen wurden, nicht generell bessere Ergebnisse als state-of-the-art nicht-ordinale Klassifikatoren. Die Fähigkeit eines Algorithmus, mit hochdimensionalen Daten umzugehen, dominiert die Klassifikationsleisting. Wir zeigen, dass unser Algorithmus hi2 konsistent gute Ergebnisse erzielt und in vielen Fällen besser abschneidet als die anderen Methoden. 510 Predictive Modelling Classification Ordinal High Dimensional Data Informatik (PPN619939052)
10	Visualizing large-scale and high-dimensional time series data Yeqiang, Lin January 2017 (has links) Time series is one of the main research objects in the field of data mining. Visualization is an important mechanism to present processed time series for further analysis by users. In recent years researchers have designed a number of sophisticated visualization techniques for time series. However, most of these techniques focus on the static format, trying to encode the maximal amount of information through one image or plot. We propose the pixel video technique, a visualization technique displaying data in video format. Using pixel video technique, a hierarchal dimension cluster tree for generating the similarity order of dimensions is first constructed, each frame image is generated according to pixeloriented techniques displaying the data in the form of a video. visualization time series high-dimensional data Computer Systems Datorsystem

Search results