Spelling suggestions: "subject:"highdimensional data"" "subject:"higherdimensional data""
61 |
Stabilité de la sélection de variables sur des données haute dimension : une application à l'expression génique / Feature selection stability on high dimensional data : an application to gene expression dataDernoncourt, David 15 October 2014 (has links)
Les technologies dites « haut débit » permettent de mesurer de très grandes quantités de variables à l'échelle de chaque individu : séquence ADN, expressions des gènes, profil lipidique… L'extraction de connaissances à partir de ces données peut se faire par exemple par des méthodes de classification. Ces données contenant un très grand nombre de variables, mesurées sur quelques centaines de patients, la sélection de variables est une étape préalable indispensable pour réduire le risque de surapprentissage, diminuer les temps de calcul, et améliorer l'interprétabilité des modèles. Lorsque le nombre d’observations est faible, la sélection tend à être instable, et on observe souvent que sur deux jeux de données différents mais traitant d’un même problème, les variables sélectionnées ne se recoupent presque pas. Pourtant, obtenir une sélection stable semble crucial si l'on veut avoir confiance dans la pertinence effective des variables sélectionnées à des fins d'extraction de connaissances. Dans ce travail, nous avons d'abord cherché à déterminer quels sont les facteurs qui influencent le plus la stabilité de la sélection. Puis nous avons proposé une approche, spécifique aux données puces à ADN, faisant appel aux annotations fonctionnelles pour assister les méthodes de sélection habituelles, en enrichissant les données avec des connaissances a priori. Nous avons ensuite travaillé sur deux aspects des méthodes d'ensemble : le choix de la méthode d'agrégation et les ensembles hybrides. Dans un dernier chapitre, nous appliquons les méthodes étudiées à un problème de prédiction de la reprise de poids suite à un régime, à partir de données puces, chez des patients obèses. / High throughput technologies allow us to measure very high amounts of variables in patients: DNA sequence, gene expression, lipid profile… Knowledge discovery can be performed on such data using, for instance, classification methods. However, those data contain a very high number of variables, which are measured, in the best cases, on a few hundreds of patients. This makes feature selection a necessary first step so as to reduce the risk of overfitting, reduce computation time, and improve model interpretability. When the amount of observations is low, feature selection tends to be unstable. It is common to observe that two selections obtained from two different datasets dealing with the same problem barely overlap. Yet, it seems important to obtain a stable selection if we want to be confident that the selected variables are really relevant, in an objective of knowledge discovery. In this work, we first tried to determine which factors have the most influence on feature selection stability. We then proposed a feature selection method, specific to microarray data, using functional annotations from Gene Ontology in order to assist usual feature selection methods, with the addition of a priori knowledge to the data. We then worked on two aspects of ensemble methods: the choice of the aggregation method, and hybrid ensemble methods. In the final chapter, we applied the methods studied in the thesis to a dataset from our lab, dealing with the prediction of weight regain after a diet, from microarray data, in obese patients.
|
62 |
High dimensional data clustering; A comparative study on gene expressions : Experiment on clustering algorithms on RNA-sequence from tumors with evaluation on internal validationHenriksson, William January 2019 (has links)
In cancer research, class discovery is the first process for investigating a new dataset for which hidden groups there are by similar attributes. However datasets from gene expressions, RNA microarray or RNA-sequence, are high-dimensional. Which makes it hard to perform clusteranalysis and to get clusters that are well separated. Well separated clusters are wanted because that tells that objects are most likely not placed in wrong clusters. This report investigate in an experiment whether using K-Means and hierarchical are suitable for clustering gene expressions in RNA-sequence data from various tumors. Dimensionality reduction methods are also applied to see whether that helps create well-separated clusters. The results tell that well separated clusters are only achieved by using PCA as dimensionality reduction and K-Means on correlation. The main contribution of this paper is determining that using K-Means or hierarchical clustering on the full natural dimensionality of RNA-sequence data returns unwanted silhouette average width, under 0,4.
|
63 |
Dimensionality Reduction in High-Dimensional Profile Analysis Using ScoresVikbladh, Jonathan January 2022 (has links)
Profile analysis is a multivariate statistical method for comparing the mean vectors for different groups. It consists of three tests, they are the tests for parallelism, level and flatness. The results from each test give information about the behaviour of the groups and the variables in the groups. The test statistics used when there are more than two groups are likelihood-ratio tests. However, issues in the form indeterminate test statistics occur in the high-dimensional setting, that is when there are more variables than observations. This thesis investigates a method to approach this problem by reducing the dimensionality of the data using scores, that is linear combinations of the variables. Three different ways of choosing this score are compared: the eigendecomposition and two variations of the non-negative matrix factorization. The methods are compared using simulations for five different type of mean parameter settings. The results show that the eigendecomposition is the best technique for choosing the score, and that using more scores only slightly improves the results. Moreover, the results for the parallelism and the flatness tests are shown to be very good, but the results for the level hypothesis deviate from the expectation.
|
64 |
PLASMA-HD: Probing the LAttice Structure and MAkeup of High-dimensional DataFuhry, David P. January 2015 (has links)
No description available.
|
65 |
Variable Selection in High-Dimensional DataReichhuber, Sarah, Hallberg, Johan January 2021 (has links)
Estimating the variables of importance in inferentialmodelling is of significant interest in many fields of science,engineering, biology, medicine, finance and marketing. However,variable selection in high-dimensional data, where the number ofvariables is relatively large compared to the observed data points,is a major challenge and requires more research in order toenhance reliability and accuracy. In this bachelor thesis project,several known methods of variable selection, namely orthogonalmatching pursuit (OMP), ridge regression, lasso, adaptive lasso,elastic net, adaptive elastic net and multivariate adaptive regressionsplines (MARS) were implemented on a high-dimensional dataset.The aim of this bachelor thesis project was to analyze andcompare these variable selection methods. Furthermore theirperformance on the same data set but extended, with the numberof variables and observations being of similar size, were analyzedand compared as well. This was done by generating models forthe different variable selection methods using built-in packagesin R and coding in MATLAB. The models were then used topredict the observations, and these estimations were compared tothe real observations. The performances of the different variableselection methods were analyzed utilizing different evaluationmethods. It could be concluded that some of the variable selectionmethods provided more accurate models for the implementedhigh-dimensional data set than others. Elastic net, for example,was one of the methods that performed better. Additionally, thecombination of final models could provide further insight in whatvariables that are crucial for the observations in the given dataset, where, for example, variable 112 and 23 appeared to be ofimportance. / Att skatta vilka variabler som är viktigai inferentiell modellering är av stort intresse inom mångaforskningsområden, industrier, biologi, medicin, ekonomi ochmarknadsföring. Variabel-selektion i högdimensionella data, därantalet variabler är relativt stort jämfört med antalet observeradedatapunkter, är emellertid en stor utmaning och krävermer forskning för att öka trovärdigheten och noggrannheteni resultaten. I detta projekt implementerades ett flertal kändavariabel-selektions-metoder, nämligen orthogonal matching pursuit(OMP), ridge regression, lasso, elastic net, adaptive lasso,adaptive elastic net och multivariate adaptive regression splines(MARS), på ett högdimensionellt data-set. Syftet med dettakandidat-examensarbete var att analysera och jämföra resultatenav dessa metoder. Vidare analyserades och jämfördes metodernasresultat på samma data-set, fast utökat, med antalet variableroch observationer ungefär lika stora. Detta gjordes genom attgenerera modeller för de olika variabel-selektions-metodernavia inbygga paket i R och programmering i MATLAB. Dessamodeller användes sedan för att prediktera observationer, ochestimeringarna jämfördes därefter med de verkliga observationerna.Resultaten av de olika variabel-selektions-metodernaanalyserades sedan med hjälp av ett flertal evaluerings-metoder.Det kunde fastställas att vissa av de implementerade variabelselektions-metoderna gav mer relevanta modeller för datanän andra. Exempelvis var elastic net en av metoderna sompresterade bättre. Dessutom drogs slutsatsen att kombineringav resultaten av de slutgiltiga modellerna kunde ge en djupareinsikt i vilka variabler som är viktiga för observationerna, där,till exempel, variabel 112 och 23 tycktes ha betydelse. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm
|
66 |
Visual Analysis of High-Dimensional Point Clouds using Topological AbstractionOesterling, Patrick 17 May 2016 (has links) (PDF)
This thesis is about visualizing a kind of data that is trivial to process by computers but difficult to imagine by humans because nature does not allow for intuition with this type of information: high-dimensional data. Such data often result from representing observations of objects under various aspects or with different properties. In many applications, a typical, laborious task is to find related objects or to group those that are similar to each other. One classic solution for this task is to imagine the data as vectors in a Euclidean space with object variables as dimensions. Utilizing Euclidean distance as a measure of similarity, objects with similar properties and values accumulate to groups, so-called clusters, that are exposed by cluster analysis on the high-dimensional point cloud. Because similar vectors can be thought of as objects that are alike in terms of their attributes, the point cloud\'s structure and individual cluster properties, like their size or compactness, summarize data categories and their relative importance. The contribution of this thesis is a novel analysis approach for visual exploration of high-dimensional point clouds without suffering from structural occlusion. The work is based on implementing two key concepts: The first idea is to discard those geometric properties that cannot be preserved and, thus, lead to the typical artifacts. Topological concepts are used instead to shift away the focus from a point-centered view on the data to a more structure-centered perspective. The advantage is that topology-driven clustering information can be extracted in the data\'s original domain and be preserved without loss in low dimensions. The second idea is to split the analysis into a topology-based global overview and a subsequent geometric local refinement. The occlusion-free overview enables the analyst to identify features and to link them to other visualizations that permit analysis of those properties not captured by the topological abstraction, e.g. cluster shape or value distributions in particular dimensions or subspaces. The advantage of separating structure from data point analysis is that restricting local analysis only to data subsets significantly reduces artifacts and the visual complexity of standard techniques. That is, the additional topological layer enables the analyst to identify structure that was hidden before and to focus on particular features by suppressing irrelevant points during local feature analysis. This thesis addresses the topology-based visual analysis of high-dimensional point clouds for both the time-invariant and the time-varying case. Time-invariant means that the points do not change in their number or positions. That is, the analyst explores the clustering of a fixed and constant set of points. The extension to the time-varying case implies the analysis of a varying clustering, where clusters appear as new, merge or split, or vanish. Especially for high-dimensional data, both tracking---which means to relate features over time---but also visualizing changing structure are difficult problems to solve.
|
67 |
Maximum-likelihood kernel density estimation in high-dimensional feature spaces /| C.M. van der WaltVan der Walt, Christiaan Maarten January 2014 (has links)
With the advent of the internet and advances in computing power, the collection of very large high-dimensional datasets has become feasible { understanding and modelling high-dimensional data has thus become a crucial activity, especially in the field of pattern recognition. Since non-parametric density estimators are data-driven and do not require or impose a pre-defined probability density function on data, they are very powerful tools for probabilistic data modelling and analysis. Conventional non-parametric density estimation methods, however, originated from the field of statistics and were not originally intended to perform density estimation in high-dimensional features spaces { as is often encountered in real-world pattern recognition tasks. Therefore we address the fundamental problem of non-parametric density estimation in high-dimensional feature spaces in this study. Recent advances in maximum-likelihood (ML) kernel density estimation have shown that kernel density estimators hold much promise for estimating nonparametric probability density functions in high-dimensional feature spaces. We therefore derive two new iterative kernel bandwidth estimators from the maximum-likelihood (ML) leave one-out objective function and also introduce a new non-iterative kernel bandwidth estimator (based on the theoretical bounds of the ML bandwidths) for the purpose of bandwidth initialisation. We name the iterative kernel bandwidth estimators the minimum leave-one-out entropy (MLE) and global MLE estimators, and name the non-iterative kernel bandwidth estimator the MLE rule-of-thumb estimator. We compare the performance of the MLE rule-of-thumb estimator and conventional kernel density estimators on artificial data with data properties that are varied in a controlled fashion and on a number of representative real-world pattern recognition tasks, to gain a better understanding of the behaviour of these estimators in high-dimensional spaces and to determine whether these estimators are suitable for initialising the bandwidths of iterative ML bandwidth estimators in high dimensions. We find that there are several regularities in the relative performance of conventional kernel density estimators across different tasks and dimensionalities and that the Silverman rule-of-thumb bandwidth estimator performs reliably across most tasks and dimensionalities of the pattern recognition datasets considered, even in high-dimensional feature spaces. Based on this empirical evidence and the intuitive theoretical motivation that the Silverman estimator optimises the asymptotic mean integrated squared error (assuming a Gaussian reference distribution), we select this estimator to initialise the bandwidths of the iterative ML kernel bandwidth estimators compared in our simulation studies. We then perform a comparative simulation study of the newly introduced iterative MLE estimators and other state-of-the-art iterative ML estimators on a number of artificial and real-world high-dimensional pattern recognition tasks. We illustrate with artificial data (guided by theoretical motivations) under what conditions certain estimators should be preferred and we empirically confirm on real-world data that no estimator performs optimally on all tasks and that the optimal estimator depends on the properties of the underlying density function being estimated. We also observe an interesting case of the bias-variance trade-off where ML estimators with fewer parameters than the MLE estimator perform exceptionally well on a wide variety of tasks; however, for the cases where these estimators do not perform well, the MLE estimator generally performs well. The newly introduced MLE kernel bandwidth estimators prove to be a useful contribution to the field of pattern recognition, since they perform optimally on a number of real-world pattern recognition tasks investigated and provide researchers and
practitioners with two alternative estimators to employ for the task of kernel density
estimation. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
68 |
Order in the random forestKarlsson, Isak January 2017 (has links)
In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered. In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events. In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations.
|
69 |
Generation of semantic layouts for interactive multidimensional data visualization / Geração de layouts semânticos para a visualização interativa de dados multidimensionaisGomez Nieto, Erick Mauricio 24 February 2017 (has links)
Visualization methods make use of interactive graphical representations embedded on a display area in order to enable data exploration and analysis. These typically rely on geometric primitives for representing data or building more sophisticated representations to assist the visual analysis process. One of the most challenging tasks in this context is to determinate an optimal layout of these primitives which turns out to be effective and informative. Existing algorithms for building layouts from geometric primitives are typically designed to cope with requirements such as orthogonal alignment, overlap removal, optimal area usage, hierarchical organization, dynamic update among others. However, most techniques are able to tackle just a few of those requirements simultaneously, impairing their use and flexibility. In this dissertation, we propose a set of approaches for building layouts from geometric primitives that concurrently addresses a wider range of requirements. Relying on multidimensional projection and optimization formulations, our methods arrange geometric objects in the visual space so as to generate well-structured layouts that preserve the semantic relation among objects while still making an efficient use of display area. A comprehensive set of quantitative comparisons against existing methods for layout generation and applications on text, image, and video data set visualization prove the effectiveness of our approaches. / Métodos de visualização fazem uso de representações gráficas interativas embutidas em uma área de exibição para exploração e análise de dados. Esses recursos visuais usam primitivas geométricas para representar dados ou compor representações mais sofisticadas que facilitem a extração visual de informações. Uma das tarefas mais desafiadoras é determinar um layout ótimo visando explorar suas capacidades para transmitir informação dentro de uma determinada visualização. Os algoritmos existentes para construir layouts a partir de primitivas geométricas são tipicamente projetados para lidar com requisitos como alinhamento ortogonal, remoção de sobreposição, área usada, organização hierárquica, atualização dinâmica entre outros. No entanto, a maioria das técnicas são capazes de lidar com apenas alguns desses requerimentos simultaneamente, prejudicando sua utilização e flexibilidade. Nesta tese, propomos um conjunto de abordagens para construir layouts a partir de primitivas geométricas que simultaneamente lidam com uma gama mais ampla de requerimentos. Baseando-se em projeções multidimensionais e formulações de otimização, os nossos métodos organizam objetos geométricos no espaço visual para gerar layouts bem estruturados que preservam a relação semântica entre objetos enquanto ainda fazem um uso eficiente da área de exibição. Um conjunto detalhado de comparações quantitativas com métodos existentes para a geração de layouts e aplicações em visualização de conjunto de dados de texto, imagem e vídeo comprova a eficácia das técnicas propostas.
|
70 |
A concentration inequality based statistical methodology for inference on covariance matrices and operatorsKashlak, Adam B. January 2017 (has links)
In the modern era of high and infinite dimensional data, classical statistical methodology is often rendered inefficient and ineffective when confronted with such big data problems as arise in genomics, medical imaging, speech analysis, and many other areas of research. Many problems manifest when the practitioner is required to take into account the covariance structure of the data during his or her analysis, which takes on the form of either a high dimensional low rank matrix or a finite dimensional representation of an infinite dimensional operator acting on some underlying function space. Thus, novel methodology is required to estimate, analyze, and make inferences concerning such covariances. In this manuscript, we propose using tools from the concentration of measure literature–a theory that arose in the latter half of the 20th century from connections between geometry, probability, and functional analysis–to construct rigorous descriptive and inferential statistical methodology for covariance matrices and operators. A variety of concentration inequalities are considered, which allow for the construction of nonasymptotic dimension-free confidence sets for the unknown matrices and operators. Given such confidence sets a wide range of estimation and inferential procedures can be and are subsequently developed. For high dimensional data, we propose a method to search a concentration in- equality based confidence set using a binary search algorithm for the estimation of large sparse covariance matrices. Both sub-Gaussian and sub-exponential concentration inequalities are considered and applied to both simulated data and to a set of gene expression data from a study of small round blue-cell tumours. For infinite dimensional data, which is also referred to as functional data, we use a celebrated result, Talagrand’s concentration inequality, in the Banach space setting to construct confidence sets for covariance operators. From these confidence sets, three different inferential techniques emerge: the first is a k-sample test for equality of covariance operator; the second is a functional data classifier, which makes its decisions based on the covariance structure of the data; the third is a functional data clustering algorithm, which incorporates the concentration inequality based confidence sets into the framework of an expectation-maximization algorithm. These techniques are applied to simulated data and to speech samples from a set of spoken phoneme data. Lastly, we take a closer look at a key tool used in the construction of concentration based confidence sets: Rademacher symmetrization. The symmetrization inequality, which arises in the probability in Banach spaces literature, is shown to be connected with optimal transport theory and specifically the Wasserstein distance. This insight is used to improve the symmetrization inequality resulting in tighter concentration bounds to be used in the construction of nonasymptotic confidence sets. A variety of other applications are considered including tests for data symmetry and tightening inequalities in Banach spaces. An R package for inference on covariance operators is briefly discussed in an appendix chapter.
|
Page generated in 0.1108 seconds