Global ETD Search

101	Advances on Dimension Reduction for Multivariate Linear Regression Guo, Wenxing January 2020 (has links) Multivariate linear regression methods are widely used statistical tools in data analysis, and were developed when some response variables are studied simultaneously, in which our aim is to study the relationship between predictor variables and response variables through the regression coefficient matrix. The rapid improvements of information technology have brought us a large number of large-scale data, but also brought us great challenges in data processing. When dealing with high dimensional data, the classical least squares estimation is not applicable in multivariate linear regression analysis. In recent years, some approaches have been developed to deal with high-dimensional data problems, among which dimension reduction is one of the main approaches. In some literature, random projection methods were used to reduce dimension in large datasets. In Chapter 2, a new random projection method, with low-rank matrix approximation, is proposed to reduce the dimension of the parameter space in high-dimensional multivariate linear regression model. Some statistical properties of the proposed method are studied and explicit expressions are then derived for the accuracy loss of the method with Gaussian random projection and orthogonal random projection. These expressions are precise rather than being bounds up to constants. In multivariate regression analysis, reduced rank regression is also a dimension reduction method, which has become an important tool for achieving dimension reduction goals due to its simplicity, computational efficiency and good predictive performance. In practical situations, however, the performance of the reduced rank estimator is not satisfactory when the predictor variables are highly correlated or the ratio of signal to noise is small. To overcome this problem, in Chapter 3, we incorporate matrix projections into reduced rank regression method, and then develop reduced rank regression estimators based on random projection and orthogonal projection in high-dimensional multivariate linear regression models. We also propose a consistent estimator of the rank of the coefficient matrix and achieve prediction performance bounds for the proposed estimators based on mean squared errors. Envelope technology is also a popular method in recent years to reduce estimative and predictive variations in multivariate regression, including a class of methods to improve the efficiency without changing the traditional objectives. Variable selection is the process of selecting a subset of relevant features variables for use in model construction. The purpose of using this technology is to avoid the curse of dimensionality, simplify models to make them easier to interpret, shorten training time and reduce overfitting. In Chapter 4, we combine envelope models and a group variable selection method to propose an envelope-based sparse reduced rank regression estimator in high-dimensional multivariate linear regression models, and then establish its consistency, asymptotic normality and oracle property. Tensor data are in frequent use today in a variety of fields in science and engineering. Processing tensor data is a practical but challenging problem. Recently, the prevalence of tensor data has resulted in several envelope tensor versions. In Chapter 5, we incorporate envelope technique into tensor regression analysis and propose a partial tensor envelope model, which leads to a parsimonious version for tensor response regression when some predictors are of special interest, and then consistency and asymptotic normality of the coefficient estimators are proved. The proposed method achieves significant gains in efficiency compared to the standard tensor response regression model in terms of the estimation of the coefficients for the selected predictors. Finally, in Chapter 6, we summarize the work carried out in the thesis, and then suggest some problems of further research interest. / Dissertation / Doctor of Philosophy (PhD)
102	Adaptive risk management Chen, Ying 13 February 2007 (has links) In den vergangenen Jahren ist die Untersuchung des Risikomanagements vom Baselkomitee angeregt, um die Kredit- und Bankwesen regelmäßig zu aufsichten. Für viele multivariate Risikomanagementmethoden gibt es jedoch Beschränkungen von: 1) verlässt sich die Kovarianzschätzung auf eine zeitunabhängige Form, 2) die Modelle beruhen auf eine unrealistischen Verteilungsannahme und 3) numerische Problem, die bei hochdimensionalen Daten auftreten. Es ist das primäre Ziel dieser Doktorarbeit, präzise und schnelle Methoden vorzuschlagen, die diesen Beschränkungen überwinden. Die Grundidee besteht darin, zuerst aus einer hochdimensionalen Zeitreihe die stochastisch unabhängigen Komponenten (IC) zu extrahieren und dann die Verteilungsparameter der resultierenden IC beruhend auf eindimensionale Heavy-Tailed Verteilungsannahme zu identifizieren. Genauer gesagt werden zwei lokale parametrische Methoden verwendet, um den Varianzprozess jeder IC zu schätzen, das lokale Moving Window Average (MVA) Methode und das lokale Exponential Smoothing (ES) Methode. Diese Schätzungen beruhen auf der realistischen Annahme, dass die IC Generalized Hyperbolic (GH) verteilt sind. Die Berechnung ist schneller und erreicht eine höhere Genauigkeit als viele bekannte Risikomanagementmethoden. / Over recent years, study on risk management has been prompted by the Basel committee for the requirement of regular banking supervisory. There are however limitations of many risk management methods: 1) covariance estimation relies on a time-invariant form, 2) models are based on unrealistic distributional assumption and 3) numerical problems appear when applied to high-dimensional portfolios. The primary aim of this dissertation is to propose adaptive methods that overcome these limitations and can accurately and fast measure risk exposures of multivariate portfolios. The basic idea is to first retrieve out of high-dimensional time series stochastically independent components (ICs) and then identify the distributional behavior of every resulting IC in univariate space. To be more specific, two local parametric approaches, local moving window average (MWA) method and local exponential smoothing (ES) method, are used to estimate the volatility process of every IC under the heavy-tailed distributional assumption, namely ICs are generalized hyperbolic (GH) distributed. By doing so, it speeds up the computation of risk measures and achieves much better accuracy than many popular risk management methods. Risikomanagement Heavy-Tailed Verteilung Lokale parametrische Methoden Hochdimensionale Datenanalyse Risk management Heavy-tailed distribution Local parametric methods High-dimensional data analysis 330 Wirtschaft 17 Wirtschaft QP 300 ddc:330
103	Understanding High-Dimensional Data Using Reeb Graphs Harvey, William John 14 August 2012 (has links) No description available. Bioinformatics Computer Science computer science computational geometry computational topology geometry topology high dimensions high dimensional data Reeb graph contour tree visualization visual analytics Morse theory protein folding molecular dynamics survivin
104	Inference for Generalized Multivariate Analysis of Variance (GMANOVA) Models and High-dimensional Extensions Jana, Sayantee 11 1900 (has links) A Growth Curve Model (GCM) is a multivariate linear model used for analyzing longitudinal data with short to moderate time series. It is a special case of Generalized Multivariate Analysis of Variance (GMANOVA) models. Analysis using the GCM involves comparison of mean growths among different groups. The classical GCM, however, possesses some limitations including distributional assumptions, assumption of identical degree of polynomials for all groups and it requires larger sample size than the number of time points. In this thesis, we relax some of the assumptions of the traditional GCM and develop appropriate inferential tools for its analysis, with the aim of reducing bias, improving precision and to gain increased power as well as overcome limitations of high-dimensionality. Existing methods for estimating the parameters of the GCM assume that the underlying distribution for the error terms is multivariate normal. In practical problems, however, we often come across skewed data and hence estimation techniques developed under the normality assumption may not be optimal. Simulation studies conducted in this thesis, in fact, show that existing methods are sensitive to the presence of skewness in the data, where estimators are associated with increased bias and mean square error (MSE), when the normality assumption is violated. Methods appropriate for skewed distributions are, therefore, required. In this thesis, we relax the distributional assumption of the GCM and provide estimators for the mean and covariance matrices of the GCM under multivariate skew normal (MSN) distribution. An estimator for the additional skewness parameter of the MSN distribution is also provided. The estimators are derived using the expectation maximization (EM) algorithm and extensive simulations are performed to examine the performance of the estimators. Comparisons with existing estimators show that our estimators perform better than existing estimators, when the underlying distribution is multivariate skew normal. Illustration using real data set is also provided, wherein Triglyceride levels from the Framingham Heart Study is modelled over time. The GCM assumes equal degree of polynomial for each group. Therefore, when groups means follow different shapes of polynomials, the GCM fails to accommodate this difference in one model. We consider an extension of the GCM, wherein mean responses from different groups can have different shapes, represented by polynomials of different degree. Such a model is referred to as Extended Growth Curve Model (EGCM). We extend our work on GCM to EGCM, and develop estimators for the mean and covariance matrices under MSN errors. We adopted the Restricted Expectation Maximization (REM) algorithm, which is based on the multivariate Newton-Raphson (NR) method and Lagrangian optimization. However, the multivariate NR method and hence, the existing REM algorithm are applicable to vector parameters and the parameters of interest in this study are matrices. We, therefore, extended the NR approach to matrix parameters, which consequently allowed us to extend the REM algorithm to matrix parameters. The performance of the proposed estimators were examined using extensive simulations and a motivating real data example was provided to illustrate the application of the proposed estimators. Finally, this thesis deals with high-dimensional application of GCM. Existing methods for a GCM are developed under the assumption of ‘small p large n’ (n >> p) and are not appropriate for analyzing high-dimensional longitudinal data, due to singularity of the sample covariance matrix. In a previous work, we used Moore-Penrose generalized inverse to overcome this challenge. However, the method has some limitations around near singularity, when p~n. In this thesis, a Bayesian framework was used to derive a test for testing the linear hypothesis on the mean parameter of the GCM, which is applicable in high-dimensional situations. Extensive simulations are performed to investigate the performance of the test statistic and establish optimality characteristics. Results show that this test performs well, under different conditions, including the near singularity zone. Sensitivity of the test to mis-specification of the parameters of the prior distribution are also examined empirically. A numerical example is provided to illustrate the usefulness of the proposed method in practical situations. / Thesis / Doctor of Philosophy (PhD) Growth Curve Model (GCM) GMANOVA models Bayesian methods High-dimensional data Longitudinal analysis Multivariate Skew Normal distribution Extended Growth Curve Model (EGCM) EM algorithm Restricted EM algorithm Matrix Newton Raphson method
105	Détection d'anomalies à la volée dans des signaux vibratoires / Anomaly detection in high-dimensional datastreams Bellas, Anastasios 28 January 2014 (has links) Le thème principal de cette thèse est d’étudier la détection d’anomalies dans des flux de données de grande dimension avec une application spécifique au Health Monitoring des moteurs d’avion. Dans ce travail, on considère que le problème de la détection d’anomalies est un problème d’apprentissage non supervisée. Les données modernes, notamment celles issues de la surveillance des systèmes industriels sont souvent des flux d’observations de grande dimension, puisque plusieurs mesures sont prises à de hautes fréquences et à un horizon de temps qui peut être infini. De plus, les données peuvent contenir des anomalies (pannes) du système surveillé. La plupart des algorithmes existants ne peuvent pas traiter des données qui ont ces caractéristiques. Nous introduisons d’abord un algorithme de clustering probabiliste offline dans des sous-espaces pour des données de grande dimension qui repose sur l’algorithme d’espérance-maximisation (EM) et qui est, en plus, robuste aux anomalies grâce à la technique du trimming. Ensuite, nous nous intéressons à la question du clustering probabiliste online de flux de données de grande dimension en développant l’inférence online du modèle de mélange d’analyse en composantes principales probabiliste. Pour les deux méthodes proposées, nous montrons leur efficacité sur des données simulées et réelles, issues par exemple des moteurs d’avion. Enfin, nous développons une application intégrée pour le Health Monitoring des moteurs d’avion dans le but de détecter des anomalies de façon dynamique. Le système proposé introduit des techniques originales de détection et de visualisation d’anomalies reposant sur les cartes auto-organisatrices. Des résultats de détection sont présentés et la question de l’identification des anomalies est aussi discutée. / The subject of this Thesis is to study anomaly detection in high-dimensional data streams with a specific application to aircraft engine Health Monitoring. In this work, we consider the problem of anomaly detection as an unsupervised learning problem. Modern data, especially those is-sued from industrial systems, are often streams of high-dimensional data samples, since multiple measurements can be taken at a high frequency and at a possibly infinite time horizon. More-over, data can contain anomalies (malfunctions, failures) of the system being monitored. Most existing unsupervised learning methods cannot handle data which possess these features. We first introduce an offline subspace clustering algorithm for high-dimensional data based on the expectation-maximization (EM) algorithm, which is also robust to anomalies through the use of the trimming technique. We then address the problem of online clustering of high-dimensional data streams by developing an online inference algorithm for the popular mixture of probabilistic principal component analyzers (MPPCA) model. We show the efficiency of both methods on synthetic and real datasets, including aircraft engine data with anomalies. Finally, we develop a comprehensive application for the aircraft engine Health Monitoring domain, which aims at detecting anomalies in aircraft engine data in a dynamic manner and introduces novel anomaly detection visualization techniques based on Self-Organizing Maps. Detection results are presented and anomaly identification is also discussed. Classification Détection d’anomalies Données de grande dimension Flux de données Trimming Clustering online Mélange de PPCA online Cartes auto-organisatrices Moteurs d’avion Health Monitoring. Classification, anomaly detection High-dimensional data Data streams Trimming Online clustering Online mixture of PPCA Self-Organizing Maps Aircraft engine Health Monitoring 510
106	Metody pro predikci s vysokodimenzionálními daty genových expresí / Methods for class prediction with high-dimensional gene expression data Šilhavá, Jana Unknown Date (has links) Dizertační práce se zabývá predikcí vysokodimenzionálních dat genových expresí. Množství dostupných genomických dat významně vzrostlo v průběhu posledního desetiletí. Kombinování dat genových expresí s dalšími daty nachází uplatnění v mnoha oblastech. Například v klinickém řízení rakoviny (clinical cancer management) může přispět k přesnějšímu určení prognózy nemocí. Hlavní část této dizertační práce je zaměřena na kombinování dat genových expresí a klinických dat. Používáme logistické regresní modely vytvořené prostřednictvím různých regularizačních technik. Generalizované lineární modely umožňují kombinování modelů s různou strukturou dat. V dizertační práci je ukázáno, že kombinování modelu dat genových expresí a klinických dat může vést ke zpřesnění výsledku predikce oproti vytvoření modelu pouze z dat genových expresí nebo klinických dat. Navrhované postupy přitom nejsou výpočetně náročné. Testování je provedeno nejprve se simulovanými datovými sadami v různých nastaveních a následně s~reálnými srovnávacími daty. Také se zde zabýváme určením přídavné hodnoty microarray dat. Dizertační práce obsahuje porovnání příznaků vybraných pomocí klasifikátoru genových expresí na pěti různých sadách dat týkajících se rakoviny prsu. Navrhujeme také postup výběru příznaků, který kombinuje data genových expresí a znalosti z genových ontologií.
107	Hard and fuzzy block clustering algorithms for high dimensional data / Algorithmes de block-clustering dur et flou pour les données en grande dimension Laclau, Charlotte 14 April 2016 (has links) Notre capacité grandissante à collecter et stocker des données a fait de l'apprentissage non supervisé un outil indispensable qui permet la découverte de structures et de modèles sous-jacents aux données, sans avoir à \étiqueter les individus manuellement. Parmi les différentes approches proposées pour aborder ce type de problème, le clustering est très certainement le plus répandu. Le clustering suppose que chaque groupe, également appelé cluster, est distribué autour d'un centre défini en fonction des valeurs qu'il prend pour l'ensemble des variables. Cependant, dans certaines applications du monde réel, et notamment dans le cas de données de dimension importante, cette hypothèse peut être invalidée. Aussi, les algorithmes de co-clustering ont-ils été proposés: ils décrivent les groupes d'individus par un ou plusieurs sous-ensembles de variables au regard de leur pertinence. La structure des données finalement obtenue est composée de blocs communément appelés co-clusters. Dans les deux premiers chapitres de cette thèse, nous présentons deux approches de co-clustering permettant de différencier les variables pertinentes du bruit en fonction de leur capacité \`a révéler la structure latente des données, dans un cadre probabiliste d'une part et basée sur la notion de métrique, d'autre part. L'approche probabiliste utilise le principe des modèles de mélanges, et suppose que les variables non pertinentes sont distribuées selon une loi de probabilité dont les paramètres sont indépendants de la partition des données en cluster. L'approche métrique est fondée sur l'utilisation d'une distance adaptative permettant d'affecter à chaque variable un poids définissant sa contribution au co-clustering. D'un point de vue théorique, nous démontrons la convergence des algorithmes proposés en nous appuyant sur le théorème de convergence de Zangwill. Dans les deux chapitres suivants, nous considérons un cas particulier de structure en co-clustering, qui suppose que chaque sous-ensemble d'individus et décrit par un unique sous-ensemble de variables. La réorganisation de la matrice originale selon les partitions obtenues sous cette hypothèse révèle alors une structure de blocks homogènes diagonaux. Comme pour les deux contributions précédentes, nous nous plaçons dans le cadre probabiliste et métrique. L'idée principale des méthodes proposées est d'imposer deux types de contraintes : (1) nous fixons le même nombre de cluster pour les individus et les variables; (2) nous cherchons une structure de la matrice de données d'origine qui possède les valeurs maximales sur sa diagonale (par exemple pour le cas des données binaires, on cherche des blocs diagonaux majoritairement composés de valeurs 1, et de 0 à l’extérieur de la diagonale). Les approches proposées bénéficient des garanties de convergence issues des résultats des chapitres précédents. Enfin, pour chaque chapitre, nous dérivons des algorithmes permettant d'obtenir des partitions dures et floues. Nous évaluons nos contributions sur un large éventail de données simulées et liées a des applications réelles telles que le text mining, dont les données peuvent être binaires ou continues. Ces expérimentations nous permettent également de mettre en avant les avantages et les inconvénients des différentes approches proposées. Pour conclure, nous pensons que cette thèse couvre explicitement une grande majorité des scénarios possibles découlant du co-clustering flou et dur, et peut être vu comme une généralisation de certaines approches de biclustering populaires. / With the increasing number of data available, unsupervised learning has become an important tool used to discover underlying patterns without the need to label instances manually. Among different approaches proposed to tackle this problem, clustering is arguably the most popular one. Clustering is usually based on the assumption that each group, also called cluster, is distributed around a center defined in terms of all features while in some real-world applications dealing with high-dimensional data, this assumption may be false. To this end, co-clustering algorithms were proposed to describe clusters by subsets of features that are the most relevant to them. The obtained latent structure of data is composed of blocks usually called co-clusters. In first two chapters, we describe two co-clustering methods that proceed by differentiating the relevance of features calculated with respect to their capability of revealing the latent structure of the data in both probabilistic and distance-based framework. The probabilistic approach uses the mixture model framework where the irrelevant features are assumed to have a different probability distribution that is independent of the co-clustering structure. On the other hand, the distance-based (also called metric-based) approach relied on the adaptive metric where each variable is assigned with its weight that defines its contribution in the resulting co-clustering. From the theoretical point of view, we show the global convergence of the proposed algorithms using Zangwill convergence theorem. In the last two chapters, we consider a special case of co-clustering where contrary to the original setting, each subset of instances is described by a unique subset of features resulting in a diagonal structure of the initial data matrix. Same as for the two first contributions, we consider both probabilistic and metric-based approaches. The main idea of the proposed contributions is to impose two different kinds of constraints: (1) we fix the number of row clusters to the number of column clusters; (2) we seek a structure of the original data matrix that has the maximum values on its diagonal (for instance for binary data, we look for diagonal blocks composed of ones with zeros outside the main diagonal). The proposed approaches enjoy the convergence guarantees derived from the results of the previous chapters. Finally, we present both hard and fuzzy versions of the proposed algorithms. We evaluate our contributions on a wide variety of synthetic and real-world benchmark binary and continuous data sets related to text mining applications and analyze advantages and inconvenients of each approach. To conclude, we believe that this thesis covers explicitly a vast majority of possible scenarios arising in hard and fuzzy co-clustering and can be seen as a generalization of some popular biclustering approaches. Classification Flou Classification croisée Modèle de mélange Approche métrique Modèle à bloc latent Données sparses Données binaires Classification de document Théorème de Zangwill Sélection de variable Données en grande dimension Algorithme Clustering Fuzzy Co-clustering Mixture model Metric approach Latent block model Sparse data Binary data Document clustering Zangwill theorem Feature selection High dimensional data Algorithm 004
108	Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data / Modèles de mélange de von Mises-Fisher pour la classification simple et croisée de données éparses de grande dimension Salah, Aghiles 21 November 2016 (has links) La classification automatique, qui consiste à regrouper des objets similaires au sein de groupes, également appelés classes ou clusters, est sans aucun doute l’une des méthodes d’apprentissage non-supervisé les plus utiles dans le contexte du Big Data. En effet, avec l’expansion des volumes de données disponibles, notamment sur le web, la classification ne cesse de gagner en importance dans le domaine de la science des données pour la réalisation de différentes tâches, telles que le résumé automatique, la réduction de dimension, la visualisation, la détection d’anomalies, l’accélération des moteurs de recherche, l’organisation d’énormes ensembles de données, etc. De nombreuses méthodes de classification ont été développées à ce jour, ces dernières sont cependant fortement mises en difficulté par les caractéristiques complexes des ensembles de données que l’on rencontre dans certains domaines d’actualité tel que le Filtrage Collaboratif (FC) et de la fouille de textes. Ces données, souvent représentées sous forme de matrices, sont de très grande dimension (des milliers de variables) et extrêmement creuses (ou sparses, avec plus de 95% de zéros). En plus d’être de grande dimension et sparse, les données rencontrées dans les domaines mentionnés ci-dessus sont également de nature directionnelles. En effet, plusieurs études antérieures ont démontré empiriquement que les mesures directionnelles, telle que la similarité cosinus, sont supérieurs à d’autres mesures, telle que la distance Euclidiennes, pour la classification des documents textuels ou pour mesurer les similitudes entre les utilisateurs/items dans le FC. Cela suggère que, dans un tel contexte, c’est la direction d’un vecteur de données (e.g., représentant un document texte) qui est pertinente, et non pas sa longueur. Il est intéressant de noter que la similarité cosinus est exactement le produit scalaire entre des vecteurs unitaires (de norme 1). Ainsi, d’un point de vue probabiliste l’utilisation de la similarité cosinus revient à supposer que les données sont directionnelles et réparties sur la surface d’une hypersphère unité. En dépit des nombreuses preuves empiriques suggérant que certains ensembles de données sparses et de grande dimension sont mieux modélisés sur une hypersphère unité, la plupart des modèles existants dans le contexte de la fouille de textes et du FC s’appuient sur des hypothèses populaires : distributions Gaussiennes ou Multinomiales, qui sont malheureusement inadéquates pour des données directionnelles. Dans cette thèse, nous nous focalisons sur deux challenges d’actualité, à savoir la classification des documents textuels et la recommandation d’items, qui ne cesse d’attirer l’attention dans les domaines de la fouille de textes et celui du filtrage collaborative, respectivement. Afin de répondre aux limitations ci-dessus, nous proposons une série de nouveaux modèles et algorithmes qui s’appuient sur la distribution de von Mises-Fisher (vMF) qui est plus appropriée aux données directionnelles distribuées sur une hypersphère unité. / Cluster analysis or clustering, which aims to group together similar objects, is undoubtedly a very powerful unsupervised learning technique. With the growing amount of available data, clustering is increasingly gaining in importance in various areas of data science for several reasons such as automatic summarization, dimensionality reduction, visualization, outlier detection, speed up research engines, organization of huge data sets, etc. Existing clustering approaches are, however, severely challenged by the high dimensionality and extreme sparsity of the data sets arising in some current areas of interest, such as Collaborative Filtering (CF) and text mining. Such data often consists of thousands of features and more than 95% of zero entries. In addition to being high dimensional and sparse, the data sets encountered in the aforementioned domains are also directional in nature. In fact, several previous studies have empirically demonstrated that directional measures—that measure the distance between objects relative to the angle between them—, such as the cosine similarity, are substantially superior to other measures such as Euclidean distortions, for clustering text documents or assessing the similarities between users/items in CF. This suggests that in such context only the direction of a data vector (e.g., text document) is relevant, not its magnitude. It is worth noting that the cosine similarity is exactly the scalar product between unit length data vectors, i.e., L 2 normalized vectors. Thus, from a probabilistic perspective using the cosine similarity is equivalent to assuming that the data are directional data distributed on the surface of a unit-hypersphere. Despite the substantial empirical evidence that certain high dimensional sparse data sets, such as those encountered in the above domains, are better modeled as directional data, most existing models in text mining and CF are based on popular assumptions such as Gaussian, Multinomial or Bernoulli which are inadequate for L 2 normalized data. In this thesis, we focus on the two challenging tasks of text document clustering and item recommendation, which are still attracting a lot of attention in the domains of text mining and CF, respectively. In order to address the above limitations, we propose a suite of new models and algorithms which rely on the von Mises-Fisher (vMF) assumption that arises naturally for directional data lying on a unit-hypersphere. Apprentissage statistique Classification Classification croisée Modèles de mélanges Statistiques directionnelles Distribution de von Mises-Fisher Fouille de textes Systèmes de recommandation Filtrage collaboratif Matrices creuses Grande dimension Machine learning Clustering Co-clustering Mixture models Directional statistics Von Mises-Fisher distribution Text mining Recommender systems Collaborative filtering Sparse data High dimensional data 003.3
109	Augmenting High-Dimensional Data with Deep Generative Models / Högdimensionell dataaugmentering med djupa generativa modeller Nilsson, Mårten January 2018 (has links) Data augmentation is a technique that can be performed in various ways to improve the training of discriminative models. The recent developments in deep generative models offer new ways of augmenting existing data sets. In this thesis, a framework for augmenting annotated data sets with deep generative models is proposed together with a method for quantitatively evaluating the quality of the generated data sets. Using this framework, two data sets for pupil localization was generated with different generative models, including both well-established models and a novel model proposed for this purpose. The unique model was shown both qualitatively and quantitatively to generate the best data sets. A set of smaller experiments on standard data sets also revealed cases where this generative model could improve the performance of an existing discriminative model. The results indicate that generative models can be used to augment or replace existing data sets when training discriminative models. / Dataaugmentering är en teknik som kan utföras på flera sätt för att förbättra träningen av diskriminativa modeller. De senaste framgångarna inom djupa generativa modeller har öppnat upp nya sätt att augmentera existerande dataset. I detta arbete har ett ramverk för augmentering av annoterade dataset med hjälp av djupa generativa modeller föreslagits. Utöver detta så har en metod för kvantitativ evaulering av kvaliteten hos genererade data set tagits fram. Med hjälp av detta ramverk har två dataset för pupillokalisering genererats med olika generativa modeller. Både väletablerade modeller och en ny modell utvecklad för detta syfte har testats. Den unika modellen visades både kvalitativt och kvantitativt att den genererade de bästa dataseten. Ett antal mindre experiment på standardiserade dataset visade exempel på fall där denna generativa modell kunde förbättra prestandan hos en existerande diskriminativ modell. Resultaten indikerar att generativa modeller kan användas för att augmentera eller ersätta existerande dataset vid träning av diskriminativa modeller. GAN GANs machine learning deep learning generative model generative models deep generative model deep generative models generative adversarial networks VAE VAEs variational autoencoder variational autoencoders autoencoder auto encoder encoder decoder computer vision eye tracking pupil localization pupil eyes eye synthetic data big data data generation synthetic data generation neural networks neural network high-dimensional data high-resolution images. Computer Sciences Datavetenskap (datalogi)

Search results