Global ETD Search

11	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Robson Leonardo Ferreira Cordeiro 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Dados de média à alta dimensionalidade MapReduce Rotulação e sumarização Correlation clustering Labeling and summarization MapReduce Moderante-to-high dimensionality data Terabyte-scale data mining
12	Le lasso linéaire : une méthode pour des données de petites et grandes dimensions en régression linéaire Watts, Yan 04 1900 (has links) Dans ce mémoire, nous nous intéressons à une façon géométrique de voir la méthode du Lasso en régression linéaire. Le Lasso est une méthode qui, de façon simultanée, estime les coefficients associés aux prédicteurs et sélectionne les prédicteurs importants pour expliquer la variable réponse. Les coefficients sont calculés à l’aide d’algorithmes computationnels. Malgré ses vertus, la méthode du Lasso est forcée de sélectionner au maximum n variables lorsque nous nous situons en grande dimension (p > n). De plus, dans un groupe de variables corrélées, le Lasso sélectionne une variable “au hasard”, sans se soucier du choix de la variable. Pour adresser ces deux problèmes, nous allons nous tourner vers le Lasso Linéaire. Le vecteur réponse est alors vu comme le point focal de l’espace et tous les autres vecteurs de variables explicatives gravitent autour du vecteur réponse. Les angles formés entre le vecteur réponse et les variables explicatives sont supposés fixes et nous serviront de base pour construire la méthode. L’information contenue dans les variables explicatives est projetée sur le vecteur réponse. La théorie sur les modèles linéaires normaux nous permet d’utiliser les moindres carrés ordinaires (MCO) pour les coefficients du Lasso Linéaire. Le Lasso Linéaire (LL) s’effectue en deux étapes. Dans un premier temps, des variables sont écartées du modèle basé sur leur corrélation avec la variable réponse; le nombre de variables écartées (ou ordonnées) lors de cette étape dépend d’un paramètre d’ajustement γ. Par la suite, un critère d’exclusion basé sur la variance de la distribution de la variable réponse est introduit pour retirer (ou ordonner) les variables restantes. Une validation croisée répétée nous guide dans le choix du modèle final. Des simulations sont présentées pour étudier l’algorithme en fonction de différentes valeurs du paramètre d’ajustement γ. Des comparaisons sont effectuées entre le Lasso Linéaire et des méthodes compétitrices en petites dimensions (Ridge, Lasso, SCAD, etc.). Des améliorations dans l’implémentation de la méthode sont suggérées, par exemple l’utilisation de la règle du 1se nous permettant d’obtenir des modèles plus parcimonieux. Une implémentation de l’algorithme LL est fournie dans la fonction R intitulée linlasso, disponible au https://github.com/yanwatts/linlasso. / In this thesis, we are interested in a geometric way of looking at the Lasso method in the context of linear regression. The Lasso is a method that simultaneously estimates the coefficients associated with the predictors and selects the important predictors to explain the response variable. The coefficients are calculated using computational algorithms. Despite its virtues, the Lasso method is forced to select at most n variables when we are in highdimensional contexts (p > n). Moreover, in a group of correlated variables, the Lasso selects a variable “at random”, without caring about the choice of the variable. To address these two problems, we turn to the Linear Lasso. The response vector is then seen as the focal point of the space and all other explanatory variables vectors orbit around the response vector. The angles formed between the response vector and the explanatory variables are assumed to be fixed, and will be used as a basis for constructing the method. The information contained in the explanatory variables is projected onto the response vector. The theory of normal linear models allows us to use ordinary least squares (OLS) for the coefficients of the Linear Lasso. The Linear Lasso (LL) is performed in two steps. First, variables are dropped from the model based on their correlation with the response variable; the number of variables dropped (or ordered) in this step depends on a tuning parameter γ. Then, an exclusion criterion based on the variance of the distribution of the response variable is introduced to remove (or order) the remaining variables. A repeated cross-validation guides us in the choice of the final model. Simulations are presented to study the algorithm for different values of the tuning parameter γ. Comparisons are made between the Linear Lasso and competing methods in small dimensions (Ridge, Lasso, SCAD, etc.). Improvements in the implementation of the method are suggested, for example the use of the 1se rule allowing us to obtain more parsimonious models. An implementation of the LL algorithm is provided in the function R entitled linlasso available at https://github.com/yanwatts/linlasso. Régression linéaire Lasso moindres carrés ordinaires sélection de variables inférence grande dimension linear regression Lasso ordinary least squares variable selection inference high dimensionality Statistics / Statistiques (UMI : 0463)
13	Regularisation and variable selection using penalized likelihood / Régularisation et sélection de variables par le biais de la vraisemblance pénalisée El anbari, Mohammed 14 December 2011 (has links) Dans cette thèse nous nous intéressons aux problèmes de la sélection de variables en régression linéaire. Ces travaux sont en particulier motivés par les développements récents en génomique, protéomique, imagerie biomédicale, traitement de signal, traitement d’image, en marketing, etc… Nous regardons ce problème selon les deux points de vue fréquentielle et bayésienne.Dans un cadre fréquentiel, nous proposons des méthodes pour faire face au problème de la sélection de variables, dans des situations pour lesquelles le nombre de variables peut être beaucoup plus grand que la taille de l’échantillon, avec présence possible d’une structure supplémentaire entre les variables, telle qu’une forte corrélation ou un certain ordre entre les variables successives. Les performances théoriques sont explorées ; nous montrons que sous certaines conditions de régularité, les méthodes proposées possèdent de bonnes propriétés statistiques, telles que des inégalités de parcimonie, la consistance au niveau de la sélection de variables et la normalité asymptotique.Dans un cadre bayésien, nous proposons une approche globale de la sélection de variables en régression construite sur les lois à priori g de Zellner dans une approche similaire mais non identique à celle de Liang et al. (2008) Notre choix ne nécessite aucune calibration. Nous comparons les approches de régularisation bayésienne et fréquentielle dans un contexte peu informatif où le nombre de variables est presque égal à la taille de l’échantillon. / We are interested in variable sélection in linear régression models. This research is motivated by recent development in microarrays, proteomics, brain images, among others. We study this problem in both frequentist and bayesian viewpoints.In a frequentist framework, we propose methods to deal with the problem of variable sélection, when the number of variables is much larger than the sample size with a possibly présence of additional structure in the predictor variables, such as high corrélations or order between successive variables. The performance of the proposed methods is theoretically investigated ; we prove that, under regularity conditions, the proposed estimators possess statistical good properties, such as Sparsity Oracle Inequalities, variable sélection consistency and asymptotic normality.In a Bayesian Framework, we propose a global noninformative approach for Bayesian variable sélection. In this thesis, we pay spécial attention to two calibration-free hierarchical Zellner’s g-priors. The first one is the Jeffreys prior which is not location invariant. A second one avoids this problem by only considering models with at least one variable in the model. The practical performance of the proposed methods is illustrated through numerical experiments on simulated and real world datasets, with a comparison betwenn Bayesian and frequentist approaches under a low informative constraint when the number of variables is almost equal to the number of observations. Réduction de la dimension Grandes dimensions Lasso Scad Elastic-net Sélection de modèles Propriétés d’Oracle Zellner’s g- prior Calibration Dimensionality réduction High dimensionality LASSO Scad Elastic-net Model selection Oracle property Zellner’s g-prior Calibration Scad
14	Dimension Flexible and Adaptive Statistical Learning Khowaja, Kainat 02 March 2023 (has links) Als interdisziplinäre Forschung verbindet diese Arbeit statistisches Lernen mit aktuellen fortschrittlichen Methoden, um mit hochdimensionalität und Nichtstationarität umzugehen. Kapitel 2 stellt Werkzeuge zur Verfügung, um statistische Schlüsse auf die Parameterfunktionen von Generalized Random Forests zu ziehen, die als Lösung der lokalen Momentenbedingung identifiziert wurden. Dies geschieht entweder durch die hochdimensionale Gaußsche Approximationstheorie oder durch Multiplier-Bootstrap. Die theoretischen Aspekte dieser beiden Ansätze werden neben umfangreichen Simulationen und realen Anwendungen im Detail diskutiert. In Kapitel 3 wird der lokal parametrische Ansatz auf zeitvariable Poisson-Prozesse ausgeweitet, um ein Instrument zur Ermittlung von Homogenitätsintervallen innerhalb der Zeitreihen von Zähldaten in einem nichtstationären Umfeld bereitzustellen. Die Methodik beinhaltet rekursive Likelihood-Ratio-Tests und hat ein Maximum in der Teststatistik mit unbekannter Verteilung. Um sie zu approximieren und den kritischen Wert zu finden, verwenden wir den Multiplier-Bootstrap und demonstrieren den Nutzen dieses Algorithmus für deutsche M\&A Daten. Kapitel 4 befasst sich mit der Erstellung einer niedrigdimensionalen Approximation von hochdimensionalen Daten aus dynamischen Systemen. Mithilfe der Resampling-Methoden, der Hauptkomponentenanalyse und Interpolationstechniken konstruieren wir reduzierte dimensionale Ersatzmodelle, die im Vergleich zu den ursprünglichen hochauflösenden Modellen schnellere Ausgaben liefern. In Kapitel 5 versuchen wir, die Verteilungsmerkmale von Kryptowährungen mit den von ihnen zugrunde liegenden Mechanismen zu verknüpfen. Wir verwenden charakteristikbasiertes spektrales Clustering, um Kryptowährungen mit ähnlichem Verhalten in Bezug auf Preis, Blockzeit und Blockgröße zu clustern, und untersuchen diese Cluster, um gemeinsame Mechanismen zwischen verschiedenen Krypto-Clustern zu finden. / As an interdisciplinary research, this thesis couples statistical learning with current advanced methods to deal with high dimensionality and nonstationarity. Chapter 2 provides tools to make statistical inference (uniformly over covariate space) on the parameter functions from Generalized Random Forests identified as the solution of the local moment condition. This is done by either highdimensional Gaussian approximation theorem or via multiplier bootstrap. The theoretical aspects of both of these approaches are discussed in detail alongside extensive simulations and real life applications. In Chapter 3, we extend the local parametric approach to time varying Poisson processes, providing a tool to find intervals of homogeneity within the time series of count data in a nonstationary setting. The methodology involves recursive likelihood ratio tests and has a maxima in test statistic with unknown distribution. To approximate it and find the critical value, we use multiplier bootstrap and demonstrate the utility of this algorithm on German M\&A data. Chapter 4 is concerned with creating low dimensional approximation of high dimensional data from dynamical systems. Using various resampling methods, Principle Component Analysis, and interpolation techniques, we construct reduced dimensional surrogate models that provide faster responses as compared to the original high fidelity models. In Chapter 5, we aim to link the distributional characteristics of cryptocurrencies to their underlying mechanism. We use characteristic based spectral clustering to cluster cryptos with similar behaviour in terms of price, block time, and block size, and scrutinize these clusters to find common mechanisms between various crypto clusters. nichtparametrische Statistik Multiplier-Bootstrap Random Forests statistisches Lernen lokaler parametrischer Ansatz Machine Learning Hochdimensionalität Nichtstationarität nonparametric statistics multiplier bootstrap random forests statistical learning local parametric approach machine learning high dimensionality nonstationarity 330 Wirtschaft QH 233 ddc:519 ddc:330

Page generated in 0.2803 seconds