Global ETD Search

201	Contributions to unsupervised learning from massive high-dimensional data streams : structuring, hashing and clustering / Contributions à l'apprentissage non supervisé à partir de flux de données massives en grande dimension : structuration, hashing et clustering Morvan, Anne 12 November 2018 (has links) Cette thèse étudie deux tâches fondamentales d'apprentissage non supervisé: la recherche des plus proches voisins et le clustering de données massives en grande dimension pour respecter d'importantes contraintes de temps et d'espace.Tout d'abord, un nouveau cadre théorique permet de réduire le coût spatial et d'augmenter le débit de traitement du Cross-polytope LSH pour la recherche du plus proche voisin presque sans aucune perte de précision.Ensuite, une méthode est conçue pour apprendre en une seule passe sur des données en grande dimension des codes compacts binaires. En plus de garanties théoriques, la qualité des sketches obtenus est mesurée dans le cadre de la recherche approximative des plus proches voisins. Puis, un algorithme de clustering sans paramètre et efficace en terme de coût de stockage est développé en s'appuyant sur l'extraction d'un arbre couvrant minimum approché du graphe de dissimilarité compressé auquel des coupes bien choisies sont effectuées. / This thesis focuses on how to perform efficiently unsupervised machine learning such as the fundamentally linked nearest neighbor search and clustering task, under time and space constraints for high-dimensional datasets. First, a new theoretical framework reduces the space cost and increases the rate of flow of data-independent Cross-polytope LSH for the approximative nearest neighbor search with almost no loss of accuracy.Second, a novel streaming data-dependent method is designed to learn compact binary codes from high-dimensional data points in only one pass. Besides some theoretical guarantees, the quality of the obtained embeddings are accessed on the approximate nearest neighbors search task.Finally, a space-efficient parameter-free clustering algorithm is conceived, based on the recovery of an approximate Minimum Spanning Tree of the sketched data dissimilarity graph on which suitable cuts are performed. Apprentissage non supervisé Recherche des plus proches voisins Flux Clustering Approximation Réduction de dimension Hachage Résumés minimalistes Unsupervised learning Nearest neighbors search Streaming Clustering Approximation Dimensionality reduction Hashing Sketching 005.7
202	Une approche basée sur les motifs fermés pour résoudre le problème de clustering par consensus / A closed patterns-based approach to the consensus clustering problem Al-Najdi, Atheer 30 November 2016 (has links) Le clustering est le processus de partitionnement d’un ensemble de données en groupes, de sorte que les instances du même groupe sont plus semblables les unes aux autres qu’avec celles de tout autre groupe. De nombreux algorithmes de clustering ont été proposés, mais aucun d’entre eux ne s’avère fournir une partitiondes données pertinente dans toutes les situations. Le clustering par consensus vise à améliorer le processus de regroupement en combinant différentes partitions obtenues à partir de divers algorithmes afin d’obtenir une solution de consensus de meilleure qualité. Dans ce travail, une nouvelle méthode de clustering par consensus, appelée MultiCons, est proposée. Cette méthode utilise la technique d’extraction des itemsets fréquents fermés dans le but de découvrir les similitudes entre les différentes solutions de clustering dits de base. Les similitudes identifiées sont représentées sous une forme de motifs de clustering, chacun définissant un accord entre un ensemble de clusters de bases sur le regroupement d’un ensemble d’instances. En traitant ces motifs par groupes, en fonction du nombre de clusters de base qui définissent le motif, la méthode MultiCons génère une solution de consensus pour chaque groupe, générant par conséquence plusieurs consensus candidats. Ces différentes solutions sont ensuite représentées dans une structure arborescente appelée arbre de consensus, ouConsTree. Cette représentation graphique facilite la compréhension du processus de construction des multiples consensus, ainsi que les relations entre les instances et les structures d’instances dans l’espace de données / Clustering is the process of partitioning a dataset into groups, so that the instances in the same group are more similar to each other than to instances in any other group. Many clustering algorithms were proposed, but none of them proved to provide good quality partition in all situations. Consensus clustering aims to enhance the clustering process by combining different partitions obtained from different algorithms to yield a better quality consensus solution. In this work, a new consensus clustering method, called MultiCons, is proposed. It uses the frequent closed itemset mining technique in order to discover the similarities between the different base clustering solutions. The identified similarities are presented in a form of clustering patterns, that each defines the agreement between a set of base clusters in grouping a set of instances. By dividing these patterns into groups based on the number of base clusters that define the pattern, MultiCons generates a consensussolution from each group, resulting in having multiple consensus candidates. These different solutions are presented in a tree-like structure, called ConsTree, that facilitates understanding the process of building the multiple consensuses, and also the relationships between the data instances and their structuring in the data space. Five consensus functions are proposed in this work in order to build a consensus solution from the clustering patterns. Approach 1 is to just merge any intersecting clustering patterns. Approach 2 can either merge or split intersecting patterns based on a proposed measure, called intersection ratio Partitionnement de données Classification non-supervisée Ensembles de partitionnement de données Itemsets fréquents fermés Clustering Unsupervised learning Consensus clustering Clusterings ensemble Frequent closed itemsets
203	Användandet av algoritmer inom investeringar kopplat till OMX30 : Tillämpning av maskininlärning inom portföljhantering: En K-Betydelsemetod Larsson Olsson, Simon January 2020 (has links) Many investors use different types of data methods before making a decision, regardless of whether it is long or short term. The choice of which analysis method is generally determined by risk, removal of bias and the cost. One method that has been investigated is the use of machine lerning in data analysis. The advantage of machine lernig is that the method successfully handles comples, non-linear and non-stationary problems. In this essay, it will be investigated whether unattended machine learning, which uses the K-meaning method, which is a method that has not been investigated to any great extent either in practice or in theory to create a beneficial portfolio. The data used for the k-meaning method was historical data from the Swedish stock market between 1 January 2018 and 2 November 2020. The k-meaning analysis consists of the return of all shares included within OMX30 and the average deviation, which created a cluster of 11 shares that could generate a relatively high return compared to the remaining shares. To analyze whether the generated cluster were acceptable, an analysis of the sharpe-ratio and downward risk was preformed, which showed that the portfolio had a good risk-adjusted returnbut a worse result on downward risk. Machine learning k-means unsupervised learning stock market OMX30 portfolio diversification Maskininlärning K-betydelse oövervakadinlärning aktiemarknad OMX30 portfölj diversifiering Business Administration Företagsekonomi
204	Designing an Interactive tool for Cluster Analysis of Clickstream Data Collin, Sara, Möllerberg, Ingrid January 2020 (has links) The purpose of this study was to develop an interactive tool that enables identification of different types of users of an application based on clickstream data. A complex hierarchical clustering algorithm tool called Recursive Hierarchical Clustering (RHC) was used. RHC provides a visualisation of user types as clusters, where each cluster has its own distinguishing action pattern, i.e., one or several consecutive actions made by the user in the application. A case study was conducted on the mobile application Plick, which is an application for selling and buying second hand clothes. During the course of the project, the analysis and its result was discovered to be difficult to understand by the operators of the tool. The interactive tool had to be extended to visualise the complex analysis and its result in an intuitive way. A literature study of how humans interpret information, and how to present it to operators, was conducted and led to a redesign of the tool. More information was added to each cluster to enable further understanding of the clustering results. A clustering reconfiguration option was also created where operators of the tool got the possibility to interact with the analysis. In the reconfiguration, the operator could change the input file of the cluster analysis and thus the end result. Usability tests showed that the extra added information about the clusters served as an amplification and a verification of the original results presented by RHC. In some cases the original result presented by RHC was used as a verification to user group identification made by the operator solely based on the extra added information. The usability tests showed that the complex analysis with its results could be understood and configured without considerable comprehension of the algorithm. Instead it seemed like it could be successfully used in order to identify user types with help of visual clues in the interface and default settings in the reconfiguration. The visualisation tool is shown to be successful in identifying and visualising user groups in an intuitive way. Hierarchical clustering Unsupervised learning User segmentation Cluster visualization Interactive tool Cluster analysis Clickstream Interface design Hierarkisk klustring Användarsegmentering Klustervisualisering Interaktivt verktyg Klusteranalys Klickström Gränssnittsdesign Engineering and Technology Teknik och teknologier
205	Automated sleep scoring using unsupervised learning of meta-features / Automatiserad sömnmätning med användning av oövervakad inlärning av meta-särdrag Olsson, Sebastian January 2016 (has links) Sleep is an important part of life as it affects the performance of one's activities during all awake hours. The study of sleep and wakefulness is therefore of great interest, particularly to the clinical and medical fields where sleep disorders are diagnosed. When studying sleep, it is common to talk about different types, or stages, of sleep. A common task in sleep research is to determine the sleep stage of the sleeping subject as a function of time. This process is known as sleep stage scoring. In this study, I seek to determine whether there is any benefit to using unsupervised feature learning in the context of electroencephalogram-based (EEG) sleep scoring. More specifically, the effect of generating and making use of new feature representations for hand-crafted features of sleep data – meta-features – is studied. For this purpose, two scoring algorithms have been implemented and compared. Both scoring algorithms involve segmentation of the EEG signal, feature extraction, feature selection and classification using a support vector machine (SVM). Unsupervised feature learning was implemented in the form of a dimensionality-reducing deep-belief network (DBN) which the feature space was processed through. Both scorers were shown to have a classification accuracy of about 76 %. The application of unsupervised feature learning did not affect the accuracy significantly. It is speculated that with a better choice of parameters for the DBN in a possible future work, the accuracy may improve significantly. / Sömnen är en viktig del av livet eftersom den påverkar ens prestation under alla vakna timmar. Forskning om sömn and vakenhet är därför av stort intresse, i synnerhet för de kliniska och medicinska områdena där sömnbesvär diagnostiseras. I forskning om sömn är det är vanligt att tala om olika typer av sömn, eller sömnstadium. En vanlig uppgift i sömnforskning är att avgöra sömnstadiet av den sovande exemplaret som en funktion av tiden. Den här processen kallas sömnmätning. I den här studien försöker jag avgöra om det finns någon fördel med att använda oövervakad inlärning av särdrag för att utföra elektroencephalogram-baserad (EEG) sömnmätning. Mer specifikt undersöker jag effekten av att generera och använda nya särdragsrepresentationer som härstammar från handgjorda särdrag av sömndata – meta-särdrag. Två sömnmätningsalgoritmer har implementerats och jämförts för det här syftet. Sömnmätningsalgoritmerna involverar segmentering av EEG-signalen, extraktion av särdragen, urval av särdrag och klassificering genom användning av en stödvektormaskin (SVM). Oövervakad inlärning av särdrag implementerades i form av ett dimensionskrympande djuptrosnätverk (DBN) som användes för att bearbetasärdragsrymden. Båda sömnmätarna visades ha en klassificeringsprecision av omkring 76 %. Användningen av oövervakad inlärning av särdrag hade ingen signifikant inverkan på precisionen. Det spekuleras att precisionen skulle kunna höjas med ett mer lämpligt val av parametrar för djuptrosnätverket. EEG sleep scoring support vector machines deep belief networks AASM unsupervised learning feature extraction genetic algorithms meta-features Computer Sciences Datavetenskap (datalogi)
206	Semantic-Driven Unsupervised Image-to-Image Translation for Distinct Image Domains Ackerman, Wesley 15 September 2020 (has links) We expand the scope of image-to-image translation to include more distinct image domains, where the image sets have analogous structures, but may not share object types between them. Semantic-Driven Unsupervised Image-to-Image Translation for Distinct Image Domains (SUNIT) is built to more successfully translate images in this setting, where content from one domain is not found in the other. Our method trains an image translation model by learning encodings for semantic segmentations of images. These segmentations are translated between image domains to learn meaningful mappings between the structures in the two domains. The translated segmentations are then used as the basis for image generation. Beginning image generation with encoded segmentation information helps maintain the original structure of the image. We qualitatively and quantitatively show that SUNIT improves image translation outcomes, especially for image translation tasks where the image domains are very distinct. computer science machine learning image-to-image translation generative adversarial network deep learning unsupervised learning convolutional neural network Physical Sciences and Mathematics
207	Classify part of day and snow on the load of timber stacks : A comparative study between partitional clustering and competitive learning Nordqvist, My January 2021 (has links) In today's society, companies are trying to find ways to utilize all the data they have, which considers valuable information and insights to make better decisions. This includes data used to keeping track of timber that flows between forest and industry. The growth of Artificial Intelligence (AI) and Machine Learning (ML) has enabled the development of ML modes to automate the measurements of timber on timber trucks, based on images. However, to improve the results there is a need to be able to get information from unlabeled images in order to decide weather and lighting conditions. The objective of this study is to perform an extensive for classifying unlabeled images in the categories, daylight, darkness, and snow on the load. A comparative study between partitional clustering and competitive learning is conducted to investigate which method gives the best results in terms of different clustering performance metrics. It also examines how dimensionality reduction affects the outcome. The algorithms K-means and Kohonen Self-Organizing Map (SOM) are selected for the clustering. Each model is investigated according to the number of clusters, size of dataset, clustering time, clustering performance, and manual samples from each cluster. The results indicate a noticeable clustering performance discrepancy between the algorithms concerning the number of clusters, dataset size, and manual samples. The use of dimensionality reduction led to shorter clustering time but slightly worse clustering performance. The evaluation results further show that the clustering time of Kohonen SOM is significantly higher than that of K-means. Machine Learning (ML) Unsupervised Learning Cluster Analysis Partitional Clustering Competitive Learning Dimensionality Reduction Principal Component Analysis (PCA) K-means Kohonen Self-Organizing Map (SOM) Timber Computer Engineering Datorteknik
208	Involving behavior in the formation of sensory representations Weiller, Daniel 07 July 2009 (has links) Neurons are sensitive to specific aspects of natural stimuli, which are according to different statistical criteria an optimal representation of the natural sensory input. Since these representations are purely sensory, it is still an open question whether they are suited to generate meaningful behavior. Here we introduce an optimization scheme that applies a statistical criterion to an agent s sensory input while taking its motor behavior into account. We first introduce a general cognitive model, and second develop an optimization scheme that increases the predictability of the sensory outcome of the agent s motor actions and apply this to a navigational paradigm.In the cognitive model, place cells divide the environment into discrete states, similar to hippocampal place cells. The agents learned the sensory outcome of its action by the state-to-state transition probabilities and the extent to which these motor actions are caused by sensory-driven reflexive behavior (obstacle avoidance). Navigational decision making integrates both learned components to derive the actions that are most likely to lead to a navigational goal. Next we introduced an optimization process that modified the state distributions to increase the predictability of the sensory outcome of the agent s actions.The cognitive model successfully performs the navigational task, and the differentiation between transitions and reflexive processing increases both behavioral accuracy, as well as behavioral adaptation to changes in the environment. Further, the optimized sensory states are similar to place fields found in behaving animals. The spatial distribution of states depends on the agent s motor capabilities as well as on the environment. We proofed the generality of predictability as a coding principle by comparing it to the existing ones. Our results suggest that the agent s motor apparatus can play a profound role in the formation of place fields and thus in higher sensory representations. sensory coding sensorimotor space place cell predictability unsupervised learning sensory representation navigation reflex four-arm-maze task ddc:610
209	Unsupervised word discovery for computational language documentation / Découverte non-supervisée de mots pour outiller la linguistique de terrain Godard, Pierre 16 April 2019 (has links) La diversité linguistique est actuellement menacée : la moitié des langues connues dans le monde pourraient disparaître d'ici la fin du siècle. Cette prise de conscience a inspiré de nombreuses initiatives dans le domaine de la linguistique documentaire au cours des deux dernières décennies, et 2019 a été proclamée Année internationale des langues autochtones par les Nations Unies, pour sensibiliser le public à cette question et encourager les initiatives de documentation et de préservation. Néanmoins, ce travail est coûteux en temps, et le nombre de linguistes de terrain, limité. Par conséquent, le domaine émergent de la documentation linguistique computationnelle (CLD) vise à favoriser le travail des linguistes à l'aide d'outils de traitement automatique. Le projet Breaking the Unwritten Language Barrier (BULB), par exemple, constitue l'un des efforts qui définissent ce nouveau domaine, et réunit des linguistes et des informaticiens. Cette thèse examine le problème particulier de la découverte de mots dans un flot non segmenté de caractères, ou de phonèmes, transcrits à partir du signal de parole dans un contexte de langues très peu dotées. Il s'agit principalement d'une procédure de segmentation, qui peut également être couplée à une procédure d'alignement lorsqu'une traduction est disponible. En utilisant deux corpus en langues bantoues correspondant à un scénario réaliste pour la linguistique documentaire, l'un en Mboshi (République du Congo) et l'autre en Myene (Gabon), nous comparons diverses méthodes monolingues et bilingues de découverte de mots sans supervision. Nous montrons ensuite que l'utilisation de connaissances linguistiques expertes au sein du formalisme des Adaptor Grammars peut grandement améliorer les résultats de la segmentation, et nous indiquons également des façons d'utiliser ce formalisme comme outil de décision pour le linguiste. Nous proposons aussi une variante tonale pour un algorithme de segmentation bayésien non-paramétrique, qui utilise un schéma de repli modifié pour capturer la structure tonale. Pour tirer parti de la supervision faible d'une traduction, nous proposons et étendons, enfin, une méthode de segmentation neuronale basée sur l'attention, et améliorons significativement la performance d'une méthode bilingue existante. / Language diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method. Apprentissage non-supervisé Segmentation automatique en mots Alignement bilingue Modèles bayésiens Langues peu dotées Unsupervised learning Automatic word segmentation Bilingual alignment Bayesian models Low-resource languages
210	Quelques applications de l’optimisation numérique aux problèmes d’inférence et d’apprentissage / Few applications of numerical optimization in inference and learning Kannan, Hariprasad 28 September 2018 (has links) Les relaxations en problème d’optimisation linéaire jouent un rôle central en inférence du maximum a posteriori (map) dans les champs aléatoires de Markov discrets. Nous étudions ici les avantages offerts par les méthodes de Newton pour résoudre efficacement le problème dual (au sens de Lagrange) d’une reformulation lisse du problème. Nous comparons ces dernières aux méthodes de premier ordre, à la fois en terme de vitesse de convergence et de robustesse au mauvais conditionnement du problème. Nous exposons donc un cadre général pour l’apprentissage non-supervisé basé sur le transport optimal et les régularisations parcimonieuses. Nous exhibons notamment une approche prometteuse pour résoudre le problème de la préimage dans l’acp à noyau. Du point de vue de l’optimisation, nous décrivons le calcul du gradient d’une version lisse de la norme p de Schatten et comment cette dernière peut être utilisée dans un schéma de majoration-minimisation. / Numerical optimization and machine learning have had a fruitful relationship, from the perspective of both theory and application. In this thesis, we present an application oriented take on some inference and learning problems. Linear programming relaxations are central to maximum a posteriori (MAP) inference in discrete Markov Random Fields (MRFs). Especially, inference in higher-order MRFs presents challenges in terms of efficiency, scalability and solution quality. In this thesis, we study the benefit of using Newton methods to efficiently optimize the Lagrangian dual of a smooth version of the problem. We investigate their ability to achieve superior convergence behavior and to better handle the ill-conditioned nature of the formulation, as compared to first order methods. We show that it is indeed possible to obtain an efficient trust region Newton method, which uses the true Hessian, for a broad range of MAP inference problems. Given the specific opportunities and challenges in the MAP inference formulation, we present details concerning (i) efficient computation of the Hessian and Hessian-vector products, (ii) a strategy to damp the Newton step that aids efficient and correct optimization, (iii) steps to improve the efficiency of the conjugate gradient method through a truncation rule and a pre-conditioner. We also demonstrate through numerical experiments how a quasi-Newton method could be a good choice for MAP inference in large graphs. MAP inference based on a smooth formulation, could greatly benefit from efficient sum-product computation, which is required for computing the gradient and the Hessian. We show a way to perform sum-product computation for trees with sparse clique potentials. This result could be readily used by other algorithms, also. We show results demonstrating the usefulness of our approach using higher-order MRFs. Then, we discuss potential research topics regarding tightening the LP relaxation and parallel algorithms for MAP inference.Unsupervised learning is an important topic in machine learning and it could potentially help high dimensional problems like inference in graphical models. We show a general framework for unsupervised learning based on optimal transport and sparse regularization. Optimal transport presents interesting challenges from an optimization point of view with its simplex constraints on the rows and columns of the transport plan. We show one way to formulate efficient optimization problems inspired by optimal transport. This could be done by imposing only one set of the simplex constraints and by imposing structure on the transport plan through sparse regularization. We show how unsupervised learning algorithms like exemplar clustering, center based clustering and kernel PCA could fit into this framework based on different forms of regularization. We especially demonstrate a promising approach to address the pre-image problem in kernel PCA. Several methods have been proposed over the years, which generally assume certain types of kernels or have too many hyper-parameters or make restrictive approximations of the underlying geometry. We present a more general method, with only one hyper-parameter to tune and with some interesting geometric properties. From an optimization point of view, we show how to compute the gradient of a smooth version of the Schatten p-norm and how it can be used within a majorization-minimization scheme. Finally, we present results from our various experiments. Vision par ordinateur Apprentissage automatique Modèles graphiques Inférence MAP Apprentissage non-supervisé Optimisation numérique Graphical models Machine learning Computer vision Unsupervised learning Numerical optimization MAP inference

Search results