41 |
Iterated learning framework for unsupervised part-of-speech inductionChristodoulopoulos, Christos January 2013 (has links)
Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on rule-based or corpora-based (supervised) methods. Despite the undeniable success of supervised learning methods in NLP, they have two main drawbacks: on the practical side, it is expensive to produce the manual annotation (or the rules) required and it is not easy to find annotators for less common languages. A theoretical disadvantage is that the computational analysis produced is tied to a specific theory or annotation scheme. Unsupervised methods offer the possibility to expand our analyses into more resourcepoor languages, and to move beyond the conventional linguistic theories. They are a way of observing patterns and regularities emerging directly from the data and can provide new linguistic insights. In this thesis I explore unsupervised methods for inducing parts of speech across languages. I discuss the challenges in evaluation of unsupervised learning and at the same time, by looking at the historical evolution of part-of-speech systems, I make the case that the compartmentalised, traditional pipeline approach of NLP is not ideal for the task. I present a generative Bayesian system that makes it easy to incorporate multiple diverse features, spanning different levels of linguistic structure, like morphology, lexical distribution, syntactic dependencies and word alignment information that allow for the examination of cross-linguistic patterns. I test the system using features provided by unsupervised systems in a pipeline mode (where the output of one system is the input to another) and show that the performance of the baseline (distributional) model increases significantly, reaching and in some cases surpassing the performance of state-of-the-art part-of-speech induction systems. I then turn to the unsupervised systems that provided these sources of information (morphology, dependencies, word alignment) and examine the way that part-of-speech information influences their inference. Having established a bi-directional relationship between each system and my part-of-speech inducer, I describe an iterated learning method, where each component system is trained using the output of the other system in each iteration. The iterated learning method improves the performance of both component systems in each task. Finally, using this iterated learning framework, and by using parts of speech as the central component, I produce chains of linguistic structure induction that combine all the component systems to offer a more holistic view of NLP. To show the potential of this multi-level system, I demonstrate its use ‘in the wild’. I describe the creation of a vastly multilingual parallel corpus based on 100 translations of the Bible in a diverse set of languages. Using the multi-level induction system, I induce cross-lingual clusters, and provide some qualitative results of my approach. I show that it is possible to discover similarities between languages that correspond to ‘hidden’ morphological, syntactic or semantic elements.
|
42 |
Geographic Relevance for Travel Search: The 2014-2015 Harvey Mudd College Clinic Project for Expedia, Inc.Long, Hannah 01 January 2015 (has links)
The purpose of this Clinic project is to help Expedia, Inc. expand the search capabilities it offers to its users. In particular, the goal is to help the company respond to unconstrained search queries by generating a method to associate hotels and regions around the world with the higher-level attributes that describe them, such as “family- friendly” or “culturally-rich.” Our team utilized machine-learning algorithms to extract metadata from textual data about hotels and cities. We focused on two machine-learning models: decision trees and Latent Dirichlet Allocation (LDA). The first appeared to be a promising approach, but would require more resources to replicate on the scale Expedia needs. On the other hand, we were able to generate useful results using LDA. We created a website to visualize these results.
|
43 |
Unsupervised learning for text-to-speech synthesisWatts, Oliver Samuel January 2013 (has links)
This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented.
|
44 |
Supervision Beyond Manual Annotations for Learning Visual RepresentationsDoersch, Carl 01 April 2016 (has links)
For both humans and machines, understanding the visual world requires relating new percepts with past experience. We argue that a good visual representation for an image should encode what makes it similar to other images, enabling the recall of associated experiences. Current machine implementations of visual representations can capture some aspects of similarity, but fall far short of human ability overall. Even if one explicitly labels objects in millions of images to tell the computer what should be considered similar—a very expensive procedure—the labels still do not capture everything that might be relevant. This thesis shows that one can often train a representation which captures similarity beyond what is labeled in a given dataset. That means we can begin with a dataset that has uninteresting labels, or no labels at all, and still build a useful representation. To do this, we propose to using pretext tasks: tasks that are not useful in and of themselves, but serve as an excuse to learn a more general-purpose representation. The labels for a pretext task can be inexpensive or even free. Furthermore, since this approach assumes training labels differ from the desired outputs, it can handle output spaces where the correct answer is ambiguous, and therefore impossible to annotate by hand. The thesis explores two broad classes of supervision. The first isweak image-level supervision, which is exploited to train mid-level discriminative patch classifiers. For example, given a dataset of street-level imagery labeled only with GPS coordinates, patch classifiers are trained to differentiate one specific geographical region (e.g. the city of Paris) from others. The resulting classifiers each automatically collect and associate a set of patches which all depict the same distinctive architectural element. In this way, we can learn to detect elements like balconies, signs, and lamps without annotations. The second type of supervision requires no information about images other than the pixels themselves. Instead, the algorithm is trained to predict the context around image patches. The context serves as a sort of weak label: to predict well, the algorithm must associate similar-looking patches which also have similar contexts. After training, the feature representation learned using this within-image context indeed captures visual similarity across images, which ultimately makes it useful for real tasks like object detection and geometry estimation.
|
45 |
Clustering Via Supervised Support Vector MachinesMerat, Sepehr 07 August 2008 (has links)
An SVM-based clustering algorithm is introduced that clusters data with no a priori knowledge of input classes. The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labeled. Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed. The lowest confidence data (e.g., the worst of the mislabeled data) then has its labels switched to the other class label. The SVM is then re-run on the data set (with partly re-labeled data). The repetition of the above process improves the separability until there is no misclassification. Variations on this type of clustering approach are shown.
|
46 |
Anomaly Detection in an e-Transaction System using Data Driven Machine Learning Models : An unsupervised learning approach in time-series dataAvdic, Adnan, Ekholm, Albin January 2019 (has links)
Background: Detecting anomalies in time-series data is a task that can be done with the help of data driven machine learning models. This thesis will investigate if, and how well, different machine learning models, with an unsupervised approach,can detect anomalies in the e-Transaction system Ericsson Wallet Platform. The anomalies in our domain context is delays on the system. Objectives: The objectives of this thesis work is to compare four different machine learning models ,in order to find the most relevant model. The best performing models are decided by the evaluation metric F1-score. An intersection of the best models are also being evaluated in order to decrease the number of False positives in order to make the model more precise. Methods: Investigating a relevant time-series data sample with 10-minutes interval data points from the Ericsson Wallet Platform was used. A number of steps were taken such as, handling data, pre-processing, normalization, training and evaluation.Two relevant features was trained separately as one-dimensional data sets. The two features that are relevant when finding delays in the system which was used in this thesis is the Mean wait (ms) and the feature Mean * N were the N is equal to the Number of calls to the system. The evaluation metrics that was used are True positives, True Negatives, False positives, False Negatives, Accuracy, Precision, Recall, F1-score and Jaccard index. The Jaccard index is a metric which will reveal how similar each algorithm are at their detection. Since the detection are binary, it’s classifying the each data point in the time-series data. Results: The results reveals the two best performing models regards to the F1-score.The intersection evaluation reveals if and how well a combination of the two best performing models can reduce the number of False positives. Conclusions: The conclusion to this work is that some algorithms perform better than others. It is a proof of concept that such classification algorithms can separate normal from non-normal behavior in the domain of the Ericsson Wallet Platform.
|
47 |
Categorizing conference room climate using K-meansAsp, Jin, Bergdahl, Saga January 2019 (has links)
Smart environments are increasingly common. By utilizing sensor data from the indoor environment and applying methods like machine learning, they can autonomously control and increase productivity, comfort, and well-being of occupants. The aim of this thesis was to model indoor climate in conference rooms and use K-means clustering to determine quality levels. Together, they enable categorization of conference room quality level during meetings. Theoretically, by alerts to the user, this may enhance occupant productivity, comfort, and well-being. Moreover, the objective was to determine which features and which k would produce the highest quality clusters given chosen evaluation measures. To do this, a quasi-experiment was used. CO2, temperature, and humidity sensors were placed in four conference rooms and were sampled continuously. K-means clustering was then used to generate clusters with 10 days of sensor data. To evaluate which feature combination and which k created optimal clusters, we used Silhouette, Davis Bouldin, and the Elbow method. The resulting model, using three clusters to represent quality levels, enabled categorization of the quality of specific meetings. Additionally, all three methods indicated that a feature combination of CO2 and humidity, with k = 2 or k = 3, was suitable.
|
48 |
Factor analysis of dynamic PET imagesCruz Cavalcanti, Yanna 31 October 2018 (has links) (PDF)
Thanks to its ability to evaluate metabolic functions in tissues from the temporal evolution of a previously injected radiotracer, dynamic positron emission tomography (PET) has become an ubiquitous analysis tool to quantify biological processes. Several quantification techniques from the PET imaging literature require a previous estimation of global time-activity curves (TACs) (herein called \textit{factors}) representing the concentration of tracer in a reference tissue or blood over time. To this end, factor analysis has often appeared as an unsupervised learning solution for the extraction of factors and their respective fractions in each voxel. Inspired by the hyperspectral unmixing literature, this manuscript addresses two main drawbacks of general factor analysis techniques applied to dynamic PET. The first one is the assumption that the elementary response of each tissue to tracer distribution is spatially homogeneous. Even though this homogeneity assumption has proven its effectiveness in several factor analysis studies, it may not always provide a sufficient description of the underlying data, in particular when abnormalities are present. To tackle this limitation, the models herein proposed introduce an additional degree of freedom to the factors related to specific binding. To this end, a spatially-variant perturbation affects a nominal and common TAC representative of the high-uptake tissue. This variation is spatially indexed and constrained with a dictionary that is either previously learned or explicitly modelled with convolutional nonlinearities affecting non-specific binding tissues. The second drawback is related to the noise distribution in PET images. Even though the positron decay process can be described by a Poisson distribution, the actual noise in reconstructed PET images is not expected to be simply described by Poisson or Gaussian distributions. Therefore, we propose to consider a popular and quite general loss function, called the $\beta$-divergence, that is able to generalize conventional loss functions such as the least-square distance, Kullback-Leibler and Itakura-Saito divergences, respectively corresponding to Gaussian, Poisson and Gamma distributions. This loss function is applied to three factor analysis models in order to evaluate its impact on dynamic PET images with different reconstruction characteristics.
|
49 |
Exploring Unsupervised Learning as a Way of Revealing User Patterns in a Mobile Bank ApplicationBergman, Elsa, Eriksson, Anna January 2019 (has links)
The purpose of this interdisciplinary study was to explore whether it is possible to conduct a data-driven study using pattern recognition in order to gain an understanding of user behavior within a mobile bank application. This knowledge was in turn used to propose ways of tailoring the application to better suit the actual needs of the users. In this thesis, unsupervised learning in the form of clustering was applied to a data set containing information about user interactions with a mobile bank application. By pre-processing the data, finding the best value for the number of clusters to use and applying these results to the K-means algorithm, clustering into distinct subgroups was possible. Visualization of the clusters was possible due to combining K-means with a Principal Component Analysis. Through clustering, patterns regarding how the different functionalities are used in the application were revealed. Thereafter, using relevant concepts within the field of human-computer interaction, a proposal was made of how the application could be altered to better suit the discovered needs of the users. The results show that most sessions are passive, that the device model is of high importance in the clusters, that some features are seldom used and that hidden functionalities are not used in full measure. This is either due to the user not wanting to use some functionalities or because there is a lack of discoverability or understanding among the users, causing them to refrain from using these functionalities. However, determining the actual cause requires further qualitative studies. Removing features which are seldom used, adding signifiers, active discovery as well as conducting user-tests are identified as possible actions in order to minimize issues with discoverability and understanding. Finally, future work and possible improvements to the research methods used in this study were proposed.
|
50 |
Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance / Unsupervided learning of massive data streams : application to Big Data in insuranceGhesmoune, Mohammed 25 November 2016 (has links)
Le travail de recherche exposé dans cette thèse concerne le développement d'approches à base de growing neural gas (GNG) pour le clustering de flux de données massives. Nous proposons trois extensions de l'approche GNG : séquentielle, distribuée et parallèle, et une méthode hiérarchique; ainsi qu'une nouvelle modélisation pour le passage à l'échelle en utilisant le paradigme MapReduce et l'application de ce modèle pour le clustering au fil de l'eau du jeu de données d'assurance. Nous avons d'abord proposé la méthode G-Stream. G-Stream, en tant que méthode "séquentielle" de clustering, permet de découvrir de manière incrémentale des clusters de formes arbitraires et en ne faisant qu'une seule passe sur les données. G-Stream utilise une fonction d'oubli an de réduire l'impact des anciennes données dont la pertinence diminue au fil du temps. Les liens entre les nœuds (clusters) sont également pondérés par une fonction exponentielle. Un réservoir de données est aussi utilisé an de maintenir, de façon temporaire, les observations très éloignées des prototypes courants. L'algorithme batchStream traite les données en micro-batch (fenêtre de données) pour le clustering de flux. Nous avons défini une nouvelle fonction de coût qui tient compte des sous ensembles de données qui arrivent par paquets. La minimisation de la fonction de coût utilise l'algorithme des nuées dynamiques tout en introduisant une pondération qui permet une pénalisation des données anciennes. Une nouvelle modélisation utilisant le paradigme MapReduce est proposée. Cette modélisation a pour objectif de passer à l'échelle. Elle consiste à décomposer le problème de clustering de flux en fonctions élémentaires (Map et Reduce). Ainsi de traiter chaque sous ensemble de données pour produire soit les clusters intermédiaires ou finaux. Pour l'implémentation de la modélisation proposée, nous avons utilisé la plateforme Spark. Dans le cadre du projet Square Predict, nous avons validé l'algorithme batchStream sur les données d'assurance. Un modèle prédictif combinant le résultat du clustering avec les arbres de décision est aussi présenté. L'algorithme GH-Stream est notre troisième extension de GNG pour la visualisation et le clustering de flux de données massives. L'approche présentée a la particularité d'utiliser une structure hiérarchique et topologique, qui consiste en plusieurs arbres hiérarchiques représentant des clusters, pour les tâches de clustering et de visualisation. / The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks.
|
Page generated in 0.1081 seconds