• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 196
  • 21
  • 18
  • 9
  • 5
  • 4
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 328
  • 328
  • 119
  • 109
  • 81
  • 80
  • 78
  • 64
  • 62
  • 61
  • 55
  • 49
  • 47
  • 47
  • 46
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

A Graph Theoretic Clustering Algorithm based on the Regularity Lemma and Strategies to Exploit Clustering for Prediction

Trivedi, Shubhendu 30 April 2012 (has links)
The fact that clustering is perhaps the most used technique for exploratory data analysis is only a semaphore that underlines its fundamental importance. The general problem statement that broadly describes clustering as the identification and classification of patterns into coherent groups also implicitly indicates it's utility in other tasks such as supervised learning. In the past decade and a half there have been two developments that have altered the landscape of research in clustering: One is improved results by the increased use of graph theoretic techniques such as spectral clustering and the other is the study of clustering with respect to its relevance in semi-supervised learning i.e. using unlabeled data for improving prediction accuracies. In this work an attempt is made to make contributions to both these aspects. Thus our contributions are two-fold: First, we identify some general issues with the spectral clustering framework and while working towards a solution, we introduce a new algorithm which we call "Regularity Clustering" which makes an attempt to harness the power of the Szemeredi Regularity Lemma, a remarkable result from extremal graph theory for the task of clustering. Secondly, we investigate some practical and useful strategies for using clustering unlabeled data in boosting prediction accuracy. For all of these contributions we evaluate our methods against existing ones and also apply these ideas in a number of settings.
52

Contributions au clustering collaboratif et à ses potentielles applications en imagerie à très haute résolution / Contributions to collaborative clustering and its potential applications on very high resolution satellite images

Sublime, Jérémie 09 November 2016 (has links)
Cette thèse présente plusieurs algorithmes développés dans le cadre du projet ANR COCLICO et contient deux axes principaux :Le premier axe concerne l'introduction d'un algorithme applicable aux images satellite à très haute résolution, qui est basé sur les champs aléatoires de Markov et qui apporte des notions sémantiques sur les clusters découverts. Cet algorithme est inspiré de l'algorithme Iterated conditional modes (ICM) et permet de faire un clustering sur des segments d'images pré-traitées. La méthode que nous proposons permet de gérer des voisinages irréguliers entre segments et d'obtenir des informations sémantiques de bas niveau sur les clusters de l'image traitée.Le second axe porte sur le développement de méthodes de clustering collaboratif applicables à autant d'algorithmes que possible, ce qui inclut les algorithmes du premier axe. La caractéristique principale des méthodes proposées dans cette thèse est leur applicabilité aux deux cas suivants : 1) plusieurs algorithmes travaillant sur les mêmes objets dans des espaces de représentation différents, 2) plusieurs algorithmes travaillant sur des données différentes ayant des distributions similaires. Les méthodes que nous proposons peuvent s'appliquer à de nombreux algorithmes comme l'ICM, les K-Moyennes, l'algorithme EM, ou les cartes topographiques (SOM et GTM). Contrairement aux méthodes précédemment proposées, notre modèle permet à des algorithmes très différents de collaborer ensemble, n'impose pas de contrainte sur le nombre de clusters recherchés et a une base mathématique solide. / This thesis presents several algorithms developed in the context of the ANR COCLICO project and contains two main axis: The first axis is concerned with introducing Markov Random Fields (MRF) based models to provide a semantic rich and suited algorithm applicable to images that are already segmented. This method is based on the Iterated Conditional Modes Algorithm (ICM algorithm) and can be applied to the segments of very high resolution (VHR) satellite pictures. Our proposed method can cope with highly irregular neighborhood dependencies and provides some low level semantic information on the clusters and their relationship within the image. The second axis deals with collaborative clustering methods developed with the goal of being applicable to as many clustering algorithms as possible, including the algorithms used in the first axis of this work. A key feature of the methods proposed in this thesis is that they can deal with either of the following two cases: 1) several clustering algorithms working together on the same data represented in different feature spaces, 2) several clustering algorithms looking for similar clusters in different data sets having similar distributions. Clustering algorithms to which these methods are applicable include the ICM algorithm, the K-Means algorithm, density based algorithms such as DB-scan, all Expectation-Maximization (EM) based algorithms such as the Self-Organizing Maps (SOM) and the Generative Topographic Mapping (GTM) algorithms. Unlike previously introduced methods, our models have no restrictions in term of types of algorithms that can collaborate together, do not require that all methods be looking for the same number of clusters, and are provided with solid mathematical foundations.
53

Feature distribution learning for covariate shift adaptation using sparse filtering

Zennaro, Fabio January 2017 (has links)
This thesis studies a family of unsupervised learning algorithms called feature distribution learning and their extension to perform covariate shift adaptation. Unsupervised learning is one of the most active areas of research in machine learning, and a central challenge in this field is to develop simple and robust algorithms able to work in real-world scenarios. A traditional assumption of machine learning is the independence and identical distribution of data. Unfortunately, in realistic conditions this assumption is often unmet and the performances of traditional algorithms may be severely compromised. Covariate shift adaptation has then developed as a lively sub-field concerned with designing algorithms that can account for covariate shift, that is for a difference in the distribution of training and test samples. The first part of this dissertation focuses on the study of a family of unsupervised learning algorithms that has been recently proposed and has shown promise: feature distribution learning; in particular, sparse filtering, the most representative feature distribution learning algorithm, has commanded interest because of its simplicity and state-of-the-art performance. Despite its success and its frequent adoption, sparse filtering lacks any strong theoretical justification. This research questions how feature distribution learning can be rigorously formalized and how the dynamics of sparse filtering can be explained. These questions are answered by first putting forward a new definition of feature distribution learning based on concepts from information theory and optimization theory; relying on this, a theoretical analysis of sparse filtering is carried out, which is validated on both synthetic and real-world data sets. In the second part, the use of feature distribution learning algorithms to perform covariate shift adaptation is considered. Indeed, because of their definition and apparent insensitivity to the problem of modelling data distributions, feature distribution learning algorithms seems particularly fit to deal with covariate shift. This research questions whether and how feature distribution learning may be fruitfully employed to perform covariate shift adaptation. After making explicit the conditions of success for performing covariate shift adaptation, a theoretical analysis of sparse filtering and another novel algorithm, periodic sparse filtering, is carried out; this allows for the determination of the specific conditions under which these algorithms successfully work. Finally, a comparison of these sparse filtering-based algorithms against other traditional algorithms aimed at covariate shift adaptation is offered, showing that the novel algorithm is able to achieve competitive performance. In conclusion, this thesis provides a new rigorous framework to analyse and design feature distribution learning algorithms; it sheds light on the hidden assumptions behind sparse filtering, offering a clear understanding of its conditions of success; it uncovers the potential and the limitations of sparse filtering-based algorithm in performing covariate shift adaptation. These results are relevant both for researchers interested in furthering the understanding of unsupervised learning algorithms and for practitioners interested in deploying feature distribution learning in an informed way.
54

Detecção de novidade com aplicação a fluxos contínuos de dados / Novelty detection with application to data streams

Spinosa, Eduardo Jaques 20 February 2008 (has links)
Neste trabalho a detecção de novidade é tratada como o problema de identificação de conceitos emergentes em dados que podem ser apresentados em um fluxo contínuo. Considerando a relação intrínseca entre tempo e novidade e os desafios impostos por fluxos de dados, uma nova abordagem é proposta. OLINDDA (OnLIne Novelty and Drift Detection Algorithm) vai além da classficação com uma classe e concentra-se no aprendizado contínuo não-supervisionado de novos conceitos. Tendo aprendido uma descrição inicial de um conceito normal, prossegue à análise de novos dados, tratando-os como um fluxo contínuo em que novos conceitos podem aparecer a qualquer momento. Com o uso de técnicas de agrupamento, OLINDDA pode empregar diversos critérios de validação para avaliar grupos em termos de sua coesão e representatividade. Grupos considerados válidos produzem conceitos que podem sofrer fusão, e cujo conhecimento é continuamente incorporado. A técnica é avaliada experimentalmente com dados artificiais e reais. O módulo de classificação com uma classe é comparado a outras técnicas de detecção de novidade, e a abordagem como um todo é analisada sob vários aspectos por meio da evolução temporal de diversas métricas. Os resultados reforçam a importância da detecção contínua de novos conceitos, assim como as dificuldades e desafios do aprendizado não-supervisionado de novos conceitos em fluxos de dados / In this work novelty detection is treated as the problem of identifying emerging concepts in data that may be presented in a continuous ow. Considering the intrinsic relationship between time and novelty and the challenges imposed by data streams, a novel approach is proposed. OLINDDA, an OnLIne Novelty and Drift Detection Algorithm, goes beyond one-class classification and focuses on the unsupervised continuous learning of novel concepts. Having learned an initial description of a normal concept, it proceeds to the analysis of new data, treating them as a continuous ow where novel concepts may appear at any time. By the use of clustering techniques, OLINDDA may employ several validation criteria to evaluate clusters in terms of their cohesiveness and representativeness. Clusters considered valid produce concepts that may be merged, and whose knowledge is continuously incorporated. The technique is experimentally evaluated with artificial and real data. The one-class classification module is compared to other novelty detection techniques, and the whole approach is analyzed from various aspects through the temporal evolution of several metrics. Results reinforce the importance of continuous detection of novel concepts, as well as the dificulties and challenges of the unsupervised learning of novel concepts in data streams
55

Técnicas de aprendizado não supervisionado baseadas no algoritmo da caminhada do turista / Unsupervised learning techniques based on the tourist walk algorithm

Porto Filho, Carlos Humberto 07 November 2017 (has links)
Nas últimas décadas, a quantidade de informações armazenadas no formato digital tem crescido de forma exponencial, levando à necessidade cada vez maior de produção de ferramentas computacionais que auxiliem na geração do conhecimento a partir desses dados. A área de Aprendizado de Máquina fornece diversas técnicas capazes de identificar padrões nesses conjuntos de dados. Dentro dessas técnicas, este trabalho destaca o Aprendizado de Máquina Não Supervisionado onde o objetivo é classificar as entidades em clusters (grupos) mutuamente exclusivos baseados na similaridade entre as instâncias. Os clusters não são pré-definidos e daí o elemento não supervisionado. Organizar esses dados em clusters que façam sentido é uma das maneiras mais fundamentais de entendimento e aprendizado. A análise de clusters é o estudo dos métodos para agrupamento e se divide entre hierárquico e particional. A classificação hierárquica é uma sequência encadeada de partições enquanto que na particional há somente uma partição. O interesse deste trabalho são as técnicas baseadas em uma caminhada determinística parcialmente auto repulsiva conhecida como caminhada do turista. Partindo da hipótese de que é possível utilizar a caminhada do turista como uma técnica de Aprendizado de Máquina Não Supervisionado, foi implementado um algoritmo hierárquico baseado na caminhada do turista proposto por Campiteli et al. (2006). Foi avaliado, através de diferentes conjuntos de imagens médicas, como essa técnica se compara com técnicas hierárquicas tradicionais. Também é proposto um novo algoritmo de Aprendizado de Máquina Não Supervisionado particional baseado na caminhada do turista, chamado de Tourist Walk Partitional Clustering (TWPC). Os resultados mostraram que a técnica hierárquica baseada na caminhada do turista é capaz de identificar clusters em conjuntos de imagens médicas através de uma árvore que não impõe uma estrutura binária, com um número menor de hierarquias e uma invariabilidade à escala dos dados, resultando em uma estrutura mais organizada. Mesmo que a árvore não seja diretamente baseada nas distâncias dos dados, mas em um ranking de vizinhos, ela ainda preserva uma correlação entre suas distâncias cofenéticas e as distâncias reais entre os dados. O método particional proposto TWPC foi capaz de encontrar, de forma eficiente, formas arbitrárias de clusters com variações inter-cluster e intra-cluster. Além disso o algoritmo tem como vantagens: ser determinístico; funcionar com interações locais, sem a necessidade de conhecimento a priori de todos os itens do conjunto; incorporar o conceito de ruído e outlier; e funcionar com um ranking de vizinhos, que pode ser construído através de qualquer medida. / In the last decades, the amount of data stored in digital format has grown exponentially, leading to the increasing need to produce computational tools that help generate knowledge from these data. The Machine Learning field provides several techniques capable of identifying patterns in these data sets. Within these techniques we highlight the Unsupervised Machine Learning where the objective is to classify the entities in mutually exclusive clusters based on the similarity between the instances. Clusters are not predefined and hence the unsupervised element. Organizing this data into clusters that make sense is one of the most fundamental ways of understanding and learning. Cluster analysis is the study of methods for clustering and is divided between hierarchical and partitional. A hierarchical clustering is a sequence of partitions whereas in the partitional clustering there is only one partition. Here we are interested in techniques based on a deterministic partially self-avoiding walk, known as tourist walk. Based on the hypothesis that it is possible to use the tourist walk as an unsupervised machine learning technique, we have implemented a hierarchical algorithm based on the tourist walk proposed by Campiteli et al. (2006). We evaluate this algorithm using different sets of medical images and compare it with traditional hierarchical techniques. We also propose a new algorithm for partitional clustering based on the tourist talk, called Tourist Walk Partitional Clustering (TWPC). The results showed that the hierarchical technique based on the tourist walk is able to identify clusters in sets of medical images through a tree that does not impose a binary structure, with a smaller number of hierarchies and is invariable to scale transformation, resulting in a more organized structure. Even though the tree is not directly based on the distances of the data but on a ranking of neighbors, it still preserves a correlation between its cophenetic distances and the actual distances between the data. The proposed partitional clustering method TWPC was able to find, in an efficient way, arbitrary shapes of clusters with inter-cluster and intra-cluster variations. In addition, the algorithm has the following advantages: it is deterministic; it operates based on local interactions, without the need for a priori knowledge of all the items in the set; it is capable of incorporate the concept of noise and outlier; and work with a ranking of neighbors, which can be built through any measure.
56

Detecting Metagame Shifts in League of Legends Using Unsupervised Learning

Peabody, Dustin P 18 May 2018 (has links)
Over the many years since their inception, the complexity of video games has risen considerably. With this increase in complexity comes an increase in the number of possible choices for players and increased difficultly for developers who try to balance the effectiveness of these choices. In this thesis we demonstrate that unsupervised learning can give game developers extra insight into their own games, providing them with a tool that can potentially alert them to problems faster than they would otherwise be able to find. Specifically, we use DBSCAN to look at League of Legends and the metagame players have formed with their choices and attempt to detect when the metagame shifts possibly giving the developer insight into what changes they should affect to achieve a more balanced, fun game.
57

The Unsupervised Acquisition of a Lexicon from Continuous Speech

Marcken, Carl de 18 January 1996 (has links)
We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.
58

Recognition and investigation of temporal patterns in seismic wavefields using unsupervised learning techniques

Köhler, Andreas January 2009 (has links)
Modern acquisition of seismic data on receiver networks worldwide produces an increasing amount of continuous wavefield recordings. Hence, in addition to manual data inspection, seismogram interpretation requires new processing utilities for event detection, signal classification and data visualization. Various machine learning algorithms, which can be adapted to seismological problems, have been suggested in the field of pattern recognition. This can be done either by means of supervised learning using manually defined training data or by unsupervised clustering and visualization. The latter allows the recognition of wavefield patterns, such as short-term transients and long-term variations, with a minimum of domain knowledge. Besides classical earthquake seismology, investigations of temporal patterns in seismic data also concern novel approaches such as noise cross-correlation or ambient seismic vibration analysis in general, which have moved into focus within the last decade. In order to find records suitable for the respective approach or simply for quality control, unsupervised preprocessing becomes important and valuable for large data sets. Machine learning techniques require the parametrization of the data using feature vectors. Applied to seismic recordings, wavefield properties have to be computed from the raw seismograms. For an unsupervised approach, all potential wavefield features have to be considered to reduce subjectivity to a minimum. Furthermore, automatic dimensionality reduction, i.e. feature selection, is required in order to decrease computational cost, enhance interpretability and improve discriminative power. This study presents an unsupervised feature selection and learning approach for the discovery, imaging and interpretation of significant temporal patterns in seismic single-station or network recordings. In particular, techniques permitting an intuitive, quickly interpretable and concise overview of available records are suggested. For this purpose, the data is parametrized by real-valued feature vectors for short time windows using standard seismic analysis tools as feature generation methods, such as frequency-wavenumber, polarization, and spectral analysis. The choice of the time window length is dependent on the expected durations of patterns to be recognized or discriminated. We use Self-Organizing Maps (SOMs) for a data-driven feature selection, visualization and clustering procedure, which is particularly suitable for high-dimensional data sets. Using synthetics composed of Rayleigh and Love waves and three different types of real-world data sets, we show the robustness and reliability of our unsupervised learning approach with respect to the effect of algorithm parameters and data set properties. Furthermore, we approve the capability of the clustering and imaging techniques. For all data, we find improved discriminative power of our feature selection procedure compared to feature subsets manually selected from individual wavefield parametrization methods. In particular, enhanced performance is observed compared to the most favorable individual feature generation method, which is found to be the frequency spectrum. The method is applied to regional earthquake records at the European Broadband Network with the aim to define suitable features for earthquake detection and seismic phase classification. For the latter, we find that a combination of spectral and polarization features favor S wave detection at a single receiver. However, SOM-based visualization of phase discrimination shows that clustering applied to the records of two stations only allows onset or P wave detection, respectively. In order to improve the discrimination of S waves on receiver networks, we recommend to consider additionally the temporal context of feature vectors. The application to continuous recordings of seismicity close to an active volcano (Mount Merapi, Java, Indonesia) shows that two typical volcano-seismic events (VTB and Guguran) can be detected and distinguished by clustering. In contrast, so-called MP events cannot be discriminated. Comparable results are obtained for selected features and recognition rates regarding a previously implemented supervised classification system. Finally, we test the reliability of wavefield clustering to improve common ambient vibration analysis methods such as estimation of dispersion curves and horizontal to vertical spectral ratios. It is found, that in general, the identified short- and long-term patterns have no significant impact on those estimates. However, for individual sites, effects of local sources can be identified. Leaving out the corresponding clusters, yields reduced uncertainties or allows for improving estimation of dispersion curves. / Die Anzahl der weltweit kontinuierlich aufzeichnenden seismischen Messstationen ist in den vergangenen Jahren immer weiter angestiegen. Aus diesem Grund steht eine große Menge von seismischen Datensätzen zu Forschungszwecken zur Verfügung. Insbesondere betrifft dies passive Verfahren zur geologischen Strukturerkundung entweder mittels transienter Ereignisse wie Erdbeben oder unter der Verwendung der permanent vorhandenen natürlichen seismischen Bodenunruhe. Die Bearbeitung dieser Daten erfordert neben der klassischen manuellen Seismogrammanalyse verstärkt auch den Einsatz automatischer Detektionssysteme. Mit Hilfe von überwachten Lernverfahren, d.h. unter Verwendung von seismischen Signalen deren Auftreten bekannt ist, ist es möglich, unbekannte Muster zu klassifizieren. Im Gegensatz dazu hatte die vorliegende Arbeit zum Ziel, ein allgemeines, unüberwachtes Verfahren zur quantitativen Zerlegung seismischer Wellenfelder zu entwickeln. Dies wird mittels einer automatischen Clusterung von Seismogrammzeitfenstern bzw. über die Visualisierung von zeitlichen Mustern auf unterschiedlichen Zeitskalen erreicht. Als unüberwachtes Lernverfahren, das neben der Clusterung auch eine einfach interpretierbare Visualisierung hoch-dimensionaler Datensätze über eine zweidimensionale Darstellung ermöglicht, wurde der Self-organizing-map Algorithmus (SOM) gewählt. Für automatische Lernverfahren ist die Parametrisierung der Seismogramme mittels Merkmalsvektoren erforderlich. Im vorliegenden Fall wurden möglichst viele potentielle Wellenfeldmerkmale unter Verwendung von verschiedenen seismischen Einzel- und Mehrstationsanalyseverfahren für aufeinanderfolgende kurze Zeitfenster berechnet. Um eine datenadaptive und effiziente Parametrisierung zu erreichen, wurde darüberhinaus ein quantitatives Auswahlverfahren für geeignete Merkmale entwickelt, das über einen mehrstufigen Filter bestehend aus einem Signifikanztest und einer SOM-basierenden Korrelationsanalyse redundante und irrelevante Eigenschaften aussortiert. Mit den neu implementierten Techniken wurden verschiedene Arten von seismischen Datensätzen unter Berücksichtigung verschiedener seismologischer Fragestellungen bearbeitet. Die Algorithmen und deren Parameter wurden zunächst intensiv und quantitativ mit Hilfe synthetischer Daten getestet und optimiert. Anschließend wurden reale Aufzeichnungen regionaler Erdbeben und vulkanischer Seismizität verwendet. Im ersten Fall konnten geeignete Merkmale zur Detektion und Klassifizierung von Erdbebenwellenphasen gefunden und die Diskriminierung dieser Signale mit Hilfe der SOM-Darstellung untersucht werden. Unter Verwendung des zweiten Datensatzes wurden Cluster typischer vulkano-seismischer Signale am Vulkan Mount Merapi (Java, Indonesien) detektiert, die sich zur Vorhersage von Eruptionen eignen. Beide Anwendungen haben gezeigt, dass, verglichen mit einzelnen Methoden, automatisch gefundene Kombinationen von Merkmalen verschiedener Parametrisierungsverfahren deutlich bessere Klassifizierungsraten zur Folge haben. Zudem können die Erkenntnisse über die Clusterung von seismischen Signalen dazu verwendet werden, verbesserte automatische Klassifizierungssysteme zu entwickeln. Abschließend wurden Aufzeichnungen der natürlichen seismischen Bodenunruhe bearbeitet. Insbesondere konnte der Einfluss kurzzeitiger und längerfristiger Variationen im Wellenfeld auf Methoden zur passiven Strukturerkundung untersucht werden. Es hat sich gezeigt, dass in einzelnen Fällen tageszeitabhängige Muster und lokale seismische Quellen die Ergebnisse negativ beeinflussen können. Die Wellenfeldzerlegung mittels Clusterung hat es erlaubt, diese Signale zu identifizieren und somit von der Analyse auszuschließen.
59

Learning from Partially Labeled Data: Unsupervised and Semi-supervised Learning on Graphs and Learning with Distribution Shifting

Huang, Jiayuan January 2007 (has links)
This thesis focuses on two fundamental machine learning problems:unsupervised learning, where no label information is available, and semi-supervised learning, where a small amount of labels are given in addition to unlabeled data. These problems arise in many real word applications, such as Web analysis and bioinformatics,where a large amount of data is available, but no or only a small amount of labeled data exists. Obtaining classification labels in these domains is usually quite difficult because it involves either manual labeling or physical experimentation. This thesis approaches these problems from two perspectives: graph based and distribution based. First, I investigate a series of graph based learning algorithms that are able to exploit information embedded in different types of graph structures. These algorithms allow label information to be shared between nodes in the graph---ultimately communicating information globally to yield effective unsupervised and semi-supervised learning. In particular, I extend existing graph based learning algorithms, currently based on undirected graphs, to more general graph types, including directed graphs, hypergraphs and complex networks. These richer graph representations allow one to more naturally capture the intrinsic data relationships that exist, for example, in Web data, relational data, bioinformatics and social networks. For each of these generalized graph structures I show how information propagation can be characterized by distinct random walk models, and then use this characterization to develop new unsupervised and semi-supervised learning algorithms. Second, I investigate a more statistically oriented approach that explicitly models a learning scenario where the training and test examples come from different distributions. This is a difficult situation for standard statistical learning approaches, since they typically incorporate an assumption that the distributions for training and test sets are similar, if not identical. To achieve good performance in this scenario, I utilize unlabeled data to correct the bias between the training and test distributions. A key idea is to produce resampling weights for bias correction by working directly in a feature space and bypassing the problem of explicit density estimation. The technique can be easily applied to many different supervised learning algorithms, automatically adapting their behavior to cope with distribution shifting between training and test data.
60

Contributions to Unsupervised and Semi-Supervised Learning

Pal, David 21 May 2009 (has links)
This thesis studies two problems in theoretical machine learning. The first part of the thesis investigates the statistical stability of clustering algorithms. In the second part, we study the relative advantage of having unlabeled data in classification problems. Clustering stability was proposed and used as a model selection method in clustering tasks. The main idea of the method is that from a given data set two independent samples are taken. Each sample individually is clustered with the same clustering algorithm, with the same setting of its parameters. If the two resulting clusterings turn out to be close in some metric, it is concluded that the clustering algorithm and the setting of its parameters match the data set, and that clusterings obtained are meaningful. We study asymptotic properties of this method for certain types of cost minimizing clustering algorithms and relate their asymptotic stability to the number of optimal solutions of the underlying optimization problem. In classification problems, it is often expensive to obtain labeled data, but on the other hand, unlabeled data are often plentiful and cheap. We study how the access to unlabeled data can decrease the amount of labeled data needed in the worst-case sense. We propose an extension of the probably approximately correct (PAC) model in which this question can be naturally studied. We show that for certain basic tasks the access to unlabeled data might, at best, halve the amount of labeled data needed.

Page generated in 0.0853 seconds