• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 6
  • 4
  • 1
  • 1
  • 1
  • Tagged with
  • 19
  • 19
  • 14
  • 14
  • 12
  • 8
  • 8
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Aprendizado semissupervisionado multidescrição em classificação de textos / Multi-view semi-supervised learning in text classification

Ígor Assis Braga 23 April 2010 (has links)
Algoritmos de aprendizado semissupervisionado aprendem a partir de uma combinação de dados rotulados e não rotulados. Assim, eles podem ser aplicados em domínios em que poucos exemplos rotulados e uma vasta quantidade de exemplos não rotulados estão disponíveis. Além disso, os algoritmos semissupervisionados podem atingir um desempenho superior aos algoritmos supervisionados treinados nos mesmos poucos exemplos rotulados. Uma poderosa abordagem ao aprendizado semissupervisionado, denominada aprendizado multidescrição, pode ser usada sempre que os exemplos de treinamento são descritos por dois ou mais conjuntos de atributos disjuntos. A classificação de textos é um domínio de aplicação no qual algoritmos semissupervisionados vêm obtendo sucesso. No entanto, o aprendizado semissupervisionado multidescrição ainda não foi bem explorado nesse domínio dadas as diversas maneiras possíveis de se descrever bases de textos. O objetivo neste trabalho é analisar o desempenho de algoritmos semissupervisionados multidescrição na classificação de textos, usando unigramas e bigramas para compor duas descrições distintas de documentos textuais. Assim, é considerado inicialmente o difundido algoritmo multidescrição CO-TRAINING, para o qual são propostas modificações a fim de se tratar o problema dos pontos de contenção. É também proposto o algoritmo COAL, o qual pode melhorar ainda mais o algoritmo CO-TRAINING pela incorporação de aprendizado ativo como uma maneira de tratar pontos de contenção. Uma ampla avaliação experimental desses algoritmos foi conduzida em bases de textos reais. Os resultados mostram que o algoritmo COAL, usando unigramas como uma descrição das bases textuais e bigramas como uma outra descrição, atinge um desempenho significativamente melhor que um algoritmo semissupervisionado monodescrição. Levando em consideração os bons resultados obtidos por COAL, conclui-se que o uso de unigramas e bigramas como duas descrições distintas de bases de textos pode ser bastante compensador / Semi-supervised learning algorithms learn from a combination of both labeled and unlabeled data. Thus, they can be applied in domains where few labeled examples and a vast amount of unlabeled examples are available. Furthermore, semi-supervised learning algorithms may achieve a better performance than supervised learning algorithms trained on the same few labeled examples. A powerful approach to semi-supervised learning, called multi-view learning, can be used whenever the training examples are described by two or more disjoint sets of attributes. Text classification is a domain in which semi-supervised learning algorithms have shown some success. However, multi-view semi-supervised learning has not yet been well explored in this domain despite the possibility of describing textual documents in a myriad of ways. The aim of this work is to analyze the effectiveness of multi-view semi-supervised learning in text classification using unigrams and bigrams as two distinct descriptions of text documents. To this end, we initially consider the widely adopted CO-TRAINING multi-view algorithm and propose some modifications to it in order to deal with the problem of contention points. We also propose the COAL algorithm, which further improves CO-TRAINING by incorporating active learning as a way of dealing with contention points. A thorough experimental evaluation of these algorithms was conducted on real text data sets. The results show that the COAL algorithm, using unigrams as one description of text documents and bigrams as another description, achieves significantly better performance than a single-view semi-supervised algorithm. Taking into account the good results obtained by COAL, we conclude that the use of unigrams and bigrams as two distinct descriptions of text documents can be very effective
12

O algoritmo de aprendizado semi-supervisionado co-training e sua aplicação na rotulação de documentos / The semi-supervised learning algorithm co-training applied to label text documents

Edson Takashi Matsubara 26 May 2004 (has links)
Em Aprendizado de Máquina, a abordagem supervisionada normalmente necessita de um número significativo de exemplos de treinamento para a indução de classificadores precisos. Entretanto, a rotulação de dados é freqüentemente realizada manualmente, o que torna esse processo demorado e caro. Por outro lado, exemplos não-rotulados são facilmente obtidos se comparados a exemplos rotulados. Isso é particularmente verdade para tarefas de classificação de textos que envolvem fontes de dados on-line tais como páginas de internet, email e artigos científicos. A classificação de textos tem grande importância dado o grande volume de textos disponível on-line. Aprendizado semi-supervisionado, uma área de pesquisa relativamente nova em Aprendizado de Máquina, representa a junção do aprendizado supervisionado e não-supervisionado, e tem o potencial de reduzir a necessidade de dados rotulados quando somente um pequeno conjunto de exemplos rotulados está disponível. Este trabalho descreve o algoritmo de aprendizado semi-supervisionado co-training, que necessita de duas descrições de cada exemplo. Deve ser observado que as duas descrições necessárias para co-training podem ser facilmente obtidas de documentos textuais por meio de pré-processamento. Neste trabalho, várias extensões do algoritmo co-training foram implementadas. Ainda mais, foi implementado um ambiente computacional para o pré-processamento de textos, denominado PreTexT, com o objetivo de utilizar co-training em problemas de classificação de textos. Os resultados experimentais foram obtidos utilizando três conjuntos de dados. Dois conjuntos de dados estão relacionados com classificação de textos e o outro com classificação de páginas de internet. Os resultados, que variam de excelentes a ruins, mostram que co-training, similarmente a outros algoritmos de aprendizado semi-supervisionado, é afetado de maneira bastante complexa pelos diferentes aspectos na indução dos modelos. / In Machine Learning, the supervised approach usually requires a large number of labeled training examples to learn accurately. However, labeling is often manually performed, making this process costly and time-consuming. By contrast, unlabeled examples are often inexpensive and easier to obtain than labeled examples. This is especially true for text classification tasks involving on-line data sources, such as web pages, email and scientific papers. Text classification is of great practical importance today given the massive volume of online text available. Semi-supervised learning, a relatively new area in Machine Learning, represents a blend of supervised and unsupervised learning, and has the potential of reducing the need of expensive labeled data whenever only a small set of labeled examples is available. This work describes the semi-supervised learning algorithm co-training, which requires a partitioned description of each example into two distinct views. It should be observed that the two different views required by co-training can be easily obtained from textual documents through pre-processing. In this works, several extensions of co-training algorithm have been implemented. Furthermore, we have also implemented a computational environment for text pre-processing, called PreTexT, in order to apply the co-training algorithm to text classification problems. Experimental results using co-training on three data sets are described. Two data sets are related to text classification and the other one to web-page classification. Results, which range from excellent to poor, show that co-training, similarly to other semi-supervised learning algorithms, is affected by modelling assumptions in a rather complicated way.
13

Semi-supervised and transductive learning algorithms for predicting alternative splicing events in genes.

Tangirala, Karthik January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / As genomes are sequenced, a major challenge is their annotation -- the identification of genes and regulatory elements, their locations and their functions. For years, it was believed that one gene corresponds to one protein, but the discovery of alternative splicing provided a mechanism for generating different gene transcripts (isoforms) from the same genomic sequence. In the recent years, it has become obvious that a large fraction of genes undergoes alternative splicing. Thus, understanding alternative splicing is a problem of great interest to biologists. Supervised machine learning approaches can be used to predict alternative splicing events at genome level. However, supervised approaches require large amounts of labeled data to produce accurate classifiers. While large amounts of genomic data are produced by the new sequencing technologies, labeling these data can be costly and time consuming. Therefore, semi-supervised learning approaches that can make use of large amounts of unlabeled data, in addition to small amounts of labeled data are highly desirable. In this work, we study the usefulness of a semi-supervised learning approach, co-training, for classifying exons as alternatively spliced or constitutive. The co-training algorithm makes use of two views of the data to iteratively learn two classifiers that can inform each other, at each step, with their best predictions on the unlabeled data. We consider three sets of features for constructing views for the problem of predicting alternatively spliced exons: lengths of the exon of interest and its flanking introns, exonic splicing enhancers (a.k.a., ESE motifs) and intronic regulatory sequences (a.k.a., IRS motifs). Naive Bayes and Support Vector Machine (SVM) algorithms are used as based classifiers in our study. Experimental results show that the usage of the unlabeled data can result in better classifiers as compared to those obtained from the small amount of labeled data alone. In addition to semi-supervised approaches, we also also study the usefulness of graph based transductive learning approaches for predicting alternatively spliced exons. Similar to the semi-supervised learning algorithms, transductive learning algorithms can make use of unlabeled data, together with labeled data, to produce labels for the unlabeled data. However, a classification model that could be used to classify new unlabeled data is not learned in this case. Experimental results show that graph based transductive approaches can make effective use of the unlabeled data.
14

Inferring Aspect-Specific Opinion Structure in Product Reviews

Carter, David January 2015 (has links)
Identifying differing opinions on a given topic as expressed by multiple people (as in a set of written reviews for a given product, for example) presents challenges. Opinions about a particular subject are often nuanced: a person may have both negative and positive opinions about different aspects of the subject of interest, and these aspect-specific opinions can be independent of the overall opinion on the subject. Being able to identify, collect, and count these nuanced opinions in a large set of data offers more insight into the strengths and weaknesses of competing products and services than does aggregating the overall ratings of such products and services. I make two useful and useable contributions in working with opinionated text. First, I present my implementation of a semi-supervised co-training machine classification method for identifying both product aspects (features of products) and sentiments expressed about such aspects. It offers better precision than fully-supervised methods while requiring much less text to be manually tagged (a time-consuming process). This algorithm can also be run in a fully supervised manner when more data is available. Second, I apply this co-training approach to reviews of restaurants and various electronic devices; such text contains both factual statements and opinions about features/aspects of products. The algorithm automatically identifies the product aspects and the words that indicate aspect-specific opinion polarity, while largely avoiding the problem of misclassifying the products themselves as inherently positive or negative. This method performs well compared to other approaches. When run on a set of reviews of five technology products collected from Amazon, the system performed with some demonstrated competence (with an average precision of 0.83) at the difficult task of simultaneously identifying aspects and sentiments, though comparison to contemporaries' simpler rules-based approaches was difficult. When run on a set of opinionated sentences about laptops and restaurants that formed the basis of a shared challenge in the SemEval-2014 Task 4 competition, it was able to classify the sentiments expressed about aspects of laptops better than any team that competed in the task (achieving 0.72 accuracy). It was above the mean in its ability to identify the aspects of restaurants about which people expressed opinions, even when co-training using only half of the labelled training data at the outset. While the SemEval-2014 aspect-based sentiment extraction task considered only separately the tasks of identifying product aspects and determining their polarities, I take an extra step and evaluate sentences as a whole, inferring aspects and the aspect-specific sentiments expressed simultaneously, a more difficult task that seems more applicable to real-world tasks. I present first results of this sentence-level task. The algorithm uses both lexical and syntactic information in a manner that is shown to be able to handle new words that it has never before seen. It offers some demonstrated ability to adapt to new subject domains for which it has no training data. The system is characterizable by very high precision and weak-to-average recall and it estimates its own confidence in its predictions; this characteristic should make the algorithm suitable for use on its own or for combination in a confidence-based voting ensemble. The software created for and described in the course of this dissertation is made available online.
15

Web genre classification using feature selection and semi-supervised learning

Chetry, Roshan January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / As the web pages continuously change and their number grows exponentially, the need for genre classification of web pages also increases. One simple reason for this is given by the need to group web pages into various genre categories in order to reduce the complexities of various web tasks (e.g., search). Experts unanimously agree on the huge potential of genre classification of web pages. However, while everybody agrees that genre classification of web pages is necessary, researchers face problems in finding enough labeled data to perform supervised classification of web pages into various genres. The high cost of skilled manual labor, rapid changing nature of web and never ending growth of web pages are the main reasons for the limited amount of labeled data. On the contrary unlabeled data can be acquired relatively inexpensively in comparison to labeled data. This suggests the use of semi-supervised learning approaches for genre classification, instead of using supervised approaches. Semi-supervised learning makes use of both labeled and unlabeled data for training - typically a small amount of labeled data and a large amount of unlabeled data. Semi-supervised learning have been extensively used in text classification problems. Given the link structure of the web, for web-page classification one can use link features in addition to the content features that are used for general text classification. Hence, the feature set corresponding to web-pages can be easily divided into two views, namely content and link based feature views. Intuitively, the two feature views are conditionally independent given the genre category and have the ability to predict the class on their own. The scarcity of labeled data, availability of large amounts of unlabeled data, richer set of features as compared to the conventional text classification tasks (specifically complementary and sufficient views of features) have encouraged us to use co-training as a tool to perform semi-supervised learning. During co-training labeled examples represented using the two views are used to learn distinct classifiers, which keep improving at each iteration by sharing the most confident predictions on the unlabeled data. In this work, we classify web-pages of .eu domain consisting of 1232 labeled host and 20000 unlabeled hosts (provided by the European Archive Foundation [Benczur et al., 2010]) into six different genres, using co-training. We compare our results with the results produced by standard supervised methods. We find that co-training can be an effective and cheap alternative to costly supervised learning. This is mainly due to the two independent and complementary feature sets of web: content based features and link based features.
16

Classification automatique pour la compréhension de la parole : vers des systèmes semi-supervisés et auto-évolutifs

Gotab, Pierre 04 December 2012 (has links) (PDF)
La compréhension automatique de la parole est au confluent des deux grands domaines que sont la reconnaissance automatique de la parole et l'apprentissage automatique. Un des problèmes majeurs dans ce domaine est l'obtention d'un corpus de données conséquent afin d'obtenir des modèles statistiques performants. Les corpus de parole pour entraîner des modèles de compréhension nécessitent une intervention humaine importante, notamment dans les tâches de transcription et d'annotation sémantique. Leur coût de production est élevé et c'est la raison pour laquelle ils sont disponibles en quantité limitée.Cette thèse vise principalement à réduire ce besoin d'intervention humaine de deux façons : d'une part en réduisant la quantité de corpus annoté nécessaire à l'obtention d'un modèle grâce à des techniques d'apprentissage semi-supervisé (Self-Training, Co-Training et Active-Learning) ; et d'autre part en tirant parti des réponses de l'utilisateur du système pour améliorer le modèle de compréhension.Ce dernier point touche à un second problème rencontré par les systèmes de compréhension automatique de la parole et adressé par cette thèse : le besoin d'adapter régulièrement leurs modèles aux variations de comportement des utilisateurs ou aux modifications de l'offre de services du système
17

Adaptivni sistem za automatsku polu-nadgledanu klasifikaciju podataka / Adaptive System for Automated Semi-supervised Data Classification

Slivka Jelena 23 December 2014 (has links)
<p>Cilj &ndash; Cilj istraživanja u okviru doktorske disertacije je razvoj sistema za automatsku polu-nadgledanu klasifikaciju podataka. Sistem bi trebao biti primenljiv na &scaron;irokom spektru domena gde je neophodna klasifikacija podataka, a te&scaron;ko je, ili čak nemoguće, doći do dovoljno velikog i raznovrsnog obučavajućeg skupa podataka<br />Metodologija &ndash; Modeli opisani u disertaciji se baziraju na kombinaciji ko-trening algoritma i tehnika učenja sa grupom hipoteza. Prvi korak jeste obučavanje grupe klasifikatora velike raznolikosti i kvaliteta. Sa ovim ciljem modeli eksploati&scaron;u primenu različitih konfiguracija ko-trening algoritma na isti skup podataka. Prednost ovog pristupa je mogućnost kori&scaron;ćenja značajno manjeg anotiranog obučavajućeg skupa za inicijalizaciju algoritma.<br />Skup nezavisno obučenih ko-trening klasifikatora se kreira generisanjem predefinisanog broja slučajnih podela obeležja polaznog skupa podataka. Nakon toga se, polazeći od istog inicijalnog obučavajućeg skupa, ali kori&scaron;ćenjem različitih kreiranih podela obeležja, obučava grupa ko-trening klasifikatora. Nakon ovoga, neophodno je kombinovati predikcije nezavisno obučenih klasifikatora.<br />Predviđena su dva načina kombinovanja predikcija. Prvi način se zasniva na klasifikaciji zapisa na osnovu većine glasova grupe ko-trening klasifikatora. Na ovaj način se daje predikcija za svaki od zapisa koji su pripadali grupi neanotiranih primera kori&scaron;ćenih u toku obuke ko-treninga. Potom se primenjuje genetski algoritam u svrhu selekcije najpouzdanije klasifikovanih zapisa ovog skupa. Konačno,<br />163<br />najpouzdanije klasifikovani zapisi se koriste za obuku finalnog klasifikatora. Ovaj finalni klasifikator se koristi za predikciju klase zapisa koje je neophodno klasifikovati. Opisani algoritam je nazvan Algoritam Statistike Slučajnih Podela (Random Split Statistics algorithm, RSSalg).<br />Drugi način kombinovanja nezavisno obučenih ko-trening klasifikatora se zasniva na GMM-MAPML tehnici estimacije tačnih klasnih obeležja na osnovu vi&scaron;estrukih obeležja pripisanih od strane različitih anotatora nepoznatog kvaliteta. U ovom algoritmu, nazvanom Integracija Vi&scaron;estrukih Ko-treninranih Klasifikatora (Integration of Multiple Co-trained Classifiers, IMCC), svaki od nezavisno treniranih ko-trening klasifikatora daje predikciju klase za svaki od zapisa koji je neophodno klasifikovati. U ovoj postavci se svaki od ko-trening klasifikatora tretira kao jedan od anotatora čiji je kvalitet nepoznat, a svakom zapisu, za koga je neophodno odrediti klasno obeležje, se dodeljuje vi&scaron;e klasnih obeležja. Na kraju se primenjuje GMM-MAPML tehnika, kako bi se na osnovu dodeljenih vi&scaron;estrukih klasnih obeležja za svaki od zapisa izvr&scaron;ila estimacija stvarnog klasnog obeležja zapisa.<br />Rezultati &ndash; U disertaciji su razvijena dva modela, Integracija Vi&scaron;estrukih Ko-treninranih Klasifikatora (IMCC) i Algoritam Statistike Slučajnih Podela (RSSalg), bazirana na ko-trening algoritmu, koja re&scaron;avaju zadatak automatske klasifikacije u slučaju nepostojanja dovoljno velikog anotiranog korpusa za obuku. Modeli predstavljeni u disertaciji dizajnirani su tako da omogućavaju primenu ko-trening algoritma na skupove podataka bez prirodne podele obeležja, kao i da unaprede njegove performanse. Modeli su na vi&scaron;e skupova podataka različite veličine, dimenzionalnosti i redudantnosti poređeni sa postojećim ko-trening alternativama. Pokazano je da razvijeni modeli na testiranim skupovima podataka postižu bolje performanse od testiranih ko-trening alternativa.<br />Praktična primena &ndash; Razvijeni modeli imaju &scaron;iroku mogućnost primene u svim domenima gde je neophodna klasifikacija podataka, a anotiranje podataka dugotrajno i skupo. U disertaciji je prikazana i primena razvijenih modela u nekoliko konkretnih<br />164<br />situacija gde su modeli od posebne koristi: detekcija subjektivnosti, vi&scaron;e-kategorijska klasifikacija i sistemi za davanje preporuka.<br />Vrednost &ndash; Razvijeni modeli su korisni u &scaron;irokom spektru domena gde je neophodna klasifikacija podataka, a anotiranje podataka dugotrajno i skupo. Njihovom primenom se u značajnoj meri smanjuje ljudski rad neophodan za anotiranje velikih skupova podataka. Pokazano je da performanse razvijenih modela prevazilaze performanse postojećih alternativa razvijenih sa istim ciljem relaksacije problema dugotrajne i mukotrpne anotacije velikih skupova podataka.</p> / <p>Aim &ndash; The research presented in this thesis is aimed towards the development of the system for automatic semi-supervised classification. The system is designed to be applicable on the broad spectrum of practical domains where automatic classification of data is needed but it is hard or impossible to obtain a large enough training set.<br />Methodology &ndash; The described models combine co-training algorithm with ensemble learning with the aim to overcome the problem of co-training application on the datasets without the natural feature split. The first step is to create the ensemble of co-training classifiers. For this purpose the models presented in this thesis apply different configurations of co-training on the same training set. Compared to existing similar approaches, this approach requires a significantly smaller initial training set.<br />The ensemble of independently trained co-training classifiers is created by generating a predefined number of random feature splits of the initial training set. Using the same initial training set, but different feature splits, a group of co-training classifiers is trained. The two models differ in the way the predictions of different co-training classifiers are combined.<br />The first approach is based on majority voting: each instance recorded in the enlarged training sets resulting from co-training application is classified by majority voting of the group of obtained co-training classifiers. After this, the genetic algorithm is applied in order to select the group of most reliably classified instances from this set. The most reliable instances are used in<br />167<br />order to train a final classifier which is used to classify new instances. The described algorithm is called Random Split Statistic Algorithm (RSSalg).<br />The other approach of combining single predictions of the group of co-training classifiers is based on GMM-MAPML technique of estimating the true hidden label based on the multiple labels assigned by multiple annotators of unknown quality. In this model, called the Integration of Multiple Co-trained Classifiers (IMCC), each of the independently trained co-training classifiers predicts the label for each test instance. Each co-training classifier is treated as one of the annotators of unknown quality and each test instance is assigned multiple labels (one by each of the classifiers). Finally, GMM-MAPML technique is applied in order to estimate the true hidden label in the multi-annotator setting.<br />Results &ndash; In the dissertation the two models are developed: the Integration of Multiple Co-trained Classifiers (IMCC) and Random Split Statistic Algorithm (RSSalg). The models are based on co-training and aimed towards enabling automatic classification in the cases where the existing training set is insufficient for training a quality classification model. The models are designed to enable the application of co-training algorithm on datasets that lack the natural feature split needed for its application, as well as with the goal to improve co-training performance. The models are compared to their co-training alternatives on multiple datasets of different size, dimensionality and feature redundancy. It is shown that the developed models exhibit superior performance compared to considered co-training alternatives.<br />Practical application &ndash; The developed models are applicable on the wide spectrum of domains where there is a need for automatic classification and training data is insufficient. The dissertation presents the successful application of models in several concrete situations where they are highly<br />168<br />beneficial: subjectivity detection, multicategory classification and recommender systems.<br />Value &ndash; The models can greatly reduce the human effort needed for long and tedious annotation of large datasets. The conducted experiments show that the developed models are superior to considered alternatives.</p>
18

Classification non supervisée : de la multiplicité des données à la multiplicité des analyses / Clustering : from multiple data to multiple analysis

Sublemontier, Jacques-Henri 07 December 2012 (has links)
La classification automatique non supervisée est un problème majeur, aux frontières de multiples communautés issues de l’Intelligence Artificielle, de l’Analyse de Données et des Sciences de la Cognition. Elle vise à formaliser et mécaniser la tâche cognitive de classification, afin de l’automatiser pour la rendre applicable à un grand nombre d’objets (ou individus) à classer. Des visées plus applicatives s’intéressent à l’organisation automatique de grands ensembles d’objets en différents groupes partageant des caractéristiques communes. La présente thèse propose des méthodes de classification non supervisées applicables lorsque plusieurs sources d’informations sont disponibles pour compléter et guider la recherche d’une ou plusieurs classifications des données. Pour la classification non supervisée multi-vues, la première contribution propose un mécanisme de recherche de classifications locales adaptées aux données dans chaque représentation, ainsi qu’un consensus entre celles-ci. Pour la classification semi-supervisée, la seconde contribution propose d’utiliser des connaissances externes sur les données pour guider et améliorer la recherche d’une classification d’objets par un algorithme quelconque de partitionnement de données. Enfin, la troisième et dernière contribution propose un environnement collaboratif permettant d’atteindre au choix les objectifs de consensus et d’alternatives pour la classification d’objets mono-représentés ou multi-représentés. Cette dernière contribution ré-pond ainsi aux différents problèmes de multiplicité des données et des analyses dans le contexte de la classification non supervisée, et propose, au sein d’une même plate-forme unificatrice, une proposition répondant à des problèmes très actifs et actuels en Fouille de Données et en Extraction et Gestion des Connaissances. / Data clustering is a major problem encountered mainly in related fields of Artificial Intelligence, Data Analysis and Cognitive Sciences. This topic is concerned by the production of synthetic tools that are able to transform a mass of information into valuable knowledge. This knowledge extraction is done by grouping a set of objects associated with a set of descriptors such that two objects in a same group are similar or share a same behaviour while two objects from different groups does not. This thesis present a study about some extensions of the classical clustering problem for multi-view data,where each datum can be represented by several sets of descriptors exhibing different behaviours or aspects of it. Our study impose to explore several nearby problems such that semi-supervised clustering, multi-view clustering or collaborative approaches for consensus or alternative clustering. In a first chapter, we propose an algorithm solving the multi-view clustering problem. In the second chapter, we propose a boosting-inspired algorithm and an optimization based algorithm closely related to boosting that allow the integration of external knowledge leading to the improvement of any clustering algorithm. This proposition bring an answer to the semi-supervised clustering problem. In the last chapter, we introduce an unifying framework allowing the discovery even of a set of consensus clustering solution or a set of alternative clustering solutions for mono-view data and or multi-viewdata. Such unifying approach offer a methodology to answer some current and actual hot topic in Data Mining and Knowledge Discovery in Data.
19

Localisation à partir de caméra vidéo portée

Dovgalecs, Vladislavs 05 December 2011 (has links) (PDF)
L'indexation par le contenu de lifelogs issus de capteurs portés a émergé comme un enjeu à forte valeur ajoutée, permettant l'exploitation de ces nouveaux types de donnés. Rendu plus accessible par la récente disponibilité de dispositifs miniaturisés d'enregistrement, les besoins se sont accrus pour l'extraction automatique d'informations pertinentes à partir de contenus générés par de tels dispositifs. Entre autres applications, la localisation en environnement intérieur est l'un des verrous que nous abordons dans cette thèse. Beaucoup des solutions existantes pour la localisation fonctionnent insuffisamment bien ou nécessitent une intervention manuelle importante. Dans cette thèse, nous abordons le problème de la localisation topologique à partir de séquences vidéo issues d'une camera portée en utilisant une approche purement visuelle. Ce travail complète d'extraction des descripteurs visuels de bas niveaux jusqu'à l'estimation finale de la localisation à l'aide d'algorithmes automatiques. Dans ce cadre, les contributions principales de ce travail concernent l'exploitation efficace des informations apportées par des descripteurs visuels multiples, par les images non étiquetées et par la continuité temporelle de la vidéo. Ainsi, la fusion précoce et la fusion tardive des données visuelles ont été examinées et l'avantage apporté par la complémentarité des descripteurs visuels a été mis en évidence sur le problème de la localisation. En raison de difficulté à obtenir des données étiquetées en quantités suffisantes, l'ensemble des données a été exploité ; d'une part les approches de réduction de dimensionnalité non-linéaire ont été appliquées, afin d'améliorer la taille des données à traiter et la complexité associée; d'autre part des approches semi-supervisés ont été étudiées pour utiliser l'information supplémentaire apportée par les images non étiquetées lors de la classification. Ces éléments ont été analysé séparément et ont été mis en oeuvre ensemble sous la forme d'une nouvelle méthode par co-apprentissage avec information temporelle. Finalement nous avons également exploré la question de l'invariance des descripteurs, en proposant l'utilisation d'un apprentissage invariant à la transformation spatiale, comme une autre réponse possible au manque de données annotées et à la variabilité visuelle. Ces méthodes ont été évaluées sur des séquences vidéo en environnement contrôlé accessibles publiquement pour évaluer le gain spécifique de chaque contribution. Ce travail a également été appliqué dans le cadre du projet IMMED, qui concerne l'observation et l'indexation d'activités de la vie quotidienne dans un objectif d'aide au diagnostic médical, à l'aide d'une caméra vidéo portée. Nous avons ainsi pu mettre en oeuvre le dispositif d'acquisition vidéo portée et montrer le potentiel de notre approche pour l'estimation de la localisation topologique sur un corpus présentant des conditions difficiles représentatives des données réelles.

Page generated in 0.481 seconds