• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 141
  • 57
  • 16
  • 11
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 269
  • 269
  • 243
  • 102
  • 73
  • 62
  • 59
  • 50
  • 40
  • 36
  • 31
  • 30
  • 28
  • 28
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Semi-Supervised Classification Using Gaussian Processes

Patel, Amrish 01 1900 (has links)
Gaussian Processes (GPs) are promising Bayesian methods for classification and regression problems. They have also been used for semi-supervised classification tasks. In this thesis, we propose new algorithms for solving semi-supervised binary classification problem using GP regression (GPR) models. The algorithms are closely related to semi-supervised classification based on support vector regression (SVR) and maximum margin clustering. The proposed algorithms are simple and easy to implement. Also, the hyper-parameters are estimated without resorting to expensive cross-validation technique. The algorithm based on sparse GPR model gives a sparse solution directly unlike the SVR based algorithm. Use of sparse GPR model helps in making the proposed algorithm scalable. The results of experiments on synthetic and real-world datasets demonstrate the efficacy of proposed sparse GP based algorithm for semi-supervised classification.
32

"Semi-supervised" trénování hlubokých neuronových sítí pro rozpoznávání řeči / Semi-Supervised Training of Deep Neural Networks for Speech Recognition

Veselý, Karel January 2018 (has links)
V této dizertační práci nejprve prezentujeme teorii trénování neuronových sítí pro rozpoznávání řeči společně s implementací trénovacího receptu 'nnet1', který je součástí toolkitu s otevřeným kódem Kaldi. Recept se skládá z předtrénování bez učitele pomocí algoritmu RBM, trénování klasifikátoru z řečových rámců s kriteriální funkcí Cross-entropy a ze sekvenčního trénování po větách s kriteriální funkcí sMBR. Následuje hlavní téma práce, kterým je semi-supervised trénování se smíšenými daty s přepisem i bez přepisu. Inspirováni konferenčními články a úvodními experimenty jsme se zaměřili na několik otázek: Nejprve na to, zda je lepší konfidence (t.j. důvěryhodnosti automaticky získaných anotací) počítat po větách, po slovech nebo po řečových rámcích. Dále na to, zda by konfidence měly být použity pro výběr dat nebo váhování dat - oba přístupy jsou kompatibilní s trénováním pomocí metody stochastického nejstrmějšího sestupu, kde jsou gradienty řečových rámců násobeny vahou. Dále jsme se zabývali vylepšováním semi-supervised trénování pomocí kalibrace kofidencí a přístupy, jak model dále vylepšit pomocí dat se správným přepisem. Nakonec jsme navrhli jednoduchý recept, pro který není nutné časově náročné ladění hyper-parametrů trénování, a který je prakticky využitelný pro různé datové sady. Experimenty probíhaly na několika sadách řečových dat: pro rozpoznávač vietnamštiny s 10 přepsaným hodinami (Babel) se chybovost snížila o 2.5%, pro angličtinu se 14 přepsanými hodinami (Switchboard) se chybovost snížila o 3.2%. Zjistili jsme, že je poměrně těžké dále vylepšit přesnost systému pomocí úprav konfidencí, zároveň jsme ale přesvědčení, že naše závěry mají značnou praktickou hodnotu: data bez přepisu je jednoduché nasbírat a naše navrhované řešení přináší dobrá zlepšení úspěšnosti a není těžké je replikovat.
33

Semi-Supervised Plant Leaf Detection and Stress Recognition / Semi-övervakad detektering av växtblad och möjlig stressigenkänning

Antal Csizmadia, Márk January 2022 (has links)
One of the main limitations of training deep learning-based object detection models is the availability of large amounts of data annotations. When annotations are scarce, semi-supervised learning provides frameworks to improve object detection performance by utilising unlabelled data. This is particularly useful in plant leaf detection and possible leaf stress recognition, where data annotations are expensive to obtain due to the need for specialised domain knowledge. This project aims to investigate the feasibility of the Unbiased Teacher, a semi-supervised object detection algorithm, for detecting plant leaves and recognising possible leaf stress in experimental settings where few annotations are available during training. We build an annotated data set for this task and implement the Unbiased Teacher algorithm. We optimise the Unbiased Teacher algorithm and compare its performance to that of a baseline model. Finally, we investigate which hyperparameters of the Unbiased Teacher algorithm most significantly affect its performance and its ability to utilise unlabelled images. We find that the Unbiased Teacher algorithm outperforms the baseline model in the experimental settings when limited annotated data are available during training. Amongst the hyperparameters we consider, we identify the confidence threshold as having the most effect on the algorithm’s performance and ability to leverage unlabelled data. Ultimately, we demonstrate the feasibility of improving object detection performance with the Unbiased Teacher algorithm in plant leaf detection and possible stress recognition when few annotations are available. The improved performance reduces the amount of annotated data required for this task, reducing annotation costs and thereby increasing usage for real-world tasks. / En av huvudbegränsningarna med att träna djupinlärningsbaserade objektdetekteringsmodeller är tillgången på stora mängder annoterad data. Vid små mängder av tillgänglig data kan semi-övervakad inlärning erbjuda ett ramverk för att förbättra objektdetekteringsprestanda genom att använda icke-annoterad data. Detta är särskilt användbart vid detektering av växtblad och möjlig igenkänning av stressymptom hos bladen, där kostnaden för annotering av data är hög på grund av behovet av specialiserad kunskap inom området. Detta projekt syftar till att undersöka genomförbarheten av Opartiska Läraren (eng. ”Unbiased Teacher”), en semi-övervakad objektdetekteringsalgoritm, för att upptäcka växtblad och känna igen möjliga stressymptom hos blad i experimentella miljöer när endast en liten mängd annoterad data finns tillgänglig under träning. För att åstadkomma detta bygger vi ett annoterat dataset och implementerar Opartiska Läraren. Vi optimerar Opartiska Läraren och jämför dess prestanda med en baslinjemodell. Slutligen undersöker vi de hyperparametrar som mest påverkar Opartiska Lärarens prestanda och dess förmåga att använda icke-annoterade bilder. Vi finner att Opartiska Läraren överträffar baslinjemodellen i de experimentella inställningarna när det finns en begränsad mängd annoterad data under träningen. Bland hyperparametrarna vi överväger identifierar vi konfidensgränsen som har störst effekt på algoritmens prestanda och dess förmåga att utnyttja icke-annoterad data. Vi demonstrerar möjligheten att förbättra objektdetekteringsprestandan med Opartiska Läraren i växtbladsdetektering och möjlig stressigenkänning när få anteckningar finns tillgängliga. Den förbättrade prestandan minskar mängden annoterad data som krävs, vilket minskar anteckningskostnaderna och ökar därmed användbarheten för användning inom mer praktiska områden.
34

A Semi-Supervised Predictive Model to Link Regulatory Regions to Their Target Genes

Hafez, Dina Mohamed January 2015 (has links)
<p>Next generation sequencing technologies have provided us with a wealth of data profiling a diverse range of biological processes. In an effort to better understand the process of gene regulation, two predictive machine learning models specifically tailored for analyzing gene transcription and polyadenylation are presented.</p><p>Transcriptional enhancers are specific DNA sequences that act as ``information integration hubs" to confer regulatory requirements on a given cell. These non-coding DNA sequences can regulate genes from long distances, or across chromosomes, and their relationships with their target genes are not limited to one-to-one. With thousands of putative enhancers and less than 14,000 protein-coding genes, detecting enhancer-gene pairs becomes a very complex machine learning and data analysis challenge. </p><p>In order to predict these specific-sequences and link them to genes they regulate, we developed McEnhancer. Using DNAseI sensitivity data and annotated in-situ hybridization gene expression clusters, McEnhancer builds interpolated Markov models to learn enriched sequence content of known enhancer-gene pairs and predicts unknown interactions in a semi-supervised learning algorithm. Classification of predicted relationships were 73-98% accurate for gene sets with varying levels of initial known examples. Predicted interactions showed a great overlap when compared to Hi-C identified interactions. Enrichment of known functionally related TF binding motifs, enhancer-associated histone modification marks, along with corresponding developmental time point was highly evident.</p><p>On the other hand, pre-mRNA cleavage and polyadenylation is an essential step for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-UTRs, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered by the lack of appropriate tests for determining APAs with significant differences across multiple libraries. </p><p>We specified a linear effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed us to identify highly specific subsets of APA events in the individual tissue types. Predictive kernel-based SVM models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. The main cis-regulatory elements described for polyadenylation were found to be a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific sites. We applied this model on SRp20 data, an RNA binding protein that might be involved in oncogene activation and obtained interesting insights. </p><p>Together, these two models contribute to the understanding of enhancers and the key role they play in regulating tissue-specific expression patterns during development, as well as provide a better understanding of the diversity of post-transcriptional gene regulation in multiple tissue types.</p> / Dissertation
35

Graph-based approaches for semi-supervised and cross-domain sentiment analysis

Ponomareva, Natalia January 2014 (has links)
The rapid development of Internet technologies has resulted in a sharp increase in the number of Internet users who create content online. User-generated content often represents people's opinions, thoughts, speculations and sentiments and is a valuable source of information for companies, organisations and individual users. This has led to the emergence of the field of sentiment analysis, which deals with the automatic extraction and classification of sentiments expressed in texts. Sentiment analysis has been intensively researched over the last ten years, but there are still many issues to be addressed. One of the main problems is the lack of labelled data necessary to carry out precise supervised sentiment classification. In response, research has moved towards developing semi-supervised and cross-domain techniques. Semi-supervised approaches still need some labelled data and their effectiveness is largely determined by the amount of these data, whereas cross-domain approaches usually perform poorly if training data are very different from test data. The majority of research on sentiment classification deals with the binary classification problem, although for many practical applications this rather coarse sentiment scale is not sufficient. Therefore, it is crucial to design methods which are able to perform accurate multiclass sentiment classification. The aims of this thesis are to address the problem of limited availability of data in sentiment analysis and to advance research in semi-supervised and cross-domain approaches for sentiment classification, considering both binary and multiclass sentiment scales. We adopt graph-based learning as our main method and explore the most popular and widely used graph-based algorithm, label propagation. We investigate various ways of designing sentiment graphs and propose a new similarity measure which is unsupervised, easy to compute, does not require deep linguistic analysis and, most importantly, provides a good estimate for sentiment similarity as proved by intrinsic and extrinsic evaluations. The main contribution of this thesis is the development and evaluation of a graph-based sentiment analysis system that a) can cope with the challenges of limited data availability by using semi-supervised and cross-domain approaches b) is able to perform multiclass classification and c) achieves highly accurate results which are superior to those of most state-of-the-art semi-supervised and cross-domain systems. We systematically analyse and compare semi-supervised and cross-domain approaches in the graph-based framework and propose recommendations for selecting the most pertinent learning approach given the data available. Our recommendations are based on two domain characteristics, domain similarity and domain complexity, which were shown to have a significant impact on semi-supervised and cross-domain performance.
36

Apprentissage à partir du mouvement / Learning from motion

Tokmakov, Pavel 04 June 2018 (has links)
L’apprentissage faiblement supervisé cherche à réduire au minimum l’effort humain requis pour entrainer les modèles de l’état de l’art. Cette technique permet de tirer parti d’une énorme quantité de données. Toutefois, dans la pratique, les méthodes faiblement supervisées sont nettement moins efficaces que celles qui sont totalement supervisées. Plus particulièrement, dans l’apprentissage profond, où les approches de vision par ordinateur sont les plus performantes, elles restent entièrement supervisées, ce qui limite leurs utilisations dans les applications du monde réel. Cette thèse tente tout d’abord de combler le fossé entre les méthodes faiblement supervisées et entièrement supervisées en utilisant l’information de mouvement. Puis étudie le problème de la segmentation des objets en mouvement, en proposant l’une des premières méthodes basées sur l’apprentissage pour cette tâche.Dans une première partie de la thèse, nous nous concentrons sur le problème de la segmentation sémantique faiblement supervisée. Le défi est de capturer de manières précises les bordures des objets et d’éviter les optimums locaux (ex : segmenter les parties les plus discriminantes). Contrairement à la plupart des approches de l’état de l’art, qui reposent sur des images statiques, nous utilisons les données vidéo avec le mouvement de l’objet comme informations importantes. Notre méthode utilise une approche de segmentation vidéo de l’état de l’art pour segmenter les objets en mouvement dans les vidéos. Les masques d’objets approximatifs produits par cette méthode sont ensuite fusionnés avec le modèle de segmentation sémantique appris dans un EM-like framework, afin d’inférer pour les trames vidéo, des labels sémantiques au niveau des pixels. Ainsi, au fur et à mesure que l’apprentissage progresse, la qualité des labels s’améliore automatiquement. Nous intégrons ensuite cette architecture à notre approche basée sur l’apprentissage pour la segmentation de la vidéo afin d’obtenir un framework d’apprentissage complet pour l’apprentissage faiblement supervisé à partir de vidéos.Dans la deuxième partie de la thèse, nous étudions la segmentation vidéo non supervisée, plus précisément comment segmenter tous les objets dans une vidéo qui se déplace indépendamment de la caméra. De nombreux défis tels qu’un grand mouvement de la caméra, des inexactitudes dans l’estimation du flux optique et la discontinuité du mouvement, complexifient la tâche de segmentation. Nous abordons le problème du mouvement de caméra en proposant une méthode basée sur l’apprentissage pour la segmentation du mouvement : un réseau de neurones convolutif qui prend le flux optique comme entrée et qui est entraîné pour segmenter les objets qui se déplacent indépendamment de la caméra. Il est ensuite étendu avec un flux d’apparence et un module de mémoire visuelle pour améliorer la continuité temporelle. Le flux d’apparence tire profit de l’information sémantique qui est complémentaire de l’information de mouvement. Le module de mémoire visuelle est un paramètre clé de notre approche : il combine les sorties des flux de mouvement et d’apparence et agréger une représentation spatio-temporelle des objets en mouvement. La segmentation finale est ensuite produite à partir de cette représentation agrégée. L’approche résultante obtient des performances de l’état de l’art sur plusieurs jeux de données de référence, surpassant la méthode d’apprentissage en profondeur et heuristique simultanée. / Weakly-supervised learning studies the problem of minimizing the amount of human effort required for training state-of-the-art models. This allows to leverage a large amount of data. However, in practice weakly-supervised methods perform significantly worse than their fully-supervised counterparts. This is also the case in deep learning, where the top-performing computer vision approaches remain fully-supervised, which limits their usage in real world applications. This thesis attempts to bridge the gap between weakly-supervised and fully-supervised methods by utilizing motion information. It also studies the problem of moving object segmentation itself, proposing one of the first learning-based methods for this task.We focus on the problem of weakly-supervised semantic segmentation. This is especially challenging due to the need to precisely capture object boundaries and avoid local optima, as for example segmenting the most discriminative parts. In contrast to most of the state-of-the-art approaches, which rely on static images, we leverage video data with object motion as a strong cue. In particular, our method uses a state-of-the-art video segmentation approach to segment moving objects in videos. The approximate object masks produced by this method are then fused with the semantic segmentation model learned in an EM-like framework to infer pixel-level semantic labels for video frames. Thus, as learning progresses, the quality of the labels improves automatically. We then integrate this architecture with our learning-based approach for video segmentation to obtain a fully trainable framework for weakly-supervised learning from videos.In the second part of the thesis we study unsupervised video segmentation, the task of segmenting all the objects in a video that move independently from the camera. This task presents challenges such as strong camera motion, inaccuracies in optical flow estimation and motion discontinuity. We address the camera motion problem by proposing a learning-based method for motion segmentation: a convolutional neural network that takes optical flow as input and is trained to segment objects that move independently from the camera. It is then extended with an appearance stream and a visual memory module to improve temporal continuity. The appearance stream capitalizes on the semantic information which is complementary to the motion information. The visual memory module is the key component of our approach: it combines the outputs of the motion and appearance streams and aggregates a spatio-temporal representation of the moving objects. The final segmentation is then produced based on this aggregated representation. The resulting approach obtains state-of-the-art performance on several benchmark datasets, outperforming the concurrent deep learning and heuristic-based methods.
37

Algoritmos evolutivos para modelos de mistura de gaussianas em problemas com e sem restrições / Evolutionary algorithms for gausian mixture models with and without constraints

Covões, Thiago Ferreira 09 December 2014 (has links)
Nesta tese, são estudados algoritmos para agrupamento de dados, com particular ênfase em Agrupamento de Dados com Restrições, no qual, além dos objetos a serem agrupados, são fornecidos pelo usuário algumas informações sobre o agrupamento desejado. Como fundamentação para o agrupamento, são considerados os modelos de mistura finitos, em especial, com componentes gaussianos, usualmente chamados de modelos de mistura de gaussianas. Dentre os principais problemas que os algoritmos desenvolvidos nesta tese de doutorado buscam tratar destacam-se: (i) estimar parâmetros de modelo de mistura de gaussianas; (ii) como incorporar, de forma eficiente, restrições no processo de aprendizado de forma que tanto os dados quanto as restrições possam ser adicionadas de forma online; (iii) estimar, via restrições derivadas de conceitos pré-determinados sobre os objetos (usualmente chamados de classes), o número de grupos destes conceitos. Como ferramenta para auxiliar no desenvolvimento de soluções para tais problemas, foram utilizados algoritmos evolutivos que operam com mais de uma solução simultaneamente, além de utilizarem informações de soluções anteriores para guiar o processo de busca. Especificamente, foi desenvolvido um algoritmo evolutivo baseado na divisão e união de componentes para a estimação dos parâmetros de um modelo de mistura de gaussianas. Este algoritmo foi comparado com o algoritmo do mesmo gênero considerado estado-da-arte na literatura, apresentando resultados competitivos e necessitando de menos parâmetros e um menor custo computacional. Nesta tese, foram desenvolvidos dois algoritmos que incorporam as restrições no processo de agrupamento de forma online. Ambos os algoritmos são baseados em algoritmos bem-conhecidos na literatura e apresentaram, em comparações empíricas, resultados melhores que seus antecessores. Finalmente, foram propostos dois algoritmos para se estimar o número de grupos por classe. Ambos os algoritmos foram comparados com algoritmos reconhecidos na literatura de agrupamento de dados com restrições, e apresentaram resultados competitivos ou melhores que estes. A estimação bem sucedida do número de grupos por classe pode auxiliar em diversas tarefas de mineração de dados, desde a sumarização dos dados até a decomposição de problemas de classificação em sub-problemas potencialmente mais simples. / In the last decade, researchers have been giving considerable attention to the field of Constrained Clustering. Algorithms in this field assume that along with the objects to be clustered, the user also provides some constraints about which kind of clustering (s)he prefers. In this thesis, two scenarios are studied: clustering with and without constraints. The developments are based on finite mixture models, namely, models with Gaussian components, which are usually called Gaussian Mixture Models (GMMs). In this context the main problems addressed are: (i) parameter estimation of GMMs; (ii) efficiently integrating constraints in the learning process allowing both constraints and the data to be added in the modeling in an online fashion; (iii) estimating, by using constraints derived from pre-determined concepts (usually named classes), the number of clusters per concept. Evolutionary algorithms were adopted to develop solutions for such problems. These algorithms analyze more than one solution simultaneously and use information provided by previous solutions to guide the search process. Specifically, an evolutionary algorithm based on procedures that perform splitting and merging of components to estimate the parameters of a GMM was developed. This algorithm was compared to an algorithm considered as the state-of-the-art in the literature, obtaining competitive results while requiring less parameters and being more computationally efficient. Besides the aforementioned contributions, two algorithms for online constrained clustering were developed. Both algorithms are based on well known algorithms from the literature and get better results than their predecessors. Finally, two algorithms to estimate the number of clusters per class were also developed. Both algorithms were compared to well established algorithms from the literature of constrained clustering, and obtained equal or better results than the ones obtained by the contenders. The successful estimation of the number of clusters per class is helpful to a variety of data mining tasks, such as data summarization and problem decomposition of challenging classification problems.
38

Expansão de recursos para análise de sentimentos usando aprendizado semi-supervisionado / Extending sentiment analysis resources using semi-supervised learning

Brum, Henrico Bertini 23 March 2018 (has links)
O grande volume de dados que temos disponíveis em ambientes virtuais pode ser excelente fonte de novos recursos para estudos em diversas tarefas de Processamento de Linguagem Natural, como a Análise de Sentimentos. Infelizmente é elevado o custo de anotação de novos córpus, que envolve desde investimentos financeiros até demorados processos de revisão. Nossa pesquisa propõe uma abordagem de anotação semissupervisionada, ou seja, anotação automática de um grande córpus não anotado partindo de um conjunto de dados anotados manualmente. Para tal, introduzimos o TweetSentBR, um córpus de tweets no domínio de programas televisivos que possui anotação em três classes e revisões parciais feitas por até sete anotadores. O córpus representa um importante recurso linguístico de português brasileiro, e fica entre os maiores córpus anotados na literatura para classificação de polaridades. Além da anotação manual do córpus, realizamos a implementação de um framework de aprendizado semissupervisionado que faz uso de dados anotados e, de maneira iterativa, expande o mesmo usando dados não anotados. O TweetSentBR, que possui 15:000 tweets anotados é assim expandido cerca de oito vezes. Para a expansão, foram treinados modelos de classificação usando seis classificadores de polaridades, assim como foram avaliados diferentes parâmetros e representações a fim de obter um córpus confiável. Realizamos experimentos gerando córpus expandidos por cada classificador, tanto para a classificação em três polaridades (positiva, neutra e negativa) quanto para classificação binária. Avaliamos os córpus gerados usando um conjunto de held-out e comparamos a FMeasure da classificação usando como treinamento os córpus anotados manualmente e semiautomaticamente. O córpus semissupervisionado que obteve os melhores resultados para a classificação em três polaridades atingiu 62;14% de F-Measure média, superando a média obtida com as avaliações no córpus anotado manualmente (61;02%). Na classificação binária, o melhor córpus expandido obteve 83;11% de F1-Measure média, superando a média obtida na avaliação do córpus anotado manualmente (79;80%). Além disso, simulamos nossa expansão em córpus anotados da literatura, medindo o quão corretas são as etiquetas anotadas semi-automaticamente. Nosso melhor resultado foi na expansão de um córpus de reviews de produtos que obteve FMeasure de 93;15% com dados binários. Por fim, comparamos um córpus da literatura obtido por meio de supervisão distante e nosso framework semissupervisionado superou o primeiro na classificação de polaridades binária em cross-domain. / The high volume of data available in the Internet can be a good resource for studies of several tasks in Natural Language Processing as in Sentiment Analysis. Unfortunately there is a high cost for the annotation of new corpora, involving financial support and long revision processes. Our work proposes an approach for semi-supervised labeling, an automatic annotation of a large unlabeled set of documents starting from a manually annotated corpus. In order to achieve that, we introduced TweetSentBR, a tweet corpora on TV show programs domain with annotation for 3-point (positive, neutral and negative) sentiment classification partially reviewed by up to seven annotators. The corpus is an important linguistic resource for Brazilian Portuguese language and it stands between the biggest annotated corpora for polarity classification. Beyond the manual annotation, we implemented a semi-supervised learning based framework that uses this labeled data and extends it using unlabeled data. TweetSentBR corpus, containing 15:000 documents, had its size augmented in eight times. For the extending process, we trained classification models using six polarity classifiers, evaluated different parameters and representation schemes in order to obtain the most reliable corpora. We ran experiments generating extended corpora for each classifier, both for 3-point and binary classification. We evaluated the generated corpora using a held-out subset and compared the obtained F-Measure values with the manually and the semi-supervised annotated corpora. The semi-supervised corpus that obtained the best values for 3-point classification achieved 62;14% on average F-Measure, overcoming the results obtained by the same classification with the manually annotated corpus (61;02%). On binary classification, the best extended corpus achieved 83;11% on average F-Measure, overcoming the results on the manually corpora (79;80%). Furthermore, we simulated the extension of labeled corpora in literature, measuring how well the semi-supervised annotation works. Our best results were in the extension of a product review corpora, achieving 93;15% on F1-Measure. Finally, we compared a literature corpus which was labeled by using distant supervision with our semi-supervised corpus, and this overcame the first in binary polarity classification on cross-domain data.
39

Impacto da geração de grafos na classificação semissupervisionada / Impact of graph construction on semi-supervised classification

Sousa, Celso André Rodrigues de 18 July 2013 (has links)
Uma variedade de algoritmos de aprendizado semissupervisionado baseado em grafos e métodos de geração de grafos foram propostos pela comunidade científica nos últimos anos. Apesar de seu aparente sucesso empírico, a área de aprendizado semissupervisionado carece de um estudo empírico detalhado que avalie o impacto da geração de grafos na classificação semissupervisionada. Neste trabalho, é provido tal estudo empírico. Para tanto, combinam-se uma variedade de métodos de geração de grafos com uma variedade de algoritmos de aprendizado semissupervisionado baseado em grafos para compará-los empiricamente em seis bases de dados amplamente usadas na literatura de aprendizado semissupervisionado. Os algoritmos são avaliados em tarefas de classificação de dígitos, caracteres, texto, imagens e de distribuições gaussianas. A avaliação experimental proposta neste trabalho é subdividida em quatro partes: (1) análise de melhor caso; (2) avaliação da estabilidade dos classificadores semissupervisionados; (3) avaliação do impacto da geração de grafos na classificação semissupervisionada; (4) avaliação da influência dos parâmetros de regularização no desempenho de classificação dos classificadores semissupervisionados. Na análise de melhor caso, avaliam-se as melhores taxas de erro de cada algoritmo semissupervisionado combinado com os métodos de geração de grafos usando uma variedade de valores para o parâmetro de esparsificação, o qual está relacionado ao número de vizinhos de cada exemplo de treinamento. Na avaliação da estabilidade dos classificadores, avalia-se a estabilidade dos classificadores semissupervisionados combinados com os métodos de geração de grafos usando uma variedade de valores para o parâmetro de esparsificação. Para tanto, fixam-se os valores dos parâmetros de regularização (quando existirem) que geraram os melhores resultados na análise de melhor caso. Na avaliação do impacto da geração de grafos, avaliam-se os métodos de geração de grafos combinados com os algoritmos de aprendizado semissupervisionado usando uma variedade de valores para o parâmetro de esparsificação. Assim como na avaliação da estabilidade dos classificadores, para esta avaliação, fixam-se os valores dos parâmetros de regularização (quando existirem) que geraram os melhores resultados na análise de melhor caso. Na avaliação da influência dos parâmetros de regularização na classificação semissupervisionada, avaliam-se as superfícies de erro geradas pelos classificadores semissupervisionados em cada grafo e cada base de dados. Para tanto, fixam-se os grafos que geraram os melhores resultados na análise de melhor caso e variam-se os valores dos parâmetros de regularização. O intuito destes experimentos é avaliar o balanceamento entre desempenho de classificação e estabilidade dos algoritmos de aprendizado semissupervisionado baseado em grafos numa variedade de métodos de geração de grafos e valores de parâmetros (de esparsificação e de regularização, se houver). A partir dos resultados obtidos, pode-se concluir que o grafo k- vizinhos mais próximos mútuo (mutKNN) pode ser a melhor opção dentre os métodos de geração de grafos de adjacência, enquanto que o kernel RBF pode ser a melhor opção dentre os métodos de geração de matrizes ponderadas. Em adição, o grafo mutKNN tende a gerar superfícies de erro que são mais suaves que aquelas geradas pelos outros métodos de geração de grafos de adjacência. Entretanto, o grafo mutKNN é instável para valores relativamente pequenos de k. Os resultados obtidos neste trabalho indicam que o desempenho de classificação dos algoritmos semissupervisionados baseados em grafos é fortemente influenciado pela configuração de parâmetros. Poucos padrões evidentes foram encontrados para auxiliar o processo de seleção de parâmetros. As consequências dessa instabilidade são discutidas neste trabalho em termos de pesquisa e aplicações práticas / A variety of graph-based semi-supervised learning algorithms have been proposed by the research community in the last few years. Despite its apparent empirical success, the field of semi-supervised learning lacks a detailed empirical study that evaluates the influence of graph construction on semisupervised learning. In this work we provide such an empirical study. For such purpose, we combine a variety of graph construction methods with a variety of graph-based semi-supervised learning algorithms in order to empirically compare them in six benchmark data sets widely used in the semi-supervised learning literature. The algorithms are evaluated in tasks about digit, character, text, and image classification as well as classification of gaussian distributions. The experimental evaluation proposed in this work is subdivided into four parts: (1) best case analysis; (2) evaluation of classifiers stability; (3) evaluation of the influence of graph construction on semi-supervised learning; (4) evaluation of the influence of regularization parameters on the classification performance of semi-supervised learning algorithms. In the best case analysis, we evaluate the lowest error rates of each semi-supervised learning algorithm combined with the graph construction methods using a variety of sparsification parameter values. Such parameter is associated with the number of neighbors of each training example. In the evaluation of classifiers stability, we evaluate the stability of the semi-supervised learning algorithms combined with the graph construction methods using a variety of sparsification parameter values. For such purpose, we fixed the regularization parameter values (if any) with the values that achieved the best result in the best case analysis. In the evaluation of the influence of graph construction, we evaluate the graph construction methods combined with the semi-supervised learning algorithms using a variety of sparsification parameter values. In this analysis, as occurred in the evaluation of classifiers stability, we fixed the regularization parameter values (if any) with the values that achieved the best result in the best case analysis. In the evaluation of the influence of regularization parameters on the classification performance of semi-supervised learning algorithms, we evaluate the error surfaces generated by the semi-supervised classifiers in each graph and data set. For such purpose, we fixed the graphs that achieved the best results in the best case analysis and varied the regularization parameters values. The intention of our experiments is evaluating the trade-off between classification performance and stability of the graphbased semi-supervised learning algorithms in a variety of graph construction methods as well as parameter values (sparsification and regularization, if applicable). From the obtained results, we conclude that the mutual k-nearest neighbors (mutKNN) graph may be the best choice for adjacency graph construction while the RBF kernel may be the best choice for weighted matrix generation. In addition, mutKNN tends to generate error surfaces that are smoother than those generated by other adjacency graph construction methods. However, mutKNN is unstable for relatively small values of k. Our results indicate that the classification performance of the graph-based semi-supervised learning algorithms are heavily influenced by parameter setting. We found just a few evident patterns that could help parameter selection. The consequences of such instability are discussed in this work in research and practice
40

Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections / Agrupamento hierárquico semissupervisionado ativo baseado em confiança e sua aplicação para extração de hierarquias de tópicos a partir de coleções de documentos

Nogueira, Bruno Magalhães 16 December 2013 (has links)
Topic hierarchies are efficient ways of organizing document collections. These structures help users to manage the knowledge contained in textual data. These hierarchies are usually obtained through unsupervised hierarchical clustering algorithms. By not considering the context of the user in the formation of the hierarchical groups, unsupervised topic hierarchies may not attend the user\'s expectations in some cases. One possible solution for this problem is to employ semi-supervised clustering algorithms. These algorithms incorporate the user\'s knowledge through the usage of constraints to the clustering process. However, in the context of semi-supervised hierarchical clustering, the works in the literature do not efficient explore the selection of cases (instances or cluster) to add constraints, neither the interaction of the user with the clustering process. In this sense, in this work we introduce two semi-supervised hierarchical clustering algorithms: HCAC (Hierarchical Confidence-based Active Clustering) and HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). These algorithms employ an active learning approach based in the confidence of cluster merges. When a low confidence merge is detected, the user is invited to decide, from a pool of candidate pairs of clusters, the best cluster merge in that point. In this work, we employ HCAC and HCAC-LC in the extraction of topic hierarchies through the SMITH framework, which is also proposed in this thesis. This framework provides a series of well defined activities that allow the user\'s interaction in the generation of topic hierarchies. The active learning approach used in the HCAC-based algorithms, the kind of queries employed in these algorithms, as well as the SMITH framework for the generation of semi-supervised topic hierarchies are innovations to the state of the art proposed in this thesis. Our experimental results indicate that HCAC and HCAC-LC outperform other semi-supervised hierarchical clustering algorithms in diverse scenarios. The results also indicate that semi-supervised topic hierarchies obtained through the SMITH framework are more intuitive and easier to navigate than unsupervised topic hierarchies / Hierarquias de tópicos são formas eficientes de organização de coleções de documentos, auxiliando usuários a gerir o conhecimento materializado nessas publicações textuais. Tais hierarquias são usualmente construídas por meio de algoritmos de agrupamento hierárquico não supervisionado. Entretanto, por não considerarem o contexto do usuário na formação dos grupos, hierarquias de tópicos não supervisionadas nem sempre conseguem atender as suas expectativas. Uma solução para este problema e o emprego de algoritmos de agrupamento semissupervisionado, os quais incorporam o conhecimento de domínio do usuário por meio de restrições. Entretanto, para o contexto de agrupamento hierárquico semissupervisionado, não são eficientemente explorados na literatura métodos de seleção de casos (instâncias ou grupos) para receber restrições, bem como não há formas eficientes de interação do usuário com o processo de agrupamento hierárquico. Dessa maneira, neste trabalho, dois algoritmos de agrupamento hierárquico semissupervisionado são propostos: HCAC (Hierarchical Confidence-based Active Clustering) e HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). Estes algoritmos empregam uma abordagem de aprendizado ativo baseado na confiança de uma junção de clusters. Quando uma junção de baixa confiança e detectada, o usuário e convidado a decidir, em um conjunto de pares de grupos candidatos, a melhor junção naquele ponto. Estes algoritmos são aqui utilizados na extração de hierarquias de tópicos por meio do framework SMITH, também proposto nesse trabalho. Este framework fornece uma série de atividades bem definidas que possibilitam a interação do usuário para a obtenção de hierarquias de tópicos. A abordagem de aprendizado ativo utilizado nos algoritmos HCAC e HCAC-LC, o tipo de restrição utilizada nestes algoritmos, bem como o framework SMITH para obtenção de hierarquias de tópicos semissupervisionadas são inovações ao estado da arte propostos neste trabalho. Os resultados obtidos indicam que os algoritmos HCAC e HCAC-LC superam o desempenho de outros algoritmos hierárquicos semissupervisionados em diversos cenários. Os resultados também indicam que hierarquias de tópico semissupervisionadas obtidas por meio do framework SMITH são mais intuitivas e fáceis de navegar do que aquelas não supervisionadas

Page generated in 0.1107 seconds