Spelling suggestions: "subject:"semisupervised"" "subject:"semissupervised""
111 |
Semi-Supervised Learning for Object DetectionRosell, Mikael January 2015 (has links)
Many automotive safety applications in modern cars make use of cameras and object detection to analyze the surrounding environment. Pedestrians, animals and other vehicles can be detected and safety actions can be taken before dangerous situations arise. To detect occurrences of the different objects, these systems are traditionally trained to learn a classification model using a set of images that carry labels corresponding to their content. To obtain high performance with a variety of object appearances, the required amount of data is very large. Acquiring unlabeled images is easy, while the manual work of labeling is both time-consuming and costly. Semi-supervised learning refers to methods that utilize both labeled and unlabeled data, a situation that is highly desirable if it can lead to improved accuracy and at the same time alleviate the demand of labeled data. This has been an active area of research in the last few decades, but few studies have investigated the performance of these algorithms in larger systems. In this thesis, we investigate if and how semi-supervised learning can be used in a large-scale pedestrian detection system. With the area of application being automotive safety, where real-time performance is of high importance, the work is focused around boosting classifiers. Results are presented on a few publicly available UCI data sets and on a large data set for pedestrian detection captured in real-life traffic situations. By evaluating the algorithms on the pedestrian data set, we add the complexity of data set size, a large variety of object appearances and high input dimension. It is possible to find situations in low dimensions where an additional set of unlabeled data can be used successfully to improve a classification model, but the results show that it is hard to efficiently utilize semi-supervised learning in large-scale object detection systems. The results are hard to scale to large data sets of higher dimensions as pair-wise computations are of high complexity and proper similarity measures are hard to find.
|
112 |
Traitement des dossiers refusés dans le processus d'octroi de crédit aux particuliers. / Reject inference in the process for granting credit.Guizani, Asma 19 March 2014 (has links)
Le credit scoring est généralement considéré comme une méthode d’évaluation du niveau du risque associé à un dossier de crédit potentiel. Cette méthode implique l'utilisation de différentes techniques statistiques pour aboutir à un modèle de scoring basé sur les caractéristiques du client.Le modèle de scoring estime le risque de crédit en prévoyant la solvabilité du demandeur de crédit. Les institutions financières utilisent ce modèle pour estimer la probabilité de défaut qui va être utilisée pour affecter chaque client à la catégorie qui lui correspond le mieux: bon payeur ou mauvais payeur. Les seules données disponibles pour construire le modèle de scoring sont les dossiers acceptés dont la variable à prédire est connue. Ce modèle ne tient pas compte des demandeurs de crédit rejetés dès le départ ce qui implique qu'on ne pourra pas estimer leurs probabilités de défaut, ce qui engendre un biais de sélection causé par la non-représentativité de l'échantillon. Nous essayons dans ce travail en utilisant l'inférence des refusés de remédier à ce biais, par la réintégration des dossiers refusés dans le processus d'octroi de crédit. Nous utilisons et comparons différentes méthodes de traitement des refusés classiques et semi supervisées, nous adaptons certaines à notre problème et montrons sur un jeu de données réel, en utilisant les courbes ROC confirmé par simulation, que les méthodes semi-supervisé donnent de bons résultats qui sont meilleurs que ceux des méthodes classiques. / Credit scoring is generally considered as a method of evaluation of a risk associated with a potential loan applicant. This method involves the use of different statistical techniques to determine a scoring model. Like any statistical model, scoring model is based on historical data to help predict the creditworthiness of applicants. Financial institutions use this model to assign each applicant to the appropriate category : Good payer or Bad payer. The only data used to build the scoring model are related to the accepted applicants in which the predicted variable is known. The method has the drawback of not estimating the probability of default for refused applicants which means that the results are biased when the model is build on only the accepted data set. We try, in this work using the reject inference, to solve the problem of selection bias, by reintegrate reject applicants in the process of granting credit. We use and compare different methods of reject inference, classical methods and semi supervised methods, we adapt some of them to our problem and show, on a real dataset, using ROC curves, that the semi-supervised methods give good results and are better than classical methods. We confirmed our results by simulation.
|
113 |
Contextualisation d'un détecteur de piétons : application à la surveillance d'espaces publics / Contextualization of a pedestrian detector : application to the monitoring of public spacesChesnais, Thierry 24 June 2013 (has links)
La démocratisation de la « vidéosurveillance intelligente » nécessite le développement d’outils automatiques et temps réel d’analyse vidéo. Parmi ceux-ci, la détection de piétons joue un rôle majeur car de nombreux systèmes reposent sur cette technologie. Les approches classiques de détection de piétons utilisent la reconnaissance de formes et l’apprentissage statistique. Elles souffrent donc d’une dégradation des performances quand l’apparence des piétons ou des éléments de la scène est trop différente de celle étudiée lors de l’apprentissage. Pour y remédier, une solution appelée « contextualisation du détecteur » est étudiée lorsque la caméra est fixe. L’idée est d’enrichir le système à l’aide d’informations provenant de la scène afin de l’adapter aux situations qu’il risque de fréquemment rencontrer. Ce travail a été réalisé en deux temps. Tout d’abord, l’architecture d’un détecteur et les différents outils utiles à sa construction sont présentés dans un état de l’art. Puis la problématique de la contextualisation est abordée au travers de diverses expériences validant ou non les pistes d’amélioration envisagées. L’objectif est d’identifier toutes les briques du système pouvant bénéficier de cet apport afin de contextualiser complètement le détecteur. Pour faciliter l’exploitation d’un tel système, la contextualisation a été entièrement automatisée et s’appuie sur des algorithmes d’apprentissage semi-supervisé. Une première phase consiste à collecter le maximum d’informations sur la scène. Différents oracles sont proposés afin d’extraire l’apparence des piétons et des éléments du fond pour former une base d’apprentissage dite contextualisée. La géométrie de la scène, influant sur la taille et l’orientation des piétons, peut ensuite être analysée pour définir des régions, dans lesquelles les piétons, tout comme le fond, restent visuellement proches. Dans la deuxième phase, toutes ces connaissances sont intégrées dans le détecteur. Pour chaque région, un classifieur est construit à l’aide de la base contextualisée et fonctionne indépendamment des autres. Ainsi chaque classifieur est entraîné avec des données ayant la même apparence que les piétons qu’il devra détecter. Cela simplifie le problème de l’apprentissage et augmente significativement les performances du système. / With the rise of videosurveillance systems comes a logical need for automatic and real-time processes to analyze the huge amount of generated data. Among these tools, pedestrian detection algorithms are essential, because in videosurveillance locating people is often the first step leading to more complex behavioral analyses. Classical pedestrian detection approaches are based on machine learning and pattern recognition algorithms. Thus they generally underperform when the pedestrians’ appearance observed by a camera tends to differ too much from the one in the generic training dataset. This thesis studies the concept of the contextualization of such a detector. This consists in introducing scene information into a generic pedestrian detector. The main objective is to adapt it to the most frequent situations and so to improve its overall performances. The key hypothesis made here is that the camera is static, which is common in videosurveillance scenarios.This work is split into two parts. First a state of the art introduces the architecture of a pedestrian detector and the different algorithms involved in its building. Then the problem of the contextualization is tackled and a series of experiments validates or not the explored leads. The goal is to identify every part of the detector which can benefit from the approach in order to fully contextualize it. To make the contextualization process easier, our method is completely automatic and is based on semi-supervised learning methods. First of all, data coming from the scene are gathered. We propose different oracles to detect some pedestrians in order to catch their appearance and to form a contextualized training dataset. Then, we analyze the scene geometry, which influences the size and the orientation of the pedestrians and we divide the scene into different regions. In each region, pedestrians as well as background elements share a similar appearance.In the second step, all this information is used to build the final detector which is composed of several classifiers, one by region. Each classifier independently scans its dedicated piece of image. Thus, it is only trained with a region-specific contextualized dataset, containing less appearance variability than a global one. Consequently, the training stage is easier and the overall detection results on the scene are improved.
|
114 |
Detecção de contradições em um sistema de aprendizado sem fimOliverio, Vinicius 29 June 2012 (has links)
Made available in DSpace on 2016-06-02T19:05:58Z (GMT). No. of bitstreams: 1
4650.pdf: 1170921 bytes, checksum: a9b0c215ae5f2804a8bbdd8bd2267ca8 (MD5)
Previous issue date: 2012-06-29 / Universidade Federal de Sao Carlos / NELL (Never Ending Language Learning) is a system that seeks to learn in an infinite way, extracting structured information from unstructured web pages using the semi-supervised learning paradigm as one of its basic principles. The Read the Web (RTW) project is the project where the NELL system is contained, actually it consists of 5 modules, all of them working independently where one of the modules is called Rule Learner (RL). The RL is responsible for inducing first order rules, which are used by the system to identify patterns in the knowledge generated by the other four components of the system. These rules are induced and then represented in a syntax that has Horn clauses as base. These rules can present contradictions, and in this context this paper proposes investigate, develop and implement methods to detect and solve these contradictions so that the system can learn in a more efficient way / O NELL (Never Ending Language Learning) é um sistema que busca aprender de uma maneira contínua, extraindo informação estruturada de páginas web desestruturadas utilizando o paradigma de aprendizagem semissupervisionado como um de seus princípios básicos. O Read the Web (RTW) é o projeto no qual o sistema NELL se insere. Atualmente o NELL possui cinco módulos, todos eles trabalhando independentemente onde um desses módulos é chamado Rule Learner (RL). O RL é responsável por induzir regras de primeira ordem, as quais são utilizadas pelo sistema para identificar padrões presentes no conhecimento gerado pelos outros quatro componentes do sistema. Estas regras são induzidas e, na sequência, representadas através de uma sintaxe que tem cláusulas de Horn como base. Tais regras podem apresentar contradições, e neste contexto o presente trabalho propõe a investigação, desenvolvimento e implementação de métodos para detectar e resolver estas contradições de maneira a fazer a aprendizagem mais eficiente.
|
115 |
Aprendizado semi-supervisionado utilizando modelos de caminhada de partículas em grafos / Semi-supervised learning using walking particles model in graphsGuerreiro, Lucas [UNESP] 01 September 2017 (has links)
Submitted by Lucas Guerreiro null (lucasg@rc.unesp.br) on 2017-10-16T22:03:24Z
No. of bitstreams: 1
LucasGuerreiro_dissertacao.pdf: 2072249 bytes, checksum: 03cb08b42175616dd567a364cf201bcd (MD5) / Approved for entry into archive by Monique Sasaki (sayumi_sasaki@hotmail.com) on 2017-10-18T18:42:00Z (GMT) No. of bitstreams: 1
guerreiro_l_me_sjrp.pdf: 2072249 bytes, checksum: 03cb08b42175616dd567a364cf201bcd (MD5) / Made available in DSpace on 2017-10-18T18:42:00Z (GMT). No. of bitstreams: 1
guerreiro_l_me_sjrp.pdf: 2072249 bytes, checksum: 03cb08b42175616dd567a364cf201bcd (MD5)
Previous issue date: 2017-09-01 / O Aprendizado de Máquina é uma área que vem crescendo nos últimos anos e é um dos destaques dentro do campo de Inteligência Artificial. Atualmente, uma das subáreas mais estudadas é o Aprendizado Semi-Supervisionado, principalmente pela sua característica de ter um menor custo na rotulação de dados de exemplo. A categoria de modelos baseados em grafos é a mais ativa nesta subárea, fazendo uso de estruturas de redes complexas. O algoritmo de competição e cooperação entre partículas é uma das técnicas deste domínio. O algoritmo provê acurácia de classificação compatível com a de algoritmos do estado da arte, e oferece um custo computacional inferior à maioria dos métodos da mesma categoria. Neste trabalho é apresentado um estudo sobre Aprendizado Semi-Supervisionado, com ênfase em modelos baseados em grafos e, em particular, no Algoritmo de Competição e Cooperação entre Partículas (PCC). O objetivo deste trabalho é propor um novo algoritmo de competição e cooperação entre partículas baseado neste modelo, com mudanças na caminhada pelo grafo, com informações de dominância sendo mantidas nas arestas ao invés dos nós; as quais possam melhorar a acurácia de classificação ou ainda o tempo de execução em alguns cenários. É proposta também uma metodologia de avaliação da rede obtida com o modelo de competição e cooperação entre partículas, para se identificar a melhor métrica de distância a ser aplicada em cada caso. Nos experimentos apresentados neste trabalho, pode ser visto que o algoritmo proposto teve melhor acurácia do que o PCC em algumas bases de dados, enquanto o método de avaliação de métricas de distância atingiu também bom nível de precisão na maioria dos casos. / Machine Learning is an increasing area over the last few years and it is one of the highlights in Artificial Intelligence area. Nowadays, one of the most studied areas is Semi-supervised learning, mainly due to its characteristic of lower cost in labeling sample data. The most active category in this subarea is that of graph-based models, using complex networks concepts. The Particle Competition and Cooperation in Networks algorithm (PCC) is one of the techniques in this field. The algorithm provides accuracy compatible with state of the art algorithms, and it presents a lower computational cost when compared to most techniques in the same category. In this project, it is presented a research about semi-supervised learning, with focus on graphbased models and, in special, the Particle Competition and Cooperation in Networks algorithm. The objective of this study is to base proposals of new particle competition and cooperation algorithms based on this model, with new dynamics on the graph walking, keeping dominance information on the edges instead of the nodes; which may improve the accuracy classification or yet the runtime in some situations. It is also proposed a method of evaluation of the network built with the Particle Competition and Cooperation model, in order to infer the best distance metric to be used in each case. In the experiments presented in this work, it can be seen that the proposed algorithm presented better accuracy when compared to the PCC for some datasets, while the proposed distance metrics evaluation achieved a high precision level in most cases.
|
116 |
Classificadores baseados em vetores de suporte gerados a partir de dados rotulados e não-rotulados. / Learning support vector machines from labeled and unlabeled data.Clayton Silva Oliveira 30 March 2006 (has links)
Treinamento semi-supervisionado é uma metodologia de aprendizado de máquina que conjuga características de treinamento supervisionado e não-supervisionado. Ela se baseia no uso de bases semi-rotuladas (bases contendo dados rotulados e não-rotulados) para o treinamento de classificadores. A adição de dados não-rotulados, mais baratos e geralmente disponíveis em maior quantidade do que os dados rotulados, pode aumentar o desempenho e/ou baratear o custo de treinamento desses classificadores (a partir da diminuição da quantidade de dados rotulados necessários). Esta dissertação analisa duas estratégias para se executar treinamento semi-supervisionado, especificamente em Support Vector Machines (SVMs): formas direta e indireta. A estratégia direta é atualmente mais conhecida e estudada, e permite o uso de dados rotulados e não-rotulados, ao mesmo tempo, em tarefas de aprendizagem de classificadores. Entretanto, a inclusão de muitos dados não-rotulados pode tornar o treinamento demasiadamente lento. Já a estratégia indireta é mais recente, sendo capaz de agregar os benefícios do treinamento semi-supervisionado direto com tempos menores para o aprendizado de classificadores. Esta opção utiliza os dados não-rotulados para pré-processar a base de dados previamente à tarefa de aprendizagem do classificador, permitindo, por exemplo, a filtragem de eventuais ruídos e a reescrita da base em espaços de variáveis mais convenientes. Dentro do escopo da forma indireta, está a principal contribuição dessa dissertação: idealização, implementação e análise do algoritmo split learning. Foram obtidos ótimos resultados com esse algoritmo, que se mostrou eficiente em treinar SVMs de melhor desempenho e em períodos menores a partir de bases semi-rotuladas. / Semi-supervised learning is a machine learning methodology that mixes features of supervised and unsupervised learning. It allows the use of partially labeled databases (databases with labeled and unlabeled data) to train classifiers. The addition of unlabeled data, which are cheaper and generally more available than labeled data, can enhance the performance and/or decrease the costs of learning such classifiers (by diminishing the quantity of required labeled data). This work analyzes two strategies to perform semi-supervised learning, specifically with Support Vector Machines (SVMs): direct and indirect concepts. The direct strategy is currently more popular and studied; it allows the use of labeled and unlabeled data, concomitantly, in learning classifiers tasks. However, the addition of many unlabeled data can lead to very long training times. The indirect strategy is more recent; it is able to attain the advantages of the direct semi-supervised learning with shorter training times. This alternative uses the unlabeled data to pre-process the database prior to the learning task; it allows denoising and rewriting the data in better feature espaces. The main contribution of this Master thesis lies within the indirect strategy: conceptualization, experimentation, and analysis of the split learning algorithm, that can be used to perform indirect semi-supervised learning using SVMs. We have obtained promising empirical results with this algorithm, which is efficient to train better performance SVMs in shorter times from partially labeled databases.
|
117 |
Abordagens para aprendizado semissupervisionado multirrótulo e hierárquico / Multi-label and hierarchical semi-supervised learning approachesJean Metz 25 October 2011 (has links)
A tarefa de classificação em Aprendizado de Máquina consiste da criação de modelos computacionais capazes de identificar automaticamente a classe de objetos pertencentes a um domínio pré-definido a partir de um conjunto de exemplos cuja classe é conhecida. Existem alguns cenários de classificação nos quais cada objeto pode estar associado não somente a uma classe, mas a várias classes ao mesmo tempo. Adicionalmente, nesses cenários denominados multirrótulo, as classes podem ser organizadas em uma taxonomia que representa as relações de generalização e especialização entre as diferentes classes, definindo uma hierarquia de classes, o que torna a tarefa de classificação ainda mais específica, denominada classificação hierárquica. Os métodos utilizados para a construção desses modelos de classificação são complexos e dependem fortemente da disponibilidade de uma quantidade expressiva de exemplos previamente classificados. Entretanto, para muitas aplicações é difícil encontrar um número significativo desses exemplos. Além disso, com poucos exemplos, os algoritmos de aprendizado supervisionado não são capazes de construir modelos de classificação eficazes. Nesses casos, é possível utilizar métodos de aprendizado semissupervisionado, cujo objetivo é aprender as classes do domínio utilizando poucos exemplos conhecidos conjuntamente com um número considerável de exemplos sem a classe especificada. Neste trabalho são propostos, entre outros, métodos que fazem uso do aprendizado semissupervisionado baseado em desacordo coperspectiva, tanto para a tarefa de classificação multirrótulo plana quanto para a tarefa de classificação hierárquica. São propostos, também, outros métodos que utilizam o aprendizado ativo com intuito de melhorar a performance de algoritmos de classificação semissupervisionada. Além disso, são propostos dois métodos para avaliação de algoritmos multirrótulo e hierárquico, os quais definem estratégias para identificação dos multirrótulos majoritários, que são utilizados para calcular os valores baseline das medidas de avaliação. Foi desenvolvido um framework para realizar a avaliação experimental da classificação hierárquica, no qual foram implementados os métodos propostos e um módulo completo para realizar a avaliação experimental de algoritmos hierárquicos. Os métodos propostos foram avaliados e comparados empiricamente, considerando conjuntos de dados de diversos domínios. A partir da análise dos resultados observa-se que os métodos baseados em desacordo não são eficazes para tarefas de classificação complexas como multirrótulo e hierárquica. Também é observado que o problema central de degradação do modelo dos algoritmos semissupervisionados agrava-se nos casos de classificação multirrótulo e hierárquica, pois, nesses casos, há um incremento nos fatores responsáveis pela degradação nos modelos construídos utilizando aprendizado semissupervisionado baseado em desacordo coperspectiva / In machine learning, the task of classification consists on creating computational models that are able to automatically identify the class of objects belonging to a predefined domain from a set of examples whose class is known a priori. There are some classification scenarios in which each object can be associated to more than one class at the same time. Moreover, in such multilabeled scenarios, classes can be organized in a taxonomy that represents the generalization and specialization relationships among the different classes, which defines a class hierarchy, making the classification task, known as hierarchical classification, even more specific. The methods used to build such classification models are complex and highly dependent on the availability of an expressive quantity of previously classified examples. However, for a large number of applications, it is difficult to find a significant number of such examples. Moreover, when few examples are available, supervised learning algorithms are not able to build efficient classification models. In such situations it is possible to use semi-supervised learning, whose aim is to learn the classes of the domain using a few classified examples in conjunction to a considerable number of examples with no specified class. In this work, we propose methods that use the co-perspective disagreement based learning approach for both, the flat multilabel classification and the hierarchical classification tasks, among others. We also propose other methods that use active learning, aiming at improving the performance of semi-supervised learning algorithms. Additionally, two methods for the evaluation of multilabel and hierarchical learning algorithms are proposed. These methods define strategies for the identification of the majority multilabels, which are used to estimate the baseline evaluation measures. A framework for the experimental evaluation of the hierarchical classification was developed. This framework includes the implementations of the proposed methods as well as a complete module for the experimental evaluation of the hierarchical algorithms. The proposed methods were empirically evaluated considering datasets from various domains. From the analysis of the results, it can be observed that the methods based on co-perspective disagreement are not effective for complex classification tasks, such as the multilabel and hierarchical classification. It can also be observed that the main degradation problem of the models of the semi-supervised algorithms worsens for the multilabel and hierarchical classification due to the fact that, for these cases, there is an increase in the causes of the degradation of the models built using semi-supervised learning based on co-perspective disagreement
|
118 |
Vers un système interactif de structuration des index pour une recherche par le contenu dans des grandes bases d'images / Towards an interactive index structuring system for content-based image retrieval in large image databasesLai, Hien Phuong 02 October 2013 (has links)
Cette thèse s'inscrit dans la problématique de l'indexation et la recherche d'images par le contenu dans des bases d'images volumineuses. Les systèmes traditionnels de recherche d'images par le contenu se composent généralement de trois étapes : l'indexation, la structuration et la recherche. Dans le cadre de cette thèse, nous nous intéressons plus particulièrement à l'étape de structuration qui vise à organiser, dans une structure de données, les signatures visuelles des images extraites dans la phase d'indexation afin de faciliter, d'accélérer et d'améliorer les résultats de la recherche ultérieure. A la place des méthodes traditionnelles de structuration, nous étudions les méthodes de regroupement des données (clustering) qui ont pour but d'organiser les signatures en groupes d'objets homogènes (clusters), sans aucune contrainte sur la taille des clusters, en se basant sur la similarité entre eux. Afin de combler le fossé sémantique entre les concepts de haut niveau sémantique exprimés par l'utilisateur et les signatures de bas niveau sémantique extraites automatiquement dans la phase d'indexation, nous proposons d'impliquer l'utilisateur dans la phase de clustering pour qu'il puisse interagir avec le système afin d'améliorer les résultats du clustering, et donc améliorer les résultats de la recherche ultérieure. En vue d'impliquer l'utilisateur dans la phase de clustering, nous proposons un nouveau modèle de clustering semi-supervisé interactif en utilisant les contraintes par paires (must-link et cannot-link) entre les groupes d'images. Tout d'abord, les images sont regroupées par le clustering non supervisé BIRCH (Zhang et al., 1996). Ensuite, l'utilisateur est impliqué dans la boucle d'interaction afin d'aider le clustering. Pour chaque itération interactive, l'utilisateur visualise les résultats de clustering et fournit des retours au système via notre interface interactive. Par des simples cliques, l'utilisateur peut spécifier les images positives ainsi que les images négatives pour chaque cluster. Il peut aussi glisser les images entre les clusters pour demander de changer l'affectation aux clusters des images. Les contraintes par paires sont ensuite déduites en se basant sur les retours de l'utilisateur ainsi que les informations de voisinage. En tenant compte de ces contraintes, le système réorganise les clusters en utilisant la méthode de clustering semi-supervisé proposée dans cette thèse. La boucle d'interaction peut être répétée jusqu'à ce que le résultat du clustering satisfasse l'utilisateur. Différentes stratégies pour déduire les contraintes par paires entre les images sont proposées. Ces stratégies sont analysées théoriquement et expérimentalement. Afin d'éviter que les résultats expérimentaux dépendent subjectivement de l'utilisateur humain, un agent logiciel simulant le comportement de l'utilisateur humain pour donner des retours est utilisé pour nos expérimentations. En comparant notre méthode avec la méthode de clustering semi-supervisé la plus populaire HMRF-kmeans (Basu et al., 2004), notre méthode donne de meilleurs résultats. / This thesis deals with the problem of Content-Based Image Retrieval (CBIR) on large image databases. Traditional CBIR systems generally rely on three phases : feature extraction, feature space structuring and retrieval. In this thesis, we are particularly interested in the structuring phase, which aims at organizing the visual feature descriptors of all images into an efficient data structure in order to facilitate, accelerate and improve further retrieval. The visual feature descriptor of each image is extracted from the feature extraction phase. Instead of traditional structuring methods, clustering methods which aim at organizing image descriptors into groups of similar objects (clusters), without any constraint on the cluster size, are studied. In order to reduce the “semantic gap” between high-level semantic concepts expressed by the user and the low-level features automatically extracted from the images, we propose to involve the user in the clustering phase so that he/she can interact with the system so as to improve the clustering results, and thus improve the results of further retrieval. With the aim of involving the user in the clustering phase, we propose a new interactive semi-supervised clustering model based on pairwise constraints (must-link and cannot-link) between groups of images. Firstly, images are organized into clusters by using the unsupervised clustering method BIRCH (Zhang et al., 1996). Then the user is involved into the interaction loop in order to guide the clustering process. In each interactive iteration, the user visualizes the clustering results and provide feedback to the system via our interactive interface. With some simple clicks, the user can specify the positive and/or negative images for each cluster. The user can also drag and drop images between clusters in order to change the cluster assignment of some images. Pairwise constraints are then deduced based on the user feedback as well as the neighbourhood information. By taking into account these constraints, the system re-organizes the data set, using the semi-supervised clustering proposed in this thesis. The interaction loop can be iterated until the clustering result satisfies the user. Different strategies for deducing pairwise constraints are proposed. These strategies are theoretically and experimentally analyzed. In order to avoid the subjective dependence of the clustering results on the human user, a software agent simulating the behaviour of the human user for providing feedback to the system is used in our experiments. By comparing our method with the most popular semi-supervised clustering HMRF-kmeans (Basu et al., 2004), our method gives better results.
|
119 |
Improving Classification and Attribute Clustering: An Iterative Semi-supervised ApproachSeifi, Farid January 2015 (has links)
This thesis proposes a novel approach to attribute clustering. It exploits the strength of semi-supervised learning to improve the quality of attribute clustering particularly when labeled data is limited. The significance of this work derives in part from the broad, and increasingly important, usage of attribute clustering to address outstanding problems within the machine learning community. This form of clustering has also been shown to have strong practical applications, being usable in heavyweight industrial applications.
Although researchers have focused on supervised and unsupervised attribute clustering in recent years, semi-supervised attribute clustering has not received substantial attention. In this research, we propose an innovative two step iterative semi-supervised attribute clustering framework. This new framework, in each iteration, uses the result of attribute clustering to improve a classifier. It then uses the classifier to augment the training data used by attribute clustering in next iteration. This iterative framework outputs an improved classifier and attribute clustering at the same time. It gives more accurate clusters of attributes which better fit the real relations between attributes.
In this study we proposed two new usages for attribute clustering to improve classification: solving the automatic view definition problem for multi-view learning and improving missing attribute-value handling at induction and prediction time. The application of these two new usages of attribute clustering in our proposed semi-supervised attribute clustering is evaluated using real world data sets from different domains.
|
120 |
Inferring Aspect-Specific Opinion Structure in Product ReviewsCarter, David January 2015 (has links)
Identifying differing opinions on a given topic as expressed by multiple people (as in a set of written reviews for a given product, for example) presents challenges. Opinions about a particular subject are often nuanced: a person may have both negative and positive opinions about different aspects of the subject of interest, and these aspect-specific opinions can be independent of the overall opinion on the subject. Being able to identify, collect, and count these nuanced opinions in a large set of data offers more insight into the strengths and weaknesses of competing products and services than does aggregating the overall ratings of such products and services.
I make two useful and useable contributions in working with opinionated text.
First, I present my implementation of a semi-supervised co-training machine classification method for identifying both product aspects (features of products) and sentiments expressed about such aspects. It offers better precision than fully-supervised methods while requiring much less text to be manually tagged (a time-consuming process). This algorithm can also be run in a fully supervised manner when more data is available.
Second, I apply this co-training approach to reviews of restaurants and various electronic devices; such text contains both factual statements and opinions about features/aspects of products. The algorithm automatically identifies the product aspects and the words that indicate aspect-specific opinion polarity, while largely avoiding the problem of misclassifying the products themselves as inherently positive or negative.
This method performs well compared to other approaches. When run on a set of reviews of five technology products collected from Amazon, the system performed with some demonstrated competence (with an average precision of 0.83) at the difficult task of simultaneously identifying aspects and sentiments, though comparison to contemporaries' simpler rules-based approaches was difficult. When run on a set of opinionated sentences about laptops and restaurants that formed the basis of a shared challenge in the SemEval-2014 Task 4 competition, it was able to classify the sentiments expressed about aspects of laptops better than any team that competed in the task (achieving 0.72 accuracy). It was above the mean in its ability to identify the aspects of restaurants about which people expressed opinions, even when co-training using only half of the labelled training data at the outset.
While the SemEval-2014 aspect-based sentiment extraction task considered only separately the tasks of identifying product aspects and determining their polarities, I take an extra step and evaluate sentences as a whole, inferring aspects and the aspect-specific sentiments expressed simultaneously, a more difficult task that seems more applicable to real-world tasks. I present first results of this sentence-level task.
The algorithm uses both lexical and syntactic information in a manner that is shown to be able to handle new words that it has never before seen. It offers some demonstrated ability to adapt to new subject domains for which it has no training data. The system is characterizable by very high precision and weak-to-average recall and it estimates its own confidence in its predictions; this characteristic should make the algorithm suitable for use on its own or for combination in a confidence-based voting ensemble. The software created for and described in the course of this dissertation is made available online.
|
Page generated in 0.0625 seconds