Spelling suggestions: "subject:"[een] SEMI-SUPERVISED LEARNING"" "subject:"[enn] SEMI-SUPERVISED LEARNING""
41 |
Rotulação de indivíduos representativos no aprendizado semissupervisionado baseado em redes: caracterização, realce, ganho e filosofia / Representatives labeling for network-based semi-supervised learning:characterization, highlighting, gain and philosophyBilzã Marques de Araújo 29 April 2015 (has links)
Aprendizado semissupervisionado (ASS) é o nome dado ao paradigma de aprendizado de máquina que considera tanto dados rotulados como dados não rotulados. Embora seja considerado frequentemente como um meio termo entre os paradigmas supervisionado e não supervisionado, esse paradigma é geralmente aplicado a tarefas preditivas ou descritivas. Na tarefa preditiva de classificação, p. ex., o objetivo é rotular dados não rotulados de acordo com os rótulos dos dados rotulados. Nesse caso, enquanto que os dados não rotulados descrevem as distribuições dos dados e mediam a propagação dos rótulos, os itens de dados rotulados semeiam a propagação de rótulos e guiam-na à estabilidade. No entanto, dados são gerados tipicamente não rotulados e sua rotulação requer o envolvimento de especialistas no domínio, rotulando-os manualmente. Dificuldades na visualização de grandes volumes de dados, bem como o custo associado ao envolvimento do especialista, são desafios que podem restringir o desempenho dessa tarefa. Por- tanto, o destacamento automático de bons candidatos a dados rotulados, doravante denominados indivíduos representativos, é uma tarefa de grande importância, e pode proporcionar uma boa relação entre o custo com especialista e o desempenho do aprendizado. Dentre as abordagens de ASS discriminadas na literatura, nosso interesse de estudo se concentra na abordagem baseada em redes, onde conjuntos de dados são representados relacionalmente, através da abstração gráfica. Logo, o presente trabalho tem como objetivo explorar a influência dos nós rotulados no desempenho do ASS baseado em redes, i.e., estudar a caracterização de nós representativos, como a estrutura da rede pode realçá-los, o ganho de desempenho de ASS proporcionado pela rotulação manual dos mesmos, e aspectos filosóficos relacionados. Em relação à caracterização, critérios de caracterização de nós centrais em redes são estudados considerando-se redes com estruturas modulares bem definidas. Contraintuitivamente, nós bastantes conectados (hubs) não são muito representativos. Nós razoavelmente conectados em vizinhanças pouco conectadas, por outro lado, são; estritamente local, esse critério de caracterização é escalável a grandes volumes de dados. Em redes com distribuição de grau homogênea - modelo Girvan-Newman (GN), nós com alto coeficiente de agrupamento também mostram-se representativos. Por outro lado, em redes com distribuição de grau heterogênea - modelo Lancichinetti-Fortunato-Radicchi (LFR), nós com alta intermedialidade se destacam. Nós com alto coeficiente de agrupamento em redes GN estão tipicamente situados em motifs do tipo quase-clique; nós com alta intermedialidade em redes LFR são hubs situados na borda das comunidades. Em ambos os casos, os nós destacados são excelentes regularizadores. Além disso, como critérios diversos se destacam em redes com características diversas, abordagens unificadas para a caracterização de nós representativos também foram estudadas. Crítica para o realce de indivíduos representativos e o bom desempenho da classificação semissupervisionada, a construção de redes a partir de bases de dados vetoriais também foi estudada. O método denominado AdaRadius foi proposto, e apresenta vantagens tais como adaptabilidade em bases de dados com densidade variada, baixa dependência da configuração de seus parâmetros, e custo computacional razoável, tanto sobre dados pool-based como incrementais. As redes resultantes, por sua vez, são esparsas, porém conectadas, e permitem que a classificação semissupervisionada se favoreça da rotulação prévia de indivíduos representativos. Por fim, também foi estudada a validação de métodos de construção de redes para o ASS, sendo proposta a medida denominada coerência grafo-rótulos de Katz. Em suma, os resultados discutidos apontam para a validade da seleção de indivíduos representativos para semear a classificação semissupervisionada, corroborando a hipótese central da presente tese. Analogias são encontrados em diversos problemas modelados em redes, tais como epidemiologia, propagação de rumores e informações, resiliência, letalidade, grandmother cells, e crescimento e auto-organização. / Semi-supervised learning (SSL) is the name given to the machine learning paradigm that considers both labeled and unlabeled data. Although often defined as a mid-term between unsupervised and supervised machine learning, this paradigm is usually applied to predictive or descriptive tasks. In the classification task, for example, the goal is to label the unlabeled data according to the labels of the labeled data. In this case, while the unlabeled data describes the data distributions and mediate the label propagation, the labeled data seeds the label propagation and guide it to the stability. However, as a whole, data is generated unlabeled, and to label data requires the involvement of domain specialists, labeling it by hand. Difficulties on visualizing huge amounts of data, as well as the cost of the specialists involvement, are challenges which may constraint the labeling task performance. Therefore, the automatic highlighting of good candidates to label by hand, henceforth called representative individuals, is a high value task, which may result in a good tradeoff between the cost with the specialist and the machine learning performance. Among the SSL approaches in the literature, our study is focused on the network--based approache, where datasets are represented relationally, through the graphic abstraction. Thus, the current study aims to explore and exploit the influence of the labeled data on the SSL performance, that is, the proper characterization of representative nodes, how the network structure may enhance them, the SSL performance gain due to labeling them by hand, and related philosophical aspects. Concerning the characterization, central nodes characterization criteria were studied on networks with well-defined modular structures. Counterintuitively, highly connected nodes (hubs) are not much representatives. Not so connected nodes placed in low connectivity neighborhoods are, though. Strictly local, this characterization is scalable to huge volumes of data. In networks with homogeneous degree distribution - Girvan-Newman networks (GN), nodes with high clustering coefficient also figure out as representatives. On the other hand, in networks with inhomogeneous degree distribution - Lancichinetti-Fortunato-Radicchi networks (LFR), nodes with high betweenness stand out. Nodes with high clustering coefficient in GN networks typically lie in almost-cliques motifs; nodes with high betweenness in LFR networks are highly connected nodes, which lie in communities borders. In both cases, the highlighted nodes are outstanding regularizers. Besides that, unified approaches to characterize representative nodes were studied because diverse criteria stand out for diverse networks. Crucial for highlighting representative nodes and ensure good SSL performance, the graph construction from vector-based datasets was also studied. The method called AdaRadius was introduced and presents advantages such as adaptability to data with variable density, low dependency on parameters settings, and reasonable computational cost on both pool based and incremental data. Yielding networks are sparse but connected and allow the semi-supervised classification to take great advantage of the manual labeling of representative nodes. Lastly, the validation of graph construction methods for SSL was studied, being proposed the validation measure called graph-labels Katz coherence. Summing up, the discussed results give rise to the validity of representative individuals selection to seed the semi-supervised classification, supporting the central assumption of current thesis. Analogies may be found in several real-world network problems, such as epidemiology, rumors and information spreading, resilience, lethality, grandmother cells, and network evolving and self-organization.
|
42 |
Hyper-parameter optimization for manifold regularization learning = Otimização de hiperparâmetros para aprendizado do computador por regularização em variedades / Otimização de hiperparâmetros para aprendizado do computador por regularização em variedadesBecker, Cassiano Otávio, 1977- 08 December 2013 (has links)
Orientador: Paulo Augusto Valente Ferreira / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-23T18:31:10Z (GMT). No. of bitstreams: 1
Becker_CassianoOtavio_M.pdf: 861514 bytes, checksum: 07ea364d206309cbabdf79f51037f481 (MD5)
Previous issue date: 2013 / Resumo: Esta dissertação investiga o problema de otimização de hiperparâmetros para modelos de aprendizado do computador baseados em regularização. Uma revisão destes algoritmos é apresentada, abordando diferentes funções de perda e tarefas de aprendizado, incluindo Máquinas de Vetores de Suporte, Mínimos Quadrados Regularizados e sua extensão para modelos de aprendizado semi-supervisionado, mais especificamente, Regularização em Variedades. Uma abordagem baseada em otimização por gradiente é proposta, através da utilização de um método eficiente de cálculo da função de validação por exclusão unitária. Com o intuito de avaliar os métodos propostos em termos de qualidade de generalização dos modelos gerados, uma aplicação deste método a diferentes conjuntos de dados e exemplos numéricos é apresentada / Abstract: This dissertation investigates the problem of hyper-parameter optimization for regularization based learning models. A review of different learning algorithms is provided in terms of different losses and learning tasks, including Support Vector Machines, Regularized Least Squares and their extension to semi-supervised learning models, more specifically, Manifold Regularization. A gradient based optimization approach is proposed, using an efficient calculation of the Leave-one-out Cross Validation procedure. Datasets and numerical examples are provided in order to evaluate the methods proposed in terms of their generalization capability of the generated models / Mestrado / Automação / Mestre em Engenharia Elétrica
|
43 |
Performance evaluation based on data from code reviewsAndrej, Sekáč January 2016 (has links)
Context. Modern code review tools such as Gerrit have made available great amounts of code review data from different open source projects as well as other commercial projects. Code reviews are used to keep the quality of produced source code under control but the stored data could also be used for evaluation of the software development process. Objectives. This thesis uses machine learning methods for an approximation of review expert’s performance evaluation function. Due to limitations in the size of labelled data sample, this work uses semisupervised machine learning methods and measure their influence on the performance. In this research we propose features and also analyse their relevance to development performance evaluation. Methods. This thesis uses Radial Basis Function networks as the regression algorithm for the performance evaluation approximation and Metric Based Regularisation as the semi-supervised learning method. For the analysis of feature set and goodness of fit we use statistical tools with manual analysis. Results. The semi-supervised learning method achieved a similar accuracy to supervised versions of algorithm. The feature analysis showed that there is a significant negative correlation between the performance evaluation and three other features. A manual verification of learned models on unlabelled data achieved 73.68% accuracy. Conclusions. We have not managed to prove that the used semisupervised learning method would perform better than supervised learning methods. The analysis of the feature set suggests that the number of reviewers, the ratio of comments to the change size and the amount of code lines modified in later parts of development are relevant to performance evaluation task with high probability. The achieved accuracy of models close to 75% leads us to believe that, considering the limited size of labelled data set, our work provides a solid base for further improvements in the performance evaluation approximation.
|
44 |
Active Cleaning of Label Noise Using Support Vector MachinesEkambaram, Rajmadhan 19 June 2017 (has links)
Large scale datasets collected using non-expert labelers are prone to labeling errors. Errors in the given labels or label noise affect the classifier performance, classifier complexity, class proportions, etc. It may be that a relatively small, but important class needs to have all its examples identified. Typical solutions to the label noise problem involve creating classifiers that are robust or tolerant to errors in the labels, or removing the suspected examples using machine learning algorithms. Finding the label noise examples through a manual review process is largely unexplored due to the cost and time factors involved. Nevertheless, we believe it is the only way to create a label noise free dataset. This dissertation proposes a solution exploiting the characteristics of the Support Vector Machine (SVM) classifier and the sparsity of its solution representation to identify uniform random label noise examples in a dataset. Application of this method is illustrated with problems involving two real-world large scale datasets. This dissertation also presents results for datasets that contain adversarial label noise. A simple extension of this method to a semi-supervised learning approach is also presented. The results show that most mislabels are quickly and effectively identified by the approaches developed in this dissertation.
|
45 |
Expansão de recursos para análise de sentimentos usando aprendizado semi-supervisionado / Extending sentiment analysis resources using semi-supervised learningHenrico Bertini Brum 23 March 2018 (has links)
O grande volume de dados que temos disponíveis em ambientes virtuais pode ser excelente fonte de novos recursos para estudos em diversas tarefas de Processamento de Linguagem Natural, como a Análise de Sentimentos. Infelizmente é elevado o custo de anotação de novos córpus, que envolve desde investimentos financeiros até demorados processos de revisão. Nossa pesquisa propõe uma abordagem de anotação semissupervisionada, ou seja, anotação automática de um grande córpus não anotado partindo de um conjunto de dados anotados manualmente. Para tal, introduzimos o TweetSentBR, um córpus de tweets no domínio de programas televisivos que possui anotação em três classes e revisões parciais feitas por até sete anotadores. O córpus representa um importante recurso linguístico de português brasileiro, e fica entre os maiores córpus anotados na literatura para classificação de polaridades. Além da anotação manual do córpus, realizamos a implementação de um framework de aprendizado semissupervisionado que faz uso de dados anotados e, de maneira iterativa, expande o mesmo usando dados não anotados. O TweetSentBR, que possui 15:000 tweets anotados é assim expandido cerca de oito vezes. Para a expansão, foram treinados modelos de classificação usando seis classificadores de polaridades, assim como foram avaliados diferentes parâmetros e representações a fim de obter um córpus confiável. Realizamos experimentos gerando córpus expandidos por cada classificador, tanto para a classificação em três polaridades (positiva, neutra e negativa) quanto para classificação binária. Avaliamos os córpus gerados usando um conjunto de held-out e comparamos a FMeasure da classificação usando como treinamento os córpus anotados manualmente e semiautomaticamente. O córpus semissupervisionado que obteve os melhores resultados para a classificação em três polaridades atingiu 62;14% de F-Measure média, superando a média obtida com as avaliações no córpus anotado manualmente (61;02%). Na classificação binária, o melhor córpus expandido obteve 83;11% de F1-Measure média, superando a média obtida na avaliação do córpus anotado manualmente (79;80%). Além disso, simulamos nossa expansão em córpus anotados da literatura, medindo o quão corretas são as etiquetas anotadas semi-automaticamente. Nosso melhor resultado foi na expansão de um córpus de reviews de produtos que obteve FMeasure de 93;15% com dados binários. Por fim, comparamos um córpus da literatura obtido por meio de supervisão distante e nosso framework semissupervisionado superou o primeiro na classificação de polaridades binária em cross-domain. / The high volume of data available in the Internet can be a good resource for studies of several tasks in Natural Language Processing as in Sentiment Analysis. Unfortunately there is a high cost for the annotation of new corpora, involving financial support and long revision processes. Our work proposes an approach for semi-supervised labeling, an automatic annotation of a large unlabeled set of documents starting from a manually annotated corpus. In order to achieve that, we introduced TweetSentBR, a tweet corpora on TV show programs domain with annotation for 3-point (positive, neutral and negative) sentiment classification partially reviewed by up to seven annotators. The corpus is an important linguistic resource for Brazilian Portuguese language and it stands between the biggest annotated corpora for polarity classification. Beyond the manual annotation, we implemented a semi-supervised learning based framework that uses this labeled data and extends it using unlabeled data. TweetSentBR corpus, containing 15:000 documents, had its size augmented in eight times. For the extending process, we trained classification models using six polarity classifiers, evaluated different parameters and representation schemes in order to obtain the most reliable corpora. We ran experiments generating extended corpora for each classifier, both for 3-point and binary classification. We evaluated the generated corpora using a held-out subset and compared the obtained F-Measure values with the manually and the semi-supervised annotated corpora. The semi-supervised corpus that obtained the best values for 3-point classification achieved 62;14% on average F-Measure, overcoming the results obtained by the same classification with the manually annotated corpus (61;02%). On binary classification, the best extended corpus achieved 83;11% on average F-Measure, overcoming the results on the manually corpora (79;80%). Furthermore, we simulated the extension of labeled corpora in literature, measuring how well the semi-supervised annotation works. Our best results were in the extension of a product review corpora, achieving 93;15% on F1-Measure. Finally, we compared a literature corpus which was labeled by using distant supervision with our semi-supervised corpus, and this overcame the first in binary polarity classification on cross-domain data.
|
46 |
Generalized Domain Adaptation for Visual DomainsJanuary 2020 (has links)
abstract: Humans have a great ability to recognize objects in different environments irrespective of their variations. However, the same does not apply to machine learning models which are unable to generalize to images of objects from different domains. The generalization of these models to new data is constrained by the domain gap. Many factors such as image background, image resolution, color, camera perspective and variations in the objects are responsible for the domain gap between the training data (source domain) and testing data (target domain). Domain adaptation algorithms aim to overcome the domain gap between the source and target domains and learn robust models that can perform well across both the domains.
This thesis provides solutions for the standard problem of unsupervised domain adaptation (UDA) and the more generic problem of generalized domain adaptation (GDA). The contributions of this thesis are as follows. (1) Certain and Consistent Domain Adaptation model for closed-set unsupervised domain adaptation by aligning the features of the source and target domain using deep neural networks. (2) A multi-adversarial deep learning model for generalized domain adaptation. (3) A gating model that detects out-of-distribution samples for generalized domain adaptation.
The models were tested across multiple computer vision datasets for domain adaptation.
The dissertation concludes with a discussion on the proposed approaches and future directions for research in closed set and generalized domain adaptation. / Dissertation/Thesis / Masters Thesis Computer Science 2020
|
47 |
Ichthyoplankton Classification Tool using Generative Adversarial Networks and Transfer LearningAljaafari, Nura 15 April 2018 (has links)
The study and the analysis of marine ecosystems is a significant part of the marine science research. These systems are valuable resources for fisheries, improving water quality and can even be used in drugs production. The investigation of ichthyoplankton inhabiting these ecosystems is also an important research field. Ichthyoplankton are fish in their early stages of life. In this stage, the fish have relatively similar shape and are small in size. The currently used way of identifying them is not optimal. Marine scientists typically study such organisms by sending a team that collects samples from the sea which is then taken to the lab for further investigation. These samples need to be studied by an expert and usually end needing a DNA sequencing. This method is time-consuming and requires a high level of experience. The recent advances in AI have helped to solve and automate several difficult tasks which motivated us to develop a classification tool for ichthyoplankton. We show that using machine learning techniques, such as generative adversarial networks combined with transfer learning solves such a problem with high accuracy. We show that using traditional machine learning algorithms fails to solve it. We also give a general framework for creating a classification tool when the dataset used for training is a limited dataset. We aim to build a user-friendly tool that can be used by any user for the classification task and we aim to give a guide to the researchers so that they can follow in creating a classification tool.
|
48 |
Learning in the Presence of Skew and Missing Labels Through Online Ensembles and Meta-reinforcement LearningVafaie, Parsa 07 September 2021 (has links)
Data streams are large sequences of data, possibly endless and temporarily ordered, that are common-place in Internet of Things (IoT) applications such as intrusion detection in computer networking, fraud detection in financial institutions, real-time tumor tracking in radiotherapy and social media analysis. Algorithms learning from such streams need to be able to construct near real-time models that continuously adapt to potential changes in patterns, in order to retain high performance throughout the stream. It follows that there are numerous challenges involved in supervised learning (or so-called classification) in such environments. One of the challenges in learning from streams is multi-class imbalance, in which the rates of instances in the different class labels differ substantially. Notably, classification algorithms may become biased towards the classes with more frequent instances, sacrificing the performance of the less frequent or so-called minority classes. Further, minority instances often arrive infrequently and in bursts, making accurate model construction problematic. For example, network intrusion detection systems must be able to distinguish between normal traffic and multiple minority classes corresponding to a variety of different types of attacks.
Further, having labels for all instances are often infeasible, since we might have missing or late-arriving labels. For instance, when learning from a stream regarding the task of detecting network intrusions, the true label for all instances might not be available, or it might take time until the label is made available, especially for new types of attacks.
In this thesis, we contribute to the advancements of online learning from evolving streams by focusing on the above-mentioned areas of multi-class imbalance and missing labels. First, we introduce a multi-class online ensemble algorithm designed to maintain a balanced performance over all classes. Specifically, our approach samples instances with replacement while dynamically increasing the weights of under-represented classes, in order to produce models that benefit all classes. Our experimental results show that our online ensemble method performs well against multi-class imbalanced data in various datasets.
We further continue our study by introducing an approach to dealing with missing labels that utilize both labelled and unlabelled data to increase a model’s performance. That is, our method utilizes labelled data for pseudo-labelling unlabelled instances, allowing the model to perform better in environments where labels are scarce. More specifically, our approach features a meta-reinforcement learning agent, trained on multiple-source streams, that can effectively select the prediction of a K nearest neighbours (K-NN) classifier as the label for unlabelled instances. Extensive experiments on benchmark datasets demonstrate the value and effectiveness of our approach and confirm that our method outperforms state-of-the-art.
|
49 |
Semi-supervised učení z nepříznivě distribuovaných dat / Semi-supervised Learning from Unfavorably Distributed DataSochor, Matěj January 2020 (has links)
Semi-supervised learning (SSL) is a branch of machine learning focusing on using not only labeled data samples, but also unlabeled ones, in an effort to decrease the need for labeled data and thus allow using machine learning even when labeling large amounts of data would be too costly. Despite its quick development in the recent years, there are still issues left to be solved before it can be broadly deployed in practice. One of those issues is class distribution mismatch. It arises when the unlabeled data contains samples not belonging to the classes present in the labeled data. This confuses the training and can even lead to getting a classifier performing worse than a classifier trained on the available data in purely supervised fashion. We designed a filtration method called Unfavorable Data Filtering (UDF) which extracts important features from the data and then uses a similarity-based filter to filter the irrelevant data out according to those features. The filtering happens before any of the SSL training takes places, making UDF usable with any SSL algorithm. To judge its effectiveness, we performed many experiments, mainly on the CIFAR-10 dataset. We found out that UDF is capable of significantly improving the resulting accuracy when compared to not filtering the data, identified basic guidelines...
|
50 |
A Unified Generative and Discriminative Approach to Automatic Chord Estimation for Music Audio Signals / 音楽音響信号に対する自動コード推定のための生成・識別統合的アプローチWu, Yiming 24 September 2021 (has links)
京都大学 / 新制・課程博士 / 博士(情報学) / 甲第23540号 / 情博第770号 / 新制||情||131(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)准教授 吉井 和佳, 教授 河原 達也, 教授 西野 恒, 教授 鹿島 久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
Page generated in 0.0451 seconds