Spelling suggestions: "subject:"semisupervised learning"" "subject:"semissupervised learning""
51 |
Expansão de recursos para análise de sentimentos usando aprendizado semi-supervisionado / Extending sentiment analysis resources using semi-supervised learningHenrico Bertini Brum 23 March 2018 (has links)
O grande volume de dados que temos disponíveis em ambientes virtuais pode ser excelente fonte de novos recursos para estudos em diversas tarefas de Processamento de Linguagem Natural, como a Análise de Sentimentos. Infelizmente é elevado o custo de anotação de novos córpus, que envolve desde investimentos financeiros até demorados processos de revisão. Nossa pesquisa propõe uma abordagem de anotação semissupervisionada, ou seja, anotação automática de um grande córpus não anotado partindo de um conjunto de dados anotados manualmente. Para tal, introduzimos o TweetSentBR, um córpus de tweets no domínio de programas televisivos que possui anotação em três classes e revisões parciais feitas por até sete anotadores. O córpus representa um importante recurso linguístico de português brasileiro, e fica entre os maiores córpus anotados na literatura para classificação de polaridades. Além da anotação manual do córpus, realizamos a implementação de um framework de aprendizado semissupervisionado que faz uso de dados anotados e, de maneira iterativa, expande o mesmo usando dados não anotados. O TweetSentBR, que possui 15:000 tweets anotados é assim expandido cerca de oito vezes. Para a expansão, foram treinados modelos de classificação usando seis classificadores de polaridades, assim como foram avaliados diferentes parâmetros e representações a fim de obter um córpus confiável. Realizamos experimentos gerando córpus expandidos por cada classificador, tanto para a classificação em três polaridades (positiva, neutra e negativa) quanto para classificação binária. Avaliamos os córpus gerados usando um conjunto de held-out e comparamos a FMeasure da classificação usando como treinamento os córpus anotados manualmente e semiautomaticamente. O córpus semissupervisionado que obteve os melhores resultados para a classificação em três polaridades atingiu 62;14% de F-Measure média, superando a média obtida com as avaliações no córpus anotado manualmente (61;02%). Na classificação binária, o melhor córpus expandido obteve 83;11% de F1-Measure média, superando a média obtida na avaliação do córpus anotado manualmente (79;80%). Além disso, simulamos nossa expansão em córpus anotados da literatura, medindo o quão corretas são as etiquetas anotadas semi-automaticamente. Nosso melhor resultado foi na expansão de um córpus de reviews de produtos que obteve FMeasure de 93;15% com dados binários. Por fim, comparamos um córpus da literatura obtido por meio de supervisão distante e nosso framework semissupervisionado superou o primeiro na classificação de polaridades binária em cross-domain. / The high volume of data available in the Internet can be a good resource for studies of several tasks in Natural Language Processing as in Sentiment Analysis. Unfortunately there is a high cost for the annotation of new corpora, involving financial support and long revision processes. Our work proposes an approach for semi-supervised labeling, an automatic annotation of a large unlabeled set of documents starting from a manually annotated corpus. In order to achieve that, we introduced TweetSentBR, a tweet corpora on TV show programs domain with annotation for 3-point (positive, neutral and negative) sentiment classification partially reviewed by up to seven annotators. The corpus is an important linguistic resource for Brazilian Portuguese language and it stands between the biggest annotated corpora for polarity classification. Beyond the manual annotation, we implemented a semi-supervised learning based framework that uses this labeled data and extends it using unlabeled data. TweetSentBR corpus, containing 15:000 documents, had its size augmented in eight times. For the extending process, we trained classification models using six polarity classifiers, evaluated different parameters and representation schemes in order to obtain the most reliable corpora. We ran experiments generating extended corpora for each classifier, both for 3-point and binary classification. We evaluated the generated corpora using a held-out subset and compared the obtained F-Measure values with the manually and the semi-supervised annotated corpora. The semi-supervised corpus that obtained the best values for 3-point classification achieved 62;14% on average F-Measure, overcoming the results obtained by the same classification with the manually annotated corpus (61;02%). On binary classification, the best extended corpus achieved 83;11% on average F-Measure, overcoming the results on the manually corpora (79;80%). Furthermore, we simulated the extension of labeled corpora in literature, measuring how well the semi-supervised annotation works. Our best results were in the extension of a product review corpora, achieving 93;15% on F1-Measure. Finally, we compared a literature corpus which was labeled by using distant supervision with our semi-supervised corpus, and this overcame the first in binary polarity classification on cross-domain data.
|
52 |
Generalized Domain Adaptation for Visual DomainsJanuary 2020 (has links)
abstract: Humans have a great ability to recognize objects in different environments irrespective of their variations. However, the same does not apply to machine learning models which are unable to generalize to images of objects from different domains. The generalization of these models to new data is constrained by the domain gap. Many factors such as image background, image resolution, color, camera perspective and variations in the objects are responsible for the domain gap between the training data (source domain) and testing data (target domain). Domain adaptation algorithms aim to overcome the domain gap between the source and target domains and learn robust models that can perform well across both the domains.
This thesis provides solutions for the standard problem of unsupervised domain adaptation (UDA) and the more generic problem of generalized domain adaptation (GDA). The contributions of this thesis are as follows. (1) Certain and Consistent Domain Adaptation model for closed-set unsupervised domain adaptation by aligning the features of the source and target domain using deep neural networks. (2) A multi-adversarial deep learning model for generalized domain adaptation. (3) A gating model that detects out-of-distribution samples for generalized domain adaptation.
The models were tested across multiple computer vision datasets for domain adaptation.
The dissertation concludes with a discussion on the proposed approaches and future directions for research in closed set and generalized domain adaptation. / Dissertation/Thesis / Masters Thesis Computer Science 2020
|
53 |
Ichthyoplankton Classification Tool using Generative Adversarial Networks and Transfer LearningAljaafari, Nura 15 April 2018 (has links)
The study and the analysis of marine ecosystems is a significant part of the marine science research. These systems are valuable resources for fisheries, improving water quality and can even be used in drugs production. The investigation of ichthyoplankton inhabiting these ecosystems is also an important research field. Ichthyoplankton are fish in their early stages of life. In this stage, the fish have relatively similar shape and are small in size. The currently used way of identifying them is not optimal. Marine scientists typically study such organisms by sending a team that collects samples from the sea which is then taken to the lab for further investigation. These samples need to be studied by an expert and usually end needing a DNA sequencing. This method is time-consuming and requires a high level of experience. The recent advances in AI have helped to solve and automate several difficult tasks which motivated us to develop a classification tool for ichthyoplankton. We show that using machine learning techniques, such as generative adversarial networks combined with transfer learning solves such a problem with high accuracy. We show that using traditional machine learning algorithms fails to solve it. We also give a general framework for creating a classification tool when the dataset used for training is a limited dataset. We aim to build a user-friendly tool that can be used by any user for the classification task and we aim to give a guide to the researchers so that they can follow in creating a classification tool.
|
54 |
Learning in the Presence of Skew and Missing Labels Through Online Ensembles and Meta-reinforcement LearningVafaie, Parsa 07 September 2021 (has links)
Data streams are large sequences of data, possibly endless and temporarily ordered, that are common-place in Internet of Things (IoT) applications such as intrusion detection in computer networking, fraud detection in financial institutions, real-time tumor tracking in radiotherapy and social media analysis. Algorithms learning from such streams need to be able to construct near real-time models that continuously adapt to potential changes in patterns, in order to retain high performance throughout the stream. It follows that there are numerous challenges involved in supervised learning (or so-called classification) in such environments. One of the challenges in learning from streams is multi-class imbalance, in which the rates of instances in the different class labels differ substantially. Notably, classification algorithms may become biased towards the classes with more frequent instances, sacrificing the performance of the less frequent or so-called minority classes. Further, minority instances often arrive infrequently and in bursts, making accurate model construction problematic. For example, network intrusion detection systems must be able to distinguish between normal traffic and multiple minority classes corresponding to a variety of different types of attacks.
Further, having labels for all instances are often infeasible, since we might have missing or late-arriving labels. For instance, when learning from a stream regarding the task of detecting network intrusions, the true label for all instances might not be available, or it might take time until the label is made available, especially for new types of attacks.
In this thesis, we contribute to the advancements of online learning from evolving streams by focusing on the above-mentioned areas of multi-class imbalance and missing labels. First, we introduce a multi-class online ensemble algorithm designed to maintain a balanced performance over all classes. Specifically, our approach samples instances with replacement while dynamically increasing the weights of under-represented classes, in order to produce models that benefit all classes. Our experimental results show that our online ensemble method performs well against multi-class imbalanced data in various datasets.
We further continue our study by introducing an approach to dealing with missing labels that utilize both labelled and unlabelled data to increase a model’s performance. That is, our method utilizes labelled data for pseudo-labelling unlabelled instances, allowing the model to perform better in environments where labels are scarce. More specifically, our approach features a meta-reinforcement learning agent, trained on multiple-source streams, that can effectively select the prediction of a K nearest neighbours (K-NN) classifier as the label for unlabelled instances. Extensive experiments on benchmark datasets demonstrate the value and effectiveness of our approach and confirm that our method outperforms state-of-the-art.
|
55 |
Semi-supervised učení z nepříznivě distribuovaných dat / Semi-supervised Learning from Unfavorably Distributed DataSochor, Matěj January 2020 (has links)
Semi-supervised learning (SSL) is a branch of machine learning focusing on using not only labeled data samples, but also unlabeled ones, in an effort to decrease the need for labeled data and thus allow using machine learning even when labeling large amounts of data would be too costly. Despite its quick development in the recent years, there are still issues left to be solved before it can be broadly deployed in practice. One of those issues is class distribution mismatch. It arises when the unlabeled data contains samples not belonging to the classes present in the labeled data. This confuses the training and can even lead to getting a classifier performing worse than a classifier trained on the available data in purely supervised fashion. We designed a filtration method called Unfavorable Data Filtering (UDF) which extracts important features from the data and then uses a similarity-based filter to filter the irrelevant data out according to those features. The filtering happens before any of the SSL training takes places, making UDF usable with any SSL algorithm. To judge its effectiveness, we performed many experiments, mainly on the CIFAR-10 dataset. We found out that UDF is capable of significantly improving the resulting accuracy when compared to not filtering the data, identified basic guidelines...
|
56 |
A Unified Generative and Discriminative Approach to Automatic Chord Estimation for Music Audio Signals / 音楽音響信号に対する自動コード推定のための生成・識別統合的アプローチWu, Yiming 24 September 2021 (has links)
京都大学 / 新制・課程博士 / 博士(情報学) / 甲第23540号 / 情博第770号 / 新制||情||131(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)准教授 吉井 和佳, 教授 河原 達也, 教授 西野 恒, 教授 鹿島 久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
57 |
Learning from Scholarly Attributed Graphs for Scientific DiscoveryAkujuobi, Uchenna Thankgod 18 October 2020 (has links)
Research and experimentation in various scientific fields are based on the knowledge and ideas from scholarly literature. The advancement of research and development has, thus, strengthened the importance of literary analysis and understanding. However, in recent years, researchers have been facing massive scholarly documents published at an exponentially increasing rate. Analyzing this vast number of publications is far beyond the capability of individual researchers.
This dissertation is motivated by the need for large scale analyses of the exploding number of scholarly literature for scientific knowledge discovery. In the first part of this dissertation, the interdependencies between scholarly literature are studied. First, I develop Delve – a data-driven search engine supported by our designed semi-supervised edge classification method. This system enables users to search and analyze the relationship between datasets and scholarly literature. Based on the Delve system, I propose to study information extraction as a node classification problem in attributed networks. Specifically, if we can learn the research topics of documents (nodes in a network), we can aggregate documents by topics and retrieve information specific to each topic (e.g., top-k popular datasets).
Node classification in attributed networks has several challenges: a limited number of labeled nodes, effective fusion of topological structure and node/edge attributes, and the co-existence of multiple labels for one node. Existing node classification approaches can only address or partially address a few of these challenges. This dissertation addresses these challenges by proposing semi-supervised multi-class/multi-label node classification models to integrate node/edge attributes and topological relationships.
The second part of this dissertation examines the problem of analyzing the interdependencies between terms in scholarly literature. I present two algorithms for the automatic hypothesis generation (HG) problem, which refers to the discovery of meaningful implicit connections between scientific terms, including but not limited to diseases, drugs, and genes extracted from databases of biomedical publications. The automatic hypothesis generation problem is modeled as a future connectivity prediction in a dynamic attributed graph. The key is to capture the temporal evolution of node-pair (term-pair) relations. Experiment results and case study analyses highlight the effectiveness of the proposed algorithms compared to the baselines’ extension.
|
58 |
Novel Semi-Supervised Learning Models to Balance Data Inclusivity and Usability in Healthcare ApplicationsJanuary 2019 (has links)
abstract: Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited labeled data, as well as the information in the abundant unlabeled data to build strong predictive models. However, not all the included information is useful. For example, some features may correspond to noise and including them will hurt the predictive model performance. Additionally, some instances may not be as relevant to model building and their inclusion will increase training time and potentially hurt the model performance. The objective of this research is to develop novel SSL models to balance data inclusivity and usability. My dissertation research focuses on applications of SSL in healthcare, driven by problems in brain cancer radiomics, migraine imaging, and Parkinson’s Disease telemonitoring.
The first topic introduces an integration of machine learning (ML) and a mechanistic model (PI) to develop an SSL model applied to predicting cell density of glioblastoma brain cancer using multi-parametric medical images. The proposed ML-PI hybrid model integrates imaging information from unbiopsied regions of the brain as well as underlying biological knowledge from the mechanistic model to predict spatial tumor density in the brain.
The second topic develops a multi-modality imaging-based diagnostic decision support system (MMI-DDS). MMI-DDS consists of modality-wise principal components analysis to incorporate imaging features at different aggregation levels (e.g., voxel-wise, connectivity-based, etc.), a constrained particle swarm optimization (cPSO) feature selection algorithm, and a clinical utility engine that utilizes inverse operators on chosen principal components for white-box classification models.
The final topic develops a new SSL regression model with integrated feature and instance selection called s2SSL (with “s2” referring to selection in two different ways: feature and instance). s2SSL integrates cPSO feature selection and graph-based instance selection to simultaneously choose the optimal features and instances and build accurate models for continuous prediction. s2SSL was applied to smartphone-based telemonitoring of Parkinson’s Disease patients. / Dissertation/Thesis / Doctoral Dissertation Industrial Engineering 2019
|
59 |
Semi Supervised Learning for Accurate Segmentation of Roughly Labeled DataRajan, Rachel 01 September 2020 (has links)
No description available.
|
60 |
Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition / Semi-övervakad inlärning med glesa autoencoders i automatisk taligenkänningDHAKA, AKASH KUMAR January 2016 (has links)
This work is aimed at exploring semi-supervised learning techniques to improve the performance of Automatic Speech Recognition systems. Semi-supervised learning takes advantage of unlabeled data in order to improve the quality of the representations extracted from the data.The proposed model is a neural network where the weights are updated by minimizing the weighted sum of a supervised and an unsupervised cost function, simultaneously. These costs are evaluated on the labeled and unlabeled portions of the data set, respectively. The combined cost is optimized through mini-batch stochastic gradient descent via standard backpropagation.The model was tested on a phone classification task on the TIMIT American English data set and on a written digit classification task on the MNIST data set. Our results show that the model outperforms a network trained with standard backpropagation on the labelled material alone. The results are also in line with state-of-the-art graph-based semi-supervised training methods. / Detta arbete syftar till att utforska halvövervakade inlärningstekniker (semi-supervised learning techniques) för att förbättra prestandan hos automatiska taligenkänningssystem.Halvövervakad maskininlärning använder sig av data ej märkt med klasstillhörighetsinformation för att förbättra kvaliteten hos den från datan extraherade representationen.Modellen som beskrivs i arbetet är ett neuralt nätverk där vikterna uppdateras genom att samtidigt minimera den viktade summan av en övervakad och en oövervakad kostnadsfunktion.Dessa kostnadsfunktioner evalueras på den märkta respektive den omärkta datamängden.De kombinerade kostnadsfunktionerna optimeras genom gradient descent med hjälp av traditionell backpropagation.Modellen har evaluerats genom en fonklassificeringsuppgift på datamängden TIMIT American English, samt en sifferklassificeringsuppgift på datamängden MNIST.Resultaten visar att modellen presterar bättre än ett nätverk tränat med backpropagation på endast märkt data.Resultaten är även konkurrenskraftiga med rådande state of the art, grafbaserade halvövervakade inlärningsmetoder.
|
Page generated in 0.1132 seconds