Global ETD Search

151	Técnicas para o problema de dados desbalanceados em classificação hierárquica / Techniques for the problem of imbalanced data in hierarchical classification Victor Hugo Barella 24 July 2015 (has links) Os recentes avanços da ciência e tecnologia viabilizaram o crescimento de dados em quantidade e disponibilidade. Junto com essa explosão de informações geradas, surge a necessidade de analisar dados para descobrir conhecimento novo e útil. Desse modo, áreas que visam extrair conhecimento e informações úteis de grandes conjuntos de dados se tornaram grandes oportunidades para o avanço de pesquisas, tal como o Aprendizado de Máquina (AM) e a Mineração de Dados (MD). Porém, existem algumas limitações que podem prejudicar a acurácia de alguns algoritmos tradicionais dessas áreas, por exemplo o desbalanceamento das amostras das classes de um conjunto de dados. Para mitigar tal problema, algumas alternativas têm sido alvos de pesquisas nos últimos anos, tal como o desenvolvimento de técnicas para o balanceamento artificial de dados, a modificação dos algoritmos e propostas de abordagens para dados desbalanceados. Uma área pouco explorada sob a visão do desbalanceamento de dados são os problemas de classificação hierárquica, em que as classes são organizadas em hierarquias, normalmente na forma de árvore ou DAG (Direct Acyclic Graph). O objetivo deste trabalho foi investigar as limitações e maneiras de minimizar os efeitos de dados desbalanceados em problemas de classificação hierárquica. Os experimentos realizados mostram que é necessário levar em consideração as características das classes hierárquicas para a aplicação (ou não) de técnicas para tratar problemas dados desbalanceados em classificação hierárquica. / Recent advances in science and technology have made possible the data growth in quantity and availability. Along with this explosion of generated information, there is a need to analyze data to discover new and useful knowledge. Thus, areas for extracting knowledge and useful information in large datasets have become great opportunities for the advancement of research, such as Machine Learning (ML) and Data Mining (DM). However, there are some limitations that may reduce the accuracy of some traditional algorithms of these areas, for example the imbalance of classes samples in a dataset. To mitigate this drawback, some solutions have been the target of research in recent years, such as the development of techniques for artificial balancing data, algorithm modification and new approaches for imbalanced data. An area little explored in the data imbalance vision are the problems of hierarchical classification, in which the classes are organized into hierarchies, commonly in the form of tree or DAG (Direct Acyclic Graph). The goal of this work aims at investigating the limitations and approaches to minimize the effects of imbalanced data with hierarchical classification problems. The experimental results show the need to take into account the features of hierarchical classes when deciding the application of techniques for imbalanced data in hierarchical classification. Aprendizado supervisionado Classificação hierárquica Dados desbalanceados Desbalanceamento de dados Data imbalance Hierarchical classification Imbalanced data Supervised learning
152	Interpretação de clusters gerados por algoritmos de clustering hierárquico / Interpreting clusters generated by hierarchical clustering algorithms Jean Metz 04 August 2006 (has links) O processo de Mineração de Dados (MD) consiste na extração automática de padrões que representam o conhecimento implícito em grandes bases de dados. Em geral, a MD pode ser classificada em duas categorias: preditiva e descritiva. Tarefas da primeira categoria, tal como a classificação, realizam inferências preditivas sobre os dados enquanto que tarefas da segunda categoria, tal como o clustering, exploram o conjunto de dados em busca de propriedades que o descrevem. Diferentemente da classificação, que analisa exemplos rotulados, o clustering utiliza exemplos para os quais o rótulo da classe não é previamente conhecido. Nessa tarefa, agrupamentos são formados de modo que exemplos de um mesmo cluster apresentam alta similaridade, ao passo que exemplos em clusters diferentes apresentam baixa similaridade. O clustering pode ainda facilitar a organização de clusters em uma hierarquia de agrupamentos, na qual são agrupados eventos similares, criando uma taxonomia que pode simplificar a interpretação de clusters. Neste trabalho, é proposto e desenvolvido um módulo de aprendizado não-supervisionado, que agrega algoritmos de clustering hierárquico e ferramentas de análise de clusters para auxiliar o especialista de domínio na interpretação dos resultados do clustering. Uma vez que o clustering hierárquico agrupa exemplos de acordo com medidas de similaridade e organiza os clusters em uma hierarquia, o usuário/especialista pode analisar e explorar essa hierarquia de agrupamentos em diferentes níveis para descobrir conceitos descritos por essa estrutura. O módulo proposto está integrado em um sistema maior, em desenvolvimento no Laboratório de Inteligência Computacional ? LABIC ?, que contempla todas as etapas do processo de MD, desde o pré-processamento de dados ao pós-processamento de conhecimento. Para avaliar o módulo proposto e seu uso para descoberta de conceitos a partir da estrutura hierárquica de clusters, foram realizados diversos experimentos sobre conjuntos de dados naturais, assim como um estudo de caso utilizando um conjunto de dados real. Os resultados mostram a viabilidade da metodologia proposta para interpretação dos clusters, apesar da complexidade do processo ser dependente das características do conjunto de dados. / The Data Mining (DM) process consists of the automated extraction of patterns representing knowledge implicitly stored in large databases. In general, DM tasks can be classified into two categories: predictive and descriptive. Tasks in the first category, such as classification and prediction, perform inference on the data in order to make predictions, while tasks in the second category, such as clustering, characterize the general properties of the data. Unlike classification and prediction, which analyze class-labeled data objects, clustering analyses data objects without a known class-label. Clusters of objects are formed so that objects that are in the same cluster have a close similarity among them, but are very dissimilar to objects in other clusters. Clustering can also facilitate the organization of clusters into a hierarchy of clusters that group similar events together. This taxonomy formation can facilitate interpretation of clusters. In this work, we propose and develop tools to deal with this task by implementing a module which comprises hierarchical clustering algorithms and several cluster analysis tools, aiming to help the domain specialist to interpret the clustering results. Once clusters group objects based on similarity measures which are organized into a hierarchy, the user/specialist is able to carry out an analysis and exploration of the agglomeration hierarchy at different levels of the hierarchy in order to discover concepts described by this structure. The proposed module is integrated into a large system under development by researchers from the Computational Intelligence Laboratory ? LABIC ?- which contemplates all the DM process steps, from data pre-processing to knowledge post-processing. To evaluate the implemented module and its use to discover concepts from the hierarchical structure of clusters, several experiments on natural databases were carried out as well as a case study using a real database. Results show the viability of the proposed methodology although the process could be complex depending on the characteristics of the database. Aprendizado não-supervisionado Exploração de dados Extração de padrões Data exploration Non-supervised learning Pattern extraction
153	Expansão de recursos para análise de sentimentos usando aprendizado semi-supervisionado / Extending sentiment analysis resources using semi-supervised learning Henrico Bertini Brum 23 March 2018 (has links) O grande volume de dados que temos disponíveis em ambientes virtuais pode ser excelente fonte de novos recursos para estudos em diversas tarefas de Processamento de Linguagem Natural, como a Análise de Sentimentos. Infelizmente é elevado o custo de anotação de novos córpus, que envolve desde investimentos financeiros até demorados processos de revisão. Nossa pesquisa propõe uma abordagem de anotação semissupervisionada, ou seja, anotação automática de um grande córpus não anotado partindo de um conjunto de dados anotados manualmente. Para tal, introduzimos o TweetSentBR, um córpus de tweets no domínio de programas televisivos que possui anotação em três classes e revisões parciais feitas por até sete anotadores. O córpus representa um importante recurso linguístico de português brasileiro, e fica entre os maiores córpus anotados na literatura para classificação de polaridades. Além da anotação manual do córpus, realizamos a implementação de um framework de aprendizado semissupervisionado que faz uso de dados anotados e, de maneira iterativa, expande o mesmo usando dados não anotados. O TweetSentBR, que possui 15:000 tweets anotados é assim expandido cerca de oito vezes. Para a expansão, foram treinados modelos de classificação usando seis classificadores de polaridades, assim como foram avaliados diferentes parâmetros e representações a fim de obter um córpus confiável. Realizamos experimentos gerando córpus expandidos por cada classificador, tanto para a classificação em três polaridades (positiva, neutra e negativa) quanto para classificação binária. Avaliamos os córpus gerados usando um conjunto de held-out e comparamos a FMeasure da classificação usando como treinamento os córpus anotados manualmente e semiautomaticamente. O córpus semissupervisionado que obteve os melhores resultados para a classificação em três polaridades atingiu 62;14% de F-Measure média, superando a média obtida com as avaliações no córpus anotado manualmente (61;02%). Na classificação binária, o melhor córpus expandido obteve 83;11% de F1-Measure média, superando a média obtida na avaliação do córpus anotado manualmente (79;80%). Além disso, simulamos nossa expansão em córpus anotados da literatura, medindo o quão corretas são as etiquetas anotadas semi-automaticamente. Nosso melhor resultado foi na expansão de um córpus de reviews de produtos que obteve FMeasure de 93;15% com dados binários. Por fim, comparamos um córpus da literatura obtido por meio de supervisão distante e nosso framework semissupervisionado superou o primeiro na classificação de polaridades binária em cross-domain. / The high volume of data available in the Internet can be a good resource for studies of several tasks in Natural Language Processing as in Sentiment Analysis. Unfortunately there is a high cost for the annotation of new corpora, involving financial support and long revision processes. Our work proposes an approach for semi-supervised labeling, an automatic annotation of a large unlabeled set of documents starting from a manually annotated corpus. In order to achieve that, we introduced TweetSentBR, a tweet corpora on TV show programs domain with annotation for 3-point (positive, neutral and negative) sentiment classification partially reviewed by up to seven annotators. The corpus is an important linguistic resource for Brazilian Portuguese language and it stands between the biggest annotated corpora for polarity classification. Beyond the manual annotation, we implemented a semi-supervised learning based framework that uses this labeled data and extends it using unlabeled data. TweetSentBR corpus, containing 15:000 documents, had its size augmented in eight times. For the extending process, we trained classification models using six polarity classifiers, evaluated different parameters and representation schemes in order to obtain the most reliable corpora. We ran experiments generating extended corpora for each classifier, both for 3-point and binary classification. We evaluated the generated corpora using a held-out subset and compared the obtained F-Measure values with the manually and the semi-supervised annotated corpora. The semi-supervised corpus that obtained the best values for 3-point classification achieved 62;14% on average F-Measure, overcoming the results obtained by the same classification with the manually annotated corpus (61;02%). On binary classification, the best extended corpus achieved 83;11% on average F-Measure, overcoming the results on the manually corpora (79;80%). Furthermore, we simulated the extension of labeled corpora in literature, measuring how well the semi-supervised annotation works. Our best results were in the extension of a product review corpora, achieving 93;15% on F1-Measure. Finally, we compared a literature corpus which was labeled by using distant supervision with our semi-supervised corpus, and this overcame the first in binary polarity classification on cross-domain data. Análise de sentimentos Anotação de córpus Aprendizado semisupervisionado Corpus annotation Semi-supervised learning Sentiment analysis
154	Generalized Domain Adaptation for Visual Domains January 2020 (has links) abstract: Humans have a great ability to recognize objects in different environments irrespective of their variations. However, the same does not apply to machine learning models which are unable to generalize to images of objects from different domains. The generalization of these models to new data is constrained by the domain gap. Many factors such as image background, image resolution, color, camera perspective and variations in the objects are responsible for the domain gap between the training data (source domain) and testing data (target domain). Domain adaptation algorithms aim to overcome the domain gap between the source and target domains and learn robust models that can perform well across both the domains. This thesis provides solutions for the standard problem of unsupervised domain adaptation (UDA) and the more generic problem of generalized domain adaptation (GDA). The contributions of this thesis are as follows. (1) Certain and Consistent Domain Adaptation model for closed-set unsupervised domain adaptation by aligning the features of the source and target domain using deep neural networks. (2) A multi-adversarial deep learning model for generalized domain adaptation. (3) A gating model that detects out-of-distribution samples for generalized domain adaptation. The models were tested across multiple computer vision datasets for domain adaptation. The dissertation concludes with a discussion on the proposed approaches and future directions for research in closed set and generalized domain adaptation. / Dissertation/Thesis / Masters Thesis Computer Science 2020 Computer science Adversarial Computer Vision Deep Learning Domain Adaptation Machine Learning semi-supervised learning
155	Ichthyoplankton Classification Tool using Generative Adversarial Networks and Transfer Learning Aljaafari, Nura 15 April 2018 (has links) The study and the analysis of marine ecosystems is a significant part of the marine science research. These systems are valuable resources for fisheries, improving water quality and can even be used in drugs production. The investigation of ichthyoplankton inhabiting these ecosystems is also an important research field. Ichthyoplankton are fish in their early stages of life. In this stage, the fish have relatively similar shape and are small in size. The currently used way of identifying them is not optimal. Marine scientists typically study such organisms by sending a team that collects samples from the sea which is then taken to the lab for further investigation. These samples need to be studied by an expert and usually end needing a DNA sequencing. This method is time-consuming and requires a high level of experience. The recent advances in AI have helped to solve and automate several difficult tasks which motivated us to develop a classification tool for ichthyoplankton. We show that using machine learning techniques, such as generative adversarial networks combined with transfer learning solves such a problem with high accuracy. We show that using traditional machine learning algorithms fails to solve it. We also give a general framework for creating a classification tool when the dataset used for training is a limited dataset. We aim to build a user-friendly tool that can be used by any user for the classification task and we aim to give a guide to the researchers so that they can follow in creating a classification tool. Deep learning transfer learning ichthyoplankton semi-supervised learning marine Generative adversarial Networks
156	Learning in the Presence of Skew and Missing Labels Through Online Ensembles and Meta-reinforcement Learning Vafaie, Parsa 07 September 2021 (has links) Data streams are large sequences of data, possibly endless and temporarily ordered, that are common-place in Internet of Things (IoT) applications such as intrusion detection in computer networking, fraud detection in financial institutions, real-time tumor tracking in radiotherapy and social media analysis. Algorithms learning from such streams need to be able to construct near real-time models that continuously adapt to potential changes in patterns, in order to retain high performance throughout the stream. It follows that there are numerous challenges involved in supervised learning (or so-called classification) in such environments. One of the challenges in learning from streams is multi-class imbalance, in which the rates of instances in the different class labels differ substantially. Notably, classification algorithms may become biased towards the classes with more frequent instances, sacrificing the performance of the less frequent or so-called minority classes. Further, minority instances often arrive infrequently and in bursts, making accurate model construction problematic. For example, network intrusion detection systems must be able to distinguish between normal traffic and multiple minority classes corresponding to a variety of different types of attacks. Further, having labels for all instances are often infeasible, since we might have missing or late-arriving labels. For instance, when learning from a stream regarding the task of detecting network intrusions, the true label for all instances might not be available, or it might take time until the label is made available, especially for new types of attacks. In this thesis, we contribute to the advancements of online learning from evolving streams by focusing on the above-mentioned areas of multi-class imbalance and missing labels. First, we introduce a multi-class online ensemble algorithm designed to maintain a balanced performance over all classes. Specifically, our approach samples instances with replacement while dynamically increasing the weights of under-represented classes, in order to produce models that benefit all classes. Our experimental results show that our online ensemble method performs well against multi-class imbalanced data in various datasets. We further continue our study by introducing an approach to dealing with missing labels that utilize both labelled and unlabelled data to increase a model’s performance. That is, our method utilizes labelled data for pseudo-labelling unlabelled instances, allowing the model to perform better in environments where labels are scarce. More specifically, our approach features a meta-reinforcement learning agent, trained on multiple-source streams, that can effectively select the prediction of a K nearest neighbours (K-NN) classifier as the label for unlabelled instances. Extensive experiments on benchmark datasets demonstrate the value and effectiveness of our approach and confirm that our method outperforms state-of-the-art. Machine learning Data streams Imbalanced learning Semi-supervised learning Meta-learning
157	Optimization of Insert-Tray Matching using Machine Learning Hedberg, Karolina January 2021 (has links) The manufacturing process of carbide inserts at Sandvik Coromant consists of several operations. During some of these, the inserts are positioned on trays. For some inserts the trays are pre-defined but for others the insert-tray matching is partly improvised. The goal of this thesis project is to examine whether machine learning can be used to predict which tray to use for a given insert. It is also investigated which insert features are determining for the choice of tray. The study is done with insert and tray data from four blasting operations and considers a set of standardized inserts since it is assumed that the tray matching for these is well tuned. The algorithm that is used for the predictions is the supervised learning algorithm k-nearest neighbors. The problem of identifying the determining features is regarded as a feature selection problem and is done with the ReliefF algorithm. From the classification results it is seen that the classifiers are overfitting. The main reason for this is probably that the datasets contain features that together are uniquely defining for which tray is used. This was not detected during the feature selection since ReliefF identifies features that are individually relevant to the output. An idea to avoid overfitting the classifiers is to exclude these defining features from the dataset. Further work is thus recommended. Machine learning Supervised learning Feature selection Computer and Information Sciences Data- och informationsvetenskap
158	Identifying Crime Hotspot: Evaluating the suitability of Supervised and Unsupervised Machine learning Hussein, Abdul Aziz 05 October 2021 (has links) No description available. Information Technology Crime hotspots machine learning supervised learning unsupervised learning classification clustering
159	Object Detection and Semantic Segmentation Using Self-Supervised Learning Gustavsson, Simon January 2021 (has links) In this thesis, three well known self-supervised methods have been implemented and trained on road scene images. The three so called pretext tasks RotNet, MoCov2, and DeepCluster were used to train a neural network self-supervised. The self-supervised trained networks where then evaluated on different amount of labeled data on two downstream tasks, object detection and semantic segmentation. The performance of the self-supervised methods are compared to networks trained from scratch on the respective downstream task. The results show that it is possible to achieve a performance increase using self-supervision on a dataset containing road scene images only. When only a small amount of labeled data is available, the performance increase can be substantial, e.g., a mIoU from 33 to 39 when training semantic segmentation on 1750 images with a RotNet pre-trained backbone compared to training from scratch. However, it seems that when a large amount of labeled images are available (>70000 images), the self-supervised pretraining does not increase the performance as much or at all. Self-supervised learning Computer vision
160	How to annotate in video for training machine learning with a good workflow Jakob, Persson January 2021 (has links) Artificial intelligence and machine learning is used in a lot of different areas, one of those areas is image recognition. In the production of a TV-show or film, image recognition can be used to help the editors to find specific objects, scenes, or people in the video content, which speeds up the production. But image recognition is not working perfect all the time and can not be used in the production of a TV-show or film as it is intended to. Therefore the image recognition algorithms needs to be trained on large datasets to become better. But to create these datasets takes time and tools that can let users create specific datasets and retrain algorithms to become better is needed. The aim of this master thesis was to investigate if it was possible to create a tool that can annotate objects and people in video content and using the data as training sets, and a tool that can retrain the output of an image recognition to make the image recognition become better. It was also important that the tools have a good workflow for the users. The study consisted of a theoretical study to gain more knowledge about annotation, and how to make a good UX-design with a good workflow. Interviews were also held to get more knowledge of what the requirements of the product was. It resulted in a user scenario and a workflow that was used together with the knowledge from the theoretical study to create a hi-fi prototype by using an iterative process with usability testing. This resulted in a final hi-fi prototype with a good design and a good workflow for the users, where it is possible to annotate objects and people with a bounding box, and where it is possible to retrain an image recognition program that has been used on video content. / Artificiell intelligens och maskininlärning används inom många olika områden, ett av dessa områden är bildigenkänning. Vid produktionen av ett TV-program eller av en film kan bildigenkänning användas för att hjälpa redigerarna att hitta specifika objekt, scener eller personer i videoinnehållet, vilket påskyndar produktionen. Men bildigenkänningsprogram fungerar inte alltid helt perfekt och kan inte användas i produktionen av ett TV-program eller film som det är tänkt att användas i det sammanhanget. För att förbättra bildigenkänningsprogram så behöver dess algoritm tränas på stora datasets av bilder och labels. Men att skapa dessa datasets tar tid och det behövs program som kan skapa datasets och återträna algoritmer för bildigenkänning så att de fungerar bättre. Syftet med detta examensarbete var att undersöka om det var möjligt att skapa ett verktyg som kan markera(annotera) objekt och personer i video och använda datat som träningsdata för algoritmer. Men även att skapa ett verktyg som kan återträna algoritmer för bildigenkänning så att de blir bättre utifrån datat man får från ett bildigenkänningprogram. Det var också viktigt att dessa verktyg hade ett bra arbetsflöde för användarna. Studien bestod av en teoretisk studie för att få mer kunskap om annoteringar i video och hur man skapar bra UX-design med ett bra arbetsflöde. Intervjuer hölls också för att få mer kunskap om kraven på produkten och vilka som skulle använda den. Det resulterade i ett användarscenario och ett arbetsflöde som användes tillsammans med kunskapen från den teoretiska studien för att skapa en hi-fi prototyp, där en iterativ process med användbarhetstestning användes. Detta resulterade i en slutlig hi-fi prototyp med bra design och ett bra arbetsflöde för användarna där det är möjligt att markera(annotera) objekt och personer med en bounding box och där det är möjligt att återträna algoritmer för bildigenkänning som har körts på video. Video annotation tool Machine learning Logger User experience Supervised learning Interaction Technologies Interaktionsteknik

Search results