281 |
Nouvelles approches itératives avec garanties théoriques pour l'adaptation de domaine non supervisée / New iterative approaches with theoretical guarantees for unsupervised domain adaptationPeyrache, Jean-Philippe 11 July 2014 (has links)
Ces dernières années, l’intérêt pour l’apprentissage automatique n’a cessé d’augmenter dans des domaines aussi variés que la reconnaissance d’images ou l’analyse de données médicales. Cependant, une limitation du cadre classique PAC a récemment été mise en avant. Elle a entraîné l’émergence d’un nouvel axe de recherche : l’Adaptation de Domaine, dans lequel on considère que les données d’apprentissage proviennent d’une distribution (dite source) différente de celle (dite cible) dont sont issues les données de test. Les premiers travaux théoriques effectués ont débouché sur la conclusion selon laquelle une bonne performance sur le test peut s’obtenir en minimisant à la fois l’erreur sur le domaine source et un terme de divergence entre les deux distributions. Trois grandes catégories d’approches s’en inspirent : par repondération, par reprojection et par auto-étiquetage. Dans ce travail de thèse, nous proposons deux contributions. La première est une approche de reprojection basée sur la théorie du boosting et s’appliquant aux données numériques. Celle-ci offre des garanties théoriques intéressantes et semble également en mesure d’obtenir de bonnes performances en généralisation. Notre seconde contribution consiste d’une part en la proposition d’un cadre permettant de combler le manque de résultats théoriques pour les méthodes d’auto-étiquetage en donnant des conditions nécessaires à la réussite de ce type d’algorithme. D’autre part, nous proposons dans ce cadre une nouvelle approche utilisant la théorie des (epsilon, gamma, tau)-bonnes fonctions de similarité afin de contourner les limitations imposées par la théorie des noyaux dans le contexte des données structurées / During the past few years, an increasing interest for Machine Learning has been encountered, in various domains like image recognition or medical data analysis. However, a limitation of the classical PAC framework has recently been highlighted. It led to the emergence of a new research axis: Domain Adaptation (DA), in which learning data are considered as coming from a distribution (the source one) different from the one (the target one) from which are generated test data. The first theoretical works concluded that a good performance on the target domain can be obtained by minimizing in the same time the source error and a divergence term between the two distributions. Three main categories of approaches are derived from this idea : by reweighting, by reprojection and by self-labeling. In this thesis work, we propose two contributions. The first one is a reprojection approach based on boosting theory and designed for numerical data. It offers interesting theoretical guarantees and also seems able to obtain good generalization performances. Our second contribution consists first in a framework filling the gap of the lack of theoretical results for self-labeling methods by introducing necessary conditions ensuring the good behavior of this kind of algorithm. On the other hand, we propose in this framework a new approach, using the theory of (epsilon, gamma, tau)- good similarity functions to go around the limitations due to the use of kernel theory in the specific context of structured data
|
282 |
Anotação automática semissupervisionada de papéis semânticos para o português do Brasil / Automatic semi-supervised semantic role labeling for Brazilian PortugueseFernando Emilio Alva Manchego 22 January 2013 (has links)
A anotac~ao de papeis sem^anticos (APS) e uma tarefa do processamento de lngua natural (PLN) que permite analisar parte do signicado das sentencas atraves da detecc~ao dos participantes dos eventos (e dos eventos em si) que est~ao sendo descritos nelas, o que e essencial para que os computadores possam usar efetivamente a informac~ao codicada no texto. A maior parte das pesquisas desenvolvidas em APS tem sido feita para textos em ingl^es, considerando as particularidades gramaticais e sem^anticas dessa lngua, o que impede que essas ferramentas e resultados sejam diretamente transportaveis para outras lnguas como o portugu^es. A maioria dos sistemas de APS atuais emprega metodos de aprendizado de maquina supervisionado e, portanto, precisa de um corpus grande de senten cas anotadas com papeis sem^anticos para aprender corretamente a tarefa. No caso do portugu^es do Brasil, um recurso lexical que prov^e este tipo de informac~ao foi recentemente disponibilizado: o PropBank.Br. Contudo, em comparac~ao com os corpora para outras lnguas como o ingl^es, o corpus fornecido por este projeto e pequeno e, portanto, n~ao permitiria que um classicador treinado supervisionadamente realizasse a tarefa de anotac~ao com alto desempenho. Para tratar esta diculdade, neste trabalho emprega-se uma abordagem semissupervisionada capaz de extrair informac~ao relevante tanto dos dados anotados disponveis como de dados n~ao anotados, tornando-a menos dependente do corpus de treinamento. Implementa-se o algoritmo self-training com modelos de regress~ ao logstica (ou maxima entropia) como classicador base, para anotar o corpus Bosque (a sec~ao correspondente ao CETENFolha) da Floresta Sinta(c)tica com as etiquetas do PropBank.Br. Ao algoritmo original se incorpora balanceamento e medidas de similaridade entre os argumentos de um verbo especco para melhorar o desempenho na tarefa de classicac~ao de argumentos. Usando um benchmark de avaliac~ao implementado neste trabalho, a abordagem semissupervisonada proposta obteve um desempenho estatisticamente comparavel ao de um classicador treinado supervisionadamente com uma maior quantidade de dados anotados (80,5 vs. 82,3 de \'F IND. 1\', p > 0, 01) / Semantic role labeling (SRL) is a natural language processing (NLP) task able to analyze part of the meaning of sentences through the detection of the events they describe and the participants involved, which is essential for computers to eectively understand the information coded in text. Most of the research carried out in SRL has been done for texts in English, considering the grammatical and semantic particularities of that language, which prevents those tools and results to be directly transported to other languages such as Portuguese. Most current SRL systems use supervised machine learning methods and require a big corpus of sentences annotated with semantic roles in order to learn how to perform the task properly. For Brazilian Portuguese, a lexical resource that provides this type of information has recently become available: PropBank.Br. However, in comparison with corpora for other languages such as English, the corpus provided by that project is small and it wouldn\'t allow a supervised classier to perform the labeling task with good performance. To deal with this problem, in this dissertation we use a semi-supervised approach capable of extracting relevant information both from annotated and non-annotated data available, making it less dependent on the training corpus. We implemented the self-training algorithm with logistic regression (or maximum entropy) models as base classier to label the corpus Bosque (section CETENFolha) from the Floresta Sintá(c)tica with the PropBank.Br semantic role tags. To the original algorithm, we incorporated balancing and similarity measures between verb-specic arguments so as to improve the performance of the system in the argument classication task. Using an evaluation benchmark implemented in this research project, the proposed semi-supervised approach has a statistical comparable performance as the one of a supervised classier trained with more annotated data (80,5 vs. 82,3 de \'F IND. 1\', p > 0, 01).
|
283 |
On The Effectiveness of Multi-TaskLearningAn evaluation of Multi-Task Learning techniques in deep learning modelsTovedal, Sofiea January 2020 (has links)
Multi-Task Learning is today an interesting and promising field which many mention as a must for achieving the next level advancement within machine learning. However, in reality, Multi-Task Learning is much more rarely used in real-world implementations than its more popular cousin Transfer Learning. The questionis why that is and if Multi-Task Learning outperforms its Single-Task counterparts. In this thesis different Multi-Task Learning architectures were utilized in order to build a model that can handle labeling real technical issues within two categories. The model faces a challenging imbalanced data set with many labels to choose from and short texts to base its predictions on. Can task-sharing be the answer to these problems? This thesis investigated three Multi-Task Learning architectures and compared their performance to a Single-Task model. An authentic data set and two labeling tasks was used in training the models with the method of supervised learning. The four model architectures; Single-Task, Multi-Task, Cross-Stitched and the Shared-Private, first went through a hyper parameter tuning process using one of the two layer options LSTM and GRU. They were then boosted by auxiliary tasks and finally evaluated against each other.
|
284 |
Flexible Structured Prediction in Natural Language Processing with Partially Annotated CorporaXiao Zhang (8776265) 29 April 2020 (has links)
<div>Structured prediction makes coherent decisions as structured objects to present the interrelations of these predicted variables. They have been widely used in many areas, such as bioinformatics, computer vision, speech recognition, and natural language processing. Machine Learning with reduced supervision aims to leverage the laborious and error-prone annotation effects and benefit the low-resource languages. In this dissertation we study structured prediction with reduced supervision for two sets of problems, sequence labeling and dependency parsing, both of which are representatives of structured prediction problems in NLP. We investigate three different approaches.</div><div> </div><div>The first approach is learning with modular architecture by task decomposition. By decomposing the labels into location sub-label and type sub-label, we designed neural modules to tackle these sub-labels respectively, with an additional module to infuse the information. The experiments on the benchmark datasets show the modular architecture outperforms existing models and can make use of partially labeled data together with fully labeled data to improve on the performance of using fully labeled data alone.</div><div><br></div><div>The second approach builds the neural CRF autoencoder (NCRFAE) model that combines a discriminative component and a generative component for semi-supervised sequence labeling. The model has a unified structure of shared parameters, using different loss functions for labeled and unlabeled data. We developed a variant of the EM algorithm for optimizing the model with tractable inference. The experiments on several languages in the POS tagging task show the model outperforms existing systems in both supervised and semi-supervised setup.</div><div><br></div><div>The third approach builds two models for semi-supervised dependency parsing, namely local autoencoding parser (LAP) and global autoencoding parser (GAP). LAP assumes the chain-structured sentence has a latent representation and uses this representation to construct the dependency tree, while GAP treats the dependency tree itself as a latent variable. Both models have unified structures for sentence with and without annotated parse tree. The experiments on several languages show both parsers can use unlabeled sentences to improve on the performance with labeled sentences alone, and LAP is faster while GAP outperforms existing models.</div>
|
285 |
Self-supervised Representation Learning via Image Out-painting for Medical Image AnalysisJanuary 2020 (has links)
abstract: In recent years, Convolutional Neural Networks (CNNs) have been widely used in not only the computer vision community but also within the medical imaging community. Specifically, the use of pre-trained CNNs on large-scale datasets (e.g., ImageNet) via transfer learning for a variety of medical imaging applications, has become the de facto standard within both communities.
However, to fit the current paradigm, 3D imaging tasks have to be reformulated and solved in 2D, losing rich 3D contextual information. Moreover, pre-trained models on natural images never see any biomedical images and do not have knowledge about anatomical structures present in medical images. To overcome the above limitations, this thesis proposes an image out-painting self-supervised proxy task to develop pre-trained models directly from medical images without utilizing systematic annotations. The idea is to randomly mask an image and train the model to predict the missing region. It is demonstrated that by predicting missing anatomical structures when seeing only parts of the image, the model will learn generic representation yielding better performance on various medical imaging applications via transfer learning.
The extensive experiments demonstrate that the proposed proxy task outperforms training from scratch in six out of seven medical imaging applications covering 2D and 3D classification and segmentation. Moreover, image out-painting proxy task offers competitive performance to state-of-the-art models pre-trained on ImageNet and other self-supervised baselines such as in-painting. Owing to its outstanding performance, out-painting is utilized as one of the self-supervised proxy tasks to provide generic 3D pre-trained models for medical image analysis. / Dissertation/Thesis / Masters Thesis Computer Science 2020
|
286 |
Positive unlabeled learning applications in music and healthcareArjannikov, Tom 10 September 2021 (has links)
The supervised and semi-supervised machine learning paradigms hinge on the idea that the training data is labeled. The label quality is often brought into question, and problems related to noisy, inaccurate, or missing labels are studied. One of these is an interesting and prevalent problem in the semi-supervised classification area where only some positive labels are known. At the same time, the remaining and often the majority of the available data is unlabeled, i.e., there are no negative examples. Known as Positive-Unlabeled (PU) learning, this problem has been identified with increasing frequency across many disciplines, including but not limited to health science, biology, bioinformatics, geoscience, physics, business, and politics. Also, there are several closely related machine learning problems, such as cost-sensitive learning and mixture proportion estimation.
This dissertation explores the PU learning problem from the perspective of density estimation and proposes a new modular method compatible with the relabeling framework that is common in PU learning literature. This approach is compared with two existing algorithms throughout the manuscript, one from a seminal work by Elkan and Noto and a current state-of-the-art algorithm by Ivanov. Furthermore, this thesis identifies two machine learning application domains that can benefit from PU learning approaches, which were not previously seen that way: predicting length of stay in hospitals and automatic music tagging. Experimental results with multiple synthetic and real-world datasets from different application domains validate the proposed approach.
Accurately predicting the in-hospital length of stay (LOS) at the time of admission can positively impact healthcare metrics, particularly in novel response scenarios such as the Covid-19 pandemic. During the regular steady-state operation, traditional classification algorithms can be used for this purpose to inform planning and resource management. However, when there are sudden changes to the admission and patient statistics, such as during the onset of a pandemic, these approaches break down because reliable training data becomes available only gradually over time. This thesis demonstrates the effectiveness of PU learning approaches in such situations through experiments by simulating the positive-unlabeled scenario using two fully-labeled publicly available LOS datasets.
Music auto-tagging systems are typically trained using tag labels provided by human listeners. In many cases, this labeling is weak, which means that the provided tags are valid for the associated tracks, but there can be tracks for which a tag would be valid but not present. This situation is analogous to PU learning with the additional complication of being a multi-label scenario. Experimental results on publicly available music datasets with tags representing three different labeling paradigms demonstrate the effectiveness of PU learning techniques in recovering the missing labels and improving auto-tagger performance. / Graduate
|
287 |
Automatic Prediction of Human Age based on Heart Rate Variability Analysis using Feature-Based MethodsAl-Mter, Yusur January 2020 (has links)
Heart rate variability (HRV) is the time variation between adjacent heartbeats. This variation is regulated by the autonomic nervous system (ANS) and its two branches, the sympathetic and parasympathetic nervous system. HRV is considered as an essential clinical tool to estimate the imbalance between the two branches, hence as an indicator of age and cardiac-related events.This thesis focuses on the ECG recordings during nocturnal rest to estimate the influence of HRV in predicting the age decade of healthy individuals. Time and frequency domains, as well as non-linear methods, are explored to extract the HRV features. Three feature-based methods (support vector machine (SVM), random forest, and extreme gradient boosting (XGBoost)) were employed, and the overall test accuracy achieved in capturing the actual class was relatively low (lower than 30%). SVM classifier had the lowest performance, while random forests and XGBoost performed slightly better. Although the difference is negligible, the random forest had the highest test accuracy, approximately 29%, using a subset of ten optimal HRV features. Furthermore, to validate the findings, the original dataset was shuffled and used as a test set and compared the performance to other related research outputs.
|
288 |
Classification d’objets au moyen de machines à vecteurs supports dans les images de sonar de haute résolution du fond marin / Object classification using support vector machines in high resolution sonar seabed imageryRousselle, Denis 28 November 2016 (has links)
Cette thèse a pour objectif d'améliorer la classification d'objets sous-marins dans des images sonar haute résolution. En particulier, il s'agit de distinguer les mines des objets inoffensifs parmi une collection d'objets ressemblant à des mines. Nos recherches ont été dirigées par deux contraintes classiques en guerre de la mine : d'une part, le manque de données et d'autre part, le besoin de lisibilité des décisions. Nous avons donc constitué une base de données la plus représentative possible et simulé des objets dans le but de la compléter. Le manque d'exemples nous a mené à utiliser une représentation compacte, issue de la reconnaissance de visages : les Structural Binary Gradient Patterns (SBGP). Dans la même optique, nous avons dérivé une méthode d'adaptation de domaine semi-supervisée, basée sur le transport optimal, qui peut être facilement interprétable. Enfin, nous avons développé un nouvel algorithme de classification : les Ensemble of Exemplar-Maximum Excluding Ball (EE-MEB) qui sont à la fois adaptés à des petits jeux de données mais dont la décision est également aisément analysable / This thesis aims to improve the classification of underwater objects in high resolution sonar images. Especially, we seek to make the distinction between mines and harmless objects from a collection of mine-like objects. Our research was led by two classical constraints of the mine warfare : firstly, the lack of data and secondly, the need for readability of the classification. In this context, we built a database as much representative as possible and simulated objects in order to complete it. The lack of examples led us to use a compact representation, originally used by the face recognition community : the Structural Binary Gradient Patterns (SBGP). To the same end, we derived a method of semi-supervised domain adaptation, based on optimal transport, that can be easily interpreted. Finally, we developed a new classification algorithm : the Ensemble of Exemplar-Maximum Excluding Ball (EE-MEB) which is suitable for small datasets and with an easily interpretable decision function
|
289 |
FAULT DETECTION FOR SMALL-SCALE PHOTOVOLTAIC POWER INSTALLATIONS : A Case Study of a Residential Solar Power SystemBrüls, Maxim January 2020 (has links)
Fault detection for residential photovoltaic power systems is an often-ignored problem. This thesis introduces a novel method for detecting power losses due to faults in solar panel performance. Five years of data from a residential system in Dalarna, Sweden, was applied on a random forest regression to estimate power production. Estimated power was compared to true power to assess the performance of the power generating systems. By identifying trends in the difference and estimated power production, faults can be identified. The model is sufficiently competent to identify consistent energy losses of 10% or greater of the expected power output, while requiring only minimal modifications to existing power generating systems.
|
290 |
A contemporary machine learning approach to detect transportation mode - A case study of Borlänge, SwedenGolshan, Arman January 2020 (has links)
Understanding travel behavior and identifying the mode of transportation are essential for adequate urban devising and transportation planning. Global positioning systems (GPS) tracking data is mainly used to find human mobility patterns in cities. Some travel information, such as most visited location, temporal changes, and the trip speed, can be easily extracted from GPS raw tracking data. GPS trajectories can be used as a method to indicate the mobility modes of commuters. Most previous studies have applied traditional machine learning algorithms and manually computed data features, making the model error-prone. Thus, there is a demand for developing a new model to resolve these methods' weaknesses. The primary purpose of this study is to propose a semi-supervised model to identify transportation mode by using a contemporary machine learning algorithm and GPS tracking data. The model can accept GPS trajectory with adjustable length and extracts their latent information with LSTM Autoencoder. This study adopts a deep neural network architecture with three hidden layers to map the latent information to detect transportation mode. Moreover, different case studies are performed to evaluate the proposed model's efficiency. The model results in an accuracy of 93.6%, which significantly outperforms similar studies.
|
Page generated in 0.0729 seconds