• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 15
  • 3
  • Tagged with
  • 22
  • 22
  • 15
  • 11
  • 9
  • 8
  • 8
  • 5
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Active visual category learning

Vijayanarasimhan, Sudheendra 02 June 2011 (has links)
Visual recognition research develops algorithms and representations to autonomously recognize visual entities such as objects, actions, and attributes. The traditional protocol involves manually collecting training image examples, annotating them in specific ways, and then learning models to explain the annotated examples. However, this is a rather limited way to transfer human knowledge to visual recognition systems, particularly considering the immense number of visual concepts that are to be learned. I propose new forms of active learning that facilitate large-scale transfer of human knowledge to visual recognition systems in a cost-effective way. The approach is cost-effective in the sense that the division of labor between the machine learner and the human annotators respects any cues regarding which annotations would be easy (or hard) for either party to provide. The approach is large-scale in that it can deal with a large number of annotation types, multiple human annotators, and huge pools of unlabeled data. In particular, I consider three important aspects of the problem: (1) cost-sensitive multi-level active learning, where the expected informativeness of any candidate image annotation is weighed against the predicted cost of obtaining it in order to choose the best annotation at every iteration. (2) budgeted batch active learning, a novel active learning setting that perfectly suits automatic learning from crowd-sourcing services where there are multiple annotators and each annotation task may vary in difficulty. (3) sub-linear time active learning, where one needs to retrieve those points that are most informative to a classifier in time that is sub-linear in the number of unlabeled examples, i.e., without having to exhaustively scan the entire collection. Using the proposed solutions for each aspect, I then demonstrate a complete end-to-end active learning system for scalable, autonomous, online learning of object detectors. The approach provides state-of-the-art recognition and detection results, while using minimal total manual effort. Overall, my work enables recognition systems that continuously improve their knowledge of the world by learning to ask the right questions of human supervisors. / text
12

Cost-sensitive boosting : a unified approach

Nikolaou, Nikolaos January 2016 (has links)
In this thesis we provide a unifying framework for two decades of work in an area of Machine Learning known as cost-sensitive Boosting algorithms. This area is concerned with the fact that most real-world prediction problems are asymmetric, in the sense that different types of errors incur different costs. Adaptive Boosting (AdaBoost) is one of the most well-studied and utilised algorithms in the field of Machine Learning, with a rich theoretical depth as well as practical uptake across numerous industries. However, its inability to handle asymmetric tasks has been the subject of much criticism. As a result, numerous cost-sensitive modifications of the original algorithm have been proposed. Each of these has its own motivations, and its own claims to superiority. With a thorough analysis of the literature 1997-2016, we find 15 distinct cost-sensitive Boosting variants - discounting minor variations. We critique the literature using {\em four} powerful theoretical frameworks: Bayesian decision theory, the functional gradient descent view, margin theory, and probabilistic modelling. From each framework, we derive a set of properties which must be obeyed by boosting algorithms. We find that only 3 of the published Adaboost variants are consistent with the rules of all the frameworks - and even they require their outputs to be calibrated to achieve this. Experiments on 18 datasets, across 21 degrees of cost asymmetry, all support the hypothesis - showing that once calibrated, the three variants perform equivalently, outperforming all others. Our final recommendation - based on theoretical soundness, simplicity, flexibility and performance - is to use the original Adaboost algorithm albeit with a shifted decision threshold and calibrated probability estimates. The conclusion is that novel cost-sensitive boosting algorithms are unnecessary if proper calibration is applied to the original.
13

Similaridade de algoritmos em cenários sensíveis a custo

MELO, Carlos Eduardo Castor de 27 August 2015 (has links)
Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2016-09-06T17:26:12Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Mestrado- Carlos Eduardo Castor de Melo.pdf: 2325318 bytes, checksum: 1a456db1f76d03f35cc83b12a6026b6b (MD5) / Made available in DSpace on 2016-09-06T17:26:12Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Mestrado- Carlos Eduardo Castor de Melo.pdf: 2325318 bytes, checksum: 1a456db1f76d03f35cc83b12a6026b6b (MD5) Previous issue date: 2015-08-27 / FACEPE / análise da similaridade entre algoritmos de aprendizagem de máquina é um importante aspecto na área de Meta-Aprendizado, onde informações obtidas a partir de processos de aprendizagem conhecidos podem ser utilizadas para guiar a seleção de algoritmos para tratar novos problemas apresentados. Essa similaridade é geralmente calculada através de métricas globais de desempenho, que omitem informações importantes para o melhor entendimento do comportamento dos algoritmos. Também existem abordagens onde é verificado o desempenho individualmente em cada instância do problema. Ambas as abordagens não consideram os custos associados a cada classe do problema, negligenciando informações que podem ser muito importantes em vários contextos de aprendizado. Nesse trabalho são apresentadas métricas para a avaliação do desempenho de algoritmos em cenários sensíveis a custo. Cada cenário é descrito a partir de um método para escolha de limiar para a construção de um classificador a partir de um modelo aprendido. Baseado nos valores de desempenho em cada instância, é proposta uma forma de avaliar a similaridade entre os algoritmos tanto em nível de problema como em nível global. Os experimentos realizados para ilustrar as métricas apresentadas neste trabalho foram realizados em um estudo de Meta-Aprendizado utilizando 19 algoritmos para a classificação das instâncias de 152 problemas. As medidas de similaridades foram utilizadas para a criação de agrupamentos hierárquicos. Os agrupamentos criados mostram como o comportamento entre os algoritmos diversifica de acordo com o cenário de custo a ser tratado. / The analysis of the similarity between machine learning algorithms is an important aspect of Meta-Learning, where knowledge gathered from known learning processes can be used to guide the selection of algorithms to tackle new learning problems presented. This similarity is usually calculated through global performance metrics that omit important information about the algorithm behavior. There are also approaches where the performance is verified individually on each instance of a problem. Both these approaches do not consider the costs associated with each problem class, hence they neglect information that can be very important in different learning contexts. In this study, metrics are presented to evaluate the performance of algorithms in cost sensitive scenarios. Each scenario is described by a threshold choice method, used to build a crisp classifier from a learned model. Based on the performance values for each problem instance, it is proposed a method to measure the similarity between the algorithms in a local level (for each problem) and in a global level (across all problems observed). The experiments used to illustrate the metrics presented in this paper were performed in a Meta-Learning study using 19 algorithms for the classification of the instances of 152 learning problems. The similarity measures were used to create hierarchical clusters. The clusters created show how the behavior of the algorithms diversifies according to the cost scenario to be treated.
14

Detecção de fraudes em cartões: um classificador baseado em regras de associação e regressão logística / Card fraud detection: a classifier based on association rules and logistic regression

Paulo Henrique Maestrello Assad Oliveira 11 December 2015 (has links)
Os cartões, sejam de crédito ou débito, são meios de pagamento altamente utilizados. Esse fato desperta o interesse de fraudadores. O mercado de cartões enxerga as fraudes como custos operacionais, que são repassados para os consumidores e para a sociedade em geral. Ainda, o alto volume de transações e a necessidade de combater as fraudes abrem espaço para a aplicação de técnicas de Aprendizagem de Máquina; entre elas, os classificadores. Um tipo de classificador largamente utilizado nesse domínio é o classificador baseado em regras. Entretanto, um ponto de atenção dessa categoria de classificadores é que, na prática, eles são altamente dependentes dos especialistas no domínio, ou seja, profissionais que detectam os padrões das transações fraudulentas, os transformam em regras e implementam essas regras nos sistemas de classificação. Ao reconhecer esse cenário, o objetivo desse trabalho é propor a uma arquitetura baseada em regras de associação e regressão logística - técnicas estudadas em Aprendizagem de Máquina - para minerar regras nos dados e produzir, como resultado, conjuntos de regras de detecção de transações fraudulentas e disponibilizá-los para os especialistas no domínio. Com isso, esses profissionais terão o auxílio dos computadores para descobrir e gerar as regras que embasam o classificador, diminuindo, então, a chance de haver padrões fraudulentos ainda não reconhecidos e tornando as atividades de gerar e manter as regras mais eficientes. Com a finalidade de testar a proposta, a parte experimental do trabalho contou com cerca de 7,7 milhões de transações reais de cartões fornecidas por uma empresa participante do mercado de cartões. A partir daí, dado que o classificador pode cometer erros (falso-positivo e falso-negativo), a técnica de análise sensível ao custo foi aplicada para que a maior parte desses erros tenha um menor custo. Além disso, após um longo trabalho de análise do banco de dados, 141 características foram combinadas para, com o uso do algoritmo FP-Growth, gerar 38.003 regras que, após um processo de filtragem e seleção, foram agrupadas em cinco conjuntos de regras, sendo que o maior deles tem 1.285 regras. Cada um desses cinco conjuntos foi submetido a uma modelagem de regressão logística para que suas regras fossem validadas e ponderadas por critérios estatísticos. Ao final do processo, as métricas de ajuste estatístico dos modelos revelaram conjuntos bem ajustados e os indicadores de desempenho dos classificadores também indicaram, num geral, poderes de classificação muito bons (AROC entre 0,788 e 0,820). Como conclusão, a aplicação combinada das técnicas estatísticas - análise sensível ao custo, regras de associação e regressão logística - se mostrou conceitual e teoricamente coesa e coerente. Por fim, o experimento e seus resultados demonstraram a viabilidade técnica e prática da proposta. / Credit and debit cards are two methods of payments highly utilized. This awakens the interest of fraudsters. Businesses see fraudulent transactions as operating costs, which are passed on to consumers. Thus, the high number of transactions and the necessity to combat fraud stimulate the use of machine learning algorithms; among them, rule-based classifiers. However, a weakness of these classifiers is that, in practice, they are highly dependent on professionals who detect patterns of fraudulent transactions, transform them into rules and implement these rules in the classifier. Knowing this scenario, the aim of this thesis is to propose an architecture based on association rules and logistic regression - techniques studied in Machine Learning - for mining rules on data and produce rule sets to detect fraudulent transactions and make them available to experts. As a result, these professionals will have the aid of computers to discover the rules that support the classifier, decreasing the chance of having non-discovered fraudulent patterns and increasing the efficiency of generate and maintain these rules. In order to test the proposal, the experimental part of the thesis has used almost 7.7 million transactions provided by a real company. Moreover, after a long process of analysis of the database, 141 characteristics were combined using the algorithm FP-Growth, generating 38,003 rules. After a process of filtering and selection, they were grouped into five sets of rules which the biggest one has 1,285 rules. Each of the five sets was subjected to logistic regression, so their rules have been validated and weighted by statistical criteria. At the end of the process, the goodness of fit tests were satisfied and the performance indicators have shown very good classification powers (AUC between 0.788 and 0.820). In conclusion, the combined application of statistical techniques - cost sensitive learning, association rules and logistic regression - proved being conceptually and theoretically cohesive and coherent. Finally, the experiment and its results have demonstrated the technical and practical feasibilities of the proposal.
15

Design and Analysis of Techniques for Multiple-Instance Learning in the Presence of Balanced and Skewed Class Distributions

Wang, Xiaoguang January 2015 (has links)
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, the Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Existing knowledge discovery and data analyzing techniques have shown great success in many real-world applications such as applying Automatic Target Recognition (ATR) methods to detect targets of interest in imagery, drug activity prediction, computer vision recognition, and so on. Among these techniques, Multiple-Instance (MI) learning is different from standard classification since it uses a set of bags containing many instances as input. The instances in each bag are not labeled | instead the bags themselves are labeled. In this area many researchers have accomplished a lot of work and made a lot of progress. However, there still exist some areas which are not covered. In this thesis, we focus on two topics of MI learning: (1) Investigating the relationship between MI learning and other multiple pattern learning methods, which include multi-view learning, data fusion method and multi-kernel SVM. (2) Dealing with the class imbalance problem of MI learning. In the first topic, three different learning frameworks will be presented for general MI learning. The first uses multiple view approaches to deal with MI problem, the second is a data fusion framework, and the third framework, which is an extension of the first framework, uses multiple-kernel SVM. Experimental results show that the approaches presented work well on solving MI problem. The second topic is concerned with the imbalanced MI problem. Here we investigate the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. For this problem, we propose three solution frameworks: a data re-sampling framework, a cost-sensitive boosting framework and an adaptive instance-weighted boosting SVM (with the name IB_SVM) for MI learning. Experimental results - on both benchmark datasets and application datasets - show that the proposed frameworks are proved to be effective solutions for the imbalanced problem of MI learning.
16

Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments

Haertel, Robbie A. 12 August 2013 (has links) (PDF)
Many projects exist whose purpose is to augment raw data with annotations that increase the usefulness of the data. The number of these projects is rapidly growing and in the age of “big data” the amount of data to be annotated is likewise growing within each project. One common use of such data is in supervised machine learning, which requires labeled data to train a predictive model. Annotation is often a very expensive proposition, particularly for structured data. The purpose of this dissertation is to explore methods of reducing the cost of creating such data sets, including annotated text corpora.We focus on active learning to address the annotation problem. Active learning employs models trained using machine learning to identify instances in the data that are most informative and least costly. We introduce novel techniques for adapting vanilla active learning to situations wherein data instances are of varying benefit and cost, annotators request work “on-demand,” and there are multiple, fallible annotators of differing levels of accuracy and cost. In order to account for data instances of varying cost, we build a model of cost from real annotation data based on a user study. We also introduce a novel cost-conscious active learning algorithm which we call return-on-investment, that selects instances for annotation that contain the most benefit per unit cost. To address the issue of annotators that request instances “on-demand,” we develop a parallel, “no-wait” framework that performs computation while the annotator is annotating. As a result, annotators need not wait for the computer to determine the best instance for them to annotate—a common problem with existing approaches. Finally, we introduce a Bayesian model designed to simultaneously infer ground truth annotations from noisy annotations, infer each individual annotators accuracy, and predict its own accuracy on unseen data, without the use of a held-out set. We extend ROI-based active learning and our annotation framework to handle multiple annotators using this model. As a whole, our work shows that the techniques introduced in this dissertation reduce the cost of annotation in scenarios that are more true-to-life than previous research.
17

Neonatal Sepsis Detection With Random Forest Classification for Heavily Imbalanced Data

Osman Abubaker, Ayman January 2022 (has links)
Neonatal sepsis is associated with most cases ofmortality in the neonatal intensive care unit. Major challengesin detecting sepsis using suitable biomarkers has lead people tolook for alternative approaches in the form of Machine Learningtechniques. In this project, Random Forest classification wasperformed on a sepsis data set provided by Karolinska Hospital.We particularly focused on tackling class imbalance in the datausing sampling and cost-sensitive techniques. We compare theclassification performances of Random Forests in six differentsetups; four using oversampling and undersampling techniques;one using cost-sensitive learning and one basic Random Forest.The performance with the oversampling techniques were betterand could identify more sepsis patients than the other setups.The overall performances were also good, making the methodspotentially useful in practice. / Neonatal sepsis är orsaken till majoriteten av mortaliteten i neonatal intensivvården. Svårigheten i att detektera sepsis med hjälp av biomarkörer har lett många att leta efter alternativa metoder. Maskininlärningstekniker är en sådan alternativ metod som har i senaste tider ökat i användning inom vård och andra sektorer. I detta project användes Random Forest klassifikations algoritmen på en sepsis datamängd given av Karolinska Sjukhuset. Vi fokuserade på att hantera klassimbalansen i datan genom att använda olika provtagningsoch kostnadskänsliga metoder. Vi jämförde klassificeringsprestanda för Random Forest med sex olika inställningar; fyra av de använde provtagingsmetoderna; en av de använde en kostnadskänslig metod och en var en vanlig Random Forest. Det visade sig att modellens prestanda ökade som mest med översamplings metoderna. Den generella klassificeringsprestandan var också bra, vilket gör Random Forests tillsammans med ingsmetoderna potentiellt användbar i praktiken. / Kandidatexjobb i elektroteknik 2022, KTH, Stockholm
18

Classification of Transcribed Voice Recordings : Determining the Claim Type of Recordings Submitted by Swedish Insurance Clients / Klassificering av Transkriberade Röstinspelningar

Piehl, Carl January 2021 (has links)
In this thesis, we investigate the problem of building a text classifier for transcribed voice recordings submitted by insurance clients. We compare different models in the context of two tasks. The first is a binary classification problem, where the models are tasked with determining if a transcript belongs to a particular type or not. The second is a multiclass problem, where the models have to choose between several types when labelling transcripts, resulting in a data set with a highly imbalanced class distribution. We evaluate four different models: pretrained BERT and three LSTMs with different word embeddings. The used word embeddings are ELMo, word2vec and a baseline model with randomly initialized embedding layer. In the binary task, we are more concerned with false positives than false negatives. Thus, we also use weighted cross entropy loss to achieve high precision for the positive class, while sacrificing recall. In the multiclass task, we use focal loss and weighted cross entropy loss to reduce bias toward majority classes. We find that BERT outperforms the other models and the baseline model is worst across both tasks. The difference in performance is greatest in the multiclass task on classes with fewer samples. This demonstrates the benefit of using large language models in data constrained scenarios. In the binary task, we find that weighted cross entropy loss provides a simple, yet effective, framework for conditioning the model to favor certain types of errors. In the multiclass task, both focal loss and weighted cross entropy loss are shown to reduce bias toward majority classes. However, we also find that BERT fine tuned with regular cross entropy loss does not show bias toward majority classes, having high recall across all classes. / I examensarbetet undersöks klassificering av transkriberade röstinspelningar från försäkringskunder. Flera modeller jämförs på två uppgifter. Den första är binär klassificering, där modellerna ska särskilja på inspelningar som tillhör en specifik klass av ärende från resterande inspelningar. I det andra inkluderas flera olika klasser som modellerna ska välja mellan när inspelningar klassificeras, vilket leder till en ojämn klassfördelning. Fyra modeller jämförs: förtränad BERT och tre LSTM-nätverk med olika varianter av förtränade inbäddningar. De inbäddningar som används är ELMo, word2vec och en basmodell som har inbäddningar som inte förtränats. I det binära klassificeringsproblemet ligger fokus på att minimera antalet falskt positiva klassificeringar, därför används viktad korsentropi. Utöver detta används även fokal förlustfunktion när flera klasser inkluderas, för att minska partiskhet mot majoritetsklasser. Resultaten indikerar att BERT är en starkare modell än de andra modellerna i båda uppgifterna. Skillnaden mellan modellerna är tydligast när flera klasser används, speciellt på de klasser som är underrepresenterade. Detta visar på fördelen av att använda stora, förtränade, modeller när mängden data är begränsad. I det binära klassificeringsproblemet ser vi även att en viktad förlustfunktion ger ett enkelt men effektivt sätt att reglera vilken typ av fel modellen ska vara partisk mot. När flera klasser inkluderas ser vi att viktad korsentropi, samt fokal förlustfunktion, kan bidra till att minska partiskhet mot överrepresenterade klasser. Detta var dock inte fallet för BERT, som visade bra resultat på minoritetsklasser även utan att modifiera förlustfunktionen.
19

Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

Razzaghi, Talayeh 01 January 2014 (has links)
Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics.
20

Enhancing supervised learning with complex aggregate features and context sensitivity / Amélioration de l'apprentissage supervisé par l'utilisation d'agrégats complexes et la prise en compte du contexte

Charnay, Clément 30 June 2016 (has links)
Dans cette thèse, nous étudions l'adaptation de modèles en apprentissage supervisé. Nous adaptons des algorithmes d'apprentissage existants à une représentation relationnelle. Puis, nous adaptons des modèles de prédiction aux changements de contexte.En représentation relationnelle, les données sont modélisées par plusieurs entités liées par des relations. Nous tirons parti de ces relations avec des agrégats complexes. Nous proposons des heuristiques d'optimisation stochastique pour inclure des agrégats complexes dans des arbres de décisions relationnels et des forêts, et les évaluons sur des jeux de données réelles.Nous adaptons des modèles de prédiction à deux types de changements de contexte. Nous proposons une optimisation de seuils sur des modèles à scores pour s'adapter à un changement de coûts. Puis, nous utilisons des transformations affines pour adapter les attributs numériques à un changement de distribution. Enfin, nous étendons ces transformations aux agrégats complexes. / In this thesis, we study model adaptation in supervised learning. Firstly, we adapt existing learning algorithms to the relational representation of data. Secondly, we adapt learned prediction models to context change.In the relational setting, data is modeled by multiples entities linked with relationships. We handle these relationships using complex aggregate features. We propose stochastic optimization heuristics to include complex aggregates in relational decision trees and Random Forests, and assess their predictive performance on real-world datasets.We adapt prediction models to two kinds of context change. Firstly, we propose an algorithm to tune thresholds on pairwise scoring models to adapt to a change of misclassification costs. Secondly, we reframe numerical attributes with affine transformations to adapt to a change of attribute distribution between a learning and a deployment context. Finally, we extend these transformations to complex aggregates.

Page generated in 0.1255 seconds