Global ETD Search

1	Text Document Categorization by Machine Learning Sendur, Zeynel 01 January 2008 (has links) Because of the explosion of digital and online text information, automatic organization of documents has become a very important research area. There are mainly two machine learning approaches to enhance the organization task of the digital documents. One of them is the supervised approach, where pre-defined category labels are assigned to documents based on the likelihood suggested by a training set of labeled documents; and the other one is the unsupervised approach, where there is no need for human intervention or labeled documents at any point in the whole process. In this thesis, we concentrate on the supervised learning task which deals with document classification. One of the most important tasks of information retrieval is to induce classifiers capable of categorizing text documents. The same document can belong to two or more categories and this situation is referred by the term multi-label classification. Multi-label classification domains have been encountered in diverse fields. Most of the existing machine learning techniques which are in multi-label classification domains are extremely expensive since the documents are characterized by an extremely large number of features. In this thesis, we are trying to reduce these computational costs by applying different types of algorithms to the documents which are characterized by large number of features. Another important thing that we deal in this thesis is to have the highest possible accuracy when we have the high computational performance on text document categorization.
2	Evaluating loss minimization in multi-label classification via stochastic simulation using beta distribution MELLO, L. H. S. 20 May 2016 (has links) Made available in DSpace on 2016-08-29T15:33:25Z (GMT). No. of bitstreams: 1 tese_9881_Ata de defesa.pdf: 679815 bytes, checksum: bd13283b6e7f400de68b79f04cf0b4a9 (MD5) Previous issue date: 2016-05-20 / The objective of this work is to present the effectiveness and efficiency of algorithms for solving the loss minimization problem in Multi-Label Classification (MLC). We first prove that a specific case of loss minimization in MLC isNP-complete for the loss functions Coverage and Search Length, and therefore,no efficient algorithm for solving such problems exists unless P=NP. Furthermore, we show a novel approach for evaluating multi-label algorithms that has the advantage of not being limited to some chosen base learners, such as K-neareast Neighbor and Support Vector Machine, by simulating the distribution of labels according to multiple Beta Distributions. multi-label classification loss minimization data mining
3	Meta-aprendizado para análise de desempenho de métodos de classificação multi-label PINTO, Eduardo Ribas 31 January 2009 (has links) Made available in DSpace on 2014-06-12T15:52:45Z (GMT). No. of bitstreams: 1 license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2009 / Nos últimos anos, têm surgido diversas aplicações que utilizam algoritmos de Aprendizagem de Máquina Supervisionada para resolver problemas de classificação envolvendo diversos domínios. No entanto, muitas destas aplicações se restringem a utilizarem algoritmos singlelabel, ou seja, que atribuem apenas uma classe a uma dada instância. Tais aplicações se tornam inadequadas quando essa mesma instância, no mundo real, pertence a mais de uma classe simultaneamente. Tal problema é denominado na literatura como Problema de Classificação Multi-Label. Atualmente, há uma diversidade de estratégias voltadas para resolver problemas multi-label. Algumas delas fazem parte de um grupo denominado de Métodos de Transformação de Problemas. Essa denominação vem do fato de esse tipo de estratégia buscar dividir um problema de classificação multi-label em vários problemas single-label de modo a reduzir sua complexidade. Outras buscam tratar conjuntos de dados multi-label diretamente, sendo conhecidas como Métodos de Adaptação de Algoritmos. Em decorrência desta grande quantidade de métodos multi-label existentes, é bastante difícil escolher o mais adequado para um dado domínio. Diante disso, a presente dissertação buscou atingir dois objetivos: realização de um estudo comparativo entre métodos de transformação de problemas muito utilizados na atualidade e a aplicação de duas estratégias de Meta-Aprendizado em classificação multi-label, a fim de predizer, com base nas características descritivas de um conjunto de dados, qual algoritmo é mais provável de obter um desempenho melhor em relação aos demais. As abordagens de Meta-aprendizado utilizadas no nosso trabalho foram derivadas com base em técnicas de análise de correlação e mineração de regras. O uso de Meta-Aprendizado no contexto de classificação multi-label é original e apresentou resultados satisfatórios nos nossos experimentos, o que aponta que este pode ser um guia inicial para o desenvolvimento de pesquisas futuras relacionadas Classificação Multi-Label Meta-Aprendizado Aprendizagem de Máquina
4	Apprentissage Ensembliste, Étude comparative et Améliorations via Sélection Dynamique / Ensemble Learning, Comparative Analysis and Further Improvements with Dynamic Ensemble Selection Narassiguin, Anil 04 May 2018 (has links) Les méthodes ensemblistes constituent un sujet de recherche très populaire au cours de la dernière décennie. Leur succès découle en grande partie de leurs solutions attrayantes pour résoudre différents problèmes d'apprentissage intéressants parmi lesquels l'amélioration de l'exactitude d'une prédiction, la sélection de variables, l'apprentissage de métrique, le passage à l'échelle d'algorithmes inductifs, l'apprentissage de multiples jeux de données physiques distribués, l'apprentissage de flux de données soumis à une dérive conceptuelle, etc... Dans cette thèse nous allons dans un premier temps présenter une comparaison empirique approfondie de 19 algorithmes ensemblistes d'apprentissage supervisé proposé dans la littérature sur différents jeux de données de référence. Non seulement nous allons comparer leurs performances selon des métriques standards de performances (Exactitude, AUC, RMS) mais également nous analyserons leur diagrammes kappa-erreur, la calibration et les propriétés biais-variance. Nous allons aborder ensuite la problématique d'amélioration des ensembles de modèles par la sélection dynamique d'ensembles (dynamic ensemble selection, DES). La sélection dynamique est un sous-domaine de l'apprentissage ensembliste où pour une donnée d'entrée x, le meilleur sous-ensemble en terme de taux de réussite est sélectionné dynamiquement. L'idée derrière les approches DES est que différents modèles ont différentes zones de compétence dans l'espace des instances. La plupart des méthodes proposées estime l'importance individuelle de chaque classifieur faible au sein d'une zone de compétence habituellement déterminée par les plus proches voisins dans un espace euclidien. Nous proposons et étudions dans cette thèse deux nouvelles approches DES. La première nommée ST-DES est conçue pour les ensembles de modèles à base d'arbres de décision. Cette méthode sélectionne via une métrique supervisée interne à l'arbre, idée motivée par le problème de la malédiction de la dimensionnalité : pour les jeux de données avec un grand nombre de variables, les métriques usuelles telle la distance euclidienne sont moins pertinentes. La seconde approche, PCC-DES, formule la problématique DES en une tâche d'apprentissage multi-label avec une fonction coût spécifique. Ici chaque label correspond à un classifieur et une base multi-label d'entraînement est constituée sur l'habilité de chaque classifieur de classer chaque instance du jeu de données d'origine. Cela nous permet d'exploiter des récentes avancées dans le domaine de l'apprentissage multi-label. PCC-DES peut être utilisé pour les approches ensemblistes homogènes et également hétérogènes. Son avantage est de prendre en compte explicitement les corrélations entre les prédictions des classifieurs. Ces algorithmes sont testés sur un éventail de jeux de données de référence et les résultats démontrent leur efficacité faces aux dernières alternatives de l'état de l'art / Ensemble methods has been a very popular research topic during the last decade. Their success arises largely from the fact that they offer an appealing solution to several interesting learning problems, such as improving prediction accuracy, feature selection, metric learning, scaling inductive algorithms to large databases, learning from multiple physically distributed data sets, learning from concept-drifting data streams etc. In this thesis, we first present an extensive empirical comparison between nineteen prototypical supervised ensemble learning algorithms, that have been proposed in the literature, on various benchmark data sets. We not only compare their performance in terms of standard performance metrics (Accuracy, AUC, RMS) but we also analyze their kappa-error diagrams, calibration and bias-variance properties. We then address the problem of improving the performances of ensemble learning approaches with dynamic ensemble selection (DES). Dynamic pruning is the problem of finding given an input x, a subset of models among the ensemble that achieves the best possible prediction accuracy. The idea behind DES approaches is that different models have different areas of expertise in the instance space. Most methods proposed for this purpose estimate the individual relevance of the base classifiers within a local region of competence usually given by the nearest neighbours in the euclidean space. We propose and discuss two novel DES approaches. The first, called ST-DES, is designed for decision tree based ensemble models. This method prunes the trees using an internal supervised tree-based metric; it is motivated by the fact that in high dimensional data sets, usual metrics like euclidean distance suffer from the curse of dimensionality. The second approach, called PCC-DES, formulates the DES problem as a multi-label learning task with a specific loss function. Labels correspond to the base classifiers and multi-label training examples are formed based on the ability of each classifier to correctly classify each original training example. This allows us to take advantage of recent advances in the area of multi-label learning. PCC-DES works on homogeneous and heterogeneous ensembles as well. Its advantage is to explicitly capture the dependencies between the classifiers predictions. These algorithms are tested on a variety of benchmark data sets and the results demonstrate their effectiveness against competitive state-of-the-art alternatives Apprentissage ensembliste Sélection dynamique Multi-label Ensemble learning Dynamic ensemble selection Multi-label 004
5	A Common Misconception in Multi-Label Learning Brodie, Michael Benjamin 01 November 2016 (has links) The majority of current multi-label classification research focuses on learning dependency structures among output labels. This paper provides a novel theoretical view on the purported assumption that effective multi-label classification models must exploit output dependencies. We submit that the flurry of recent dependency-exploiting, multi-label algorithms may stem from the deficiencies in existing datasets, rather than an inherent need to better model dependencies. We introduce a novel categorization of multi-label metrics, namely, evenly and unevenly weighted label metrics. We explore specific features that predispose datasets to improved classification by methods that model label dependence. Additionally, we provide an empirical analysis of 15 benchmark datasets, 1 real-life dataset, and a variety of synthetic datasets. We assert that binary relevance (BR) yields similar, if not better, results than dependency-exploiting models for metrics with evenly weighted label contributions. We qualify this claim with discussions on specific characteristics of datasets and models that render negligible the differences between BR and dependency-learning models. binary relevance multi-label classification multi-dimensional classification Computer Sciences
6	Induction in Hierarchical Multi-label Domains with Focus on Text Categorization Dendamrongvit, Sareewan 02 May 2011 (has links) Induction of classifiers from sets of preclassified training examples is one of the most popular machine learning tasks. This dissertation focuses on the techniques needed in the field of automated text categorization. Here, each document can be labeled with more than one class, sometimes with many classes. Moreover, the classes are hierarchically organized, the mutual relations being typically expressed in terms of a generalization tree. Both aspects (multi-label classification and hierarchically organized classes) have so far received inadequate attention. Existing literature work largely assumes that it is enough to induce a separate binary classifier for each class, and the question of class hierarchy is rarely addressed. This, however, ignores some serious problems. For one thing, induction of thousands of classifiers from hundreds of thousands of examples described by tens of thousands of features (a common case in automated text categorization) incurs prohibitive computational costs---even a single binary classifier in domains of this kind often takes hours, even days, to induce. For another, the circumstance that the classes are hierarchically organized affects the way we view the classification performance of the induced classifiers. The presented work proposes a technique referred to by the acronym "H-kNN-plus." The technique combines support vector machines and nearest neighbor classifiers with the intention to capitalize on the strengths of both. As for performance evaluation, a variety of measures have been used to evaluate hierarchical classifiers, including the standard non-hierarchical criteria that assign the same weight to different types of error. The author proposes a performance measure that overcomes some of their weaknesses. The dissertation begins with a study of (non-hierarchical) multi-label classification. One of the reasons for the poor performance of earlier techniques is the class-imbalance problem---a small number of positive examples being outnumbered by a great many negative examples. Another difficulty is that each of the classes tends to be characterized by a different set of characteristic features. This means that most of the binary classifiers are induced from examples described by predominantly irrelevant features. Addressing these weaknesses by majority-class undersampling and feature selection, the proposed technique significantly improves the overall classification performance. Even more challenging is the issue of hierarchical classification. Here, the dissertation introduces a new induction mechanism, H-kNN-plus, and subjects it to extensive experiments with two real-world datasets. The results indicate its superiority, in these domains, over earlier work in terms of prediction performance as well as computational costs. Induction Text categorization Hierarchical classification Multi-label examples Imbalanced classes
7	Effective Gene Expression Annotation Approaches for Mouse Brain Images January 2016 (has links) abstract: Understanding the complexity of temporal and spatial characteristics of gene expression over brain development is one of the crucial research topics in neuroscience. An accurate description of the locations and expression status of relative genes requires extensive experiment resources. The Allen Developing Mouse Brain Atlas provides a large number of in situ hybridization (ISH) images of gene expression over seven different mouse brain developmental stages. Studying mouse brain models helps us understand the gene expressions in human brains. This atlas collects about thousands of genes and now they are manually annotated by biologists. Due to the high labor cost of manual annotation, investigating an efficient approach to perform automated gene expression annotation on mouse brain images becomes necessary. In this thesis, a novel efficient approach based on machine learning framework is proposed. Features are extracted from raw brain images, and both binary classification and multi-class classification models are built with some supervised learning methods. To generate features, one of the most adopted methods in current research effort is to apply the bag-of-words (BoW) algorithm. However, both the efficiency and the accuracy of BoW are not outstanding when dealing with large-scale data. Thus, an augmented sparse coding method, which is called Stochastic Coordinate Coding, is adopted to generate high-level features in this thesis. In addition, a new multi-label classification model is proposed in this thesis. Label hierarchy is built based on the given brain ontology structure. Experiments have been conducted on the atlas and the results show that this approach is efficient and classifies the images with a relatively higher accuracy. / Dissertation/Thesis / Masters Thesis Computer Science 2016 Computer science Gene Expression Image Annotation Multi-label Sparse Coding
8	Emergency Medical Service EMR-Driven Concept Extraction From Narrative Text George, Susanna Serene 08 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Being in the midst of a pandemic with patients having minor symptoms that quickly become fatal to patients with situations like a stemi heart attack, a fatal accident injury, and so on, the importance of medical research to improve speed and efficiency in patient care, has increased. As researchers in the computer domain work hard to use automation in technology in assisting the first responders in the work they do, decreasing the cognitive load on the field crew, time taken for documentation of each patient case and improving accuracy in details of a report has been a priority. This paper presents an information extraction algorithm that custom engineers certain existing extraction techniques that work on the principles of natural language processing like metamap along with syntactic dependency parser like spacy for analyzing the sentence structure and regular expressions to recurring patterns, to retrieve patient-specific information from medical narratives. These concept value pairs automatically populates the fields of an EMR form which could be reviewed and modified manually if needed. This report can then be reused for various medical and billing purposes related to the patient. Concept extraction Natural Language Processing EMR-driven Multi-label classification
9	To Encourage or to Restrict: the Label Dependency in Multi-Label Learning Yang, Zhuo 06 1900 (has links) Multi-label learning addresses the problem that one instance can be associated with multiple labels simultaneously. Understanding and exploiting the Label Dependency (LD) is well accepted as the key to build high-performance multi-label classifiers, i.e., classifiers having abilities including but not limited to generalizing well on clean data and being robust under evasion attack. From the perspective of generalization on clean data, previous works have proved the advantage of exploiting LD in multi-label classification. To further verify the positive role of LD in multi-label classification and address previous limitations, we originally propose an approach named Prototypical Networks for Multi- Label Learning (PNML). Specially, PNML addresses multi-label classification from the angle of estimating the positive and negative class distribution of each label in a shared nonlinear embedding space. PNML achieves the State-Of-The-Art (SOTA) classification performance on clean data. From the perspective of robustness under evasion attack, as a pioneer, we firstly define the attackability of an multi-label classifier as the expected maximum number of flipped decision outputs by injecting budgeted perturbations to the feature distribution of data. Denote the attackability of a multi-label classifier as C∗, and the empirical evaluation of C∗ is an NP-hard problem. We thus develop a method named Greedy Attack Space Exploration (GASE) to estimate C∗ efficiently. More interestingly, we derive an information-theoretic upper bound for the adversarial risk faced by multi-label classifiers. The bound unveils the key factors determining the attackability of multi-label classifiers and points out the negative role of LD in multi-label classifiers’ adversarial robustness, i.e. LD helps the transfer of attack across labels, which makes multi-label classifiers more attackable. One step forward, inspired by the derived bound, we propose a Soft Attackability Estimator (SAE) and further develop Adversarial Robust Multi-label learning with regularized SAE (ARM-SAE) to improve the adversarial robustness of multi-label classifiers. This work gives a more comprehensive understanding of LD in multi-label learning. The exploiting of LD should be encouraged since its positive role in models’ generalization on clean data, but be restricted because of its negative role in models’ adversarial robustness. multi-label learning label dependency adversarial evasion attackability
10	Improving Multi-label Classification by Avoiding Implicit Negativity with Incomplete Data Heath, Derrall L. 11 October 2011 (has links) (PDF) Many real world problems require multi-label classification, in which each training instance is associated with a set of labels. There are many existing learning algorithms for multi-label classification; however, these algorithms assume implicit negativity, where missing labels in the training data are automatically assumed to be negative. Additionally, many of the existing algorithms do not handle incremental learning in which new labels could be encountered later in the learning process. A novel multi-label adaptation of the backpropagation algorithm is proposed that does not assume implicit negativity. In addition, this algorithm can, using a naive Bayesian approach, infer missing labels in the training data. This algorithm can also be trained incrementally as it dynamically considers new labels. This solution is compared with existing multi-label algorithms using data sets from multiple domains and the performance is measured with standard multi-label evaluation metrics. It is shown that our algorithm improves classification performance for all metrics by an overall average of 7.4% when at least 40% of the labels are missing from the training data, and improves by 18.4% when at least 90% of the labels are missing. implicit negativity multi-label classification thesis Computer Sciences

Search results