Global ETD Search

21	Classifica??o com algoritmo AdaBoost.M1 : o mito do limiar de erro de treinamento Le?es Neto, Ant?nio do Nascimento 20 November 2017 (has links) Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-02-16T13:18:07Z No. of bitstreams: 1 Ant?nio_do_Nascimento_Le?es_ Neto_Dis.pdf: 1049012 bytes, checksum: 293046d3be865048cd37706b38494e1a (MD5) / Approved for entry into archive by Caroline Xavier (caroline.xavier@pucrs.br) on 2018-02-22T16:34:51Z (GMT) No. of bitstreams: 1 Ant?nio_do_Nascimento_Le?es_ Neto_Dis.pdf: 1049012 bytes, checksum: 293046d3be865048cd37706b38494e1a (MD5) / Made available in DSpace on 2018-02-22T16:40:19Z (GMT). No. of bitstreams: 1 Ant?nio_do_Nascimento_Le?es_ Neto_Dis.pdf: 1049012 bytes, checksum: 293046d3be865048cd37706b38494e1a (MD5) Previous issue date: 2017-11-20 / The accelerated growth of data repositories, in the different areas of activity, opens space for research in the area of data mining, in particular, with the methods of classification and combination of classifiers. The Boosting method is one of them, which combines the results of several classifiers in order to obtain better results. The main purpose of this dissertation is the experimentation of alternatives to increase the effectiveness and performance of the algorithm AdaBoost.M1, which is the implementation often employed by the Boosting method. An empirical study was perfered taking into account stochastic aspects trying to shed some light on an obscure internal parameter, in which algorithm creators and other researchers assumed that the training error threshold should be correlated with the number of classes in the target data set and logically, most data sets should use a value of 0.5. In this paper, we present an empirical evidence that this is not a fact, but probably a myth originated by the mistaken application of the theoretical assumption of the joint effect. To achieve this goal, adaptations were proposed for the algorithm, focusing on finding a better suggestion to define this threshold in a general case. / O crescimento acelerado dos reposit?rios de dados, nas diversas ?reas de atua??o, abre espa?o para pesquisas na ?rea da minera??o de dados, em espec?fico, com os m?todos de classifica??o e de combina??o de classificadores. O Boosting ? um desses m?todos, e combina os resultados de diversos classificadores com intuito de obter melhores resultados. O prop?sito central desta disserta??o ? responder a quest?o de pesquisa com a experimenta??o de alternativas para aumentar a efic?cia e o desempenho do algoritmo AdaBoost.M1 que ? a implementa??o frequentemente empregada pelo Boosting. Foi feito um estudo emp?rico levando em considera??o aspectos estoc?sticos tentando lan?ar alguma luz sobre um par?metro interno obscuro em que criadores do algoritmo e outros pesquisadores assumiram que o limiar de erro de treinamento deve ser correlacionado com o n?mero de classes no conjunto de dados de destino e, logicamente, a maioria dos conjuntos de dados deve usar um valor de 0.5. Neste trabalho, apresentamos evid?ncias emp?ricas de que isso n?o ? um fato, mas provavelmente um mito originado pela aplica??o da primeira defini??o do algoritmo. Para alcan?ar esse objetivo, foram propostas adapta??es para o algoritmo, focando em encontrar uma sugest?o melhor para definir esse limiar em um caso geral. Minera??o de dados Classifica??o Combina??o de classificadores Classification Boosting AdaBoost.M1 Data Mining Ensemble Methods
22	Resource Efficient Representation of Machine Learning Models : investigating optimization options for decision trees in embedded systems / Resurseffektiv Representation av Maskininlärningsmodeller Lundberg, Jacob January 2019 (has links) Combining embedded systems and machine learning models is an exciting prospect. However, to fully target any embedded system, with the most stringent resource requirements, the models have to be designed with care not to overwhelm it. Decision tree ensembles are targeted in this thesis. A benchmark model is created with LightGBM, a popular framework for gradient boosted decision trees. This model is first transformed and regularized with RuleFit, a LASSO regression framework. Then it is further optimized with quantization and weight sharing, techniques used when compressing neural networks. The entire process is combined into a novel framework, called ESRule. The data used comes from the domain of frequency measurements in cellular networks. There is a clear use-case where embedded systems can use the produced resource optimized models. Compared with LightGBM, ESRule uses 72ˆ less internal memory on average, simultaneously increasing predictive performance. The models use 4 kilobytes on average. The serialized variant of ESRule uses 104ˆ less hard disk space than LightGBM. ESRule is also clearly faster at predicting a single sample. machine learning rule fit decision trees embedded systems resources ensemble methods lasso regression optimization Computer Engineering Datorteknik
23	Supervised Learning for Sequential and Uncertain Decision Making Problems - Application to Short-Term Electric Power Generation Scheduling Cornélusse, Bertrand 21 December 2010 (has links) Our work is driven by a class of practical problems of sequential decision making in the context of electric power generation under uncertainties. These problems are usually treated as receding horizon deterministic optimization problems, and/or as scenario-based stochastic programs. Stochastic programming allows to compute a first stage decision that is hedged against the possible futures and -- if a possibility of recourse exists -- this decision can then be particularized to possible future scenarios thanks to the information gathered until the recourse opportunity. Although many decomposition techniques exist, stochastic programming is currently not tractable in the context of day-ahead electric power generation and furthermore does not provide an explicit recourse strategy. The latter observation also makes this approach cumbersome when one wants to evaluate its value on independent scenarios. We propose a supervised learning methodology to learn an explicit recourse strategy for a given generation schedule, from optimal adjustments of the system under simulated perturbed conditions. This methodology may thus be complementary to a stochastic programming based approach. With respect to a receding horizon optimization, it has the advantages of transferring the heavy computation offline, while providing the ability to quickly infer decisions during online exploitation of the generation system. Furthermore the learned strategy can be validated offline on an independent set of scenarios. On a realistic instance of the intra-day electricity generation rescheduling problem, we explain how to generate disturbance scenarios, how to compute adjusted schedules, how to formulate the supervised learning problem to obtain a recourse strategy, how to restore feasibility of the predicted adjustments and how to evaluate the recourse strategy on independent scenarios. We analyze different settings, namely either to predict the detailed adjustment of all the generation units, or to predict more qualitative variables that allow to speed up the adjustment computation procedure by facilitating the ``classical' optimization problem. Our approach is intrinsically scalable to large-scale generation management problems, and may in principle handle all kinds of uncertainties and practical constraints. Our results show the feasibility of the approach and are also promising in terms of economic efficiency of the resulting strategies. The solutions of the optimization problem of generation (re)scheduling must satisfy many constraints. However, a classical learning algorithm that is (by nature) unaware of the constraints the data is subject to may indeed successfully capture the sensitivity of the solution to the model parameters. This has nevertheless raised our attention on one particular aspect of the relation between machine learning algorithms and optimization algorithms. When we apply a supervised learning algorithm to search in a hypothesis space based on data that satisfies a known set of constraints, can we guarantee that the hypothesis that we select will make predictions that satisfy the constraints? Can we at least benefit from our knowledge of the constraints to eliminate some hypotheses while learning and thus hope that the selected hypothesis has a better generalization error? In the second part of this thesis, where we try to answer these questions, we propose a generic extension of tree-based ensemble methods that allows incorporating incomplete data but also prior knowledge about the problem. The framework is based on a convex optimization problem allowing to regularize a tree-based ensemble model by adjusting either (or both) the labels attached to the leaves of an ensemble of regression trees or the outputs of the observations of the training sample. It allows to incorporate weak additional information in the form of partial information about output labels (like in censored data or semi-supervised learning) or -- more generally -- to cope with observations of varying degree of precision, or strong priors in the form of structural knowledge about the sought model. In addition to enhancing the precision by exploiting information that cannot be used by classical supervised learning algorithms, the proposed approach may be used to produce models which naturally comply with feasibility constraints that must be satisfied in many practical decision making problems, especially in contexts where the output space is of high-dimension and/or structured by invariances, symmetries and other kinds of constraints. Simulation Unit commitment Machine Learning Electricity generation scheduling Optimization
24	Εξόρυξη γνώσης από ιατροβιολογικά δεδομένα / Biomedical data mining Καλλά, Μαρία-Παυλίνα 28 February 2013 (has links) Πίσω από όλα αυτά τα δεδομένα που υπάρχουν κρύβεται ένας τεράστιος θησαυρός γνώσεων τον οποίο δεν μπορούμε να αντιληφθούμε καθώς η μορφή των πληροφοριών δεν μας το επιτρέπει. Έτσι αναπτύχθηκαν μέθοδοι και τεχνικές που μας βοηθούν να βρούμε την κρυμμένη γνώση και να την αξιοποιήσουμε προς όφελος κυρίως του κοινού και η πιο γνωστή μέθοδος, με την οποία θα ασχοληθούμε και εμείς είναι η Εξόρυξη Γνώσης. Στην εργασία που ακολουθεί θα μιλήσουμε για την χρήση των μεθόδων Εξόρυξης Γνώσης (όπως λέγονται) σε βιοϊατρικά δεδομένα. Στην αρχή θα κάνουμε αναφορά στην Μοριακή Βιολογία και στην Βιοπληροφορική. Ακολούθως θα δουμε την Ανακάλυψη γνώσης από βάσεις δεδομένων. Θα δούμε αναλυτικά την Εξόρυξη γνώσης και πιο πολύ τις μεθόδους κατηγοριοποίησης. Τέλος θα εφαρμόσουμε τους αλγορίθμους σε ιατροβιολογικά δεδομένα και θα δούμε τα συμπεράσματα που προκύπτουν αλλά και μελλοντικές επεκτάσεις. / Behind all these data there is hidden a huge treasure of knowledge which we can not understand . Thus developed methods and techniques that help us find the hidden knowledge and to utilize it for the benefit of the public. The most famous method, which we will study, is Data Mining. In the work that follows we will discuss the use of data mining methods (as they are called) in biomedical data. In the beginning, we will report information about Molecular Biology and Bioinformatics. Then. we will see the knowledge discovery in databases. We will see in detail the Data Mining and the classification methods. Finally we implement the algorithms in biomedical data and see the conclusions and future extensions. Βιοπληροφορική Εξόρυξη γνώσης 610.285 Bioinformatics Data mining Classification algorithms Biological databases Ensemble methods
25	Integrative Analyses of Diverse Biological Data Sources January 2011 (has links) abstract: The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards these objectives, this research focuses on data integration within two scenarios: (1) transcriptomic, proteomic and functional information and (2) real-time sensor-based measurements motivated by single-cell technology. To assess relationships between protein abundance, transcriptomic and functional data, a nonlinear model was explored at static and temporal levels. The successful integration of these heterogeneous data sources through the stochastic gradient boosted tree approach and its improved predictability are some highlights of this work. Through the development of an innovative validation subroutine based on a permutation approach and the use of external information (i.e., operons), lack of a priori knowledge for undetected proteins was overcome. The integrative methodologies allowed for the identification of undetected proteins for Desulfovibrio vulgaris and Shewanella oneidensis for further biological exploration in laboratories towards finding functional relationships. In an effort to better understand diseases such as cancer at different developmental stages, the Microscale Life Science Center headquartered at the Arizona State University is pursuing single-cell studies by developing novel technologies. This research arranged and applied a statistical framework that tackled the following challenges: random noise, heterogeneous dynamic systems with multiple states, and understanding cell behavior within and across different Barrett's esophageal epithelial cell lines using oxygen consumption curves. These curves were characterized with good empirical fit using nonlinear models with simple structures which allowed extraction of a large number of features. Application of a supervised classification model to these features and the integration of experimental factors allowed for identification of subtle patterns among different cell types visualized through multidimensional scaling. Motivated by the challenges of analyzing real-time measurements, we further explored a unique two-dimensional representation of multiple time series using a wavelet approach which showcased promising results towards less complex approximations. Also, the benefits of external information were explored to improve the image representation. / Dissertation/Thesis / Ph.D. Industrial Engineering 2011 Industrial Engineering Bioinformatics Biostatistics Data integration Data mining Ensemble methods Genomics Multiple time series Single-cell studies
26	Uma abordagem baseada em Perceptrons balanceados para geração de ensembles e redução do espaço de versões Enes, Karen Braga 08 January 2016 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-06-07T17:28:53Z No. of bitstreams: 1 karenbragaenes.pdf: 607859 bytes, checksum: f7907cc35c012dd829a223c7d46a7e6b (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-24T13:13:01Z (GMT) No. of bitstreams: 1 karenbragaenes.pdf: 607859 bytes, checksum: f7907cc35c012dd829a223c7d46a7e6b (MD5) / Made available in DSpace on 2017-06-24T13:13:01Z (GMT). No. of bitstreams: 1 karenbragaenes.pdf: 607859 bytes, checksum: f7907cc35c012dd829a223c7d46a7e6b (MD5) Previous issue date: 2016-01-08 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Recentemente, abordagens baseadas em ensemble de classificadores têm sido bastante exploradas por serem uma alternativa eficaz para a construção de classificadores mais acurados. A melhoria da capacidade de generalização de um ensemble está diretamente relacionada à acurácia individual e à diversidade de seus componentes. Este trabalho apresenta duas contribuições principais: um método ensemble gerado pela combinação de Perceptrons balanceados e um método para geração de uma hipótese equivalente ao voto majoritário de um ensemble. Para o método ensemble, os componentes são selecionados por medidas de diversidade, que inclui a introdução de uma medida de dissimilaridade, e avaliados segundo a média e o voto majoritário das soluções. No caso de voto majoritário, o teste de novas amostras deve ser realizado perante todas as hipóteses geradas. O método para geração da hipótese equivalente é utilizado para reduzir o custo desse teste. Essa hipótese é obtida a partir de uma estratégia iterativa de redução do espaço de versões. Um estudo experimental foi conduzido para avaliação dos métodos propostos. Os resultados mostram que os métodos propostos são capazes de superar, na maior parte dos casos, outros algoritmos testados como o SVM e o AdaBoost. Ao avaliar o método de redução do espaço de versões, os resultados obtidos mostram a equivalência da hipótese gerada com a votação de um ensemble de Perceptrons balanceados. / Recently, ensemble learning theory has received much attention in the machine learning community, since it has been demonstrated as a great alternative to generate more accurate predictors with higher generalization abilities. The improvement of generalization performance of an ensemble is directly related to the diversity and accuracy of the individual classifiers. In this work, we present two main contribuitions: we propose an ensemble method by combining Balanced Perceptrons and we also propose a method for generating a hypothesis equivalent to the majority voting of an ensemble. Considering the ensemble method, we select the components by using some diversity strategies, which include a dissimilarity measure. We also apply two strategies in view of combining the individual classifiers decisions: majority unweighted vote and the average of all components. Considering the majority vote strategy, the set of unseen samples must be evaluate towards the generated hypotheses. The method for generating a hypothesis equivalent to the majority voting of an ensemble is applied in order to reduce the costs of the test phase. The hypothesis is obtained by successive reductions of the version space. We conduct a experimental study to evaluate the proposed methods. Reported results show that our methods outperforms, on most cases, other classifiers such as SVM and AdaBoost. From the results of the reduction of the version space, we observe that the genareted hypothesis is, in fact, equivalent to the majority voting of an ensemble. Perceptron Classificação binária Métodos ensemble Espaço de versões Perceptron Binary Classification Ensemble Methods Version Space
27	[pt] APRENDIZADO EM DOIS ESTÁGIOS PARA MÉTODOS DE COMITÉ DE ÁRVORES DE DECISÃO / [en] TWO-STAGE LEARNING FOR TREE ENSEMBLE METHODS ALEXANDRE WERNECK ANDREZA 23 November 2020 (has links) [pt] Tree ensemble methods são reconhecidamente métodos de sucesso em problemas de aprendizado supervisionado, bem como são comumente descritos como métodos resistentes ao overfitting. A proposta deste trabalho é investigar essa característica a partir de modelos que extrapolem essa resistência. Ao prever uma instância de exemplo, os métodos de conjuntos são capazes de identificar a folha onde essa instância ocorre em cada uma das árvores. Nosso método então procura identificar uma nova função sobre todas as folhas deste conjunto, minimizando uma função de perda no conjunto de treino. Uma das maneiras de definir conceitualmente essa proposta é interpretar nosso modelo como um gerador automático de features ou um otimizador de predição. / [en] In supervised learning, tree ensemble methods have been recognized for their high level performance in a wide range of applications. Moreover, several references report such methods to present a resistance of to overfitting. This work investigates this observed resistance by proposing a method that explores it. When predicting an instance, tree ensemble methods determines the leaf of each tree where the instance falls. The prediction is then obtained by a function of these leaves, minimizing a loss function or an error estimator for the training set, overfitting in the learning phase in some sense. This method can be interpreted either as an Automated Feature Engineering or a Predictor Optimization. [pt] APRENDIZADO DE MAQUINA [pt] PREVISAO OTIMIZADA [pt] CONSTRUCAO DE CARACTERISTICAS [pt] METODOS DE FLORESTA [en] MACHINE LEARNING [en] OPTIMIZER PREDICTION [en] FEATURE CONSTRUCTION [en] ENSEMBLE METHODS
28	An Unsupervised Consensus Control Chart Pattern Recognition Framework Haghtalab, Siavash 01 January 2014 (has links) Early identification and detection of abnormal time series patterns is vital for a number of manufacturing. Slide shifts and alterations of time series patterns might be indicative of some anomaly in the production process, such as machinery malfunction. Usually due to the continuous flow of data monitoring of manufacturing processes requires automated Control Chart Pattern Recognition(CCPR) algorithms. The majority of CCPR literature consists of supervised classification algorithms. Less studies consider unsupervised versions of the problem. Despite the profound advantage of unsupervised methodology for less manual data labeling their use is limited due to the fact that their performance is not robust enough for practical purposes. In this study we propose the use of a consensus clustering framework. Computational results show robust behavior compared to individual clustering algorithms. Data mining machine learning unsupervised learing control chart pattern recognition clustering consensus clustering ensemble methods Engineering Industrial Engineering Systems Engineering
29	Using supervised machine learning and sentiment analysis techniques to predict homophobia in portuguese tweets Pereira, Vinicius Gomes 16 April 2018 (has links) Submitted by Vinicius Pereira (viniciusgomespe@gmail.com) on 2018-06-26T20:56:26Z No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5) / Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2018-07-11T12:40:51Z (GMT) No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5) / Made available in DSpace on 2018-07-16T17:48:51Z (GMT). No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5) Previous issue date: 2018-04-16 / Este trabalho estuda a identificação de tweets homofóbicos, utilizando uma abordagem de processamento de linguagem natural e aprendizado de máquina. O objetivo é construir um modelo preditivo que possa detectar, com razoável precisão, se um Tweet contém conteúdo ofensivo a indivı́duos LGBT ou não. O banco de dados utilizado para treinar os modelos preditivos foi construı́do agregando tweets de usuários que interagiram com polı́ticos e/ou partidos polı́ticos no Brasil. Tweets contendo termos relacionados a LGBTs ou que têm referências a indivı́duos LGBT foram coletados e classificados manualmente. Uma grande parte deste trabalho está na construção de features que capturam com precisão não apenas o texto do tweet, mas também caracterı́sticas especı́ficas dos usuários e de expressões coloquiais do português. Em particular, os usos de palavrões e vocabulários especı́ficos são um forte indicador de tweets ofensivos. Naturalmente, n-gramas e esquemas de frequência de termos também foram considerados como caracterı́sticas do modelo. Um total de 12 conjuntos de recursos foram construı́dos. Uma ampla gama de técnicas de aprendizado de máquina foi empregada na tarefa de classificação: Naive Bayes, regressões logı́sticas regularizadas, redes neurais feedforward, XGBoost (extreme gradient boosting), random forest e support vector machines. Depois de estimar e ajustar cada modelo, eles foram combinados usando voting e stacking. Voting utilizando 10 modelos obteve o melhor resultado, com 89,42% de acurácia. / This work studies the identification of homophobic tweets from a natural language processing and machine learning approach. The goal is to construct a predictive model that can detect, with reasonable accuracy, whether a Tweet contains offensive content to LGBT or not. The database used to train the predictive models was constructed aggregating tweets from users that have interacted with politicians and/or political parties in Brazil. Tweets containing LGBT-related terms or that have references to open LGBT individuals were collected and manually classified. A large part of this work is in constructing features that accurately capture not only the text of the tweet but also specific characteristics of the users and language choices. In particular, the uses of swear words and strong vocabulary is a quite strong predictor of offensive tweets. Naturally, n-grams and term weighting schemes were also considered as features of the model. A total of 12 sets of features were constructed. A broad range of machine learning techniques were employed in the classification task: naive Bayes, regularized logistic regressions, feedforward neural networks, extreme gradient boosting (XGBoost), random forest and support vector machines. After estimating and tuning each model, they were combined using voting and stacking. Voting using 10 models obtained the best result, with 89.42% accuracy. Sentiment Analysis Machine Learning Supervised learning Ensemble Methods Homophobia Análise de sentimentos Aprendizagem de máquina Aprendizagem supervisionada Mineração de dados (Computação) Aprendizado do computador Modelagem de dados Homofobia
30	Combined decision making with multiple agents Simpson, Edwin Daniel January 2014 (has links) In a wide range of applications, decisions must be made by combining information from multiple agents with varying levels of trust and expertise. For example, citizen science involves large numbers of human volunteers with differing skills, while disaster management requires aggregating information from multiple people and devices to make timely decisions. This thesis introduces efficient and scalable Bayesian inference for decision combination, allowing us to fuse the responses of multiple agents in large, real-world problems and account for the agents’ unreliability in a principled manner. As the behaviour of individual agents can change significantly, for example if agents move in a physical space or learn to perform an analysis task, this work proposes a novel combination method that accounts for these time variations in a fully Bayesian manner using a dynamic generalised linear model. This approach can also be used to augment agents’ responses with continuous feature data, thus permitting decision-making when agents’ responses are in limited supply. Working with information inferred using the proposed Bayesian techniques, an information-theoretic approach is developed for choosing optimal pairs of tasks and agents. This approach is demonstrated by an algorithm that maintains a trustworthy pool of workers and enables efficient learning by selecting informative tasks. The novel methods developed here are compared theoretically and empirically to a range of existing decision combination methods, using both simulated and real data. The results show that the methodology proposed in this thesis improves accuracy and computational efficiency over alternative approaches, and allows for insights to be determined into the behavioural groupings of agents. 519.5

Search results