Global ETD Search

31	Μηχανική μάθηση σε ανομοιογενή δεδομένα / Machine learning in imbalanced data sets Λυπιτάκη, Αναστασία Δήμητρα Δανάη 07 July 2015 (has links) Οι αλγόριθμοι μηχανικής μάθησης είναι επιθυμητό να είναι σε θέση να γενικεύσουν για οποιασδήποτε κλάση με ίδια ακρίβεια. Δηλαδή σε ένα πρόβλημα δύο κλάσεων - θετικών και αρνητικών περιπτώσεων - ο αλγόριθμος να προβλέπει με την ίδια ακρίβεια και τα θετικά και τα αρνητικά παραδείγματα. Αυτό είναι φυσικά η ιδανική κατάσταση. Σε πολλές εφαρμογές οι αλγόριθμοι καλούνται να μάθουν από ένα σύνολο στοιχείων, το οποίο περιέχει πολύ περισσότερα παραδείγματα από τη μια κλάση σε σχέση με την άλλη. Εν γένει, οι επαγωγικοί αλγόριθμοι είναι σχεδιασμένοι να ελαχιστοποιούν τα σφάλματα. Ως συνέπεια οι κλάσεις που περιέχουν λίγες περιπτώσεις μπορούν να αγνοηθούν κατά ένα μεγάλο μέρος επειδή το κόστος λανθασμένης ταξινόμησης της υπερ-αντιπροσωπευόμενης κλάσης ξεπερνά το κόστος λανθασμένης ταξινόμησης της μικρότερη κλάση. Το πρόβλημα των ανομοιογενών συνόλων δεδομένων εμφανίζεται και σε πολλές πραγματικές εφαρμογές όπως στην ιατρική διάγνωση, στη ρομποτική, στις διαδικασίες βιομηχανικής παραγωγής, στην ανίχνευση λαθών δικτύων επικοινωνίας, στην αυτοματοποιημένη δοκιμή του ηλεκτρονικού εξοπλισμού, και σε πολλές άλλες περιοχές. Η παρούσα διπλωματική εργασία με τίτλο ‘Μηχανική Μάθηση με Ανομοιογενή Δεδομένα’ (Machine Learning with Imbalanced Data) αναφέρεται στην επίλυση του προβλήματος αποδοτικής χρήσης αλγορίθμων μηχανικής μάθησης σε ανομοιογενή/ανισοκατανεμημένα δεδομένα. Η διπλωματική περιλαμβάνει μία γενική περιγραφή των βασικών αλγορίθμων μηχανικής μάθησης και των μεθόδων αντιμετώπισης του προβλήματος ανομοιογενών δεδομένων. Παρουσιάζεται πλήθος αλγοριθμικών τεχνικών διαχείρισης ανομοιογενών δεδομένων, όπως οι αλγόριθμοι AdaCost, Cost Senistive Boosting, Metacost και άλλοι. Παρατίθενται οι μετρικές αξιολόγησης των μεθόδων Μηχανικής Μάθησης σε ανομοιογενή δεδομένα, όπως οι καμπύλες διαχείρισης λειτουργικών χαρακτηριστικών (ROC curves), καμπύλες ακρίβειας (PR curves) και καμπύλες κόστους. Στο τελευταίο μέρος της εργασίας προτείνεται ένας υβριδικός αλγόριθμος που συνδυάζει τις τεχνικές OverBagging και Rotation Forest. Συγκρίνεται ο προτεινόμενος αλγόριθμος σε ένα σύνολο ανομοιογενών δεδομένων με άλλους αλγόριθμους και παρουσιάζονται τα αντίστοιχα πειραματικά αποτελέσματα που δείχνουν την καλύτερη απόδοση του προτεινόμενου αλγόριθμου. Τελικά διατυπώνονται τα συμπεράσματα της εργασίας και δίνονται χρήσιμες ερευνητικές κατευθύνσεις. / Machine Learning (ML) algorithms can generalize for every class with the same accuracy. In a problem of two classes, positive (true) and negative (false) cases-the algorithm can predict with the same accuracy the positive and negative examples that is the ideal case. In many applications ML algorithms are used in order to learn from data sets that include more examples from the one class in relationship with another class. In general inductive algorithms are designed in such a way that they can minimize the occurred errors. As a conclusion the classes that contain some cases can be ignored in a large percentage since the cost of the false classification of the super-represented class is greater than the cost of false classification of lower class. The problem of imbalanced data sets is occurred in many ‘real’ applications, such as medical diagnosis, robotics, industrial development processes, communication networks error detection, automated testing of electronic equipment and in other related areas. This dissertation entitled ‘Machine Learning with Imbalanced Data’ is referred to the solution of the problem of efficient use of ML algorithms with imbalanced data sets. The thesis includes a general description of basic ML algorithms and related methods for solving imbalanced data sets. A number of algorithmic techniques for handling imbalanced data sets is presented, such as Adacost, Cost Sensitive Boosting, Metacost and other algorithms. The evaluation metrics of ML methods for imbalanced datasets are presented, including the ROC (Receiver Operating Characteristic) curves, the PR (Precision and Recall) curves and cost curves. A new hybrid ML algorithm combining the OverBagging and Rotation Forest algorithms is introduced and the proposed algorithmic procedure is compared with other related algorithms by using the WEKA operational environment. Experimental results demonstrate the performance superiority of the proposed algorithm. Finally, the conclusions of this research work are presented and several future research directions are given. Ανομοιογενή δεδομένα Μηχανική μάθηση Εξόρυξη δεδομένων Σύνολα ταξινομητών Καμπύλη ROC Καμπύλη PRC Αλγόριθμος Bagging Αλγόριθμος Rotation forest 006.31 Machine learning Imbalanced data sets Data mining ROC curves PRC curves Bagging algorithm Rotation forest algorithm
32	Arbres de décisions symboliques, outils de validations et d'aide à l'interprétation / Symbolic decision trees, tools for validation and interpretation assistance Seck, Djamal 20 December 2012 (has links) Nous proposons dans cette thèse la méthode STREE de construction d'arbres de décision avec des données symboliques. Ce type de données permet de caractériser des individus de niveau supérieur qui peuvent être des classes ou catégories d’individus ou des concepts au sens des treillis de Galois. Les valeurs des variables, appelées variables symboliques, peuvent être des ensembles, des intervalles ou des histogrammes. Le critère de partitionnement récursif est une combinaison d'un critère par rapport aux variables explicatives et d'un critère par rapport à la variable à expliquer. Le premier critère est la variation de la variance des variables explicatives. Quand il est appliqué seul, STREE correspond à une méthode descendante de classification non supervisée. Le second critère permet de construire un arbre de décision. Il s'agit de la variation de l'indice de Gini si la variable à expliquer est nominale et de la variation de la variance si la variable à expliquer est continue ou bien est une variable symbolique. Les données classiques sont un cas particulier de données symboliques sur lesquelles STREE peut aussi obtenir de bons résultats. Il en ressort de bonnes performances sur plusieurs jeux de données UCI par rapport à des méthodes classiques de Data Mining telles que CART, C4.5, Naive Bayes, KNN, MLP et SVM. STREE permet également la construction d'ensembles d'arbres de décision symboliques soit par bagging soit par boosting. L'utilisation de tels ensembles a pour but de pallier les insuffisances liées aux arbres de décisions eux-mêmes et d'obtenir une décision finale qui est en principe plus fiable que celle obtenue à partir d'un arbre unique. / In this thesis, we propose the STREE methodology for the construction of decision trees with symbolic data. This data type allows us to characterize individuals of higher levels which may be classes or categories of individuals or concepts within the meaning of the Galois lattice. The values of the variables, called symbolic variables, may be sets, intervals or histograms. The criterion of recursive partitioning is a combination of a criterion related to the explanatory variables and a criterion related to the dependant variable. The first criterion is the variation of the variance of the explanatory variables. When it is applied alone, STREE acts as a top-down clustering methodology. The second criterion enables us to build a decision tree. This criteron is expressed as the variation of the Gini index if the dependant variable is nominal, and as the variation of the variance if thedependant variable is continuous or is a symbolic variable. Conventional data are a special case of symbolic data on which STREE can also get good results. It has performed well on multiple sets of UCI data compared to conventional methodologies of Data Mining such as CART, C4.5, Naive Bayes, KNN, MLP and SVM. The STREE methodology also allows for the construction of ensembles of symbolic decision trees either by bagging or by boosting. The use of such ensembles is designed to overcome shortcomings related to the decisions trees themselves and to obtain a finaldecision that is in principle more reliable than that obtained from a single tree. Arbre de décision Données symboliques Variable à expliquer Variables explicatives Indice de Gini Variance Élagage Courbe ROC Bagging Boosting Decision tree Symbolic data Dependant variable Explanatory variables Gini index Variance Pruning ROC curve Bagging Boosting
33	Combinação de classificadores para inferência dos rejeitados Rocha, Ricardo Ferreira da 16 March 2012 (has links) Made available in DSpace on 2016-06-02T20:06:06Z (GMT). No. of bitstreams: 1 4300.pdf: 2695135 bytes, checksum: c7742258a75f77aa35ccb54abc3439fe (MD5) Previous issue date: 2012-03-16 / Financiadora de Estudos e Projetos / In credit scoring problems, the interest is to associate to an element who request some kind of credit, a probability of default. However, traditional models uses samples biased because the data obtained from the tenderers has only clients who won a approval of a request for previous credit. In order to reduce the bias sample of these models, we use strategies to extract information about individuals rejected to be able to infer a response, good or bad payer. This is what we call the reject inference. With the use of these strategies, we also use the bagging technique (bootstrap aggregating), which consist in generate models based in some bootstrap samples of the training data in order to get a new predictor, when these models is combined. In this work we will discuss about some of the combination methods in the literature, especially the method of combination by logistic regression, although little used but with interesting results.We'll also discuss some strategies relating to reject inference. Analyses are given through a simulation study, in data sets generated and real data sets of public domain. / Em problemas de credit scoring, o interesse é associar a um elemento solicitante de algum tipo de crédito, uma probabilidade de inadimplência. No entanto, os modelos tradicionais utilizam amostras viesadas, pois constam apenas de dados obtidos dos proponentes que conseguiram a aprovação de uma solicitação de crédito anterior. Com o intuito de reduzir o vício amostral desses modelos, utilizamos estratégias para extrair informações acerca dos indivíduos rejeitados para que nele seja inferida uma resposta do tipo bom/- mau pagador. Isto é o que chamamos de inferência dos rejeitados. Juntamente com o uso dessas estratégias utilizamos a técnica bagging (bootstrap aggregating ), que é baseada na construção de diversos modelos a partir de réplicas bootstrap dos dados de treinamento, de modo que, quando combinados, gera um novo preditor. Nesse trabalho discutiremos sobre alguns dos métodos de combinação presentes na literatura, em especial o método de combinação via regressão logística, que é ainda pouco utilizado, mas com resultados interessantes. Discutiremos também as principais estratégias referentes à inferência dos rejeitados. As análises se dão por meio de um estudo simulação, em conjuntos de dados gerados e em conjuntos de dados reais de domínio público. Estatística Riscos Financeiros Combinação de classificadores Credit scoring Regressão logística Bagging Combinação de modelos Inferência dos rejeitados Bagging Credit scoring Logistic regression Model combination Reject inference
34	Analyse und Vergleich des Modal Splits in den Jahren 2013 und 2018 auf Basis der SrV-Daten mithilfe von Random Forest Lins, Stefan Martin 04 March 2021 (has links) Der hohe Anteil des Verkehrs an den Gesamtemissionen, dem damit verbundenen Beitrag zum Klimawandel sowie der extensive Flächenverbrauch des Individualverkehrs verstärken die politischen Forderungen nach einer Verkehrswende. Das Ziel dieser Arbeit ist es, mithilfe ausführlich methodisch dargestellter Verfahren des maschinellen Lernens ein optimales Klassifikationsmodell zu entwickeln. Dieses ermöglicht die Evaluation und Prognose der Verkehrsmittelwahl und damit den Modal Split auf Basis verschiedener Einflussfaktoren insbesondere im Zeitverlauf zwischen 2013 und 2018. Bisherige Untersuchungen konzentrieren sich auf außereuropäische Gebiete und einmalige Erhebungsdurchläufe. Für die Analyse wird auf die von der Technischen Universität Dresden durchgeführte Mobilitätsbefragung 'SrV - Mobilität in Städten' für die 25 großen deutschen Vergleichsstädte der Jahre 2013 und 2018 zurückgegriffen. Nach der Datenaufbereitung werden unter Verwendung deskriptiver Methoden und Zusammenhangsmaße die einzelnen Merkmalsvariablen auf die Eignung in der Modellbildung beurteilt, um möglichst aussagekräftige Modellergebnisse zu erhalten. Basierend auf CART-Entscheidungsbäumen werden Modelle mit dem Bagging-, Random Forest- und dem Boosting-Algorithmus für beide Jahre erstellt. Zur Einordnung der Effektivität der Modelle werden ebenfalls Modelle für Künstliche Neuronale Netzwerke und der Multinomialen Logistischen Regression für beide Jahre untersucht. Auf Basis von Random Forest, das insgesamt in der Untersuchung mit einer Gesamttrefferquote von 82,9 % (AUC-Wert 0,9458) für 2013 und 79,8 % (AUC-Wert 0,9377) für 2018 die besten Gütemaße erzielt, werden die Einflussfaktoren mittels eines Variable Importance Plots und des Partial Dependence Plots beschrieben und ausgewertet. Insbesondere wird festgestellt, dass Länge und Dauer des Weges und die Verfügbarkeit einer Dauerkarte für den öffentlichen Verkehr den größten Einfluss auf die Verkehrsmittelwahl haben. Im Zeitverlauf fällt auf, dass insbesondere MIV-Wege durch Rad- und ÖV-Fahrten substituiert werden, während bei den Fußwegen nur geringe Veränderungen auffallen. Die geschätzten Klassifikationsmodelle erreichen überwiegend herausragende Vorhersagen der Verkehrsmittelwahl, wobei diese Prognosen für das Fahrrad sich am schwierigsten gestalten.:Inhaltsverzeichnis Abbildungsverzeichnis VII Tabellenverzeichnis XI Abkürzungsverzeichnis XIII Symbolverzeichnis XV 1 Einleitung 1 2 Literaturübersicht 3 3 Methodik 5 3.1 Entscheidungsbäume 5 3.1.1 Notation der Baumstruktur 5 3.1.2 Regressionsbäume 6 3.1.3 Klassifikationsbäume 6 3.1.4 Stutzen eines Baumes und Abbruchkriterien 9 3.1.5 Bewertung des Verfahrens 10 3.2 Bagging 11 3.2.1 Idee 11 3.2.2 Bootstrap 12 3.2.3 Subsampling 12 3.2.4 Prinzip des Bagging-Algorithmus 12 3.2.5 Bewertung des Verfahrens und Anpassung 15 3.3 Random Forest 16 3.3.1 Idee 16 3.3.2 Prinzip des Random-Forest-Algorithmus 17 3.3.3 Bewertung des Verfahrens und Anpassung 20 3.3.4 Bewertung der Einflussfaktoren 21 3.4 Boosting 23 3.4.1 Idee 23 3.4.2 Prinzip des AdaBoost-Verfahrens 24 3.4.3 Evaluation 25 3.5 Künstliches Neuronales Netzwerk 25 3.5.1 Idee 26 3.5.2 Prinzip des Künstlichen Neuronalen Netzwerks 26 3.5.3 Evaluation und Anpassungsparameter 29 3.6 Multinomiale Logistische Regression 30 3.7 Gütemaße 30 3.7.1 Trefferquote 30 3.7.2 ROC-Kurve und AUC 30 4 Daten 33 4.1 Datensatz 33 4.2 Datenaufbereitung 34 4.2.1 Auflösung der Multilevelstruktur 34 4.2.2 Daten in der Haushaltsebene 35 4.2.3 Daten in der Personenebene 36 4.2.4 Daten in der Wegeebene 37 4.2.5 Ausreißer und fehlende Werte 37 5 Deskriptive Analyse 39 5.1 Auswertung der kategorialen abhängigen Variablen 39 5.2 Auswertung der kardinalen Variablen 40 5.2.1 Streu- und Lagemaße 40 5.2.2 Korrelation zwischen den kardinalen Variablen 42 5.3 Auswertung der ordinalen und nominalen Variablen 43 5.3.1 Relative Häufigkeiten 43 5.3.2 Beurteilung der ordinalen und nominalen Variablen mithilfe des korrigierten Kontingenzkoeffizienten nach Pearson 46 5.4 Analyse statistischer Unterschiede der beiden untersuchten Stichproben 47 6 Ergebnisse der Modelle 49 6.1 Baumbasierte Klassifikationsverfahren 49 6.1.1 CART-Entscheidungsbäume 49 6.1.2 Bagging 52 6.1.3 Random Forest 53 6.1.4 Boosting 66 6.2 Künstliches Neuronales Netzwerk 69 6.3 Multinomiale Logistische Regression 71 7 Fazit 73 8 Kritische Würdigung und Ausblick 75 Literaturverzeichnis XIX Anhang XXV Danksagung LXI / The high share of traffic in total emissions, the associated contribution to climate change and the extensive land consumption of individual traffic reinforce the political demands for a traffic turnaround. The aim of this thesis is to develop an optimal classification model with the help of detailed methodical presented methods of machine learning. This enables the evaluation and forcast of the choice of means of transport and thus the modal split on the basis of various influencing factors, particularly over the course of time between 2013 and 2018. Previous studies have focused on non-European areas and one-off surveys. For the analysis, the mobility survey 'SrV-Mobilität in Städten' carried out by the Technische Universität Dresden for the 25 large German cities in 2013 and 2018 is used. After the data processing, the individual feature variables are assessed for their suitability in the modeling process using descriptive methods and correlation measures in order to obtain the most meaningful model results possible. Based on CART Decision Trees, models with the Bagging, Random Forest and Boosting algorithms are created for both years. To classify the effectiveness of the models, models for Artificial Neural Networks and Multinomial Logistic Regression are also examined for both years. Based on Random Forest, which achieved the best quality measures in the study with an overall accuracy of 82.9 % (AUC value 0.9458) for 2013 and 79.8 % (AUC value 0.9377) for 2018, the influencing factors are described and evaluated using a Variable Importance Plot and the Partial Dependence Plot. In particular, it is found that the length and duration of the journey and the availability of a season ticket for public transport have the greatest influence on the choice of the mode of transport. Over the course of time, it is noticeable that in particular motorized traffic routes are being replaced by cycling and public transport, while only minor changes are noticeable in the case of walking. Most of the estimated classification models achieve excellent predictions in the choice of mode of transport, although these predictions are the most difficult for the bicycle.:Inhaltsverzeichnis Abbildungsverzeichnis VII Tabellenverzeichnis XI Abkürzungsverzeichnis XIII Symbolverzeichnis XV 1 Einleitung 1 2 Literaturübersicht 3 3 Methodik 5 3.1 Entscheidungsbäume 5 3.1.1 Notation der Baumstruktur 5 3.1.2 Regressionsbäume 6 3.1.3 Klassifikationsbäume 6 3.1.4 Stutzen eines Baumes und Abbruchkriterien 9 3.1.5 Bewertung des Verfahrens 10 3.2 Bagging 11 3.2.1 Idee 11 3.2.2 Bootstrap 12 3.2.3 Subsampling 12 3.2.4 Prinzip des Bagging-Algorithmus 12 3.2.5 Bewertung des Verfahrens und Anpassung 15 3.3 Random Forest 16 3.3.1 Idee 16 3.3.2 Prinzip des Random-Forest-Algorithmus 17 3.3.3 Bewertung des Verfahrens und Anpassung 20 3.3.4 Bewertung der Einflussfaktoren 21 3.4 Boosting 23 3.4.1 Idee 23 3.4.2 Prinzip des AdaBoost-Verfahrens 24 3.4.3 Evaluation 25 3.5 Künstliches Neuronales Netzwerk 25 3.5.1 Idee 26 3.5.2 Prinzip des Künstlichen Neuronalen Netzwerks 26 3.5.3 Evaluation und Anpassungsparameter 29 3.6 Multinomiale Logistische Regression 30 3.7 Gütemaße 30 3.7.1 Trefferquote 30 3.7.2 ROC-Kurve und AUC 30 4 Daten 33 4.1 Datensatz 33 4.2 Datenaufbereitung 34 4.2.1 Auflösung der Multilevelstruktur 34 4.2.2 Daten in der Haushaltsebene 35 4.2.3 Daten in der Personenebene 36 4.2.4 Daten in der Wegeebene 37 4.2.5 Ausreißer und fehlende Werte 37 5 Deskriptive Analyse 39 5.1 Auswertung der kategorialen abhängigen Variablen 39 5.2 Auswertung der kardinalen Variablen 40 5.2.1 Streu- und Lagemaße 40 5.2.2 Korrelation zwischen den kardinalen Variablen 42 5.3 Auswertung der ordinalen und nominalen Variablen 43 5.3.1 Relative Häufigkeiten 43 5.3.2 Beurteilung der ordinalen und nominalen Variablen mithilfe des korrigierten Kontingenzkoeffizienten nach Pearson 46 5.4 Analyse statistischer Unterschiede der beiden untersuchten Stichproben 47 6 Ergebnisse der Modelle 49 6.1 Baumbasierte Klassifikationsverfahren 49 6.1.1 CART-Entscheidungsbäume 49 6.1.2 Bagging 52 6.1.3 Random Forest 53 6.1.4 Boosting 66 6.2 Künstliches Neuronales Netzwerk 69 6.3 Multinomiale Logistische Regression 71 7 Fazit 73 8 Kritische Würdigung und Ausblick 75 Literaturverzeichnis XIX Anhang XXV Danksagung LXI info:eu-repo/classification/ddc/380 ddc:380
35	Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset Zoghi, Zeinab 30 November 2020 (has links) No description available. Mathematics Computer Engineering Computer Science Engineering Statistics UNSW-NB15 Ensemble Learning Ensemble Classification XGBoost Random Forest Balanced Bagging Bagging Boosting Hellinger Distance Elastic Net Sequential Feature Selection Anomaly Detection System Machine Learning Cybersecurity Data Science
36	The evaluation of different banana bunch protection materials on selected banana cultivars for optimum fruit production and quality in Nampula Province, Mozambique Kutinyu, Rodrick 14 January 2015 (has links) Mozambique has potential to boost its banana exports. To fully realise this, agronomic practices in production should be fully developed to combat physiological disorders associated with banana within the region. Currently, lower temperatures are being experienced in some production sites, consequently affecting yield and quality. The objective of this study was to evaluate use of bunch protection covers on banana cultivars Grand Nain and Williams banana cultivars, for performance under different fruit protection materials to determine best fruit protection bag suitable for Metocheria, Nampula. Plants were not selected near plantation borders, drainage canals, cable way and roads, as this would influence the growth pattern of plants and fruit development. Treatments consisted of control (no bag on bunches), white perforated polyethylene, white non-perforated polyethylene, blue perforated polyethylene, blue non perforated polyethylene, green perforated polyethylene, green polyethylene non perforated and cheese cloth bags arranged in a complete randomised block designed CRBD with 26 plants replicated eight times. During 2012/2013, bagging treatments did not considerably improve weight in hands, banana finger weight, total fruit weight, marketable weight and percentage marketable fruit weight and box stem ratio (BSR) of Grand Nain. However there was reduction of fruit defects in all bagging treatments compared to control (no bags). In Williams during the 2013 season bagging treatments improved weight but no significant differences were observed on weight of hands in 2012. Bagging of banana bunches reduce defects in both seasons. Both green and blue perforated bags improved box stem ratio. Bagging treatments increased Williams‟s cultivar yield (per ton) in both seasons / Agriculture and Animal Health / M. Sc. (Agriculture) Banana bunch cover Early bagging De-handing 634.722096797
37	Investigation of multivariate prediction methods for the analysis of biomarker data Hennerdal, Aron January 2006 (has links) <p>The paper describes predictive modelling of biomarker data stemming from patients suffering from multiple sclerosis. Improvements of multivariate analyses of the data are investigated with the goal of increasing the capability to assign samples to correct subgroups from the data alone.</p><p>The effects of different preceding scalings of the data are investigated and combinations of multivariate modelling methods and variable selection methods are evaluated. Attempts at merging the predictive capabilities of the method combinations through voting-procedures are made. A technique for improving the result of PLS-modelling, called bagging, is evaluated.</p><p>The best methods of multivariate analysis of the ones tried are found to be Partial least squares (PLS) and Support vector machines (SVM). It is concluded that the scaling have little effect on the prediction performance for most methods. The method combinations have interesting properties – the default variable selections of the multivariate methods are not always the best. Bagging improves performance, but at a high cost. No reasons for drastically changing the work flows of the biomarker data analysis are found, but slight improvements are possible. Further research is needed.</p> Multivariate analysis multiple sclerosis biomarker predictive modeling partial least squares support vector machines variable selection bagging neural networks Bioinformatics Bioinformatik
38	SVM-Based Negative Data Mining to Binary Classification Jiang, Fuhua 03 August 2006 (has links) The properties of training data set such as size, distribution and the number of attributes significantly contribute to the generalization error of a learning machine. A not well-distributed data set is prone to lead to a partial overfitting model. Two approaches proposed in this dissertation for the binary classification enhance useful data information by mining negative data. First, an error driven compensating hypothesis approach is based on Support Vector Machines (SVMs) with (1+k)-iteration learning, where the base learning hypothesis is iteratively compensated k times. This approach produces a new hypothesis on the new data set in which each label is a transformation of the label from the negative data set, further producing the positive and negative child data subsets in subsequent iterations. This procedure refines the base hypothesis by the k child hypotheses created in k iterations. A prediction method is also proposed to trace the relationship between negative subsets and testing data set by a vector similarity technique. Second, a statistical negative example learning approach based on theoretical analysis improves the performance of the base learning algorithm learner by creating one or two additional hypotheses audit and booster to mine the negative examples output from the learner. The learner employs a regular Support Vector Machine to classify main examples and recognize which examples are negative. The audit works on the negative training data created by learner to predict whether an instance is negative. However, the boosting learning booster is applied when audit does not have enough accuracy to judge learner correctly. Booster works on training data subsets with which learner and audit do not agree. The classifier for testing is the combination of learner, audit and booster. The classifier for testing a specific instance returns the learner's result if audit acknowledges learner's result or learner agrees with audit's judgment, otherwise returns the booster's result. The error of the classifier is decreased to O(e^2) comparing to the error O(e) of a base learning algorithm. Data partition Data classification Vector similarity Multiple passes learning Machine learning Bagging Boosting Support vector machines Data preparation Computer Sciences
39	The evaluation of different banana bunch protection materials on selected banana cultivars for optimum fruit production and quality in Nampula Province, Mozambique Kutinyu, Rodrick 14 January 2015 (has links) Mozambique has potential to boost its banana exports. To fully realise this, agronomic practices in production should be fully developed to combat physiological disorders associated with banana within the region. Currently, lower temperatures are being experienced in some production sites, consequently affecting yield and quality. The objective of this study was to evaluate use of bunch protection covers on banana cultivars Grand Nain and Williams banana cultivars, for performance under different fruit protection materials to determine best fruit protection bag suitable for Metocheria, Nampula. Plants were not selected near plantation borders, drainage canals, cable way and roads, as this would influence the growth pattern of plants and fruit development. Treatments consisted of control (no bag on bunches), white perforated polyethylene, white non-perforated polyethylene, blue perforated polyethylene, blue non perforated polyethylene, green perforated polyethylene, green polyethylene non perforated and cheese cloth bags arranged in a complete randomised block designed CRBD with 26 plants replicated eight times. During 2012/2013, bagging treatments did not considerably improve weight in hands, banana finger weight, total fruit weight, marketable weight and percentage marketable fruit weight and box stem ratio (BSR) of Grand Nain. However there was reduction of fruit defects in all bagging treatments compared to control (no bags). In Williams during the 2013 season bagging treatments improved weight but no significant differences were observed on weight of hands in 2012. Bagging of banana bunches reduce defects in both seasons. Both green and blue perforated bags improved box stem ratio. Bagging treatments increased Williams‟s cultivar yield (per ton) in both seasons / Agriculture and Animal Health / M. Sc. (Agriculture) Banana bunch cover Early bagging De-handing 634.722096797
40	Agregação via bootstrap: uma investigação de desempenho em classificadores estatísticos e redes neurais, avaliação numérica e aplicação no suporte ao diagnóstico de câncer de mama / Bootstrap agregating : an investigation of performance in statistics and neural networks classifiers, numerical evaluation and application on breast cancer diagnostic support SIMÕES, Simone Castelo Branco 27 February 2007 (has links) Submitted by (ana.araujo@ufrpe.br) on 2016-08-16T14:12:24Z No. of bitstreams: 1 Simone Castelo Branco Simoes.pdf: 1283329 bytes, checksum: ab664570df5d0a685483c6dfc554deb4 (MD5) / Made available in DSpace on 2016-08-16T14:12:24Z (GMT). No. of bitstreams: 1 Simone Castelo Branco Simoes.pdf: 1283329 bytes, checksum: ab664570df5d0a685483c6dfc554deb4 (MD5) Previous issue date: 2007-02-27 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / In pattern recognition, the medical diagnosis has received great attention. In gene-ral, the emphasis has been to identify one best model for diagnostic forecast, measured according to generalization ability. In this context, ensembles methods have been eficients, can be considered on the improvement of performance in diagnostic tasks that demand greater precision. The bagging method, purposed from Breiman (1996), uses bootstrap to generate different samples of the training set, building classifiers with the generated samples and combining different forecasts for majority vote. In general, empirical estudies are done for evaluate the bagging performance. In this thesis, we investigate the bagging generalization ability for statistical usual classifiers and the multilayer perceptron net through sthocastic simulation. Different structures of separation of populations are build from especific distributions. Additionally, we make an application on diagnostic suport of brest cancer. The results were obtained using R. In general, we observed that bagging performance depends on the population separation behavior. In the application, bagging showed to be e±cient on sensibility improvement. / Em reconhecimento de padrões, o diagnóstico médico tem recebido grande atenção. Em geral, a ênfase tem sido a identificação de um melhor modelo de previsão diagnóstica, avaliado de acordo com a habilidade de generalização. Nesse contexto, métodos que combinam classificadores têm se mostrado muito eficazes, podendo ser considerados no melhoramento de desempenho em tarefas diagnósticas que exigem maior precisão. O método bagging, proposto por Breiman (1996), utiliza bootstrap para gerar diferentes amostras do conjunto de treinamento, construindo classificadores com as amostras geradas e combinando as diferentes previsões por voto majoritário. Em geral, estudos empíricos são realizados para avaliar o desempenho do bagging. Nesta dissertação , investigamos a habilidade de generalização do bagging para classificadores estatísticos usuais e a rede perceptron de múltiplas camadas através de simulações estocásticas. Diferentes estruturas de separação das populações são construídas a partir de distribuições específicas consideradas. Adicionalmente, realizamos uma alicação no suporte ao diagnóstico de câncer de mama. Os resultados foram obtidos utilizando o ambiente de programação análise de dados e gráficos R. Em geral, as simulações realizadas indicam que o desempenho do bagging depende do comportamento de separação das populações. Na aplicação, o bagging mostrou ser eficiente no melhoramento da sensibilidade. Classificação estatística Rede neural Câncer de mama Bagging Bootstrap Statistical classification Neural network Breast cancer

Search results