Global ETD Search

381	Computing Random Forests Variable Importance Measures (VIM) on Mixed Numerical and Categorical Data / Beräkning av Random Forests variable importance measures (VIM) på kategoriska och numeriska prediktorvariabler Hjerpe, Adam January 2016 (has links) The Random Forest model is commonly used as a predictor function and the model have been proven useful in a variety of applications. Their popularity stems from the combination of providing high prediction accuracy, their ability to model high dimensional complex data, and their applicability under predictor correlations. This report investigates the random forest variable importance measure (VIM) as a means to find a ranking of important variables. The robustness of the VIM under imputation of categorical noise, and the capability to differentiate informative predictors from non-informative variables is investigated. The selection of variables may improve robustness of the predictor, improve the prediction accuracy, reduce computational time, and may serve as a exploratory data analysis tool. In addition the partial dependency plot obtained from the random forest model is examined as a means to find underlying relations in a non-linear simulation study. / Random Forest (RF) är en populär prediktormodell som visat goda resultat vid en stor uppsättning applikationsstudier. Modellen ger hög prediktionsprecision, har förmåga att modellera komplex högdimensionell data och modellen har vidare visat goda resultat vid interkorrelerade prediktorvariabler. Detta projekt undersöker ett mått, variabel importance measure (VIM) erhållna från RF modellen, för att beräkna graden av association mellan prediktorvariabler och målvariabeln. Projektet undersöker känsligheten hos VIM vid kvalitativt prediktorbrus och undersöker VIMs förmåga att differentiera prediktiva variabler från variabler som endast, med aveende på målvariableln, beskriver brus. Att differentiera prediktiva variabler vid övervakad inlärning kan användas till att öka robustheten hos klassificerare, öka prediktionsprecisionen, reducera data dimensionalitet och VIM kan användas som ett verktyg för att utforska relationer mellan prediktorvariabler och målvariablel. machine learning ml variable importance vim random forests rf feature selection variable selection exploratory data analysis eda Computer Sciences Datavetenskap (datalogi)
382	Comparison of Automatic Classifiers’ Performances using Word-based Feature Extraction Techniques in an E-government setting Marin Rodenas, Alfonso January 2011 (has links) Nowadays email is commonly used by citizens to establish communication with their government. On the received emails, governments deal with some common queries and subjects which some handling officers have to manually answer. Automatic email classification of the incoming emails allows to increase the communication efficiency by decreasing the delay between the query and its response. This thesis takes part within the IMAIL project, which aims to provide an automatic answering solution to the Swedish Social Insurance Agency (SSIA) (“Försäkringskassan” in Swedish). The goal of this thesis is to analyze and compare the classification performance of different sets of features extracted from SSIA emails on different automatic classifiers. The features extracted from the emails will depend on the previous preprocessing that is carried out as well. Compound splitting, lemmatization, stop words removal, Part-of-Speech tagging and Ngrams are the processes used in the data set. Moreover, classifications will be performed using Support Vector Machines, k- Nearest Neighbors and Naive Bayes. For the analysis and comparison of different results, precision, recall and F-measure are used. From the results obtained in this thesis, SVM provides the best classification with a F-measure value of 0.787. However, Naive Bayes provides a better classification for most of the email categories than SVM. Thus, it can not be concluded whether SVM classify better than Naive Bayes or not. Furthermore, a comparison to Dalianis et al. (2011) is made. The results obtained in this approach outperformed the results obtained before. SVM provided a F-measure value of 0.858 when using PoS-tagging on original emails. This result improves by almost 3% the 0.83 obtained in Dalianis et al. (2011). In this case, SVM was clearly better than Naive Bayes. E-government machine learning WEKA SVM Naive Bayes kNN Swedish PoStagging feature extraction feature selection automatic e-mail classification Computer and Information Sciences Data- och informationsvetenskap
383	Sélection de caractéristiques stables pour la segmentation d'images histologiques par calcul haute performance / Robust feature selection for histology images through high performance computing Bouvier, Clément 18 January 2019 (has links) L’histologie produit des images à l’échelle cellulaire grâce à des microscopes optiques très performants. La quantification du tissu marqué comme les neurones s’appuie de plus en plus sur des segmentations par apprentissage automatique. Cependant, l’apprentissage automatique nécessite une grande quantité d’informations intermédiaires, ou caractéristiques, extraites de la donnée brute multipliant d’autant la quantité de données à traiter. Ainsi, le nombre important de ces caractéristiques est un obstacle au traitement robuste et rapide de séries d’images histologiques. Les algorithmes de sélection de caractéristiques pourraient réduire la quantité d’informations nécessaires mais les ensembles de caractéristiques sélectionnés sont peu reproductibles. Nous proposons une méthodologie originale fonctionnant sur des infrastructures de calcul haute-performance (CHP) visant à sélectionner des petits ensembles de caractéristiques stables afin de permettre des segmentations rapides et robustes sur des images histologiques acquises à très haute-résolution. Cette sélection se déroule en deux étapes : la première à l’échelle des familles de caractéristiques. La deuxième est appliquée directement sur les caractéristiques issues de ces familles. Dans ce travail, nous avons obtenu des ensembles généralisables et stables pour deux marquages neuronaux différents. Ces ensembles permettent des réductions significatives des temps de traitement et de la mémoire vive utilisée. Cette méthodologie rendra possible des études histologiques exhaustives à haute-résolution sur des infrastructures CHP que ce soit en recherche préclinique et possiblement clinique. / In preclinical research and more specifically in neurobiology, histology uses images produced by increasingly powerful optical microscopes digitizing entire sections at cell scale. Quantification of stained tissue such as neurons relies on machine learning driven segmentation. However such methods need a lot of additional information, or features, which are extracted from raw data multiplying the quantity of data to process. As a result, the quantity of features is becoming a drawback to process large series of histological images in a fast and robust manner. Feature selection methods could reduce the amount of required information but selected subsets lack of stability. We propose a novel methodology operating on high performance computing (HPC) infrastructures and aiming at finding small and stable sets of features for fast and robust segmentation on high-resolution histological whole sections. This selection has two selection steps: first at feature families scale (an intermediate pool of features, between space and individual feature). Second, feature selection is performed on pre-selected feature families. In this work, the selected sets of features are stables for two different neurons staining. Furthermore the feature selection results in a significant reduction of computation time and memory cost. This methodology can potentially enable exhaustive histological studies at a high-resolution scale on HPC infrastructures for both preclinical and clinical research settings. Apprentissage automatique Données massives Sélection de caractéristiques Calcul haute performance Histologie Traitement d'images Machine learning High performance computing Feature selection High Performance Computing Histology Image processing 005.7
384	Locality-Dependent Training and Descriptor Sets for QSAR Modeling Hobocienski, Bryan Christopher 21 September 2020 (has links) No description available. Chemical Engineering
385	Data Engineering and Failure Prediction for Hard Drive S.M.A.R.T. Data Ramanayaka Mudiyanselage, Asanga 08 September 2020 (has links) No description available. Computer Science Machine Learning Data Engineering Python Data Analysis Big Data Predictive Analytics, Feature Selection Resampling Techniques Hard Drive Failure Prediction SMART Attributes Scikit-Learn PySpark
386	Ein Framework zur Optimierung der Energieeffizienz von HPC-Anwendungen auf der Basis von Machine-Learning-Methoden Gocht-Zech, Andreas 03 November 2022 (has links) Ein üblicher Ansatzpunkt zur Verbesserung der Energieeffizienz im High Performance Computing (HPC) ist, neben Verbesserungen an der Hardware oder einer effizienteren Nachnutzung der Wärme des Systems, die Optimierung der ausgeführten Programme. Dazu können zum Beispiel energieoptimale Einstellungen, wie die Frequenzen des Prozessors, für verschiedene Programmfunktionen bestimmt werden, um diese dann im späteren Verlauf des Programmes anwenden zu können. Mit jeder Änderung des Programmes kann sich dessen optimale Einstellung ändern, weshalb diese zeitaufwendig neu bestimmt werden muss. Das stellt eine wesentliche Hürde für die Anwendung solcher Verfahren dar. Dieser Prozess des Bestimmens der optimalen Frequenzen kann mithilfe von Machine-Learning-Methoden vereinfacht werden, wie in dieser Arbeit gezeigt wird. So lässt sich mithilfe von sogenannten Performance-Events ein neuronales Netz erstellen, mit dem während der Ausführung des Programmes die optimalen Frequenzen automatisch geschätzt werden können. Performance-Events sind prozessorintern und können Einblick in die Abläufe im Prozessor gewähren. Bei dem Einsatz von Performance-Events gilt es einige Fallstricke zu vermeiden. So werden die Performance-Events von Performance-Countern gezählt. Die Anzahl der Counter ist allerdings begrenzt, womit auch die Anzahl der Events, die gleichzeitig gezählt werden können, limitiert ist. Eine für diese Arbeit wesentliche Fragestellung ist also: Welche dieser Events sind relevant und müssen gezählt werden? Bei der Beantwortung dieser Frage sind Merkmalsauswahlverfahren hilfreich, besonders sogenannte Filtermethoden, bei denen die Merkmale vor dem Training ausgewählt werden. Viele bekannte Methoden gehen dabei entweder davon aus, dass die Zusammenhänge zwischen den Merkmalen linear sind, wie z. B. bei Verfahren, die den Pearson-Korrelationskoeffizienten verwenden, oder die Daten müssen in Klassen eingeteilt werden, wie etwa bei Verfahren, die auf der Transinformation beruhen. Beides ist für Performance-Events nicht ideal. Auf der einen Seite können keine linearen Zusammenhänge angenommen werden. Auf der anderen Seite bedeutet das Einteilen in Klassen einen Verlust an Information. Um diese Probleme zu adressieren, werden in dieser Arbeit bestehende Merkmalsauswahlverfahren mit den dazugehörigen Algorithmen analysiert, neue Verfahren entworfen und miteinander verglichen. Es zeigt sich, dass mit neuen Verfahren, die auf sogenannten Copulas basieren, auch nichtlineare Zusammenhänge erkannt werden können, ohne dass die Daten in Klassen eingeteilt werden müssen. So lassen sich schließlich einige Events identifiziert, die zusammen mit neuronalen Netzen genutzt werden können, um die Energieeffizienz von HPC-Anwendung zu steigern. Das in dieser Arbeit erstellte Framework erfüllt dabei neben der Auswahl der Performance-Events weitere Aufgaben: Es stellt sicher, dass diverse Programmteile mit verschiedenen optimalen Einstellungen voneinander unterschieden werden können. Darüber hinaus sorgt das Framework dafür, dass genügend Daten erzeugt werden, um ein neuronales Netz zu trainieren, und dass dieses Netz später einfach genutzt werden kann. Dabei ist das Framework so flexibel, dass auch andere Machine-Learning-Methoden getestet werden können. Die Leistungsfähigkeit des Frameworks wird abschließend in einer Ende-zu-Ende-Evaluierung an einem beispielhaften Programm demonstriert. Die Evaluierung illustriert, dass bei nur 7% längerer Laufzeit eine Energieeinsparung von 24% erzielt werden kann und zeigt damit, dass mit Machine-Learning-Methoden wesentliche Energieeinsparungen erreicht werden können.:1 Einleitung und Motiovation 2 Energieeffizienz und Machine-Learning – eine thematische Einführung 2.1 Energieeffizienz von Programmen im Hochleistungsrechnen 2.1.1 Techniken zur Energiemessung oder -abschätzung 2.1.2 Techniken zur Beeinflussung der Energieeffizienz in der Hardware 2.1.3 Grundlagen zur Performanceanalyse 2.1.4 Regionsbasierte Ansätze zur Erhöhung der Energieeffizienz 2.1.5 Andere Ansätze zur Erhöhung der Energieeffizienz 2.2 Methoden zur Merkmalsauswahl 2.2.1 Merkmalsauswahlmethoden basierend auf der Informationstheorie 2.2.2 Merkmalsauswahl für stetige Merkmale 2.2.3 Andere Verfahren zur Merkmalsauswahl 2.3 Machine-Learning mit neuronalen Netzen 2.3.1 Neuronale Netze 2.3.2 Backpropagation 2.3.3 Aktivierungsfunktionen 3 Merkmalsauswahl für mehrdimensionale nichtlineare Abhängigkeiten 3.1 Analyse der Problemstellung, Merkmale und Zielgröße 3.2 Merkmalsauswahl mit mehrdimensionaler Transinformation für stetige Merkmale 3.2.1 Mehrdimensionale Copula-Entropie und mehrdimensionale Transinformation 3.2.2 Schätzung der mehrdimensionalen Transinformation basierend auf Copula-Dichte 3.3 Normierung 3.4 Vergleich von Copula-basierten Maßzahlen mit der klassischen Transinformation und dem Pearson-Korrelationskoeffizienten 3.4.1 Deterministische Abhängigkeit zweier Variablen 3.4.2 UnabhängigkeitVergleich verschiedener Methoden zur Auswahl stetiger Merkmale 3.5 Vergleich verschiedener Methoden zur Auswahl stetiger Merkmale 3.5.1 Erzeugung synthetischer Daten 3.5.2 Szenario 1 – fünf relevante Merkmale 3.5.3 Szenario 2 – fünf relevante Merkmale, fünf wiederholte Merkmale 3.5.4 Schlussfolgerungen aus den Simulationen 3.6 Zusammenfassung 4 Entwicklung und Umsetzung des Frameworks 4.1 Erweiterungen der READEX Runtime Library 4.1.1 Grundlegender Aufbau der READEX Runtime Library 4.1.2 Call-Path oder Call-Tree 4.1.3 Calibration-Module 4.2 Testsystem 4.2.1 Architektur 4.2.2 Bestimmung des Offsets zur Energiemessung mit RAPL 4.3 Verwendete Benchmarks zur Erzeugung der Datengrundlage 4.3.1 Datensatz 1: Der Stream-Benchmark 4.3.2 Datensatz 2: Eine Sammlung verschiedener Benchmarks 4.4 Merkmalsauswahl und Modellgenerierung 4.4.1 Datenaufbereitung 4.4.2 Merkmalsauswahl Algorithmus 4.4.3 Performance-Events anderer Arbeiten zum Vergleich 4.4.4 Erzeugen und Validieren eines Modells mithilfe von TensorFlow und Keras 4.5 Zusammenfassung 5 Evaluierung des Ansatzes 5.1 Der Stream-Benchmark 5.1.1 Analyse der gewählten Merkmale 5.1.2 Ergebnisse des Trainings 5.2 Verschiedene Benchmarks 5.2.1 Ausgewählte Merkmale 5.2.2 Ergebnisse des Trainings 5.3 Energieoptimierung einer Anwendung 6 Zusammenfassung und Ausblick Literatur Abbildungsverzeichnis Tabellenverzeichnis Quelltextverzeichnis / There are a variety of different approaches to improve energy efficiency in High Performance Computing (HPC). Besides advances to the hardware or cooling systems, optimising the executed programmes' energy efficiency is another a promising approach. Determining energy-optimal settings of program functions, such as the processor frequency, can be applied during the program's execution to reduce energy consumption. However, when the program is modified, the optimal setting might change. Therefore, the energy-optimal settings need to be determined again, which is a time-consuming process and a significant impediment for applying such methods. Fortunately, finding the optimal frequencies can be simplified using machine learning methods, as shown in this thesis. With the help of so-called performance events, a neural network can be trained, which can automatically estimate the optimal processor frequencies during program execution. Performance events are processor-specific and can provide insight into the procedures of a processor. However, there are some pitfalls to be avoided when using performance events. Performance events are counted by performance counters, but as the number of counters is limited, the number of events that can be counted simultaneously is also limited. This poses the question of which of these events are relevant and need to be counted. % Though the issue has received some attention in several publications, a convincing solution remains to be found. In answering this question, feature selection methods are helpful, especially so-called filter methods, where features are selected before the training. Unfortunately, many feature selection methods either assume a linear correlation between the features, such as methods using the Pearson correlation coefficient or require data split into classes, particularly methods based on mutual information. Neither can be applied to performance events as linear correlation cannot be assumed, and splitting the data into classes would result in a loss of information. In order to address that problem, this thesis analyses existing feature selection methods together with their corresponding algorithms, designs new methods, and compares different feature selection methods. By utilising new methods based on the mathematical concept of copulas, it was possible to detect non-linear correlations without splitting the data into classes. Thus, several performance events could be identified, which can be utilised together with neural networks to increase the energy efficiency of HPC applications. In addition to selecting performance events, the created framework ensures that different programme parts, which might have different optimal settings, can be identified. Moreover, it assures that sufficient data for the training of the neural networks is generated and that the network can easily be applied. Furthermore, the framework is flexible enough to evaluate other machine learning methods. Finally, an end-to-end evaluation with a sample application demonstrated the framework's performance. The evaluation illustrates that, while extending the runtime by only 7%, energy savings of 24% can be achieved, showing that substantial energy savings can be attained using machine learning approaches.:1 Einleitung und Motiovation 2 Energieeffizienz und Machine-Learning – eine thematische Einführung 2.1 Energieeffizienz von Programmen im Hochleistungsrechnen 2.1.1 Techniken zur Energiemessung oder -abschätzung 2.1.2 Techniken zur Beeinflussung der Energieeffizienz in der Hardware 2.1.3 Grundlagen zur Performanceanalyse 2.1.4 Regionsbasierte Ansätze zur Erhöhung der Energieeffizienz 2.1.5 Andere Ansätze zur Erhöhung der Energieeffizienz 2.2 Methoden zur Merkmalsauswahl 2.2.1 Merkmalsauswahlmethoden basierend auf der Informationstheorie 2.2.2 Merkmalsauswahl für stetige Merkmale 2.2.3 Andere Verfahren zur Merkmalsauswahl 2.3 Machine-Learning mit neuronalen Netzen 2.3.1 Neuronale Netze 2.3.2 Backpropagation 2.3.3 Aktivierungsfunktionen 3 Merkmalsauswahl für mehrdimensionale nichtlineare Abhängigkeiten 3.1 Analyse der Problemstellung, Merkmale und Zielgröße 3.2 Merkmalsauswahl mit mehrdimensionaler Transinformation für stetige Merkmale 3.2.1 Mehrdimensionale Copula-Entropie und mehrdimensionale Transinformation 3.2.2 Schätzung der mehrdimensionalen Transinformation basierend auf Copula-Dichte 3.3 Normierung 3.4 Vergleich von Copula-basierten Maßzahlen mit der klassischen Transinformation und dem Pearson-Korrelationskoeffizienten 3.4.1 Deterministische Abhängigkeit zweier Variablen 3.4.2 UnabhängigkeitVergleich verschiedener Methoden zur Auswahl stetiger Merkmale 3.5 Vergleich verschiedener Methoden zur Auswahl stetiger Merkmale 3.5.1 Erzeugung synthetischer Daten 3.5.2 Szenario 1 – fünf relevante Merkmale 3.5.3 Szenario 2 – fünf relevante Merkmale, fünf wiederholte Merkmale 3.5.4 Schlussfolgerungen aus den Simulationen 3.6 Zusammenfassung 4 Entwicklung und Umsetzung des Frameworks 4.1 Erweiterungen der READEX Runtime Library 4.1.1 Grundlegender Aufbau der READEX Runtime Library 4.1.2 Call-Path oder Call-Tree 4.1.3 Calibration-Module 4.2 Testsystem 4.2.1 Architektur 4.2.2 Bestimmung des Offsets zur Energiemessung mit RAPL 4.3 Verwendete Benchmarks zur Erzeugung der Datengrundlage 4.3.1 Datensatz 1: Der Stream-Benchmark 4.3.2 Datensatz 2: Eine Sammlung verschiedener Benchmarks 4.4 Merkmalsauswahl und Modellgenerierung 4.4.1 Datenaufbereitung 4.4.2 Merkmalsauswahl Algorithmus 4.4.3 Performance-Events anderer Arbeiten zum Vergleich 4.4.4 Erzeugen und Validieren eines Modells mithilfe von TensorFlow und Keras 4.5 Zusammenfassung 5 Evaluierung des Ansatzes 5.1 Der Stream-Benchmark 5.1.1 Analyse der gewählten Merkmale 5.1.2 Ergebnisse des Trainings 5.2 Verschiedene Benchmarks 5.2.1 Ausgewählte Merkmale 5.2.2 Ergebnisse des Trainings 5.3 Energieoptimierung einer Anwendung 6 Zusammenfassung und Ausblick Literatur Abbildungsverzeichnis Tabellenverzeichnis Quelltextverzeichnis info:eu-repo/classification/ddc/006 ddc:006
387	Automated Intro Detection ForTV Series / Automatiserad detektion avintron i TV-serier Redaelli, Tiago, Ekedahl, Jacob January 2020 (has links) Media consumption has shown a tremendous increase in recent years, and with this increase, new audience expectations are put on the features offered by media-streaming services. One of these expectations is the ability to skip redundant content, which most probably is not of interest to the user. In this work, intro sequences which have sufficient length and a high degree of image similarity across all episodes of a show is targeted for detection. A statistical prediction model for classifying video intros based on these features was proposed. The model tries to identify frame similarities across videos from the same show and then filter out incorrect matches. The performance evaluation of the prediction model shows that the proposed solution for unguided predictions had an accuracy of 90.1%, and precision and recall rate of 93.8% and 95.8% respectively.The mean margin of error for a predicted start and end was 1.4 and 2.0 seconds. The performance was even better if the model had prior knowledge of one or more intro sequences from the same TV series confirmed by a human. However, due to dataset limitations the result is inconclusive. The prediction model was integrated into an automated system for processing internet videos available on SVT Play, and included administrative capabilities for correcting invalid predictions. / Under de senaste åren så har konsumtionen av TV-serier ökat markant och med det tillkommer nya förväntningar på den funktionalitet som erbjuds av webb-TVtjänster. En av dessa förväntningar är förmågan att kunna hoppa över redundant innehåll, vilket troligen inte är av intresse för användaren. I detta arbete så ligger fokus på att detektera video intron som bedöms som tillräckligt långa och har en hög grad av bildlighet över flera episoder från samma TV-program. En statistisk modell för att klassificera intron baserat på dessa egenskaper föreslogs. Modellen jämför bilder från samma TV-program för att försöka identifiera matchande sekvenser och filtrera bort inkorrekta matchningar. Den framtagna modellen hade en träffsäkerhet på 90.1%, precision på 93.8% och en återkallelseförmåga på 95.8%. Medelfelmarginalen uppgick till 1.4 sekunder för start och 2.0 sekunder för slut av ett intro. Modellen presterade bättre om den hade tillgång till en eller fler liknande introsekvenser från relaterade videor från sammaTV-program bekräftat av en människa. Eftersom datasetet som användes för testning hade vissa brister så ska resultatet endast ses som vägledande. Modellen integrerades i ett system som automatiskt processar internet videos frånSVT-Play. Ett tillhörande administrativt verktyg skapades även för att kunna rätta felaktiga gissningar. intro detection Hidden Markov model feature selection image similarity comparison average hash SVT intro detektion dold Markovmodell attributselektion bildlikhet average hash SVT Computer and Information Sciences Data- och informationsvetenskap
388	Automatic Feature Extraction for Human Activity Recognitionon the Edge Cleve, Oscar, Gustafsson, Sara January 2019 (has links) This thesis evaluates two methods for automatic feature extraction to classify the accelerometer data of periodic and sporadic human activities. The first method selects features using individual hypothesis tests and the second one is using a random forest classifier as an embedded feature selector. The hypothesis test was combined with a correlation filter in this study. Both methods used the same initial pool of automatically generated time series features. A decision tree classifier was used to perform the human activity recognition task for both methods.The possibility of running the developed model on a processor with limited computing power was taken into consideration when selecting methods for evaluation. The classification results showed that the random forest method was good at prioritizing among features. With 23 features selected it had a macro average F1 score of 0.84 and a weighted average F1 score of 0.93. The first method, however, only had a macro average F1 score of 0.40 and a weighted average F1 score of 0.63 when using the same number of features. In addition to the classification performance this thesis studies the potential business benefits that automation of feature extractioncan result in. / Denna studie utvärderar två metoder som automatiskt extraherar features för att klassificera accelerometerdata från periodiska och sporadiska mänskliga aktiviteter. Den första metoden väljer features genom att använda individuella hypotestester och den andra metoden använder en random forest-klassificerare som en inbäddad feature-väljare. Hypotestestmetoden kombinerades med ett korrelationsfilter i denna studie. Båda metoderna använde samma initiala samling av automatiskt genererade features. En decision tree-klassificerare användes för att utföra klassificeringen av de mänskliga aktiviteterna för båda metoderna. Möjligheten att använda den slutliga modellen på en processor med begränsad hårdvarukapacitet togs i beaktning då studiens metoder valdes. Klassificeringsresultaten visade att random forest-metoden hade god förmåga att prioritera bland features. Med 23 utvalda features erhölls ett makromedelvärde av F1 score på 0,84 och ett viktat medelvärde av F1 score på 0,93. Hypotestestmetoden resulterade i ett makromedelvärde av F1 score på 0,40 och ett viktat medelvärde av F1 score på 0,63 då lika många features valdes ut. Utöver resultat kopplade till klassificeringsproblemet undersöker denna studie även potentiella affärsmässiga fördelar kopplade till automatisk extrahering av features. Human Activity Recognition Automatic Feature Extraction Automatic Feature Selection Automated Machine Learning Random Forest Classifier Hypothesis Test Computer and Information Sciences Data- och informationsvetenskap
389	Exploring Feature Selection Techniques for Machine Learning-based Melanoma Skin Cancer Classification / Utforskar tekniker för attributurval för maskininlärningsbaserad klassificering av melanomhudcancer Eriksson Mueller, Thomas, Fornstad, Viktor January 2023 (has links) One of the most globally common types of cancer is skin cancer, where melanoma is the most deadly form. An important and promising tool for diagnosing diseases such as skin cancer is computer aided diagnostics, a tool which utilizes machine learning to predict and classify cancer. Limiting the complexity of the data, known as feature selection, can potentially improve classification accuracy. This report evaluates the accuracy of four different classifiers - Support Vector Machine, Naive Bayes, Decision Tree and Artificial Neural Network - with four different feature selection methods - Sequantial Forward Selection, Sequantial Backward Selection, Entropy and Principal Component Analysis - on the PH2 skin cancer dataset, containing dermoscopic images of skin lesions and their respective metadata. The findings reveal that all feature selection methods led to an improved accuracy rate on at least one classifier compared to not using feature selection. Furthermore, certain feature selection methods resulted in a significant gain in accuracy, indicating the potential value of feature selection techniques in improving the accuracy and efficiency of machine learning classifiers in computer-aided diagnosis systems for melanoma skin cancer detection. However, the results also underscore the importance of careful selection of the number of features to avoid adverse effects on model performance. This research contributes to the field by demonstrating the impact of feature selection methods on melanoma skin cancer detection and highlighting considerations for their application. / En av de globalt vanligaste typerna av cancer är hudcancer, där melanom är den mest dödliga typen. Ett viktigt och effektivt verktyg för att diagnostisera sjukdomar som hudcancer är datorstödd diagnostik, ett verktyg som använder maskininlärning för att förutse och klassificera cancer. Att begränsa komplexiteten i data, känt som attributurval, kan potentiellt förbättra klassificeringsnoggrannheten. Denna rapport utvärderar noggrannheten hos fyra olika klassificerare - ”Support Vector Machine”, ”Naive Bayes”, ”Decision Tree” och ”Artificial Neural Network” - med fyra olika attributurvalsmetoder - ”Sequantial Forward Selection”, ”Sequantial Backward Selection”, ”Entropy” and ”Principal Component Analysis” - på PH2 hudcancerdatasetet, som innehåller dermoskopiska bilder av hudlesioner och deras respektive metadata. Resultaten visar att alla attributurvalsmetoder ledde till en förbättrad noggrannhetsgrad på minst en klassificerare jämfört med att inte använda attributurval. Dessutom resulterade vissa attributurvalsmetoder i en betydande ökning i noggrannhet, vilket indikerar det potentiella värdet av attributurvalstekniker för att förbättra noggrannheten och effektiviteten hos maskininlärningsklassificerare i datorstödda diagnossystem för detektering av melanom hudcancer. Däremot understryker resultaten också vikten av noggrant urval av antalet attribut för att undvika negativa effekter på modellens prestanda. Denna forskning bidrar till fältet genom att demonstrera inverkan av attributurvalsmetoder på detektering av melanom hudcancer och belysa överväganden för deras tillämpning. Machine Learning Feature Selection Melanoma Computer Aided Diagnosis Bachelor Thesis Maskininlärning Attributurval Melanom Datorstödd Diagnostik Kandidatarbete Computer and Information Sciences Data- och informationsvetenskap
390	Improving classification accuracy for machine learning / 機械学習における分類精度の向上 / キカイガクシュウニオケルブンルイセイドノコウジョウ鄭弯弯, Wanwan Zheng 22 March 2021 (has links) 本論文は，5章より構成されている。第1章では，機械学習の現状，応用及び構成を述べた上，本研究で扱った三つの課題を挙げた。第2章では，小サンプルデータの特徴選択方法を提案した。第3章では，クラスの不均衡性と学習データのサイズが分類器精度への影響を検討した。第4章では，ノイズが分類器の学習を妨げる問題点に対して，多要素ベースの学習に基づいた高速クラスノイズの検出方法を提案した。第5章では，分析の主な結果をまとめ，今後の課題と展望を述べた。 / This thesis is organized under five chapters. Chapter 1 gives a brief explanation of what machine learning is and why it matters. Chapter 2 makes a proposal to improve the performance of feature selection methods with low-sample-size data. Chapter 3 studies the effects of class imbalance and training data size on classifier learning empirically. Chapter 4 proposes a fast noise detector referring to the problems of noise detection algorithms, which are over-cleansing, large computational complexity and long response time. Chapter 5 draws a summary and the closing. / 博士(文化情報学) / Doctor of Culture and Information Science / 同志社大学 / Doshisha University 特徴選択クラスの不均衡性学習データサイズノイズ検出 Feature Selection Imbalanced Data Training Data Size Noise Detection

Search results