Global ETD Search

41	Data mining file sharing metadata : A comparison between Random Forests Classificiation and Bayesian Networks Petersson, Andreas January 2015 (has links) In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks. This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques. The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives. machine learning random forests bayesian network bittorrent file sharing Computer Sciences Datavetenskap (datalogi)
42	Data mining file sharing metadata : A comparison between Random Forests Classification and Bayesian Networks Petersson, Andreas January 2015 (has links) In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks.This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques.The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives. Machine learning random forests Bayesian networks Bit Torrent file sharing Computer Sciences Datavetenskap (datalogi)
43	Predicting movie ratings : A comparative study on random forests and support vector machines Persson, Karl January 2015 (has links) The aim of this work is to evaluate the prediction performance of random forests in comparison to support vector machines, for predicting the numerical user ratings of a movie using pre-release attributes such as its cast, directors, budget and movie genres. In order to answer this question an experiment was conducted on predicting the overall user rating of 3376 hollywood movies, using data from the well established movie database IMDb. The prediction performance of the two algorithms was assessed and compared over three commonly used performance and error metrics, as well as evaluated by the means of significance testing in order to further investigate whether or not any significant differences could be identified. The results indicate some differences between the two algorithms, with consistently better performance from random forests in comparison to support vector machines over all of the performance metrics, as well as significantly better results for two out of three metrics. Although a slight difference has been indicated by the results one should also note that both algorithms show great similarities in terms of their prediction performance, making it hard to draw any general conclusions on which algorithm yield the most accurate movie predictions. data mining machine learning regression movie prediction random forests support vector machines Computer Sciences Datavetenskap (datalogi)
44	Time Dependent Kernel Density Estimation: A New Parameter Estimation Algorithm, Applications in Time Series Classiﬁcation and Clustering Wang, Xing 23 May 2016 (has links) The Time Dependent Kernel Density Estimation (TDKDE) developed by Harvey & Oryshchenko (2012) is a kernel density estimation adjusted by the Exponentially Weighted Moving Average (EWMA) weighting scheme. The Maximum Likelihood Estimation (MLE) procedure for estimating the parameters proposed by Harvey & Oryshchenko (2012) is easy to apply but has two inherent problems. In this study, we evaluate the performances of the probability density estimation in terms of the uniformity of Probability Integral Transforms (PITs) on various kernel functions combined with diﬀerent preset numbers. Furthermore, we develop a new estimation algorithm which can be conducted using Artiﬁcial Neural Networks to eliminate the inherent problems with the MLE method and to improve the estimation performance as well. Based on the new estimation algorithm, we develop the TDKDE-based Random Forests time series classiﬁcation algorithm which is signiﬁcantly superior to the commonly used statistical feature-based Random Forests method as well as the Ker- nel Density Estimation (KDE)-based Random Forests approach. Furthermore, the proposed TDKDE-based Self-organizing Map (SOM) clustering algorithm is demonstrated to be superior to the widely used Discrete-Wavelet- Transform (DWT)-based SOM method in terms of the Adjusted Rand Index (ARI). Feature Selection Artificial Neural Networks Random Forests Self-organizing Maps Statistics and Probability
45	Automatic Pain Assessment from Infants’ Crying Sounds Pai, Chih-Yun 01 November 2016 (has links) Crying is infants utilize to express their emotional state. It provides the parents and the nurses a criterion to understand infants’ physiology state. Many researchers have analyzed infants’ crying sounds to diagnose specific diseases or define the reasons for crying. This thesis presents an automatic crying level assessment system to classify infants’ crying sounds that have been recorded under realistic conditions in the Neonatal Intensive Care Unit (NICU) as whimpering or vigorous crying. To analyze the crying signal, Welch’s method and Linear Predictive Coding (LPC) are used to extract spectral features; the average and the standard deviation of the frequency signal and the maximum power spectral density are the other spectral features which are used in classification. For classification, three state-of-the-art classifiers, namely K-nearest Neighbors, Random Forests, and Least Squares Support Vector Machine are tested in this work, and the experimental result achieves the highest accuracy in classifying whimper and vigorous crying using the clean dataset is 90%, which is sampled with 10 seconds before scoring and 5 seconds after scoring and uses K-nearest neighbors as the classifier. Whimpering Vigorous Crying K-Nearest Neighbors Random Forests Least Squares Support Vector Machines Computer Sciences
46	Vývoj kredit skóringových modelov s využitím vybraných štatistických metód v R / Building credit scoring models using selected statistical methods in R Jánoš, Andrej January 2016 (has links) Credit scoring is important and rapidly developing discipline. The aim of this thesis is to describe basic methods used for building and interpretation of the credit scoring models with an example of application of these methods for designing such models using statistical software R. This thesis is organized into five chapters. In chapter one, the term of credit scoring is explained with main examples of its application and motivation for studying this topic. In the next chapters, three in financial practice most often used methods for building credit scoring models are introduced. In chapter two, the most developed one, logistic regression is discussed. The main emphasis is put on the logistic regression model, which is characterized from a mathematical point of view and also various ways to assess the quality of the model are presented. The other two methods presented in this thesis are decision trees and Random forests, these methods are covered by chapters three and four. An important part of this thesis is a detailed application of the described models to a specific data set Default using the R program. The final fifth chapter is a practical demonstration of building credit scoring models, their diagnostics and subsequent evaluation of their applicability in practice using R. The appendices include used R code and also functions developed for testing of the final model and code used through the thesis. The key aspect of the work is to provide enough theoretical knowledge and practical skills for a reader to fully understand the mentioned models and to be able to apply them in practice.
47	Modelos computacionais prognósticos de lesões traumáticas do plexo braquial em adultos / Prognostic computational models for traumatic brachial plexus injuries in adults Luciana de Melo e Abud 20 June 2018 (has links) Estudos de prognóstico clínico consistem na predição do curso de uma doença em pacientes e são utilizados por profissionais da saúde com o intuito de aumentar as chances ou a qualidade de sua recuperação. Sob a perspectiva computacional, a criação de um modelo prognóstico clínico é um problema de classificação, cujo objetivo é identificar a qual classe (dentro de um conjunto de classes predefinidas) uma nova amostra pertence. Este projeto visa a criar modelos prognósticos de lesões traumáticas do plexo braquial, um conjunto de nervos que inervam os membros superiores, utilizando dados de pacientes adultos com esse tipo de lesão. Os dados são provenientes do Instituto de Neurologia Deolindo Couto (INDC) da Universidade Federal do Rio de Janeiro (UFRJ) e contêm dezenas de atributos clínicos coletados por meio de questionários eletrônicos. Com esses modelos prognósticos, deseja-se identificar de maneira automática os possíveis preditores do curso desse tipo de lesão. Árvores de decisão são classificadores frequentemente utilizados para criação de modelos prognósticos, por se tratarem de um modelo transparente, cujo resultado pode ser examinado e interpretado clinicamente. As Florestas Aleatórias, uma técnica que utiliza um conjunto de árvores de decisão para determinar o resultado final da classificação, podem aumentar significativamente a acurácia e a generalização dos modelos gerados, entretanto ainda são pouco utilizadas na criação de modelos prognósticos. Neste projeto, exploramos a utilização de florestas aleatórias nesse contexto, bem como a aplicação de métodos de interpretação de seus modelos gerados, uma vez que a transparência do modelo é um aspecto particularmente importante em domínios clínicos. A estimativa de generalização dos modelos resultantes foi feita por meio de métodos que viabilizam sua utilização sobre um número reduzido de instâncias, uma vez que os dados relativos ao prognóstico são provenientes de 44 pacientes do INDC. Além disso, adaptamos a técnica de florestas aleatórias para incluir a possível existência de valores faltantes, que é uma característica presente nos dados utilizados neste projeto. Foram criados quatro modelos prognósticos - um para cada objetivo de recuperação, sendo eles a ausência de dor e forças satisfatórias avaliadas sobre abdução do ombro, flexão do cotovelo e rotação externa no ombro. As acurácias dos modelos foram estimadas entre 77% e 88%, utilizando o método de validação cruzada leave-one-out. Esses modelos evoluirão com a inclusão de novos dados, provenientes da contínua chegada de novos pacientes em tratamento no INDC, e serão utilizados como parte de um sistema de apoio à decisão clínica, de forma a possibilitar a predição de recuperação de um paciente considerando suas características clínicas. / Studies of prognosis refer to the prediction of the course of a disease in patients and are employed by health professionals in order to improve patients\' recovery chances and quality. Under a computational perspective, the creation of a prognostic model is a classification task that aims to identify to which class (within a predefined set of classes) a new sample belongs. The goal of this project is the creation of prognostic models for traumatic injuries of the brachial plexus, a network of nerves that innervates the upper limbs, using data from adult patients with this kind of injury. The data come from the Neurology Institute Deolindo Couto (INDC) of Rio de Janeiro Federal University (UFRJ) and they are characterized by dozens of clinical features that are collected by means of electronic questionnaires. With the use of these prognostic models we intended to automatically identify possible predictors of the course of brachial plexus injuries. Decision trees are classifiers that are frequently used for the creation of prognostic models since they are a transparent technique that produces results that can be clinically examined and interpreted. Random Forests are a technique that uses a set of decision trees to determine the final classification results and can significantly improve model\'s accuracy and generalization, yet they are still not commonly used for the creation of prognostic models. In this project we explored the use of random forests for that purpose, as well as the use of interpretation methods for the resulting models, since model transparency is an important aspect in clinical domains. Model assessment was achieved by means of methods whose application over a small set of samples is suitable, since the available prognostic data refer to only 44 patients from INDC. Additionally, we adapted the random forests technique to include missing data, that are frequent among the data used in this project. Four prognostic models were created - one for each recovery goal, those being absence of pain and satisfactory strength evaluated over shoulder abduction, elbow flexion and external shoulder rotation. The models\' accuracies were estimated between 77% and 88%, calculated through the leave-one-out cross validation method. These models will evolve with the inclusion of new data from new patients that will arrive at the INDC and they will be used as part of a clinical decision support system, with the purpose of prediction of a patient\'s recovery considering his or her clinical characteristics. Aprendizado de máquina Forestas aleatórias Modelo prognóstico Plexo braquial Brachial plexus Machine learning Prognostic model Random forests
48	Head impact detection with sensor fusion and machine learning Strandberg, Aron January 2022 (has links) Head injury is common in many different sports and elsewhere, and is often associated with differentdifficulties. One major problem is to identify and value the injury or the severity. Sometimes there is no sign of head injury, but a serious neck distortion has occurred, causing similar symptoms as head injuries e.g. concussion or mild TBI (traumatic brain injury). This study investigated whether direct and indirect measurements of head kinematics, combined with machine learning and 3D visualization can be used to identify head injury and value the injury. Injury statistics have found that many severe head injuries are caused by oblique impacts. An oblique impact will give rise to both linear and rotational kinematics. Since the human brain is very sensitive to rotational kinematics, many violent rotations of the head can results in large shear strains in the brain. This is when white matter and white matter connections are disrupted in the brain from acceleration and deceleration, or rotational acceleration kinematics which in turn will cause traumatic brain injuries as e.g. diffuse axonal injury (DAI). Lately there has been many studies in this field using different types of new technologies, but the most prevalent is the rise of wearable sensors that have become smaller, faster and more energy efficient where they have been integrated into mouthguards and inertial measurement units (IMUs) the size of a sim-card that measures and reports a body's specific force. It has been shown that a 6-axis IMU (3-axis rotational- and 3-axis acceleration measurements) may improve head injury prediction but more data is needed to confirm with existing head injury criterions and new criterions needs to be developed, that considers directional sensitivity. Today, IMUs are typically used in self-driving cars, aircrafts, spacecrafts, satellites etc. As of today, more and more studies have evaluated and utilized IMUs in new uncharted fields have shown promises, especially in sports, and in the neuroscience and medical field. This study proposed a method to 3D visualize head kinematics during the event of a possible head injury to indirectly identify and value the injury, by medical professionals, as well as, a direct method to identify and also value the severity of head injury with machine learning. An erroneous data collection process of reconstructed head impacts and non-head impacts have been recorded using an open-source 9-axis IMU sensor and a proprietary 6-axis IMU sensor. To value the head injury or the severity, existing head injury criterions as the Abbreviated Injury Scale (AIS), Head Injury Criterion (HIC), Head Impact Power (HIP), Severity Index (SI) and Generalized Acceleration Model for Brain Injury Threshold (GAMBIT) have been introduced. To detect head impact including the severity and non-head impact, a Random Forests (RF) classifier and Support Vector Machine (SVM) classifiers with linear- and radial basis function have been proposed, the prediction results have been promising. head impact detection sensor fusion 3D machine learning support vector machine random forests Computer Engineering Datorteknik
49	Predicting the size of a company winning a procurement: an evaluation study of three classification models Björkegren, Ellen January 2022 (has links) In this thesis, the performance of the classification methods Linear Discriminant Analysis (LDA), Random Forests (RF), and Support Vector Machines (SVM) are compared using procurement data to predict what size company will win a procurement. This is useful information for companies, since bidding on a procurement takes time and resources, which they can save if they know their chances of winning are low. The data used in the models are collected from OpenTender and allabolag.se and represent procurements that were awarded to companies in 2020. A total of 8 models are created, two versions of the LDA model, two versions of the RF model, and four versions of the SVM model, where some models are more complex than others. All models are evaluated on overall performance using hit rate, Huberty’s I Index, mean average error, and Area Under the Curve. The most complex SVM model performed the best across all evaluation measurements, whereas the less complex LDA model performed overall worst. Hit rates and mean average errors are also calculated within each class, and the complex SVM models performed best on all company sizes, except the small companies which were best predicted by the less complex Random Forest model. public procurement classification Linear Discriminant Analysis Random Forests Support Vector Machines Probability Theory and Statistics Sannolikhetsteori och statistik
50	Post-Pruning of Random Forests Diyar, Jamal January 2018 (has links) Abstract Context. In machine learning, ensemble methods continue to receive increased attention. Since machine learning approaches that generate a single classifier or predictor have shown limited capabilities in some contexts, ensemble methods are used to yield better predictive performance. One of the most interesting and effective ensemble algorithms that have been introduced in recent years is Random Forests. A common approach to ensure that Random Forests can achieve a high predictive accuracy is to use a large number of trees. If the predictive accuracy is to be increased with a higher number of trees, this will result in a more complex model, which may be more difficult to interpret or analyse. In addition, the generation of an increased number of trees results in higher computational power and memory requirements. Objectives. This thesis explores automatic simplification of Random Forest models via post-pruning as a means to reduce the size of the model and increase interpretability while retaining or increasing predictive accuracy. The aim of the thesis is twofold. First, it compares and empirically evaluates a set of state-of-the-art post-pruning techniques on the simplification task. Second, it investigates the trade-off between predictive accuracy and model interpretability. Methods. The primary research method used to conduct this study and to address the research questions is experimentation. All post-pruning techniques are implemented in Python. The Random Forest models are trained, evaluated, and validated on five selected datasets with varying characteristics. Results. There is no significant difference in predictive performance between the compared techniques and none of the studied post-pruning techniques outperforms the other on all included datasets. The experimental results also show that model interpretability is proportional to model accuracy, at least for the studied settings. That is, a positive change in model interpretability is accompanied by a negative change in model accuracy. Conclusions. It is possible to reduce the size of a complex Random Forest model while retaining or improving the predictive accuracy. Moreover, the suitability of a particular post-pruning technique depends on the application area and the amount of training data available. Significantly simplified models may be less accurate than the original model but tend to be perceived as more comprehensible. / Sammanfattning Kontext. Ensemble metoder fortsätter att få mer uppmärksamhet inom maskininlärning. Då maskininlärningstekniker som genererar en enskild klassificerare eller prediktor har visat tecken på begränsad kapacitet i vissa sammanhang, har ensemble metoder vuxit fram som alternativa metoder för att åstadkomma bättre prediktiva prestanda. En av de mest intressanta och effektiva ensemble algoritmerna som har introducerats under de senaste åren är Random Forests. För att säkerställa att Random Forests uppnår en hög prediktiv noggrannhet behöver oftast ett stort antal träd användas. Resultatet av att använda ett större antal träd för att öka den prediktiva noggrannheten är en komplex modell som kan vara svår att tolka eller analysera. Problemet med det stora antalet träd ställer dessutom högre krav på såväl lagringsutrymmet som datorkraften. Syfte. Denna uppsats utforskar möjligheten att automatiskt förenkla modeller som är genererade av Random Forests i syfte att reducera storleken på modellen, öka dess tolkningsbarhet, samt bevara eller förbättra den prediktiva noggrannheten. Syftet med denna uppsats är tvåfaldigt. Vi kommer först att jämföra och empiriskt utvärdera olika beskärningstekniker. Den andra delen av uppsatsen undersöker sambandet mellan den prediktiva noggrannheten och modellens tolkningsbarhet. Metod. Den primära forskningsmetoden som har använts för att genomföra den studien är experiment. Alla beskärningstekniker är implementerade i Python. För att träna, utvärdera, samt validera de olika modellerna, har fem olika datamängder använts. Resultat. Det finns inte någon signifikant skillnad i det prediktiva prestanda mellan de jämförda teknikerna och ingen av de undersökta beskärningsteknikerna är överlägsen på alla plan. Resultat från experimenten har också visat att sambandet mellan tolkningsbarhet och noggrannhet är proportionellt, i alla fall för de studerade konfigurationerna. Det vill säga, en positiv förändring i modellens tolkningsbarhet åtföljs av en negativ förändring i modellens noggrannhet. Slutsats. Det är möjligt att reducera storleken på en komplex Random Forests modell samt bibehålla eller förbättra den prediktiva noggrannheten. Dessutom beror valet av beskärningstekniken på användningsområdet och mängden träningsdata tillgänglig. Slutligen kan modeller som är signifikant förenklade vara mindre noggranna men å andra sidan tenderar de att uppfattas som mer förståeliga. Random Forests pruning interpretability accuracy. Engineering and Technology Teknik och teknologier Robotics Robotteknik och automation

Search results