Spelling suggestions: "subject:"“random forest”"" "subject:"“random corest”""
111 |
Result Prediction by Mining Replays in Dota 2Johansson, Filip, Wikström, Jesper January 2015 (has links)
Context: Real-time games like Dota 2 lack the extensive mathematical modeling of turn-based games that can be used to make objective statements about how to best play them. Understanding a real-time computer game through the same kind of modeling as a turn-based game is practically impossible. Objectives: In this thesis an attempt was made to create a model using machine learning that can predict the winning team of a Dota 2 game given partial data collected as the game progressed. A couple of different classifiers were tested, out of these Random Forest was chosen to be studied more in depth. Methods: A method was devised for retrieving Dota 2 replays and parsing them into a format that can be used to train classifier models. An experiment was conducted comparing the accuracy of several machine learning algorithms with the Random Forest algorithm on predicting the outcome of Dota 2 games. A further experiment comparing the average accuracy of 25 Random Forest models using different settings for the number of trees and attributes was conducted. Results: Random Forest had the highest accuracy of the different algorithms with the best parameter setting having an average of 88.83% accuracy, with a 82.23% accuracy at the five minute point. Conclusions: Given the results, it was concluded that partial game-state data can be used to accurately predict the results of an ongoing game of Dota 2 in real-time with the application of machine learning techniques.
|
112 |
A knowledge based approach of toxicity prediction for drug formulation : modelling drug vehicle relationships using soft computing techniquesMistry, Pritesh January 2015 (has links)
This multidisciplinary thesis is concerned with the prediction of drug formulations for the reduction of drug toxicity. Both scientific and computational approaches are utilised to make original contributions to the field of predictive toxicology. The first part of this thesis provides a detailed scientific discussion on all aspects of drug formulation and toxicity. Discussions are focused around the principal mechanisms of drug toxicity and how drug toxicity is studied and reported in the literature. Furthermore, a review of the current technologies available for formulating drugs for toxicity reduction is provided. Examples of studies reported in the literature that have used these technologies to reduce drug toxicity are also reported. The thesis also provides an overview of the computational approaches currently employed in the field of in silico predictive toxicology. This overview focuses on the machine learning approaches used to build predictive QSAR classification models, with examples discovered from the literature provided. Two methodologies have been developed as part of the main work of this thesis. The first is focused on use of directed bipartite graphs and Venn diagrams for the visualisation and extraction of drug-vehicle relationships from large un-curated datasets which show changes in the patterns of toxicity. These relationships can be rapidly extracted and visualised using the methodology proposed in chapter 4. The second methodology proposed, involves mining large datasets for the extraction of drug-vehicle toxicity data. The methodology uses an area-under-the-curve principle to make pairwise comparisons of vehicles which are classified according to the toxicity protection they offer, from which predictive classification models based on random forests and decisions trees are built. The results of this methodology are reported in chapter 6.
|
113 |
Evaluering och optimering av automatisk beståndsindelningBrehmer, Dan January 2016 (has links)
Beståndsindelning av skog är till stor den en manuell process som kräver mycket tid. De senaste 20 åren har tekniker som Airborne Laser Scanning (ALS) bidragit till en effektivisering av processen genom att generera laserdata som möjliggör skapandet av lättolkade bilder av skogsområden. Ur laser- och bilddata kan skogliga attribut så som trädhöjd, trädtäthet och markhöjd extraheras. Studiens syfte var att utvärdera vilka attribut som var mest relevanta för att särskilja skogsbestånd i ett system som delade in skog i bestånd automatiskt. Vid analys av attributens relevans användes klassificeringsmodeller. Fackmän intervjuades och litteratur studerades. Under studien modifierades systemets algoritmer med ambitionen att höja dess resultat till en tillfredsställande nivå. Studien visade att attribut som är kopplade till skogssköstel har störst relevans vid automatisk beståndsindelning. Trots modifieringar och använding av relevanta attribut lyckades studien inte påvisa att systemet kunde fungera som en egen lösning för beståndsindelning av skog. Däremot var den resulterande beståndsindelningen lämplig att använda som ett komplement vid manuell beståndsindelning.
|
114 |
Modelling of patterns between operational data, diagnostic trouble codes and workshop history using big data and machine learningVirkkala, Linda, Haglund, Johanna January 2016 (has links)
The work presented in this thesis is part of a large research and development project on condition-based maintenance for heavy trucks and buses at Scania. The aim of this thesis was to be able to predict the status of a component (the starter motor) using data mining methods and to create models that can predict the failure of that component. Based on workshop history data, error codes and operational data, three sets of classification models were built and evaluated. The first model aims to find patterns in a set of error codes, to see which codes are related to a starter motor failure. The second model aims to see if there are patterns in operational data that lead to the occurrence of an error code. Finally, the two data sets were merged and a classifier was trained and evaluated on this larger data set. Two machine learning algorithms were used and compared throughout the model building: AdaBoost and random forest. There is no statistically significant difference in their performance, and both algorithms had an error rate around ~13%, ~5% and ~13% for the three classification models respectively. However, random forest is much faster, and is therefore the preferable option for an industrial implementation. Variable analysis was conducted for the error codes and operational data, resulting in rankings of informative variables. From the evaluation metric precision, it can be derived that if our random forest model predicts a starter motor failure, there is a 85.7% chance that it actually has failed. This model finds 32% (the models recall) of the failed starter motors. It is also shown that four error codes; 2481, 2639, 2657 and 2597 have the highest predictive power for starter motor failure classification. For the operational data, variables that concern the starter motor lifetime and battery health are generally ranked as important by the models. The random forest model finds 81.9% of the cases where the 2481 error code occurs. If the random forest model predicts that the error code 2481 will occur, there is a 88.2% chance that it will. The classification performance was not increased when the two data sets were merged, indicating that the patterns detected by the two first classification models do not add value toone another.
|
115 |
Time Series Online Empirical Bayesian Kernel Density Segmentation: Applications in Real Time Activity Recognition Using Smartphone AccelerometerNa, Shuang 28 June 2017 (has links)
Time series analysis has been explored by the researchers in many areas such, as statistical research, engineering applications, medical analysis, and finance study. To represent the data more efficiently, the mining process is supported by time series segmentation. Time series segmentation algorithm looks for the change points between two different patterns and develops a suitable model, depending on the data observed in such segment. Based on the issue of limited computing and storage capability, it is necessary to consider an adaptive and incremental online segmentation method. In this study, we propose an Online Empirical Bayesian Kernel Segmentation (OBKS), which combines Online Multivariate Kernel Density Estimation (OMKDE) and Online Empirical Bayesian Segmentation (OBS) algorithm. This innovative method considers Online Multivariate Kernel density as a predictive distribution derived by Online Empirical Bayesian segmentation instead of using posterior predictive distribution as a predictive distribution. The benefit of Online Multivariate Kernel Density Estimation is that it does not require the assumption of a pre-defined prior function, which makes the OMKDE more adaptive and adjustable than the posterior predictive distribution.
Human Activity Recognition (HAR) by smartphones with embedded sensors is a modern time series application applied in many areas, such as therapeutic applications and sensors of cars. The important procedures related to the HAR problem include classification, clustering, feature extraction, dimension reduction, and segmentation. Segmentation as the first step of HAR analysis attempts to represent the time interval more effectively and efficiently. The traditional segmentation method of HAR is to partition the time series into short and fixed length segments. However, these segments might not be long enough to capture the sufficient information for the entire activity time interval. In this research, we segment the observations of a whole activity as a whole interval using the Online Empirical Bayesian Kernel Segmentation algorithm as the first step. The smartphone with built-in accelerometer generates observations of these activities.
Based on the segmenting result, we introduce a two-layer random forest classification method. The first layer is used to identify the main group; the second layer is designed to analyze the subgroup from each core group. We evaluate the performance of our method based on six activities: sitting, standing, lying, walking, walking\_upstairs, and walking\_downstairs on 30 volunteers. If we want to create a machine that can detect walking\_upstairs and walking\_downstairs automatically, it requires more information and more detail that can generate more complicated features, since these two activities are very similar. Continuously, considering the real-time Activity Recognition application on the smartphones by the embedded accelerometers, the first layer classifies the activities as static and dynamic activities, the second layer classifies each main group into the sub-classes, depending on the first layer result. For the data collected, we get an overall accuracy of 91.4\% based on the six activities and an overall accuracy of 100\% based only on the dynamic activity (walking, walking\_upstairs, walking\_downstairs) and the static activity (sitting, standing, lying).
|
116 |
Lending Sociodynamics and Drivers of the Financial Business CycleJ. Hawkins, Raymond, Kuang, Hengyu January 2017 (has links)
We extend sociodynamic modeling of the financial business cycle to the Euro Area and Japan. Using an opinion-formation model and machine learning techniques we find stable model estimation of the financial business cycle using central bank lending surveys and a few selected macroeconomic variables. We find that banks have asymmetric response to good and bad economic information, and that banks adapt to their peers' opinions when changing lending policies.
|
117 |
A pipeline for the identification and examination of proteins implicated in frontotemporal dementiaWaury, Katharina January 2020 (has links)
Frontotemporal dementia is a neurodegenerative disorder with high heterogeneity on the genetic, pathological and clinical level. The familial form of the disease is mainly caused by pathogenic variants of three genes: C9orf72, MAPT and GRN. As there is no clear correlation between the mutation and the clinical phenotype, symptom severity or age of onset, the demand for predictive biomarkers is high. While there is no fluid biomarker for frontotemporal dementia in use yet, there is strong hope that changes of protein concentrations in the blood or cerebrospinal fluid can aid prognostics many years before symptoms develop. Increasing amounts of data are becoming available because of long-term studies of families affected by familial frontotemporal dementia, but its analysis is time-consuming and work intensive. In the scope of this project a pipeline was built for the automated analysis of proteomics data. Specifically, it aims to identify proteins useful for differentiation between two groups by using random forest, a supervised machine learning method. The analysis results of the pipeline for a data set containing blood plasma protein concentration of healthy controls and participants affected by frontotemporal dementia were promising and the generalized functioning of the pipeline was proven with an independent breast cancer proteomics data set.
|
118 |
Time prediction and process discovery of administration processÖberg, Johanna January 2020 (has links)
Machine learning and process mining are two techniques that are becoming more and more popular among organisations for business intelligence purposes. Results from these techniques can be very useful for organisations' decision-making. The Swedish National Forensic Centre (NFC), an organisation that performs forensic analyses, is in need of a way to visualise and understand its administration process. In addition, the organisation would like to be able to predict the time analyses will take to perform. In this project, it was evaluated if machine learning and process mining could be used on NFC's administration process-related data to satisfy the organisation's needs. Using the process mining tool Mehrwerk Process Mining implemented in the software Qlik Sense, different process variants were discovered from the data and visualised in a comprehensible way. The process variants were easy to interpret and useful for NFC. Machine learning regression models were trained on the data to predict analysis length. Two different datasets were tried, a large dataset with few features and a smaller dataset with more features. The models were then evaluated on test datasets. The models did not predict the length of analyses in an acceptable way. A reason to this could be that the information in the data was not sufficient for this prediction.
|
119 |
Identificación de clientes con patrones de consumo eléctrico fraudulentoPereira Bizama, Nicole January 2014 (has links)
Memoria para optar al título de Ingeniera Civil Industrial / La industria de distribución eléctrica en Chile sufre anualmente pérdidas, solo en el año 2012 la empresa en estudio registró pérdidas por más de 6 mil millones de pesos ya sea por robo o fallas en los equipos de medición, por lo cual existe un gran interés de parte de estas en buscar soluciones para mitigar esta problemática.
El presente trabajo tiene como objetivo la creación de modelos de minería de datos que logren identificar aquellos consumidores que poseen una alta propensión al hurto de electricidad. Para esto, se utilizó la información histórica disponible de los clientes desde enero de 2012 a marzo de 2014, tales como consumo mensual, inspecciones previas, cortes de suministro, entre otras fuentes. La información fue separada en dos bases de datos de acuerdo a si un cliente posee o no algún registro de inspección durante el periodo de estudio. Esta división se debe a que un cliente inspeccionado ya posee un filtro previo de inspección y a diferencia de un cliente no inspeccionado, se tiene la certeza de si ha cometido fraude o no.
Con la data de clientes inspeccionados, se construyeron tres modelos de clasificación: regresión logística, árbol de decisión y random forest. Además, debido a que se tiene una data desbalanceada con un 2.2% de casos fraude, se realizó de forma paralela un modelo de regresión logística ponderado que obtuvo resultados similares al modelo sin ponderar concluyendo que el desbalanceo de clases no afecta el problema.
Utilizando como métrica de evaluación una curva de ganancia, el modelo de random forest obtuvo los mejores resultados capturando un 39% del fraude en el primer decil de clientes versus un 35% alcanzado por el modelo de regresión. En cuanto al tiempo de ejecución, el modelo random forest tardo más de un día en su construcción mientras que el modelo de regresión y árbol de decisión tardaron entre 2 y 3 minutos. Debido a la simpleza en la interpretación de sus resultados y a su breve tiempo de ejecución se escoge el modelo de regresión logística (sin ponderar) para generar la probabilidad de fraude de cada cliente, el cual al ser aplicado a la data de clientes no inspeccionados logra una tasa esperada de fraude de un 8.6%, cifra que supera al 2.2% capturado en la realidad y que además se traduciría en una recuperación promedio mensual de más de $MM 7 si se realizasen la cantidad de inspecciones sugeridas.
De forma complementaria, con la data de clientes no inspeccionados, se construyó un modelo de clustering cuyo objetivo es agrupar clientes con similares características e identificar casos anómalos o más alejados de su grupo. Para establecer un punto de comparación entre los resultados obtenidos, se aplica el modelo de regresión al listado de casos anómalos, obteniendo una tasa esperada de fraude de un 3.1%.
Finalmente, como lineamiento futuro se espera la incorporación de otras fuentes de información que se cree serán de gran aporte en la detección de fraude energético, tales como información demográfica más detallada de los clientes y un análisis económico más preciso que permita mejores estimaciones de los beneficios a obtener.
|
120 |
Inclusion of Gabor textural transformations and hierarchical structures within an object based analysis of a riparian landscapeKutz, Kain Markus 01 May 2018 (has links)
Land cover mapping is an important part of resource management, planning, and economic predictions. Improvements in remote sensing, machine learning, image processing, and object based image analysis (OBIA) has made the process of identifying land cover types increasingly faster and reliable but these advances are unable to utilize the amount of information encompassed within ultra-high (sub-meter) resolution imagery.
Previously, users have typically reduced the resolution of imagery in an attempt to more closely represent the interpretation or object scale in an image and rid the image of any extraneous information within the image that may cause the OBIA process to identify too small of objects when performing semi-automated delineation of objects based on an images’ properties (Mas et al., 2015; Eiesank et al., 2014; Hu et al., 2010). There have been few known attempts to try and maximize this detailed information in high resolution imagery using advanced textural components.
In this study we try to circumnavigate the inherent problems associated with high resolution imagery by combining well researched data transformations that aid the OBIA process with a seldom used texture transformation in Geographic Object Based Image Analyses (GEOBIA) known as the Gabor Transform and the hierarchal organization of landscapes. We will observe the difference made in segmentation and classification accuracy of a random forest classifier when we fuse a Gabor transformed image to a Normalized Difference Vegetation Index (NDVI), high resolution multi-spectral imagery (RGB and NIR) and Light Detection and Ranging (LiDAR) derived canopy height model (CHM) within a riparian area in Southeast Iowa. Additionally, we will observe the effects on classification accuracy when adding multi-scale land cover data to objects. Both, the addition of hierarchical information and Gabor textural information, could aid the GEOBIA process in delineating and classifying the same objects that human experts would delineate within this riparian landscape.
|
Page generated in 0.0478 seconds