Spelling suggestions: "subject:"random forests"" "subject:"random gorests""
91 |
Computing Random Forests Variable Importance Measures (VIM) on Mixed Numerical and Categorical Data / Beräkning av Random Forests variable importance measures (VIM) på kategoriska och numeriska prediktorvariablerHjerpe, Adam January 2016 (has links)
The Random Forest model is commonly used as a predictor function and the model have been proven useful in a variety of applications. Their popularity stems from the combination of providing high prediction accuracy, their ability to model high dimensional complex data, and their applicability under predictor correlations. This report investigates the random forest variable importance measure (VIM) as a means to find a ranking of important variables. The robustness of the VIM under imputation of categorical noise, and the capability to differentiate informative predictors from non-informative variables is investigated. The selection of variables may improve robustness of the predictor, improve the prediction accuracy, reduce computational time, and may serve as a exploratory data analysis tool. In addition the partial dependency plot obtained from the random forest model is examined as a means to find underlying relations in a non-linear simulation study. / Random Forest (RF) är en populär prediktormodell som visat goda resultat vid en stor uppsättning applikationsstudier. Modellen ger hög prediktionsprecision, har förmåga att modellera komplex högdimensionell data och modellen har vidare visat goda resultat vid interkorrelerade prediktorvariabler. Detta projekt undersöker ett mått, variabel importance measure (VIM) erhållna från RF modellen, för att beräkna graden av association mellan prediktorvariabler och målvariabeln. Projektet undersöker känsligheten hos VIM vid kvalitativt prediktorbrus och undersöker VIMs förmåga att differentiera prediktiva variabler från variabler som endast, med aveende på målvariableln, beskriver brus. Att differentiera prediktiva variabler vid övervakad inlärning kan användas till att öka robustheten hos klassificerare, öka prediktionsprecisionen, reducera data dimensionalitet och VIM kan användas som ett verktyg för att utforska relationer mellan prediktorvariabler och målvariablel.
|
92 |
Analyse des leviers : effets de colinéarité et hiérarchisation des impacts dans les études de marché et sociales / Driver Analysis : consequenses of multicollinearity quantification of relative impact of drivers in market research applications.Wallard, Henri 18 December 2015 (has links)
La colinéarité rend difficile l’utilisation de la régression linéaire pour estimer l’importance des variables dans les études de marché. D’autres approches ont donc été utilisées.Concernant la décomposition de la variance expliquée, une démonstration de l’égalité entre les méthodes lmg-Shapley et celle de Johnson avec deux prédicteurs est proposée. Il a aussi été montré que la méthode de Fabbris est différente des méthodes de Genizi et Johnson et que les CAR scores de deux prédicteurs ne s’égalisent pas lorsque leur corrélation tend vers 1.Une méthode nouvelle, weifila (weighted first last) a été définie et publiée en 2015.L’estimation de l’importance des variables avec les forêts aléatoires a également été analysée et les résultats montrent une bonne prise en compte des non-linéarités.Avec les réseaux bayésiens, la multiplicité des solutions et le recours à des restrictions et choix d’expert militent pour utilisation prudente même si les outils disponibles permettent une aide dans le choix des modèles.Le recours à weifila ou aux forêts aléatoires est recommandé plutôt que lmg-Shapley sans négliger les approches structurelles et les modèles conceptuels.Mots clés :régression, décomposition de la variance, importance, valeur de Shapley, forêts aléatoires, réseaux bayésiens. / AbstractLinear regression is used in Market Research but faces difficulties due to multicollinearity. Other methods have been considered.A demonstration of the equality between lmg-Shapley and and Johnson methods for Variance Decomposition has been proposed. Also this research has shown that the decomposition proposed by Fabbris is not identical to those proposed by Genizi and Johnson, and that the CAR scores of two predictors do not equalize when their correlation tends towards 1. A new method, weifila (weighted first last) has been proposed and published in 2015.Also we have shown that permutation importance using Random Forest enables to take into account non linear relationships and deserves broader usage in Marketing Research.Regarding Bayesian Networks, there are multiple solutions available and expert driven restrictions and decisions support the recommendation to be careful in their usage and presentation, even if they allow to explore possible structures and make simulations.In the end, weifila or random forests are recommended instead of lmg-Shapley knowing that the benefit of structural and conceptual models should not be underestimated.Keywords :Linear regression, Variable Importance, Shapley Value, Random Forests, Bayesian Networks
|
93 |
Méthodes Non-Paramétriques de Post-Traitement des Prévisions d'Ensemble / Non-parametric Methods of post-processing for Ensemble ForecastingTaillardat, Maxime 11 December 2017 (has links)
En prévision numérique du temps, les modèles de prévision d'ensemble sont devenus un outil incontournable pour quantifier l'incertitude des prévisions et fournir des prévisions probabilistes. Malheureusement, ces modèles ne sont pas parfaits et une correction simultanée de leur biais et de leur dispersion est nécessaire.Cette thèse présente de nouvelles méthodes de post-traitement statistique des prévisions d'ensemble. Celles-ci ont pour particularité d'être basées sur les forêts aléatoires.Contrairement à la plupart des techniques usuelles, ces méthodes non-paramétriques permettent de prendre en compte la dynamique non-linéaire de l'atmosphère.Elles permettent aussi d'ajouter des covariables (autres variables météorologiques, variables temporelles, géographiques...) facilement et sélectionnent elles-mêmes les prédicteurs les plus utiles dans la régression. De plus, nous ne faisons aucune hypothèse sur la distribution de la variable à traiter. Cette nouvelle approche surpasse les méthodes existantes pour des variables telles que la température et la vitesse du vent.Pour des variables reconnues comme difficiles à calibrer, telles que les précipitations sexti-horaires, des versions hybrides de nos techniques ont été créées. Nous montrons que ces versions hybrides (ainsi que nos versions originales) sont meilleures que les méthodes existantes. Elles amènent notamment une véritable valeur ajoutée pour les pluies extrêmes.La dernière partie de cette thèse concerne l'évaluation des prévisions d'ensemble pour les événements extrêmes. Nous avons montré quelques propriétés concernant le Continuous Ranked Probability Score (CRPS) pour les valeurs extrêmes. Nous avons aussi défini une nouvelle mesure combinant le CRPS et la théorie des valeurs extrêmes, dont nous examinons la cohérence sur une simulation ainsi que dans un cadre opérationnel.Les résultats de ce travail sont destinés à être insérés au sein de la chaîne de prévision et de vérification à Météo-France. / In numerical weather prediction, ensemble forecasts systems have become an essential tool to quantifyforecast uncertainty and to provide probabilistic forecasts. Unfortunately, these models are not perfect and a simultaneouscorrection of their bias and their dispersion is needed.This thesis presents new statistical post-processing methods for ensemble forecasting. These are based onrandom forests algorithms, which are non-parametric.Contrary to state of the art procedures, random forests can take into account non-linear features of atmospheric states. They easily allowthe addition of covariables (such as other weather variables, seasonal or geographic predictors) by a self-selection of the mostuseful predictors for the regression. Moreover, we do not make assumptions on the distribution of the variable of interest. This new approachoutperforms the existing methods for variables such as surface temperature and wind speed.For variables well-known to be tricky to calibrate, such as six-hours accumulated rainfall, hybrid versions of our techniqueshave been created. We show that these versions (and our original methods) are better than existing ones. Especially, they provideadded value for extreme precipitations.The last part of this thesis deals with the verification of ensemble forecasts for extreme events. We have shown several properties ofthe Continuous Ranked Probability Score (CRPS) for extreme values. We have also defined a new index combining the CRPS and the extremevalue theory, whose consistency is investigated on both simulations and real cases.The contributions of this work are intended to be inserted into the forecasting and verification chain at Météo-France.
|
94 |
A Multi-Scale Analysis of Jaguar (Panthera onca) and Puma (Puma concolor) Habitat Selection and Conservation in the Narrowest Section of Panama.Craighead, Kimberly A. 02 May 2019 (has links)
No description available.
|
95 |
Modeling distributions of Cantharellus formosus using natural history and citizen science dataArmstrong, Zoey Nicole 21 April 2021 (has links)
No description available.
|
96 |
Seasonal Habitat Selection by Greater Sage Grouse in Strawberry Valley UtahPeck, Riley D. 09 December 2011 (has links) (PDF)
This study examined winter habitat use and nesting ecology of greater sage grouse (Centrocercus urophasianus) in Strawberry Valley (SV), Utah located in the north-central part of the state. We monitored sage grouse with the aid of radio telemetry throughout the year, but specifically used information from the winter and nesting periods for this study. Our study provided evidence that sage grouse show fidelity to nesting areas in subsequent years regardless of nest success. We found only 57% of our nests located within the 3 km distance from an active lek typically used to delineate critical nesting habitat. We suggest a more conservative distance of 10 km for our study area. Whenever possible, we urge consideration of nest-area fidelity in conservation planning across the range of greater sage grouse. We also evaluated winter-habitat selection at multiple spatial scales. Sage grouse in our study area selected gradual slopes with high amounts of sagebrush exposed above the snow. We produced a map that identified suitable winter habitat for sage grouse in our study area. This map highlighted core areas that should be conserved and will provide a basis for management decisions affecting Strawberry Valley, Utah.
|
97 |
Habitat Selection by Two K-Selected Species: An Application to Bison and Sage GrouseKaze, Joshua Taft 01 December 2013 (has links) (PDF)
Population growth for species with long lifespans and low reproductive rates (i.e., K-selected species) is influenced primarily by both survival of adult females and survival of young. Because survival of adults and young is influenced by habitat quality and resource availability, it is important for managers to understand factors that influence habitat selection during the period of reproduction. My thesis contains two chapters addressing this issue for K-selected species in Utah. Chapter one evaluates habitat selection of greater sage-grouse (Centrocercusurophasianus) on Diamond Mountain during the critical nesting and brood-rearing period. Chapter two address selection of birth sites by bison (Bison bison) on Antelope Island, Utah. We collected micro-habitat data for 88 nests and 138 brood locations of greater sage-grouse from 2010-2012 to determine habitat preferences of nesting and brooding sage-grouse. Using random forests modeling techniques, we found that percent sagebrush, percent canopy cover, percent total shrubs, and percent obscurity (Robel pole) best differentiated nest locations from random locations with selection of higher values in each case. We used a 26-day nesting period to determine an average nest survival rate of 0.35 (95% CI = 0.23 – 0.47) for adults and 0.31 (95% CI = 0.14 – 0.50) for juvenile grouse.Brood sites were closer to habitat edges, contained more forbs and less rock than random locations. Average annual adult female survival across the two-year study period was 0.52 (95% CI= 0.38 – 0.65) compared to 0.43 (95% CI= 0.28 – 0.59) for yearlings.Brooding and nesting habitat at use locations on Diamond Mountain met or exceeded published guidelines for everything but forb cover at nest sites. Adult and juvenile survival rates were in line with average values from around the range whereas nest success was on the low end of reported values. For bison, we quantified variables surrounding 35 birth sites and 100 random sites during 2010 and 2011 on Antelope Island State Park. We found females selected birth sites based on landscape attributes such as curvature and elevation, but also distance to anthropogenic features (i.e., human structures such as roads or trails). Models with variables quantifying the surrounding vegetation received no support.Coefficients associated with top models indicated that areas near anthropogenic features had a lower probability of selection as birth sites. Our model predicted 91% of observed birth sites in medium-high or high probability categories. This model of birthing habitat, in cooperation with data of birth timing, provides biologists with a map of high-probability birthing areas and a time of year in which human access to trails or roads could be minimized to reduce conflict between recreation and female bison.
|
98 |
Comparison of Recommendation Systems for Auto-scaling in the Cloud EnvironmentBoyapati, Sai Nikhil January 2023 (has links)
Background: Cloud computing’s rapid growth has highlighted the need for efficientresource allocation. While cloud platforms offer scalability and cost-effectiveness for a variety of applications, managing resources to match dynamic workloads remains a challenge. Auto-scaling, the dynamic allocation of resources in response to real-time demand and performance metrics, has emerged as a solution. Traditional rule-based methods struggle with the increasing complexity of cloud applications. Machine Learning models offer promising accuracy by learning from performance metrics and adapting resource allocations accordingly. Objectives: This thesis addresses the topic of cloud environments auto-scaling recommendations emphasizing the integration of Machine Learning models and significant application metrics. Its primary objectives are determining the critical metrics for accurate recommendations and evaluating the best recommendation techniques for auto-scaling. Methods: The study initially identifies the crucial metrics—like CPU usage and memory consumption that have a substantial impact on auto-scaling selections through thorough experimentation and analysis. Machine Learning(ML) techniques are selected based on literature review, and then further evaluated through thorough experimentation and analysis. These findings establish a foundation for the subsequent evaluation of ML techniques for auto-scaling recommendations. Results: The performance of Random Forests (RF), K-Nearest Neighbors (KNN), and Support Vector Machines (SVM) are investigated in this research. The results show that RF have higher accuracy, precision, and recall which is consistent with the significance of the metrics which are identified earlier. Conclusions: This thesis enhances the understanding of auto-scaling recommendations by combining the findings from metric importance and recommendation technique performance. The findings show the complex interactions between metrics and recommendation methods, establishing the way for the development of adaptive auto-scaling systems that improve resource efficiency and application functionality.
|
99 |
Predicting user churn using temporal information : Early detection of churning users with machine learning using log-level data from a MedTech application / Förutsägning av användaravhopp med tidsinformation : Tidig identifiering av avhoppande användare med maskininlärning utifrån systemloggar från en medicinteknisk produktMarcus, Love January 2023 (has links)
User retention is a critical aspect of any business or service. Churn is the continuous loss of active users. A low churn rate enables companies to focus more resources on providing better services in contrast to recruiting new users. Current published research on predicting user churn disregards time of day and time variability of events and actions by feature selection or data preprocessing. This thesis empirically investigates the practical benefits of including accurate temporal information for binary prediction of user churn by training a set of Machine Learning (ML) classifiers on differently prepared data. One data preparation approach was based on temporally sorted logs (log-level data set), and the other on stacked aggregations (aggregated data set) with additional engineered temporal features. The additional temporal features included information about relative time, time of day, and temporal variability. The inclusion of the temporal information was evaluated by training and evaluating the classifiers with the different features on a real-world dataset from a MedTech application. Artificial Neural Networks (ANNs), Random Forrests (RFs), Decision Trees (DTs) and naïve approaches were applied and benchmarked. The classifiers were compared with among others the Area Under the Receiver Operating Characteristics Curve (AUC), Positive Predictive Value (PPV) and True Positive Rate (TPR) (a.k.a. precision and recall). The PPV scores the classifiers by their accuracy among the positively labeled class, the TPR measures the recognized proportion of the positive class, and the AUC is a metric of general performance. The results demonstrate a statistically significant value of including time variation features overall and particularly that the classifiers performed better on the log-level data set. An ANN trained on temporally sorted logs performs best followed by a RF on the same data set. / Bevarande av användare är en kritisk aspekt för alla företag eller tjänsteleverantörer. Ett lågt användarbortfall gör det möjligt för företag att fokusera mer resurser på att tillhandahålla bättre tjänster istället för att rekrytera nya användare. Tidigare publicerad forskning om att förutsäga användarbortfall bortser från tid på dygnet och tidsvariationer för loggad användaraktivitet genom val av förbehandlingsmetoder eller variabelselektion. Den här avhandlingen undersöker empiriskt de praktiska fördelarna med att inkludera information om tidsvariabler innefattande tid på dygnet och tidsvariation för binär förutsägelse av användarbortfall genom att träna klassificerare på data förbehandlat på olika sätt. Två förbehandlingsmetoder används, en baserad på tidssorterade loggar (loggnivå) och den andra på packade aggregeringar (aggregerat) utökad med framtagna tidsvariabler. Inklusionen av tidsvariablerna utvärderades genom att träna och utvärdera en uppsättning MLklassificerare med de olika tidsvariablerna på en verklig datamängd från en digital medicinskteknisk produkt. ANNs, RFs, DTs och naiva tillvägagångssätt tillämpades och jämfördes på den aggregerade datamängden med och utan tidsvariationsvariablerna och på datamängden på loggnivå. Klassificerarna jämfördes med bland annat AUC, PPV och TPR. PPV betygsätter algoritmerna efter träffsäkerhet bland den positivt märkta klassen och TPR utvärderar hur stor del av den positiva klassen som identifierats medan AUC är ett mått av klassificerarnas allmänna prestanda. Resultaten visar ett betydande värde av att inkludera tidsvariationsvariablerna överlag och i synnerhet att klassificerarna presterade bättre på datauppsättningen på loggnivå. Ett ANN tränad på tidssorterade loggar presterar bäst följt av en RF på samma datamängd.
|
100 |
A Comparison of Classification Methods in Predicting the Presence of DNA Profiles in Sexual Assault KitsHeckman, Derek J. 11 January 2018 (has links)
No description available.
|
Page generated in 0.0803 seconds