Spelling suggestions: "subject:"random forests"" "subject:"random gorests""
61 |
Spatial random forests for brain lesions segmentation in MRIs and model-based tumor cell extrapolationGeremia, Ezequiel 30 January 2013 (has links) (PDF)
The large size of the datasets produced by medical imaging protocols contributes to the success of supervised discriminative methods for semantic labelling of images. Our study makes use of a general and efficient emerging framework, discriminative random forests, for the detection of brain lesions in multi-modal magnetic resonance images (MRIs). The contribution is three-fold. First, we focus on segmentation of brain lesions which is an essential task to diagnosis, prognosis and therapy planning. A context-aware random forest is designed for the automatic multi-class segmentation of MS lesions, low grade and high grade gliomas in MR images. It uses multi-channel MRIs, prior knowledge on tissue classes, symmetrical and long-range spatial context to discriminate lesions from background. Then, we investigate the promising perspective of estimating the brain tumor cell density from MRIs. A generative-discriminative framework is presented to learn the latent and clinically unavailable tumor cell density from model-based estimations associated with synthetic MRIs. The generative model is a validated and publicly available biophysiological tumor growth simulator. The discriminative model builds on multi-variate regression random forests to estimate the voxel-wise distribution of tumor cell density from input MRIs. Finally, we present the "Spatially Adaptive Random Forests" which merge the benefits of multi-scale and random forest methods and apply it to previously cited classification and regression settings. Quantitative evaluation of the proposed methods are carried out on publicly available labeled datasets and demonstrate state of the art performance.
|
62 |
Alternative Sampling and Analysis Methods for Digital Soil Mapping in Southwestern UtahBrungard, Colby W. 01 May 2009 (has links)
Digital soil mapping (DSM) relies on quantitative relationships between easily measured environmental covariates and field and laboratory data. We applied innovative sampling and inference techniques to predict the distribution of soil attributes, taxonomic classes, and dominant vegetation across a 30,000-ha complex Great Basin landscape in southwestern Utah. This arid rangeland was characterized by rugged topography, diverse vegetation, and intricate geology. Environmental covariates calculated from digital elevation models (DEM) and spectral satellite data were used to represent factors controlling soil development and distribution. We investigated optimal sample size and sampled the environmental covariates using conditioned Latin Hypercube Sampling (cLHS). We demonstrated that cLHS, a type of stratified random sampling, closely approximated the full range of variability of environmental covariates in feature and geographic space with small sample sizes. Site and soil data were collected at 300 locations identified by cLHS. Random forests was used to generate spatial predictions and associated probabilities of site and soil characteristics. Balanced random forests and balanced and weighted random forests were investigated for their use in producing an overall soil map. Overall and class errors (referred to as out-of-bag [OOB] error) were within acceptable levels. Quantitative covariate importance was useful in determining what factors were important for soil distribution. Random forest spatial predictions were evaluated based on the conceptual framework developed during field sampling.
|
63 |
Bewertung der Erfassungswahrscheinlichkeit für globales Biodiversitäts-Monitoring: Ergebnisse von Sampling GRIDs aus unterschiedlichen klimatischen Regionen / An assessment of sampling detectability for global biodiversity monitoring: results from sampling GRIDs in different climatic regionsNemitz, Dirk 05 December 2008 (has links)
No description available.
|
64 |
Αναγνώριση και κατάταξη ονομάτων-οντοτήτων σε ελληνικά κείμενα με χρήση τυχαίων δασών / Name entity recognition in Greek texts with random forestsΖαγγανά, Ελένη 08 January 2013 (has links)
Η αναγνώριση και κατηγοριοποίηση ονομάτων οντοτήτων είναι μία ιδιαίτερα χρήσιμη υπό-εργασία σε πολλές εφαρμογές επεξεργασίας φυσικής γλώσσας. Σε αυτήν την εργασία παρουσιάζεται μία προσπάθεια αναγνώρισης και κατηγοριοποίησης ονομάτων προσώπων, ημερομηνιών, περιοχών(πόλεων, χωρών) και οργανισμών(π.χ. Δημόσια Επιχείρηση Ηλεκτρισμού) χρησιμοποιώντας μια νέα μέθοδο επιβλεπόμενης μάθησης για ταξινόμηση δεδομένων, τα «Τυχαία Δάση». Η μέθοδος κατηγοριοποίησης αυτή, χρησιμοποιεί ένα σύνολο δέντρων απόφασης, όπου το κάθε ένα «ψηφίζει» μια κατηγορία. Η τελική και οριστική κατηγοριοποίηση γίνεται με το «τυχαίο δάσος» να διαλέγει την κατηγορία με τις περισσότερες ψήφους.
Σε μια συλλογή ελληνικών κειμένων, εφαρμόστηκαν τεχνικές επεξεργασίας κειμένων για διαχωρισμό και κατηγοριοποίηση των λέξεων, όπου το αποτέλεσμα που προέκυψε ήταν ένα σύνολο χαρακτηριστικών για κάθε λέξη. Το σύνολο των χαρακτηριστικών χωρίστηκε σε ένα «σύνολο εκπαίδευσης» και ένα «σύνολο ελέγχου». Το «σύνολο εκπαίδευσης» χρησιμοποιήθηκε για την εκπαίδευση του «τυχαίου δάσους». Το τελευταίο, θα χρησιμοποιηθεί για την αναγνώριση της κατηγορίας στην οποία ανήκει μια λέξη. Το Τυχαίο Δάσος που αναπτύχθηκε, ελέγχθηκε με βάση το «σύνολο ελέγχου» και προέκυψαν ικανοποιητικά αποτελέσματα, πιο συγκεκριμένα για την κατάταξη ημερομηνιών και οργανισμών η απόδοση ήταν 96% ενώ η ακρίβειά του ήταν 93%. Επιπλέον, για το πρόβλημα που διερευνάται, συγκρίθηκαν τα αποτελέσματα της χρήσης Μηχανών Διανυσμάτων Υποστήριξης και Νευρωνικών Δικτύων με αυτά των Τυχαίων Δασών. / Name entity recognition and categorization is a very important subtask in several natural language processing applications. In this master thesis, we present an attempt to recognize and categorize person names, temporal expressions(i.e. dates), areas (cities/countries), organizations (e.g. Public Electric Company) by using a new supervised learning method for classification, Random Forests. This classification method, uses a group of decision trees where each tree, votes for one classification category. The Random Forest results to the classification category with the most votes.
In a Greek corpus (collection of texts), text processing techniques were applied such as stemming and tokenization. The result obtained was a set of features for each word. The set of features was divided to a “train dataset” and a “test dataset”. The “train dataset” was used in order to train the Random Forest. The latter will classify each word to one of the four categories mentioned before. The Random Forest, was tested against the “test dataset” and the results were very satisfactory, in particular the performance for classifying dates and organizations was 96%, in addition classification accuracy was 93%. Furthermore, for the problem examined, the results of using Support Vector Machines and Neural Networks, where compared with the ones of Random Forests.
|
65 |
Paralelní zpracování velkých objemů astronomických dat / Parallel Processing of Huge Astronomical DataHaas, František January 2016 (has links)
This master thesis focuses on the Random Forests algorithm analysis and implementation. The Random Forests is a machine learning algorithm targeting data classification. The goal of the thesis is an implementation of the Random Forests algorithm using techniques and technologies of parallel programming for CPU and GPGPU and also a reference serial implementation for CPU. A comparison and evaluation of functional and performance attributes of these implementations will be performed. For the comparison of these implementations various data sets will be used but an emphasis will be given to real world data obtained from astronomical observations of stellar spectra. Usefulness of these implementations for stellar spectra classification from the functional and performance view will be performed. Powered by TCPDF (www.tcpdf.org)
|
66 |
Využití statistických metod při oceňování nemovitostí / Valuation of real estates using statistical methodsFuniok, Ondřej January 2017 (has links)
The thesis deals with the valuation of real estates in the Czech Republic using statistical methods. The work focuses on a complex task based on data from an advertising web portal. The aim of the thesis is to create a prototype of the statistical predication model of the residential properties valuation in Prague and to further evaluate the dissemination of its possibilities. The structure of the work is conceived according to the CRISP-DM methodology. On the pre-processed data are tested the methods regression trees and random forests, which are used to predict the price of real estate.
|
67 |
Using machine learning to determine fold class and secondary structure content from Raman optical activity and Raman vibrational spectroscopyKinalwa-Nalule, Myra January 2012 (has links)
The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. However, there are many bands of which little is known. There is a need, therefore, to find ways of extrapolating information from spectral bands and investigate which regions of the spectra contain the most useful structural information. Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage protein structural content from Raman and ROA spectral data. The analyses were performed on spectral bin widths of 10cm-1 and on the spectral amide regions I, II and III. The full spectra and different combinations of the amide regions were also analysed. The SVM analyses, classification and regression, generally did not perform well. SVM classification models for example, had low Matthew Correlation Coefficient (MCC) values below 0.5 but this is better than a negative value which would indicate a random chance prediction. The SVM regression analyses also showed very poor performances with average R2 values below 0.5. R2 is the Pearson's correlations coefficient and shows how well predicted and observed structural content values correlate. An R2 value 1 indicates a good correlation and therefore a good prediction model. The Partial Least Squares regression analyses yielded much improved results with very high accuracies. Analyses of full spectrum and the spectral amide regions produced high R2 values of 0.8-0.9 for both ROA and Raman spectral data. This high accuracy was also seen in the analysis of the 850-1100 cm-1 backbone region for both ROA and Raman spectra which indicates that this region could have an important contribution to protein structure analysis. 2nd derivative Raman spectra PLS regression analysis showed very improved performance with high accuracy R2 values of 0.81-0.97. The Random Forest algorithm used here for classification showed good performance. The 2-dimensional plots used to visualise the classification clusters showed clear clusters in some analyses, for example tighter clustering was observed for amide I, amide I & III and amide I & II & III spectral regions than for amide II, amide III and amide II&III spectra analysis. The Random Forest algorithm also determines variable importance which showed spectral bins were crucial in the classification decisions. The ROA Random Forest analyses performed generally better than Raman Random Forest analyses. ROA Random Forest analyses showed 75% as the highest percentage of correctly classified proteins while Raman analyses reported 50% as the highest percentage. The analyses presented in this thesis have shown that Raman and ROA vibrational spectral contains information about protein secondary structure and these data can be extracted using mathematical methods such as the machine learning techniques presented here. The machine learning methods applied in this project were used to mine information about protein secondary structure and the work presented here demonstrated that these techniques are useful and could be powerful tools in the determination protein structure from spectral data.
|
68 |
Forêts aléatoires et sélection de variables : analyse des données des enregistreurs de vol pour la sécurité aérienne / Random forests and variable selection : analysis of the flight data recorders for aviation safetyGregorutti, Baptiste 11 March 2015 (has links)
De nouvelles réglementations imposent désormais aux compagnies aériennes d'établir une stratégie de gestion des risques pour réduire encore davantage le nombre d'accidents. Les données des enregistreurs de vol, très peu exploitées à ce jour, doivent être analysées de façon systématique pour identifier, mesurer et suivre l'évolution des risques. L'objectif de cette thèse est de proposer un ensemble d'outils méthodologiques pour répondre à la problématique de l'analyse des données de vol. Les travaux présentés dans ce manuscrit s'articulent autour de deux thèmes statistiques : la sélection de variables en apprentissage supervisé d'une part et l'analyse des données fonctionnelles d'autre part. Nous utilisons l'algorithme des forêts aléatoires car il intègre des mesures d'importance pouvant être employées dans des procédures de sélection de variables. Dans un premier temps, la mesure d'importance par permutation est étudiée dans le cas où les variables sont corrélées. Nous étendons ensuite ce critère pour des groupes de variables et proposons une nouvelle procédure de sélection de variables fonctionnelles. Ces méthodes sont appliquées aux risques d'atterrissage long et d'atterrissage dur, deux questions importantes pour les compagnies aériennes. Nous présentons enfin l'intégration des méthodes proposées dans le produit FlightScanner développé par Safety Line. Cette solution innovante dans le transport aérien permet à la fois le monitoring des risques et le suivi des facteurs qui les influencent. / New recommendations require airlines to establish a safety management strategy to keep reducing the number of accidents. The flight data recorders have to be systematically analysed in order to identify, measure and monitor the risk evolution. The aim of this thesis is to propose methodological tools to answer the issue of flight data analysis. Our work revolves around two statistical topics: variable selection in supervised learning and functional data analysis. The random forests are used as they implement importance measures which can be embedded in selection procedures. First, we study the permutation importance measure when the variables are correlated. This criterion is extended for groups of variables and a new selection algorithm for functional variables is introduced. These methods are applied to the risks of long landing and hard landing which are two important questions for airlines. Finally, we present the integration of the proposed methods in the software FlightScanner implemented by Safety Line. This new solution in the air transport helps safety managers to monitor the risks and identify the contributed factors.
|
69 |
Bayesian statistical inference for intractable likelihood models / Inférence statistique bayésienne pour les modélisations donnant lieu à un calcul de vraisemblance impossibleRaynal, Louis 10 September 2019 (has links)
Dans un processus d’inférence statistique, lorsque le calcul de la fonction de vraisemblance associée aux données observées n’est pas possible, il est nécessaire de recourir à des approximations. C’est un cas que l’on rencontre très fréquemment dans certains champs d’application, notamment pour des modèles de génétique des populations. Face à cette difficulté, nous nous intéressons aux méthodes de calcul bayésien approché (ABC, Approximate Bayesian Computation) qui se basent uniquement sur la simulation de données, qui sont ensuite résumées et comparées aux données observées. Ces comparaisons nécessitent le choix judicieux d’une distance, d’un seuil de similarité et d’un ensemble de résumés statistiques pertinents et de faible dimension.Dans un contexte d’inférence de paramètres, nous proposons une approche mêlant des simulations ABC et les méthodes d’apprentissage automatique que sont les forêts aléatoires. Nous utilisons diverses stratégies pour approximer des quantités a posteriori d’intérêts sur les paramètres. Notre proposition permet d’éviter les problèmes de réglage liés à l’ABC, tout en fournissant de bons résultats ainsi que des outils d’interprétation pour les praticiens. Nous introduisons de plus des mesures d’erreurs de prédiction a posteriori (c’est-à-dire conditionnellement à la donnée observée d’intérêt) calculées grâce aux forêts. Pour des problèmes de choix de modèles, nous présentons une stratégie basée sur des groupements de modèles qui permet, en génétique des populations, de déterminer dans un scénario évolutif les évènements plus ou moins bien identifiés le constituant. Toutes ces approches sont implémentées dans la bibliothèque R abcrf. Par ailleurs, nous explorons des manières de construire des forêts aléatoires dites locales, qui prennent en compte l’observation à prédire lors de leur phase d’entraînement pour fournir une meilleure prédiction. Enfin, nous présentons deux études de cas ayant bénéficié de nos développements, portant sur la reconstruction de l’histoire évolutive de population pygmées, ainsi que de deux sous-espèces du criquet pèlerin Schistocerca gregaria. / In a statistical inferential process, when the calculation of the likelihood function is not possible, approximations need to be used. This is a fairly common case in some application fields, especially for population genetics models. Toward this issue, we are interested in approximate Bayesian computation (ABC) methods. These are solely based on simulated data, which are then summarised and compared to the observed ones. The comparisons are performed depending on a distance, a similarity threshold and a set of low dimensional summary statistics, which must be carefully chosen.In a parameter inference framework, we propose an approach combining ABC simulations and the random forest machine learning algorithm. We use different strategies depending on the parameter posterior quantity we would like to approximate. Our proposal avoids the usual ABC difficulties in terms of tuning, while providing good results and interpretation tools for practitioners. In addition, we introduce posterior measures of error (i.e., conditionally on the observed data of interest) computed by means of forests. In a model choice setting, we present a strategy based on groups of models to determine, in population genetics, which events of an evolutionary scenario are more or less well identified. All these approaches are implemented in the R package abcrf. In addition, we investigate how to build local random forests, taking into account the observation to predict during their learning phase to improve the prediction accuracy. Finally, using our previous developments, we present two case studies dealing with the reconstruction of the evolutionary history of Pygmy populations, as well as of two subspecies of the desert locust Schistocerca gregaria.
|
70 |
Coreference Resolution for Swedish / Koreferenslösning för svenskaVällfors, Lisa January 2022 (has links)
This report explores possible avenues for developing coreference resolution methods for Swedish. Coreference resolution is an important topic within natural language processing, as it is used as a preprocessing step in various information extraction tasks. The topic has been studied extensively for English, but much less so for smaller languages such as Swedish. In this report we adapt two coreference resolution algorithms that were originally used for English, for use on Swedish texts. One algorithm is entirely rule-based, while the other uses machine learning. We have also annotated a Swedish dataset to be used for training and evaluation. Both algorithms showed promising results and as none clearly outperformed the other we can conclude that both would be good candidates for further development. For the rule-based algorithm more advanced rules, especially ones that could incorporate some semantic knowledge, was identified as the most important avenue of improvement. For the machine learning algorithm more training data would likely be the most beneficial. For both algorithms improved detection of mention spans would also help, as this was identified as one of the most error-prone components. / I denna rapport undersöks möjliga metoder för koreferenslösning för svenska. Koreferenslösning är en viktig uppgift inom språkteknologi, eftersom det utgör ett första steg i många typer av informationsextraktion. Uppgiften har studerats utförligt för flera större språk, framförallt engelska, men är ännu relativt outforskad för svenska och andra mindre språk. I denna rapport har vi anpassat två algoritmer som ursprungligen utvecklades för engelska för användning på svensk text. Den ena algoritmen bygger på maskininlärning och den andra är helt regelbaserad. Vi har också annoterat delar av Talbankens korpus med koreferensrelationer, för att användas för träning och utvärdering av koreferenslösningsalgoritmer. Båda algoritmerna visade lovande resultat, och ingen var tydligt bättre än den andra. Bägge vore därför lämpliga alternativ för vidareutveckling. För ML-algoritmen vore mer träningsdata den viktigaste punkten för förbättring, medan den regelbaserade algoritmen skulle kunna förbättras med mer komplexa regler, för att inkorporera exempelvis semantisk information i besluten. Ett annat viktigt utvecklingsområde är identifieringen av de fraser som utvärderas för möjlig koreferens, eftersom detta steg introducerade många fel i bägge algoritmerna.
|
Page generated in 0.0791 seconds