Spelling suggestions: "subject:"random forests"" "subject:"random gorests""
101 |
Intrångsdetektering på CAN bus data : En studie för likvärdig jämförelse av metoderHedman, Pontus, Skepetzis, Vasilios January 2020 (has links)
Utförda hacker-attacker på moderna fordon belyser ett behov av snabb detektering av hot inom denna miljö, särskilt när det förekommer en trend inom denna industri där moderna fordon idag kan klassas som IoT-enheter. Det förekommer kända fall av attacker där en angripare förmår stoppa fordon i drift, eller ta bromsar ur funktion, och detta har påvisats ske fjärrstyrt. Denna studie undersöker detektion av utförda attacker, på en riktig bil, genom studie av CAN bus meddelanden. De två modellerna CUSUM, från området Change Point Detection, och Random Forests, från området maskininlärning, tillämpas på riktig datamängd, för att sedan jämföras på simulerad data sinsemellan. En ny hypotesdefinition introduceras vilket möjliggör att evalueringsmetoden Conditional expected delay kan nyttjas för fallet Random Forests, där resultat förmås jämföras med evalueringsresultat från CUSUM. Conditional expected delay har inte tidigare studerats för metod av maskininlärning. De båda metoderna evalueras också genom ROC-kurva. Sammantaget förmås de båda metoderna jämföras sinsemellan, med varandras etablerade evalueringsmetoder. Denna studie påvisar metod och hypotes för att brygga de två områdena change point detection och maskininlärning, för att evaluera de två enligt gemensamt motiverade parametervärden. / There are known hacker attacks which have been conducted on modern vehicles. These attacks illustrates a need for early threat detection in this environment. Development of security systems in this environment is of special interest due to the increasing interconnection of vehicles and their newfound classification as IoT devices. Known attacks, that have even been carried out remotely on modern vehicles, include attacks which allow a perpetrator to stop vehicles, or to disable brake mechanisms. This study examines the detection of attacks carried out on a real vehicle, by studying CAN bus messages. The two methods CUSUM, from the field of Change Point Detection, and Random Forests, from the field of Machine Learning, are both applied to real data, and then later comparably evaluated on simulated data. A new hypothesis defintion is introduced which allows for the evaluation method Conditional expected delay to be used in the case of Random Forests, where results may be compared to evaluation results from CUSUM. Conditional expected delay has not been studied in the machinelarning case before. Both methods are also evaluated by method of ROC curve. The combined hypothesis definition for the two separate fields, allow for a comparison between the two models, in regard to each other's established evaluation methods. This study present a method and hypothesis to bridge the two separate fields of study, change point detection, and machinelearning, to achieve a comparable evaluation between the two.
|
102 |
It’s a Match: Predicting Potential Buyers of Commercial Real Estate Using Machine LearningHellsing, Edvin, Klingberg, Joel January 2021 (has links)
This thesis has explored the development and potential effects of an intelligent decision support system (IDSS) to predict potential buyers for commercial real estate property. The overarching need for an IDSS of this type has been identified exists due to information overload, which the IDSS aims to reduce. By shortening the time needed to process data, time can be allocated to make sense of the environment with colleagues. The system architecture explored consisted of clustering commercial real estate buyers into groups based on their characteristics, and training a prediction model on historical transaction data from the Swedish market from the cadastral and land registration authority. The prediction model was trained to predict which out of the cluster groups most likely will buy a given property. For the clustering, three different clustering algorithms were used and evaluated, one density based, one centroid based and one hierarchical based. The best performing clustering model was the centroid based (K-means). For the predictions, three supervised Machine learning algorithms were used and evaluated. The different algorithms used were Naive Bayes, Random Forests and Support Vector Machines. The model based on Random Forests performed the best, with an accuracy of 99.9%. / Denna uppsats har undersökt utvecklingen av och potentiella effekter med ett intelligent beslutsstödssystem (IDSS) för att prediktera potentiella köpare av kommersiella fastigheter. Det övergripande behovet av ett sådant system har identifierats existerar på grund av informtaionsöverflöd, vilket systemet avser att reducera. Genom att förkorta bearbetningstiden av data kan tid allokeras till att skapa förståelse av omvärlden med kollegor. Systemarkitekturen som undersöktes bestod av att gruppera köpare av kommersiella fastigheter i kluster baserat på deras köparegenskaper, och sedan träna en prediktionsmodell på historiska transkationsdata från den svenska fastighetsmarknaden från Lantmäteriet. Prediktionsmodellen tränades på att prediktera vilken av grupperna som mest sannolikt kommer köpa en given fastighet. Tre olika klusteralgoritmer användes och utvärderades för grupperingen, en densitetsbaserad, en centroidbaserad och en hierarkiskt baserad. Den som presterade bäst var var den centroidbaserade (K-means). Tre övervakade maskininlärningsalgoritmer användes och utvärderades för prediktionerna. Dessa var Naive Bayes, Random Forests och Support Vector Machines. Modellen baserad p ̊a Random Forests presterade bäst, med en noggrannhet om 99,9%.
|
103 |
Machine Learning Based Prediction and Classification for Uplift Modeling / Maskininlärningsbaserad prediktion och klassificering för inkrementell responsanalysBörthas, Lovisa, Krange Sjölander, Jessica January 2020 (has links)
The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in each data set. / Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna. Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression. Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan. Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat.
|
104 |
Dimension Flexible and Adaptive Statistical LearningKhowaja, Kainat 02 March 2023 (has links)
Als interdisziplinäre Forschung verbindet diese Arbeit statistisches Lernen mit aktuellen fortschrittlichen Methoden, um mit hochdimensionalität und Nichtstationarität umzugehen. Kapitel 2 stellt Werkzeuge zur Verfügung, um statistische Schlüsse auf die Parameterfunktionen von Generalized Random Forests zu ziehen, die als Lösung der lokalen Momentenbedingung identifiziert wurden. Dies geschieht entweder durch die hochdimensionale Gaußsche Approximationstheorie oder durch Multiplier-Bootstrap. Die theoretischen Aspekte dieser beiden Ansätze werden neben umfangreichen Simulationen und realen Anwendungen im Detail diskutiert. In Kapitel 3 wird der lokal parametrische Ansatz auf zeitvariable Poisson-Prozesse ausgeweitet, um ein Instrument zur Ermittlung von Homogenitätsintervallen innerhalb der Zeitreihen von Zähldaten in einem nichtstationären Umfeld bereitzustellen. Die Methodik beinhaltet rekursive Likelihood-Ratio-Tests und hat ein Maximum in der Teststatistik mit unbekannter Verteilung. Um sie zu approximieren und den kritischen Wert zu finden, verwenden wir den Multiplier-Bootstrap und demonstrieren den Nutzen dieses Algorithmus für deutsche M\&A Daten. Kapitel 4 befasst sich mit der Erstellung einer niedrigdimensionalen Approximation von hochdimensionalen Daten aus dynamischen Systemen. Mithilfe der Resampling-Methoden, der Hauptkomponentenanalyse und Interpolationstechniken konstruieren wir reduzierte dimensionale Ersatzmodelle, die im Vergleich zu den ursprünglichen hochauflösenden Modellen schnellere Ausgaben liefern. In Kapitel 5 versuchen wir, die Verteilungsmerkmale von Kryptowährungen mit den von ihnen zugrunde liegenden Mechanismen zu verknüpfen. Wir verwenden charakteristikbasiertes spektrales Clustering, um Kryptowährungen mit ähnlichem Verhalten in Bezug auf Preis, Blockzeit und Blockgröße zu clustern, und untersuchen diese Cluster, um gemeinsame Mechanismen zwischen verschiedenen Krypto-Clustern zu finden. / As an interdisciplinary research, this thesis couples statistical learning with current advanced methods to deal with high dimensionality and nonstationarity. Chapter 2 provides tools to make statistical inference (uniformly over covariate space) on the parameter functions from Generalized Random Forests identified as the solution of the local moment condition. This is done by either highdimensional Gaussian approximation theorem or via multiplier bootstrap. The theoretical aspects of both of these approaches are discussed in detail alongside extensive simulations and real life applications. In Chapter 3, we extend the local parametric approach to time varying Poisson processes, providing a tool to find intervals of homogeneity within the time series of count data in a nonstationary setting. The methodology involves recursive likelihood ratio tests and has a maxima in test statistic with unknown distribution. To approximate it and find the critical value, we use multiplier bootstrap and demonstrate the utility of this algorithm on German M\&A data. Chapter 4 is concerned with creating low dimensional approximation of high dimensional data from dynamical systems. Using various resampling methods, Principle Component Analysis, and interpolation techniques, we construct reduced dimensional surrogate models that provide faster responses as compared to the original high fidelity models. In Chapter 5, we aim to link the distributional characteristics of cryptocurrencies to their underlying mechanism. We use characteristic based spectral clustering to cluster cryptos with similar behaviour in terms of price, block time, and block size, and scrutinize these clusters to find common mechanisms between various crypto clusters.
|
105 |
Moderní regresní metody při dobývání znalostí z dat / Modern regression methods in data miningKopal, Vojtěch January 2015 (has links)
The thesis compares several non-linear regression methods on synthetic data sets gen- erated using standard benchmarks for a continuous black-box optimization. For that com- parison, we have chosen the following regression methods: radial basis function networks, Gaussian processes, support vector regression and random forests. We have also included polynomial regression which we use to explain the basic principles of regression. The com- parison of these methods is discussed in the context of black-box optimization problems where the selected methods can be applied as surrogate models. The methods are evalu- ated based on their mean-squared error and on the Kendall's rank correlation coefficient between the ordering of function values according to the model and according to the function used to generate the data. 1
|
106 |
Modélisation de l’incertitude sur les trajectoires d’avions / Uncertainty modeling on aircraft trajectoriesFouemkeu, Norbert 22 October 2010 (has links)
Dans cette thèse, nous proposons des modèles probabilistes et statistiques d’analyse de données multidimensionnelles pour la prévision de l’incertitude sur les trajectoires d’aéronefs. En supposant que pendant le vol, chaque aéronef suit sa trajectoire 3D contenue dans son plan de vol déposé, nous avons utilisé l’ensemble des caractéristiques de l’environnement des vols comme variables indépendantes pour expliquer l’heure de passage des aéronefs sur les points de leur trajectoire de vol prévue. Ces caractéristiques sont : les conditions météorologiques et atmosphériques, les paramètres courants des vols, les informations contenues dans les plans de vol déposés et la complexité de trafic. Typiquement, la variable dépendante dans cette étude est la différence entre les instants observés pendant le vol et les instants prévus dans les plans de vol pour le passage des aéronefs sur les points de leur trajectoire prévue : c’est la variable écart temporel. En utilisant une technique basée sur le partitionnement récursif d’un échantillon des données, nous avons construit quatre modèles. Le premier modèle que nous avons appelé CART classique est basé sur le principe de la méthode CART de Breiman. Ici, nous utilisons un arbre de régression pour construire une typologie des points des trajectoires des vols en fonction des caractéristiques précédentes et de prévoir les instants de passage des aéronefs sur ces points. Le second modèle appelé CART modifié est une version améliorée du modèle précédent. Ce dernier est construit en remplaçant les prévisions calculées par l’estimation de la moyenne de la variable dépendante dans les nœuds terminaux du modèle CART classique par des nouvelles prévisions données par des régressions multiples à l’intérieur de ces nœuds. Ce nouveau modèle développé en utilisant l’algorithme de sélection et d’élimination des variables explicatives (Stepwise) est parcimonieux. En effet, pour chaque nœud terminal, il permet d’expliquer le temps de vol par des variables indépendantes les plus pertinentes pour ce nœud. Le troisième modèle est fondé sur la méthode MARS, modèle de régression multiple par les splines adaptatives. Outre la continuité de l’estimateur de la variable dépendante, ce modèle permet d’évaluer les effets directs des prédicteurs et de ceux de leurs interactions sur le temps de passage des aéronefs sur les points de leur trajectoire de vol prévue. Le quatrième modèle utilise la méthode d’échantillonnage bootstrap. Il s’agit notamment des forêts aléatoires où pour chaque échantillon bootstrap de l’échantillon de données initial, un modèle d’arbre de régression est construit, et la prévision du modèle général est obtenue par une agrégation des prévisions sur l’ensemble de ces arbres. Malgré le surapprentissage observé sur ce modèle, il est robuste et constitue une solution au problème d’instabilité des arbres de régression propre à la méthode CART. Les modèles ainsi construits ont été évalués et validés en utilisant les données test. Leur application au calcul des prévisions de la charge secteur en nombre d’avions entrants a montré qu’un horizon de prévision d’environ 20 minutes pour une fenêtre de temps supérieure à 20 minutes permettait d’obtenir les prévisions avec des erreurs relatives inférieures à 10%. Parmi ces modèles, CART classique et les forêts aléatoires présentaient de meilleures performances. Ainsi, pour l’autorité régulatrice des courants de trafic aérien, ces modèles constituent un outil d’aide pour la régulation et la planification de la charge des secteurs de l’espace aérien contrôlé. / In this thesis we propose probabilistic and statistic models based on multidimensional data for forecasting uncertainty on aircraft trajectories. Assuming that during the flight, aircraft follows his 3D trajectory contained into his initial flight plan, we used all characteristics of flight environment as predictors to explain the crossing time of aircraft at given points on their planned trajectory. These characteristics are: weather and atmospheric conditions, flight current parameters, information contained into the flight plans and the air traffic complexity. Typically, in this study, the dependent variable is difference between actual time observed during flight and planned time to cross trajectory planned points: this variable is called temporal difference. We built four models using method based on partitioning recursive of the sample. The first called classical CART is based on Breiman CART method. Here, we use regression trees to build points typology of aircraft trajectories based on previous characteristics and to forecast crossing time of aircrafts on these points. The second model called amended CART is the previous model improved. This latter is built by replacing forecasting estimated by the mean of dependent variable inside the terminal nodes of classical CART by new forecasting given by multiple regression inside these nodes. This new model developed using Stepwise algorithm is parcimonious because for each terminal node it permits to explain the flight time by the most relevant predictors inside the node. The third model is built based on MARS (Multivariate adaptive regression splines) method. Besides continuity of the dependent variable estimator, this model allows to assess the direct and interaction effects of the explanatory variables on the crossing time on flight trajectory points. The fourth model uses boostrap sampling method. It’s random forests where for each bootstrap sample from the initial data, a tree regression model is built like in CART method. The general model forecasting is obtained by aggregating forecasting on the set of trees. Despite the overfitting observed on this model, it is robust and constitutes a solution against instability problem concerning regression trees obtained from CART method. The models we built have been assessed and validated using data test. Their using to compute the sector load forecasting in term to aircraft count entering the sector shown that, the forecast time horizon about 20 minutes with the interval time larger than 20 minutes, allowed to obtain forecasting with relative errors less than 10%. Among all these models, classical CART and random forests are more powerful. Hence, for regulator authority these models can be a very good help for managing the sector load of the airspace controlled.
|
107 |
O consumo de filmes em cinemas no Brasil: uma análise de florestas aleatórias / Movies\' consumption in Brazil: a random forests analysisJusta, Ticiana Sá da 04 January 2019 (has links)
Baseado em dois momentos distintos da realidade brasileira, os anos de 2002/03 e 2008/09, este estudo busca avaliar se houve mudança no perfil dos consumidore de filmes em salas de cinema, por meio de variáveis observáveis de indivíduos consumidores e não consumidores. Para esta finalidade são empregadas técnicas de mineração de dados de florestas aleatórias sobre os microdados das Pesquisas de Orçamentos Familiares (POFs) de 2002/2003, quando o uso de banda larga no Brasil era praticamente nulo, e de 2008/2009, quando esta já estava estabelecida no país. Esta diferença no tempo e de acesso à banda larga proporcionam uma janela de oportunidade para nosso objetivo, dado o elevado grau de mudanças tecnológicas do período investigado. Apesar de metodologicamente não ser possível isolar completamente o efeito do acesso à Internet sobre o consumo de cinema, espera-se que quanto maior a velocidade de acesso e maior a evolução das tecnologias de compactação de arquivos, a distribuição de conteúdo aumente na rede e permita maior consumo de filmes de maneira alternativa ao em salas de cinema. Neste contexto, além de identificar se houve alguma mudança de perfil dos consumidores de filmes em salas de cinema no Brasil após a popularização da banda larga no país, que oficialmente pode ser considerada no ano de 2006, também avaliamos se estes consumidores diferem significativamente dos não consumidores em suas características observáveis. Adicionalmente, num segundo estudo, usando um modelo de intervenção de Diferenças em Diferenças, investigamos o efeito da banda larga sobre o consumo de cinema. Os resultados apontam divergência clara entre os perfis de consumidores e não consumidores de filmes de cinema. / I investigate whether film at theaters consumers in two distinct time are different in their observable characteristics. My investigation is based on observable variables of consumers and non-consumers data from the Family Budget Survey (POF) of 2002/2003, when the use of broadband in Brazil was almost nil, and from 2008/2009, when it already was established in the country. This change in addition to the high degree of technological changes over the period provides a window of opportunity for our objective. Although methodologically it is not possible to full isolate the effect of Internet access on film consumption, it is expected that the higher the user access speed and the greater the evolution of file compression technologies, the faster and bigger will be the traffic of films files on the web. Accordingly, the greater distribution of content on the network enables the consumption of films in alternative ways to movie theaters. Thus, in addition to identifying possible changes in the profile of movie consumers at the theaters in Brazil since the popularization of broadband in the country, which can officially be considered in the year 2006, the study assess whether these consumers differ significantly from not consumers in their observable characteristics. The results points to several distinctions between cinema consumers and no cinema consumers
|
108 |
O consumo de filmes em cinemas no Brasil: uma análise de florestas aleatórias / Movies\' consumption in Brazil: a random forests analysisTiciana Sá da Justa 04 January 2019 (has links)
Baseado em dois momentos distintos da realidade brasileira, os anos de 2002/03 e 2008/09, este estudo busca avaliar se houve mudança no perfil dos consumidore de filmes em salas de cinema, por meio de variáveis observáveis de indivíduos consumidores e não consumidores. Para esta finalidade são empregadas técnicas de mineração de dados de florestas aleatórias sobre os microdados das Pesquisas de Orçamentos Familiares (POFs) de 2002/2003, quando o uso de banda larga no Brasil era praticamente nulo, e de 2008/2009, quando esta já estava estabelecida no país. Esta diferença no tempo e de acesso à banda larga proporcionam uma janela de oportunidade para nosso objetivo, dado o elevado grau de mudanças tecnológicas do período investigado. Apesar de metodologicamente não ser possível isolar completamente o efeito do acesso à Internet sobre o consumo de cinema, espera-se que quanto maior a velocidade de acesso e maior a evolução das tecnologias de compactação de arquivos, a distribuição de conteúdo aumente na rede e permita maior consumo de filmes de maneira alternativa ao em salas de cinema. Neste contexto, além de identificar se houve alguma mudança de perfil dos consumidores de filmes em salas de cinema no Brasil após a popularização da banda larga no país, que oficialmente pode ser considerada no ano de 2006, também avaliamos se estes consumidores diferem significativamente dos não consumidores em suas características observáveis. Adicionalmente, num segundo estudo, usando um modelo de intervenção de Diferenças em Diferenças, investigamos o efeito da banda larga sobre o consumo de cinema. Os resultados apontam divergência clara entre os perfis de consumidores e não consumidores de filmes de cinema. / I investigate whether film at theaters consumers in two distinct time are different in their observable characteristics. My investigation is based on observable variables of consumers and non-consumers data from the Family Budget Survey (POF) of 2002/2003, when the use of broadband in Brazil was almost nil, and from 2008/2009, when it already was established in the country. This change in addition to the high degree of technological changes over the period provides a window of opportunity for our objective. Although methodologically it is not possible to full isolate the effect of Internet access on film consumption, it is expected that the higher the user access speed and the greater the evolution of file compression technologies, the faster and bigger will be the traffic of films files on the web. Accordingly, the greater distribution of content on the network enables the consumption of films in alternative ways to movie theaters. Thus, in addition to identifying possible changes in the profile of movie consumers at the theaters in Brazil since the popularization of broadband in the country, which can officially be considered in the year 2006, the study assess whether these consumers differ significantly from not consumers in their observable characteristics. The results points to several distinctions between cinema consumers and no cinema consumers
|
109 |
Segmentation d'image par intégration itérative de connaissances / Image segmentation by iterative knowledge integrationChaibou salaou, Mahaman Sani 02 July 2019 (has links)
Le traitement d’images est un axe de recherche très actif depuis des années. L’interprétation des images constitue une de ses branches les plus importantes de par ses applications socio-économiques et scientifiques. Cependant cette interprétation, comme la plupart des processus de traitements d’images, nécessite une phase de segmentation pour délimiter les régions à analyser. En fait l’interprétation est un traitement qui permet de donner un sens aux régions détectées par la phase de segmentation. Ainsi, la phase d’interprétation ne pourra analyser que les régions détectées lors de la segmentation. Bien que l’objectif de l’interprétation automatique soit d’avoir le même résultat qu’une interprétation humaine, la logique des techniques classiques de ce domaine ne marie pas celle de l’interprétation humaine. La majorité des approches classiques d’interprétation d’images séparent la phase de segmentation et celle de l’interprétation. Les images sont d’abord segmentées puis les régions détectées sont interprétées. En plus, au niveau de la segmentation les techniques classiques parcourent les images de manière séquentielle, dans l’ordre de stockage des pixels. Ce parcours ne reflète pas nécessairement le parcours de l’expert humain lors de son exploration de l’image. En effet ce dernier commence le plus souvent par balayer l’image à la recherche d’éventuelles zones d’intérêts. Dans le cas échéant, il analyse les zones potentielles sous trois niveaux de vue pour essayer de reconnaitre de quel objet s’agit-il. Premièrement, il analyse la zone en se basant sur ses caractéristiques physiques. Ensuite il considère les zones avoisinantes de celle-ci et enfin il zoome sur toute l’image afin d’avoir une vue complète tout en considérant les informations locales à la zone et celles de ses voisines. Pendant son exploration, l’expert, en plus des informations directement obtenues sur les caractéristiques physiques de l’image, fait appel à plusieurs sources d’informations qu’il fusionne pour interpréter l’image. Ces sources peuvent inclure les connaissent acquises grâce à son expérience professionnelle, les contraintes existantes entre les objets de ce type d’images, etc. L’idée de l’approche présentée ici est que simuler l’activité visuelle de l’expert permettrait une meilleure compatibilité entre les résultats de l’interprétation et ceux de l’expert. Ainsi nous retenons de cette analyse trois aspects importants du processus d’interprétation d’image que nous allons modéliser dans l’approche proposée dans ce travail : 1. Le processus de segmentation n’est pas nécessairement séquentiel comme la plus part des techniques de segmentations qu’on rencontre, mais plutôt une suite de décisions pouvant remettre en cause leurs prédécesseurs. L’essentiel étant à la fin d’avoir la meilleure classification des régions. L’interprétation ne doit pas être limitée par la segmentation. 2. Le processus de caractérisation d’une zone d’intérêt n’est pas strictement monotone i.e. que l’expert peut aller d’une vue centrée sur la zone à vue plus large incluant ses voisines pour ensuite retourner vers la vue contenant uniquement la zone et vice-versa. 3. Lors de la décision plusieurs sources d’informations sont sollicitées et fusionnées pour une meilleure certitude. La modélisation proposée de ces trois niveaux met particulièrement l’accent sur les connaissances utilisées et le raisonnement qui mène à la segmentation des images. / Image processing has been a very active area of research for years. The interpretation of images is one of its most important branches because of its socio-economic and scientific applications. However, the interpretation, like most image processing processes, requires a segmentation phase to delimit the regions to be analyzed. In fact, interpretation is a process that gives meaning to the regions detected by the segmentation phase. Thus, the interpretation phase can only analyze the regions detected during the segmentation. Although the ultimate objective of automatic interpretation is to produce the same result as a human, the logic of classical techniques in this field does not marry that of human interpretation. Most conventional approaches to this task separate the segmentation phase from the interpretation phase. The images are first segmented and then the detected regions are interpreted. In addition, conventional techniques of segmentation scan images sequentially, in the order of pixels appearance. This way does not necessarily reflect the way of the expert during the image exploration. Indeed, a human usually starts by scanning the image for possible region of interest. When he finds a potential area, he analyzes it under three view points trying to recognize what object it is. First, he analyzes the area based on its physical characteristics. Then he considers the region's surrounding areas and finally he zooms in on the whole image in order to have a wider view while considering the information local to the region and those of its neighbors. In addition to information directly gathered from the physical characteristics of the image, the expert uses several sources of information that he merges to interpret the image. These sources include knowledge acquired through professional experience, existing constraints between objects from the images, and so on.The idea of the proposed approach, in this manuscript, is that simulating the visual activity of the expert would allow a better compatibility between the results of the interpretation and those ofthe expert. We retain from the analysis of the expert's behavior three important aspects of the image interpretation process that we will model in this work: 1. Unlike what most of the segmentation techniques suggest, the segmentation process is not necessarily sequential, but rather a series of decisions that each one may question the results of its predecessors. The main objective is to produce the best possible regions classification. 2. The process of characterizing an area of interest is not a one way process i.e. the expert can go from a local view restricted to the region of interest to a wider view of the area, including its neighbors and vice versa. 3. Several information sources are gathered and merged for a better certainty, during the decision of region characterisation. The proposed model of these three levels places particular emphasis on the knowledge used and the reasoning behind image segmentation.
|
110 |
隨機森林分類方法於基因組顯著性檢定上之應用 / Assessing the significance of a Gene Set卓達瑋 Unknown Date (has links)
在現今生物醫學領域中,一重要課題為透過基因實驗所獲得的量化資料,來研究與分析基因與外顯表型變數(phenotype)的相關性。已知多數已發展的方法皆屬於單基因分析法,無法適當的考慮基因之間的相關性。本研究主要針對基因組分析(gene set analysis)問題,提出統計檢定方法來驗證特定基因組的顯著性。為了能盡其所能的捕捉整體基因組與外顯表型變數的關係,我們結合了傳統的檢定方法與分類方法,提出以隨機森林分類方法(Random Forests)的測試組分類誤差值(test error)作為檢定統計量(test statistic),並以其排列顯著值(permutation-based p-value)來獲得統計結論。我們透過模擬研究將本研究方法和其他七種基因組分析方法做比較,可發現本方法在型一誤差率(type I error rate)和檢定力(power)上皆有優異表現。最後,我們運用本方法在數個實際基因資料組的分析上,並深入探討所獲得結果。 / Nowadays microarray data analysis has become an important issue in biomedical research. One major goal is to explore the relationship between gene expressions and some specific phenotypes. So far in literatures many developed methods are single gene-based methods, which use solely the information of individual genes and cannot appropriately take into account the relationship among genes. This research focuses on the gene set analysis, which carries out the statistical test for the significance of a set of genes to a phenotype. In order to capture the relationship between a gene set and the phenotype, we propose the use of performance of a complex classifier in the statistical test: The test error rate of a Random Forests classification is adopted as the test statistic, and the statistical conclusion is drawn according to its permutation-based p-value. We compare our test with other seven existing gene set analyses through simulation studies. It’s found that our method has leading performance in terms of having a controlled type I error rate and a high power. Finally, this method is applied in several real examples and brief discussions on the results are provided.
|
Page generated in 0.0594 seconds