Global ETD Search

301	Anomaly Detection in Categorical Data with Interpretable Machine Learning : A random forest approach to classify imbalanced data Yan, Ping January 2019 (has links) Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data. The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set containsonly categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward. In this thesis, we propose a combination of tree-based supervised learning and hyperparametertuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model. Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest performancein terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel. machine learning decision tree imbalanced data anomaly detection random forest maskininlärning beslut träd obalanserat data anomalitetsdetektering Probability Theory and Statistics Sannolikhetsteori och statistik
302	Unconstrained Gaze Estimation Using RGB-D Camera. / Estimation du regard avec une caméra RGB-D dans des environnements utilisateur non-contraints Kacete, Amine 15 December 2016 (has links) Dans ce travail, nous avons abordé le problème d’estimation automatique du regard dans des environnements utilisateur sans contraintes. Ce travail s’inscrit dans la vision par ordinateur appliquée à l’analyse automatique du comportement humain. Plusieurs solutions industrielles sont aujourd’hui commercialisées et donnent des estimations précises du regard. Certaines ont des spécifications matérielles très complexes (des caméras embarquées sur un casque ou sur des lunettes qui filment le mouvement des yeux) et présentent un niveau d’intrusivité important, ces solutions sont souvent non accessible au grand public. Cette thèse vise à produire un système d’estimation automatique du regard capable d’augmenter la liberté du mouvement de l’utilisateur par rapport à la caméra (mouvement de la tête, distance utilisateur-capteur), et de réduire la complexité du système en utilisant des capteurs relativement simples et accessibles au grand public. Dans ce travail, nous avons exploré plusieurs paradigmes utilisés par les systèmes d’estimation automatique du regard. Dans un premier temps, Nous avons mis au point deux systèmes basés sur deux approches classiques: le premier basé caractéristiques et le deuxième basé semi apparence. L’inconvénient majeur de ces paradigmes réside dans la conception des systèmes d'estimation du regard qui supposent une indépendance totale entre l'image d'apparence des yeux et la pose de la tête. Pour corriger cette limitation, Nous avons convergé vers un nouveau paradigme qui unifie les deux blocs précédents en construisant un espace regard global, nous avons exploré deux directions en utilisant des données réelles et synthétiques respectivement. / In this thesis, we tackled the automatic gaze estimation problem in unconstrained user environments. This work takes place in the computer vision research field applied to the perception of humans and their behaviors. Many existing industrial solutions are commercialized and provide an acceptable accuracy in gaze estimation. These solutions often use a complex hardware such as range of infrared cameras (embedded on a head mounted or in a remote system) making them intrusive, very constrained by the user's environment and inappropriate for a large scale public use. We focus on estimating gaze using cheap low-resolution and non-intrusive devices like the Kinect sensor. We develop new methods to address some challenging conditions such as head pose changes, illumination conditions and user-sensor large distance. In this work we investigated different gaze estimation paradigms. We first developed two automatic gaze estimation systems following two classical approaches: feature and semi appearance-based approaches. The major limitation of such paradigms lies in their way of designing gaze systems which assume a total independence between eye appearance and head pose blocks. To overcome this limitation, we converged to a novel paradigm which aims at unifying the two previous components and building a global gaze manifold, we explored two global approaches across the experiments by using synthetic and real RGB-D gaze samples. Estimation du regard Caméra RGB-D Suivi de la pupille Champs aléatoires Apprentissage automatique Gaze estimation RGB-D Camera Eye-pupil localization Random Forest Machine learning
303	Sinergismo entre eventos clim?ticos extremos, desmatamento e aumento da suscetibilidade a inc?ndios florestais no Estado do Acre / Synergism between extreme weather events, deforestation and increased susceptibility and risk of forest fires in Acre state Tostes, Juliana de Oliveira 29 February 2016 (has links) Submitted by Sandra Pereira (srpereira@ufrrj.br) on 2016-10-25T11:21:37Z No. of bitstreams: 1 2016 - Juliana de Oliveira Tostes.pdf: 4618564 bytes, checksum: 951350c8676b3f82092fedfc3a9e0f79 (MD5) / Made available in DSpace on 2016-10-25T11:21:37Z (GMT). No. of bitstreams: 1 2016 - Juliana de Oliveira Tostes.pdf: 4618564 bytes, checksum: 951350c8676b3f82092fedfc3a9e0f79 (MD5) Previous issue date: 2016-02-29 / This research analyzes the temporal and spatial variables that can affect the distribution and frequency of hot spots in the state of Acre. Given the scarcity of regular spatial information and long time series for the study area, it was initially carried out a validation between air temperature and precipitation data in Global Grid Precipitation Climatology Centre (GPCC), University of Delaware (UDEL) and Global Historical Climatology Network (GHCN) with data from five Weather Stations Mainstream (EMC) to Acre and region, through an analysis of precision and accuracy of the data. Regarding precipitation, it was found that both the GPCC UDEL represented as the average variability significantly throughout the series. In relation to the air temperature standards, although the accuracy of GHCN and UDEL was low, it was satisfactory accuracy according to statistical methods. Assuming that the extreme weather events increase susceptibility to forest fires, then it was carried out an analysis of the influence of climate variability modes in generating categorized scenarios dry or wet years, based on the Standardized Precipitation Index (SPI) and Harmonic and Spectral (AHE). It was found that the AHE is not able to identify the intensity of the events, but was satisfactory in the signal cycles identifying the anomaly, i.e., whether the abnormality SPI was positive or negative. It was found that the Atlantic signal had greater influence on the precipitation of the Pacific. For the regions that correspond to Groups 1, 2 and 3 there was an inverse pattern for precipitation in relation to ENSO compared to the North and East Amazon. Thus, it identified negative precipitation anomalies during La Ni?a and El Ni?o events during positive events for the dry and rainy seasons. For the area corresponding to the effect Group 4 was otherwise. The natural climate variability patterns identified in this study may contribute to the establishment of strategies for prevention and adaptation to extreme events. Finally, in Chapter 3 was carried out an analysis of the spatial and temporal patterns of the fire in Acre, through a discussion of various climatic, environmental and anthropogenic variables that contribute to its occurrence. Thus, through the Random Forest algorithm were generated susceptibility maps that estimated the probability of fires and burned in the state. . It was found that although drought triggers an increase in the number of hot spots, its spatial pattern is more related to human factors such as the proximity areas already cleared. / A presente pesquisa analisa as vari?veis temporais e espaciais que podem afetar a distribui??o e frequ?ncia dos focos de calor no estado do Acre. Diante da escassez de dados regularmente espacializados e com longa s?rie temporal para a ?rea de estudo, inicialmente foi realizada uma valida??o entre os dados de temperatura do ar e precipita??o em grade do Global Precipitation Climatology Centre (GPCC), Universidade de Delaware (UDEL) e Global Historical Climatology Network (GHCN) com dados de cinco Esta??es Meteorol?gicas Convencionais (EMC) para o Acre e regi?o, atrav?s de uma an?lise da precis?o e exatid?o dos dados. Em rela??o ? precipita??o, verificou-se que tanto o GPCC quanto da UDEL representaram significativamente as variabilidades m?dias ao longo da s?rie. Em rela??o aos padr?es da temperatura do ar, embora a precis?o do GHCN e da UDEL tenha sido baixa, a exatid?o foi satisfat?ria segundo os m?todos estat?sticos. Partindo do pressuposto que os eventos clim?ticos extremos aumentam a suscetibilidade a inc?ndios florestais, em seguida foi realizada uma an?lise da influ?ncia dos modos de variabilidade clim?tica na gera??o de cen?rios categorizados de anos secos ou ?midos, baseado no ?ndice de Precipita??o Padronizado (SPI) e na An?lise Harm?nica e Espectral (AHE). Verificou-se que a AHE n?o foi capaz de identificar a intensidade dos eventos, mas mostrou-se satisfat?ria na identifica??o dos ciclos de sinal da anomalia, ou seja, se anomalia do SPI foi positiva ou negativa. Verificou-se que o sinal do Atl?ntico teve maior influ?ncia sobre a precipita??o do que o Pac?fico. Para as regi?es que correspondem os Grupos 1, 2 e 3 observou-se um padr?o inverso para a precipita??o em rela??o ao ENOS, quando comparado com a Amaz?nia Norte e Oriental. Assim, foram identificadas anomalias negativas de precipita??o durante eventos de La Ni?a e positivas durante eventos de El Ni?o para as esta??es seca e chuvosa. Para a regi?o que corresponde ao Grupo 4 o efeito foi contr?rio. Os padr?es de variabilidade natural do clima identificados nesse trabalho podem contribuir para o estabelecimento de estrat?gias de preven??o e adapta??o aos eventos extremos. Finalmente, no Cap?tulo 3 foi realizada uma an?lise sobre o padr?o espacial e temporal do fogo no Acre, atrav?s de uma discuss?o sobre diversas vari?veis clim?ticas, ambientais e antr?picas que contribuem para a sua ocorr?ncia. Assim, por meio do algoritmo Random Forest foram gerados mapas de suscetibilidade que estimaram a probabilidade de ocorr?ncia de inc?ndios e queimadas no estado. Verificou-se que, embora a estiagem propicie um aumento do n?mero de focos de calor, o seu padr?o espacial est? mais relacionado a fatores antr?picos, tais como a proximidade de ?reas j? desmatadas. Random Forest precipitation anomalies;; , climate variability deforestation hot spots anomalias de precipita??o; ; , ; variabilidade clim?tica desflorestamento focos de calor Ci?ncias Agr?rias
304	THREE ESSAYS ON THE APPLICATION OF MACHINE LEARNING METHODS IN ECONOMICS Lawani, Abdelaziz 01 January 2018 (has links) Over the last decades, economics as a field has experienced a profound transformation from theoretical work toward an emphasis on empirical research (Hamermesh, 2013). One common constraint of empirical studies is the access to data, the quality of the data and the time span it covers. In general, applied studies rely on surveys, administrative or private sector data. These data are limited and rarely have universal or near universal population coverage. The growth of the internet has made available a vast amount of digital information. These big digital data are generated through social networks, sensors, and online platforms. These data account for an increasing part of the economic activity yet for economists, the availability of these big data also raises many new challenges related to the techniques needed to collect, manage, and derive knowledge from them. The data are in general unstructured, complex, voluminous and the traditional software used for economic research are not always effective in dealing with these types of data. Machine learning is a branch of computer science that uses statistics to deal with big data. The objective of this dissertation is to reconcile machine learning and economics. It uses threes case studies to demonstrate how data freely available online can be harvested and used in economics. The dissertation uses web scraping to collect large volume of unstructured data online. It uses machine learning methods to derive information from the unstructured data and show how this information can be used to answer economic questions or address econometric issues. The first essay shows how machine learning can be used to derive sentiments from reviews and using the sentiments as a measure for quality it examines an old economic theory: Price competition in oligopolistic markets. The essay confirms the economic theory that agents compete for price. It also confirms that the quality measure derived from sentiment analysis of the reviews is a valid proxy for quality and influences price. The second essay uses a random forest algorithm to show that reviews can be harnessed to predict consumers’ preferences. The third essay shows how properties description can be used to address an old but still actual problem in hedonic pricing models: the Omitted Variable Bias. Using the Least Absolute Shrinkage and Selection Operator (LASSO) it shows that pricing errors in hedonic models can be reduced by including the description of the properties in the models. Machine Learning Hedonic Price Model Sentiment Analysis Random Forest Omitted Variable Bias LASSO Agribusiness Agricultural and Resource Economics E-Commerce Economics Real Estate Tourism and Travel
305	Characterization Of Taxonomically Related Some Turkish Oak (quercus L.) Species In An Isolated Stand: A Morphometric Analysis Approach Aktas, Caner 01 June 2010 (has links) (PDF) The genus Quercus L. is represented with more than 400 species in the world and 18 of these species are found naturally in Turkey. Although its taxonomical, phytogeographical and dendrological importance, the genus Quercus is still taxonomically one of the most problematical woody genus in Turkish flora. In this study, multivariate morphometric approach was used to analyze oak specimens collected from an isolated forest (Beynam Forest, Ankara) where Quercus pubescens Willd., Q. infectoria Olivier subsp. boissieri (Reuter) O. Schwarz and Q. macranthera Fisch. &amp / C. A. Mey. ex Hohen. subsp. syspirensis (C.Koch) Menitsky taxa are belonging to section Quercus sensu stricto (s.s.) are found. Additional oak specimens were included in the analysis for comparison. Morphometric study was based on 52 leaf characters such as, distance, angle, and area as well as counted, descriptive and calculated variables. Morphometric variables were calculated automatically by use of landmark and outline data. Random forest classification method was used to select discriminating variables and predict unidentified specimens by use of pre-identified training group. The results of the random forest variable selection procedure and the principal component analysis (PCA) showed that the morphometric variables could distinguish the specimens of Q. pubescens and Q. macranthera subsp. syspirensis mostly based on the overall leaf size and number of intercalary veins while the specimens of Q. infectoria subsp. boissieri were separated from others based on lobe and lamina base shape. Finally, micromorphological observations of abaxial lamina surface have been performed by scanning electron microscope (SEM) on selected specimens which were found useful to differentiate, particularly the specimens of Q. macranthera subsp. syspirensis and its putative hybrids from other taxa. QK Classification 91-97
306	Virtualaus objekto valdymo sistemos smegenų kompiuterio sąsajos tyrimas / Virtual object management system of the brain computer interface research Šidlauskas, Kęstutis 26 August 2013 (has links) Šiame darbe nagrinėjama smegenų – kompiuterio sąsajos (BCI) sistema. Taip pat dirbtinių neuroninių tinklų ir atsitiktinių miškų klasifikavimo algoritmų panaudojimas smegenų – kompiuterio sąsajos sistemose. Realizuotas smegenų – kompiuterio sąsajos prototipas. Šis prototipas leidžia valdyti kompiuterio pelę, naudojant elektroencefalogramos arba elektromiogramos skaitytuvą. Atliktas kompiuterio pelės valdymo, naudojant smegenų – kompiuterio sistemą, tyrimas, vykdant praktines užduotis. Rezultatai palyginti su įprastu būdu valdoma kompiuterio pele. Tyrime naudotas OCZ NIA elektroencefalogramos ir elektromiogramos signalų skaitytuvas. Palyginta kuris iš naudotų klasifikavimo algoritmų pasiekia didžiausią tikslumą. Padarytos išvados apie smegenų – kompiuterio sąsajos sistemos prototipo privalumus ir trūkumus. / This work analyzes the brain – computer interface (BCI) system. Also artificial neural networks and random forest classification algorithms are used in brain – computer interface systems. A prototype of the brain – computer interface was developed. The prototype lets you control your mouse using electromyogram or electroencephalogram reader. In this work, the practical tasks carried out mouse control study using a brain – computer interface. The results were compared with the normal – controlled computer mouse. The study used OCZ NIA electroencephalogram and electromyogram signal reader. Compared which of the used algorithms achieves the highest accuracy. The conclusions were drawn about the BCI prototype. Informatics Engineering Smegenų - kompiuterio sąsaja EEG duomenų klasifikavimas Neuroniniai tinklai Atsitiktinis miškas Brain - computer interface EEG classification Neural networks Random forest
307	Evaluation of Supervised Machine LearningAlgorithms for Detecting Anomalies in Vehicle’s Off-Board Sensor Data Wahab, Nor-Ul January 2018 (has links) A diesel particulate filter (DPF) is designed to physically remove diesel particulate matter or soot from the exhaust gas of a diesel engine. Frequently replacing DPF is a waste of resource and waiting for full utilization is risky and very costly, so, what is the optimal time/milage to change DPF? Answering this question is very difficult without knowing when the DPF is changed in a vehicle. We are finding the answer with supervised machine learning algorithms for detecting anomalies in vehicles off-board sensor data (operational data of vehicles). Filter change is considered an anomaly because it is rare as compared to normal data. Non-sequential machine learning algorithms for anomaly detection like oneclass support vector machine (OC-SVM), k-nearest neighbor (K-NN), and random forest (RF) are applied for the first time on DPF dataset. The dataset is unbalanced, and accuracy is found misleading as a performance measure for the algorithms. Precision, recall, and F1-score are found good measure for the performance of the machine learning algorithms when the data is unbalanced. RF gave highest F1-score of 0.55 than K-NN (0.52) and OCSVM (0.51). It means that RF perform better than K-NN and OC-SVM but after further investigation it is concluded that the results are not satisfactory. However, a sequential approach should have been tried which could yield better result. Anomaly detection rule-based one class support vector machine k-nearest neighbor random forest confusion matrix accuracy precision recall F1-score Social Sciences Interdisciplinary
308	Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúde Lento, Gabriel Carneiro 27 March 2017 (has links) Submitted by Gabriel Lento (gabriel.carneiro.lento@gmail.com) on 2017-05-01T23:16:04Z No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) / Approved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-04T18:39:57Z (GMT) No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) / Made available in DSpace on 2017-05-17T12:43:35Z (GMT). No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) Previous issue date: 2017-03-27 / In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn. / Neste trabalho estudamos o problema de churn em seguro saúde, isto é, a previsão se o cliente irá cancelar o produto ou serviço em até um período de tempo pré-estipulado. Tradicionalmente, regressão logística é utilizada para modelar a probabilidade de cancelamento do serviço. Atualmente, técnicas modernas de machine learning vêm se tornando cada vez mais populares para esse tipo de problema, com exemplos nas áreas de telecomunicação, bancos, e seguros de carro, dentre outras. Uma das grandes dificuldades nesta modelagem é que apenas uma pequena fração dos clientes de fato cancela o serviço, o que significa que a base de dados tratada é altamente desbalanceada. Técnicas de under-sampling e over-sampling são utilizadas para contornar esse problema. Neste trabalho, aplicamos random forests, que são combinações de árvores de decisão ajustadas em subamostras dos dados, construídas utilizando under-sampling e over-sampling. Ao fim do trabalho comparamos métricas de ajustes obtidas nas diversas especificações dos modelos testados e avaliamos seus resultados dentro e fora da amostra. Observamos que técnicas de random forest utilizando sub-amostras não balanceadas com o tamanho menor do que a amostra original apresenta a melhor performance dentre as random forests utilizadas e uma melhora com relação ao praticado no mercado de seguro saúde. Under-sampling Over-sampling Imbalanced class Health insurance Random forest Churn Dados desbalanceados Seguro saúde Matemática Aprendizado do computador Mineração de dados (Computação) Seguro-saúde
309	Computational studies of biomolecules Chen, Sih-Yu January 2017 (has links) In modern drug discovery, lead discovery is a term used to describe the overall process from hit discovery to lead optimisation, with the goal being to identify drug candidates. This can be greatly facilitated by the use of computer-aided (or in silico) techniques, which can reduce experimentation costs along the drug discovery pipeline. The range of relevant techniques include: molecular modelling to obtain structural information, molecular dynamics (which will be covered in Chapter 2), activity or property prediction by means of quantitative structure activity/property models (QSAR/QSPR), where machine learning techniques are introduced (to be covered in Chapter 1) and quantum chemistry, used to explain chemical structure, properties and reactivity. This thesis is divided into five parts. Chapter 1 starts with an outline of the early stages of drug discovery; introducing the use of virtual screening for hit and lead identification. Such approaches may roughly be divided into structure-based (docking, by far the most often referred to) and ligand-based, leading to a set of promising compounds for further evaluation. Then, the use of machine learning techniques, the issue of which will be frequently encountered, followed by a brief review of the "no free lunch" theorem, that describes how no learning algorithm can perform optimally on all problems. This implies that validation of predictive accuracy in multiple models is required for optimal model selection. As the dimensionality of the feature space increases, the issue referred to as "the curse of dimensionality" becomes a challenge. In closing, the last sections focus on supervised classification Random Forests. Computer-based analyses are an integral part of drug discovery. Chapter 2 begins with discussions of molecular docking; including strategies incorporating protein flexibility at global and local levels, then a specific focus on an automated docking program – AutoDock, which uses a Lamarckian genetic algorithm and empirical binding free energy function. In the second part of the chapter, a brief introduction of molecular dynamics will be given. Chapter 3 describes how we constructed a dataset of known binding sites with co-crystallised ligands, used to extract features characterising the structural and chemical properties of the binding pocket. A machine learning algorithm was adopted to create a three-way predictive model, capable of assigning each case to one of the classes (regular, orthosteric and allosteric) for in silico selection of allosteric sites, and by a feature selection algorithm (Gini) to rationalize the selection of important descriptors, most influential in classifying the binding pockets. In Chapter 4, we made use of structure-based virtual screening, and we focused on docking a fluorescent sensor to a non-canonical DNA quadruplex structure. The preferred binding poses, binding site, and the interactions are scored, followed by application of an ONIOM model to re-score the binding poses of some DNA-ligand complexes, focusing on only the best pose (with the lowest binding energy) from AutoDock. The use of a pre-generated conformational ensemble using MD to account for the receptors' flexibility followed by docking methods are termed “relaxed complex” schemes. Chapter 5 concerns the BLUF domain photocycle. We will be focused on conformational preference of some critical residues in the flavin binding site after a charge redistribution has been introduced. This work provides another activation model to address controversial features of the BLUF domain. 572
310	Cartographie de l'occupation des sols à partir de séries temporelles d'images satellitaires à hautes résolutions : identification et traitement des données mal étiquetées / Land cover mapping by using satellite image time series at high resolutions : identification and processing of mislabeled data Pelletier, Charlotte 11 December 2017 (has links) L'étude des surfaces continentales est devenue ces dernières années un enjeu majeur à l'échelle mondiale pour la gestion et le suivi des territoires, notamment en matière de consommation des terres agricoles et d'étalement urbain. Dans ce contexte, les cartes d'occupation du sol caractérisant la couverture biophysique des terres émergées jouent un rôle essentiel pour la cartographie des surfaces continentales. La production de ces cartes sur de grandes étendues s'appuie sur des données satellitaires qui permettent de photographier les surfaces continentales fréquemment et à faible coût. Le lancement de nouvelles constellations satellitaires - Landsat-8 et Sentinel-2 - permet depuis quelques années l'acquisition de séries temporelles à hautes résolutions. Ces dernières sont utilisées dans des processus de classification supervisée afin de produire les cartes d'occupation du sol. L'arrivée de ces nouvelles données ouvre de nouvelles perspectives, mais questionne sur le choix des algorithmes de classification et des données à fournir en entrée du système de classification. Outre les données satellitaires, les algorithmes de classification supervisée utilisent des échantillons d'apprentissage pour définir leur règle de décision. Dans notre cas, ces échantillons sont étiquetés, \ie{} la classe associée à une occupation des sols est connue. Ainsi, la qualité de la carte d'occupation des sols est directement liée à la qualité des étiquettes des échantillons d'apprentissage. Or, la classification sur de grandes étendues nécessite un grand nombre d'échantillons, qui caractérise la diversité des paysages. Cependant, la collecte de données de référence est une tâche longue et fastidieuse. Ainsi, les échantillons d'apprentissage sont bien souvent extraits d'anciennes bases de données pour obtenir un nombre conséquent d'échantillons sur l'ensemble de la surface à cartographier. Cependant, l'utilisation de ces anciennes données pour classer des images satellitaires plus récentes conduit à la présence de nombreuses données mal étiquetées parmi les échantillons d'apprentissage. Malheureusement, l'utilisation de ces échantillons mal étiquetés dans le processus de classification peut engendrer des erreurs de classification, et donc une détérioration de la qualité de la carte produite. L'objectif général de la thèse vise à améliorer la classification des nouvelles séries temporelles d'images satellitaires à hautes résolutions. Le premier objectif consiste à déterminer la stabilité et la robustesse des méthodes de classification sur de grandes étendues. Plus particulièrement, les travaux portent sur l'analyse d'algorithmes de classification et la sensibilité de ces algorithmes vis-à-vis de leurs paramètres et des données en entrée du système de classification. De plus, la robustesse de ces algorithmes à la présence des données imparfaites est étudiée. Le second objectif s'intéresse aux erreurs présentes dans les données d'apprentissage, connues sous le nom de données mal étiquetées. Dans un premier temps, des méthodes de détection de données mal étiquetées sont proposées et étudiées. Dans un second temps, un cadre méthodologique est proposé afin de prendre en compte les données mal étiquetées dans le processus de classification. L'objectif est de réduire l'influence des données mal étiquetées sur les performances de l'algorithme de classification, et donc d'améliorer la carte d'occupation des sols produite. / Land surface monitoring is a key challenge for diverse applications such as environment, forestry, hydrology and geology. Such monitoring is particularly helpful for the management of territories and the prediction of climate trends. For this purpose, mapping approaches that employ satellite-based Earth Observations at different spatial and temporal scales are used to obtain the land surface characteristics. More precisely, supervised classification algorithms that exploit satellite data present many advantages compared to other mapping methods. In addition, the recent launches of new satellite constellations - Landsat-8 and Sentinel-2 - enable the acquisition of satellite image time series at high spatial and spectral resolutions, that are of great interest to describe vegetation land cover. These satellite data open new perspectives, but also interrogate the choice of classification algorithms and the choice of input data. In addition, learning classification algorithms over large areas require a substantial number of instances per land cover class describing landscape variability. Accordingly, training data can be extracted from existing maps or specific existing databases, such as crop parcel farmer's declaration or government databases. When using these databases, the main drawbacks are the lack of accuracy and update problems due to a long production time. Unfortunately, the use of these imperfect training data lead to the presence of mislabeled training instance that may impact the classification performance, and so the quality of the produced land cover map. Taking into account the above challenges, this Ph.D. work aims at improving the classification of new satellite image time series at high resolutions. The work has been divided into two main parts. The first Ph.D. goal consists in studying different classification systems by evaluating two classification algorithms with several input datasets. In addition, the stability and the robustness of the classification methods are discussed. The second goal deals with the errors contained in the training data. Firstly, methods for the detection of mislabeled data are proposed and analyzed. Secondly, a filtering method is proposed to take into account the mislabeled data in the classification framework. The objective is to reduce the influence of mislabeled data on the classification performance, and thus to improve the produced land cover map. Classification Données mal étiquetées Séries temporelles Images satellitaires Occupation des sols Apprentissage automatique Forêt aléatoire Classification Class label Time series Satellite images Land cover Machine learning Random Forest

Search results