• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 112
  • 32
  • 8
  • 6
  • 5
  • 4
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 220
  • 220
  • 73
  • 59
  • 55
  • 49
  • 36
  • 36
  • 33
  • 32
  • 30
  • 29
  • 28
  • 27
  • 24
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
151

Sélection de variables pour l’analyse des données semi-supervisées dans les systèmes d’Information décisionnels / Feature selection for semi-supervised data analysis in decisional information systems

Hindawi, Mohammed 21 February 2013 (has links)
La sélection de variables est une tâche primordiale en fouille de données et apprentissage automatique. Il s’agit d’une problématique très bien connue par les deux communautés dans les contextes, supervisé et non-supervisé. Le contexte semi-supervisé est relativement récent et les travaux sont embryonnaires. Récemment, l’apprentissage automatique a bien été développé à partir des données partiellement labélisées. La sélection de variables est donc devenue plus importante dans le contexte semi-supervisé et plus adaptée aux applications réelles, où l’étiquetage des données est devenu plus couteux et difficile à obtenir. Dans cette thèse, nous présentons une étude centrée sur l’état de l’art du domaine de la sélection de variable en s’appuyant sur les méthodes qui opèrent en mode semi-supervisé par rapport à celles des deux contextes, supervisé et non-supervisé. Il s’agit de montrer le bon compromis entre la structure géométrique de la partie non labélisée des données et l’information supervisée de leur partie labélisée. Nous nous sommes particulièrement intéressés au «small labeled-sample problem» où l’écart est très important entre les deux parties qui constituent les données. Pour la sélection de variables dans ce contexte semi-supervisé, nous proposons deux familles d’approches en deux grandes parties. La première famille est de type «Filtre» avec une série d’algorithmes qui évaluent la pertinence d’une variable par une fonction de score. Dans notre cas, cette fonction est basée sur la théorie spectrale de graphe et l’intégration de contraintes qui peuvent être extraites à partir des données en question. La deuxième famille d’approches est de type «Embedded» où la sélection de variable est intrinsèquement liée à un modèle d’apprentissage. Pour ce faire, nous proposons des algorithmes à base de pondération de variables dans un paradigme de classification automatique sous contraintes. Deux visions sont développées à cet effet, (1) une vision globale en se basant sur la satisfaction relaxée des contraintes intégrées directement dans la fonction objective du modèle proposé ; et (2) une deuxième vision, qui est locale et basée sur le contrôle stricte de violation de ces dites contraintes. Les deux approches évaluent la pertinence des variables par des poids appris en cours de la construction du modèle de classification. En outre de cette tâche principale de sélection de variables, nous nous intéressons au traitement de la redondance. Pour traiter ce problème, nous proposons une méthode originale combinant l’information mutuelle et un algorithme de recherche d’arbre couvrant construit à partir de variables pertinentes en vue de l’optimisation de leur nombre au final. Finalement, toutes les approches développées dans le cadre de cette thèse sont étudiées en termes de leur complexité algorithmique d’une part et sont validés sur des données de très grande dimension face et des méthodes connues dans la littérature d’autre part. / Feature selection is an important task in data mining and machine learning processes. This task is well known in both supervised and unsupervised contexts. The semi-supervised feature selection is still under development and far from being mature. In general, machine learning has been well developed in order to deal with partially-labeled data. Thus, feature selection has obtained special importance in the semi-supervised context. It became more adapted with the real world applications where labeling process is costly to obtain. In this thesis, we present a literature review on semi-supervised feature selection, with regard to supervised and unsupervised contexts. The goal is to show the importance of compromising between the structure from unlabeled part of data, and the background information from their labeled part. In particular, we are interested in the so-called «small labeled-sample problem» where the difference between both data parts is very important. In order to deal with the problem of semi-supervised feature selection, we propose two groups of approaches. The first group is of «Filter» type, in which, we propose some algorithms which evaluate the relevance of features by a scoring function. In our case, this function is based on spectral-graph theory and the integration of pairwise constraints which can be extracted from the data in hand. The second group of methods is of «Embedded» type, where feature selection becomes an internal function integrated in the learning process. In order to realize embedded feature selection, we propose algorithms based on feature weighting. The proposed methods rely on constrained clustering. In this sense, we propose two visions, (1) a global vision, based on relaxed satisfaction of pairwise constraints. This is done by integrating the constraints in the objective function of the proposed clustering model; and (2) a second vision, which is local and based on strict control of constraint violation. Both approaches evaluate the relevance of features by weights which are learned during the construction of the clustering model. In addition to the main task which is feature selection, we are interested in redundancy elimination. In order to tackle this problem, we propose a novel algorithm based on combining the mutual information with maximum spanning tree-based algorithm. We construct this tree from the relevant features in order to optimize the number of these selected features at the end. Finally, all proposed methods in this thesis are analyzed and their complexities are studied. Furthermore, they are validated on high-dimensional data versus other representative methods in the literature.
152

Mapeamento de dados genômicos usando escalonamento multidimensional / Representation of genomics data with multidimensional scaling

Espezúa Llerena, Soledad 04 June 2008 (has links)
Neste trabalho são exploradas diversas técnicas de escalonamento multidimensional (MDS), com o objetivo de estudar sua aplicabilidade no mapeamento de dados genômicos resultantes da técnica RFLP-PCR, sendo esse mapeamento realizado em espaços de baixa dimensionalidade (2D ou 3D) com o fim de aproveitar a habilidade de análise e interpretação visual que possuem os seres humanos. Foi realizada uma análise comparativa de diversos algoritmos MDS, visando sua aptidão para mapear dados genômicos. Esta análise compreendeu o estudo de alguns índices de desempenho como a precisão no mapeamento, o custo computacional e a capacidade de induzir bons agrupamentos. Para a realização dessa análise foi desenvolvida a ferramenta \"MDSExplorer\", a qual integra os algoritmos estudados e várias opções que permitem comparar os algoritmos e visualizar os mapeamentos. Á análise realizada sobre diversos bancos de dados citados na literatura, sugerem que o algoritmo LANDMARK possui o menor tempo computacional, uma precisão de mapeamento similar aos demais algoritmos, e uma boa capacidade de manter as estruturas existentes nos dados. Finalmente, o MDSExplorer foi usado para mapear um banco de dados genômicos: o banco de estirpes de bactérias fixadoras de nitrogênio, pertencentes ao gênero Bradyrhizobium, com objetivo de ajudar o especialista a inferir visualmente alguma taxonomia nessas estirpes. Os resultados na redução dimensional desse banco de dados sugeriram que a informação relevante (acima dos 60% da variância acumulada) para as regiões 16S, 23S e IGS estaria nas primeiras 5, 4 e 9 dimensões respectivamente. / In this work were studied various Multidimensional Scaling (MDS) techniques intended to apply in the mapping of genomics data obtained of RFLP-PCR technique. This mapping is done in a low dimensional space (2D or 3D), and has the intention of exploiting the visual human capability on analysis and synthesis. A comparative analysis of diverse algorithms MDS was carried out in order to devise its ubiquity in representing genomics data. This analysis covers the study of some indices of performance such as: the precision in the mapping, the computational cost and the capacity to induce good groupings. The purpose of this analysis was developed a software tool called \"MDSExplorer\", which integrates various MDS algorithms and some options that allow to compare the algorithms and to visualize the mappings. The analysis, carried out over diverse datasets cited in the literature, suggest that the algorithm LANDMARK has the lowest computational time, a good precision in the mapping, and a tendency to maintain the existing structures in the data. Finally, MDSExplorer was used to mapping a real genomics dataset: the RFLP-PRC images of a Brazilian collection of bacterial strains belonging to the genus Bradyrhizobium (known by their capability to transform the nitrogen of the atmosphere into compounds useful for the host plants), with the objective to aid the specialist to infer visually a taxonomy in these strains. The results in reduction of dimensionality in this data base, suggest that the relevant information (above 60% of variance accumulated) to the region 16S, 23S and IGS is around 5, 4 and 9 dimensions respectively.
153

Propagação em grafos bipartidos para extração de tópicos em fluxo de documentos textuais / Propagation in bipartite graphs for topic extraction in stream of textual data

Faleiros, Thiago de Paulo 08 June 2016 (has links)
Tratar grandes quantidades de dados é uma exigência dos modernos algoritmos de mineração de texto. Para algumas aplicações, documentos são constantemente publicados, o que demanda alto custo de armazenamento em longo prazo. Então, é necessário criar métodos de fácil adaptação para uma abordagem que considere documentos em fluxo, e que analise os dados em apenas um passo sem requerer alto custo de armazenamento. Outra exigência é a de que essa abordagem possa explorar heurísticas a fim de melhorar a qualidade dos resultados. Diversos modelos para a extração automática das informações latentes de uma coleção de documentos foram propostas na literatura, dentre eles destacando-se os modelos probabilísticos de tópicos. Modelos probabilísticos de tópicos apresentaram bons resultados práticos, sendo estendidos para diversos modelos com diversos tipos de informações inclusas. Entretanto, descrever corretamente esses modelos, derivá-los e em seguida obter o apropriado algoritmo de inferência são tarefas difíceis, exigindo um tratamento matemático rigoroso para as descrições das operações efetuadas no processo de descoberta das dimensões latentes. Assim, para a elaboração de um método simples e eficiente para resolver o problema da descoberta das dimensões latentes, é necessário uma apropriada representação dos dados. A hipótese desta tese é a de que, usando a representação de documentos em grafos bipartidos, é possível endereçar problemas de aprendizado de máquinas, para a descoberta de padrões latentes em relações entre objetos, por exemplo nas relações entre documentos e palavras, de forma simples e intuitiva. Para validar essa hipótese, foi desenvolvido um arcabouço baseado no algoritmo de propagação de rótulos utilizando a representação em grafos bipartidos. O arcabouço, denominado PBG (Propagation in Bipartite Graph), foi aplicado inicialmente para o contexto não supervisionado, considerando uma coleção estática de documentos. Em seguida, foi proposta uma versão semissupervisionada, que considera uma pequena quantidade de documentos rotulados para a tarefa de classificação transdutiva. E por fim, foi aplicado no contexto dinâmico, onde se considerou fluxo de documentos textuais. Análises comparativas foram realizadas, sendo que os resultados indicaram que o PBG é uma alternativa viável e competitiva para tarefas nos contextos não supervisionado e semissupervisionado. / Handling large amounts of data is a requirement for modern text mining algorithms. For some applications, documents are published constantly, which demand a high cost for long-term storage. So it is necessary easily adaptable methods for an approach that considers documents flow, and be capable of analyzing the data in one step without requiring the high cost of storage. Another requirement is that this approach can exploit heuristics in order to improve the quality of results. Several models for automatic extraction of latent information in a collection of documents have been proposed in the literature, among them probabilistic topic models are prominent. Probabilistic topic models achieve good practical results, and have been extended to several models with different types of information included. However, properly describe these models, derive them, and then get appropriate inference algorithms are difficult tasks, requiring a rigorous mathematical treatment for descriptions of operations performed in the latent dimensions discovery process. Thus, for the development of a simple and efficient method to tackle the problem of latent dimensions discovery, a proper representation of the data is required. The hypothesis of this thesis is that by using bipartite graph for representation of textual data one can address the task of latent patterns discovery, present in the relationships between documents and words, in a simple and intuitive way. For validation of this hypothesis, we have developed a framework based on label propagation algorithm using the bipartite graph representation. The framework, called PBG (Propagation in Bipartite Graph) was initially applied to the unsupervised context for a static collection of documents. Then a semi-supervised version was proposed which need only a small amount of labeled documents to the transductive classification task. Finally, it was applied in the dynamic context in which flow of textual data was considered. Comparative analyzes were performed, and the results indicated that the PBG is a viable and competitive alternative for tasks in the unsupervised and semi-supervised contexts.
154

"Redução de dimensionalidade utilizando entropia condicional média aplicada a problemas de bioinformática e de processamento de imagens" / Dimensionality reduction using mean conditional entropy applied for bioinformatics and image processing problems

Martins Junior, David Correa 22 September 2004 (has links)
Redução de dimensionalidade é um problema muito importante da área de reconhecimento de padrões com aplicação em diversos campos do conhecimento. Dentre as técnicas de redução de dimensionalidade, a de seleção de características foi o principal foco desta pesquisa. De uma forma geral, a maioria dos métodos de redução de dimensionalidade presentes na literatura costumam privilegiar casos nos quais os dados sejam linearmente separáveis e só existam duas classes distintas. No intuito de tratar casos mais genéricos, este trabalho propõe uma função critério, baseada em sólidos princípios de teoria estatística como entropia e informação mútua, a ser embutida nos algoritmos de seleção de características existentes. A proposta dessa abordagem é tornar possível classificar os dados, linearmente separáveis ou não, em duas ou mais classes levando em conta um pequeno subespaço de características. Alguns resultados com dados sintéticos e dados reais foram obtidos confirmando a utilidade dessa técnica. Este trabalho tratou dois problemas de bioinformática. O primeiro trata de distinguir dois fenômenos biológicos através de seleção de um subconjunto apropriado de genes. Foi estudada uma técnica de seleção de genes fortes utilizando máquinas de suporte vetorial (MSV) que já vinha sendo aplicada para este fim em dados de SAGE do genoma humano. Grande parte dos genes fortes encontrados por esta técnica para distinguir tumores de cérebro (glioblastoma e astrocytoma), foram validados pela metodologia apresentada neste trabalho. O segundo problema que foi tratado neste trabalho é o de identificação de redes de regulação gênica, utilizando a metodologia proposta, em dados produzidos pelo trabalho de DeRisi et al sobre microarray do genoma do Plasmodium falciparum, agente causador da malária, durante as 48 horas de seu ciclo de vida. O presente texto apresenta evidências de que a utilização da entropia condicional média para estimar redes genéticas probabilísticas (PGN) pode ser uma abordagem bastante promissora nesse tipo de aplicação. No contexto de processamento de imagens, tal técnica pôde ser aplicada com sucesso em obter W-operadores minimais para realização de filtragem de imagens e reconhecimento de texturas. / Dimensionality reduction is a very important pattern recognition problem with many applications. Among the dimensionality reduction techniques, feature selection was the main focus of this research. In general, most dimensionality reduction methods that may be found in the literature privilegiate cases in which the data is linearly separable and with only two distinct classes. Aiming at covering more generic cases, this work proposes a criterion function, based on the statistical theory principles of entropy and mutual information, to be embedded in the existing feature selection algorithms. This approach allows to classify the data, linearly separable or not, in two or more classes, taking into account a small feature subspace. Results with synthetic and real data were obtained corroborating the utility of this technique. This work addressed two bioinformatics problems. The first is about distinguishing two biological fenomena through the selection of an appropriate subset of genes. We studied a strong genes selection technique using support vector machines (SVM) which has been applied to SAGE data of human genome. Most of the strong genes found by this technique to distinguish brain tumors (glioblastoma and astrocytoma) were validated by the proposed methodology presented in this work. The second problem covered in this work is the identification of genetic network regulation, using our proposed methodology, from data produced by work of DeRisi et al about microarray of the Plasmodium falciparum genome, malaria agent, during 48 hours of its life cycle. This text presents evidences that using mean conditional entropy to estimate a probabilistic genetic network (PGN) may be very promising. In the image processing context, it is shown that this technique can be applied to obtain minimal W-operators that perform image filtering and texture recognition.
155

A method for reducing dimensionality in large design problems with computationally expensive analyses

Berguin, Steven Henri 08 June 2015 (has links)
Strides in modern computational fluid dynamics and leaps in high-power computing have led to unprecedented capabilities for handling large aerodynamic problem. In particular, the emergence of adjoint design methods has been a break-through in the field of aerodynamic shape optimization. It enables expensive, high-dimensional optimization problems to be tackled efficiently using gradient-based methods in CFD; a task that was previously inconceivable. However, adjoint design methods are intended for gradient-based optimization; the curse of dimensionality is still very much alive when it comes to design space exploration, where gradient-free methods cannot be avoided. This research describes a novel approach for reducing dimensionality in large, computationally expensive design problems to a point where gradient-free methods become possible. This is done using an innovative application of Principal Component Analysis (PCA), where the latter is applied to the gradient distribution of the objective function; something that had not been done before. This yields a linear transformation that maps a high-dimensional problem onto an equivalent low-dimensional subspace. None of the original variables are discarded; they are simply linearly combined into a new set of variables that are fewer in number. The method is tested on a range of analytical functions, a two-dimensional staggered airfoil test problem and a three-dimensional Over-Wing Nacelle (OWN) integration problem. In all cases, the method performed as expected and was found to be cost effective, requiring only a relatively small number of samples to achieve large dimensionality reduction.
156

Novel computationally intelligent machine learning algorithms for data mining and knowledge discovery

Gheyas, Iffat A. January 2009 (has links)
This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model.
157

Fast Algorithms for Mining Co-evolving Time Series

Li, Lei 01 September 2011 (has links)
Time series data arise in many applications, from motion capture, environmental monitoring, temperatures in data centers, to physiological signals in health care. In the thesis, I will focus on the theme of learning and mining large collections of co-evolving sequences, with the goal of developing fast algorithms for finding patterns, summarization, and anomalies. In particular, this thesis will answer the following recurring challenges for time series: 1. Forecasting and imputation: How to do forecasting and to recover missing values in time series data? 2. Pattern discovery and summarization: How to identify the patterns in the time sequences that would facilitate further mining tasks such as compression, segmentation and anomaly detection? 3. Similarity and feature extraction: How to extract compact and meaningful features from multiple co-evolving sequences that will enable better clustering and similarity queries of time series? 4. Scale up: How to handle large data sets on modern computing hardware? We develop models to mine time series with missing values, to extract compact representation from time sequences, to segment the sequences, and to do forecasting. For large scale data, we propose algorithms for learning time series models, in particular, including Linear Dynamical Systems (LDS) and Hidden Markov Models (HMM). We also develop a distributed algorithm for finding patterns in large web-click streams. Our thesis will present special models and algorithms that incorporate domain knowledge. For motion capture, we will describe the natural motion stitching and occlusion filling for human motion. In particular, we provide a metric for evaluating the naturalness of motion stitching, based which we choose the best stitching. Thanks to domain knowledge (body structure and bone lengths), our algorithm is capable of recovering occlusions in mocap sequences, better in accuracy and longer in missing period. We also develop an algorithm for forecasting thermal conditions in a warehouse-sized data center. The forecast will help us control and manage the data center in a energy-efficient way, which can save a significant percentage of electric power consumption in data centers.
158

Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

Duan, Haoyang 15 May 2014 (has links)
From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.
159

From Human to Robot Grasping

Romero, Javier January 2011 (has links)
Imagine that a robot fetched this thesis for you from a book shelf. How doyou think the robot would have been programmed? One possibility is thatexperienced engineers had written low level descriptions of all imaginabletasks, including grasping a small book from this particular shelf. A secondoption would be that the robot tried to learn how to grasp books from yourshelf autonomously, resulting in hours of trial-and-error and several bookson the floor.In this thesis, we argue in favor of a third approach where you teach therobot how to grasp books from your shelf through grasping by demonstration.It is based on the idea of robots learning grasping actions by observinghumans performing them. This imposes minimum requirements on the humanteacher: no programming knowledge and, in this thesis, no need for specialsensory devices. It also maximizes the amount of sources from which therobot can learn: any video footage showing a task performed by a human couldpotentially be used in the learning process. And hopefully it reduces theamount of books that end up on the floor. This document explores the challenges involved in the creation of such asystem. First, the robot should be able to understand what the teacher isdoing with their hands. This means, it needs to estimate the pose of theteacher's hands by visually observing their in the absence of markers or anyother input devices which could interfere with the demonstration. Second,the robot should translate the human representation acquired in terms ofhand poses to its own embodiment. Since the kinematics of the robot arepotentially very different from the human one, defining a similarity measureapplicable to very different bodies becomes a challenge. Third, theexecution of the grasp should be continuously monitored to react toinaccuracies in the robot perception or changes in the grasping scenario.While visual data can help correcting the reaching movement to the object,tactile data enables accurate adaptation of the grasp itself, therebyadjusting the robot's internal model of the scene to reality. Finally,acquiring compact models of human grasping actions can help in bothperceiving human demonstrations more accurately and executing them in a morehuman-like manner. Moreover, modeling human grasps can provide us withinsights about what makes an artificial hand design anthropomorphic,assisting the design of new robotic manipulators and hand prostheses. All these modules try to solve particular subproblems of a grasping bydemonstration system. We hope the research on these subproblems performed inthis thesis will both bring us closer to our dream of a learning robot andcontribute to the multiple research fields where these subproblems arecoming from. / QC 20111125
160

Automatic classification of natural signals for environmental monitoring / Classification automatique de signaux naturels pour la surveillance environnementale

Malfante, Marielle 03 October 2018 (has links)
Ce manuscrit de thèse résume trois ans de travaux sur l’utilisation des méthodes d’apprentissage statistique pour l’analyse automatique de signaux naturels. L’objectif principal est de présenter des outils efficaces et opérationnels pour l’analyse de signaux environnementaux, en vue de mieux connaitre et comprendre l’environnement considéré. On se concentre en particulier sur les tâches de détection et de classification automatique d’événements naturels.Dans cette thèse, deux outils basés sur l’apprentissage supervisé (Support Vector Machine et Random Forest) sont présentés pour (i) la classification automatique d’événements, et (ii) pour la détection et classification automatique d’événements. La robustesse des approches proposées résulte de l’espace des descripteurs dans lequel sont représentés les signaux. Les enregistrements y sont en effet décrits dans plusieurs espaces: temporel, fréquentiel et quéfrentiel. Une comparaison avec des descripteurs issus de réseaux de neurones convolutionnels (Deep Learning) est également proposée, et favorise les descripteurs issus de la physique au détriment des approches basées sur l’apprentissage profond.Les outils proposés au cours de cette thèse sont testés et validés sur des enregistrements in situ de deux environnements différents : (i) milieux marins et (ii) zones volcaniques. La première application s’intéresse aux signaux acoustiques pour la surveillance des zones sous-marines côtières : les enregistrements continus sont automatiquement analysés pour détecter et classifier les différents sons de poissons. Une périodicité quotidienne est mise en évidence. La seconde application vise la surveillance volcanique : l’architecture proposée classifie automatiquement les événements sismiques en plusieurs catégories, associées à diverses activités du volcan. L’étude est menée sur 6 ans de données volcano-sismiques enregistrées sur le volcan Ubinas (Pérou). L’analyse automatique a en particulier permis d’identifier des erreurs de classification faites dans l’analyse manuelle originale. L’architecture pour la classification automatique d’événements volcano-sismiques a également été déployée et testée en observatoire en Indonésie pour la surveillance du volcan Mérapi. Les outils développés au cours de cette thèse sont rassemblés dans le module Architecture d’Analyse Automatique (AAA), disponible en libre accès. / This manuscript summarizes a three years work addressing the use of machine learning for the automatic analysis of natural signals. The main goal of this PhD is to produce efficient and operative frameworks for the analysis of environmental signals, in order to gather knowledge and better understand the considered environment. Particularly, we focus on the automatic tasks of detection and classification of natural events.This thesis proposes two tools based on supervised machine learning (Support Vector Machine, Random Forest) for (i) the automatic classification of events and (ii) the automatic detection and classification of events. The success of the proposed approaches lies in the feature space used to represent the signals. This relies on a detailed description of the raw acquisitions in various domains: temporal, spectral and cepstral. A comparison with features extracted using convolutional neural networks (deep learning) is also made, and favours the physical features to the use of deep learning methods to represent transient signals.The proposed tools are tested and validated on real world acquisitions from different environments: (i) underwater and (ii) volcanic areas. The first application considered in this thesis is devoted to the monitoring of coastal underwater areas using acoustic signals: continuous recordings are analysed to automatically detect and classify fish sounds. A day to day pattern in the fish behaviour is revealed. The second application targets volcanoes monitoring: the proposed system classifies seismic events into categories, which can be associated to different phases of the internal activity of volcanoes. The study is conducted on six years of volcano-seismic data recorded on Ubinas volcano (Peru). In particular, the outcomes of the proposed automatic classification system helped in the discovery of misclassifications in the manual annotation of the recordings. In addition, the proposed automatic classification framework of volcano-seismic signals has been deployed and tested in Indonesia for the monitoring of Mount Merapi. The software implementation of the framework developed in this thesis has been collected in the Automatic Analysis Architecture (AAA) package and is freely available.

Page generated in 0.0985 seconds