Global ETD Search

271	Learning from Asymmetric Models and Matched Pairs January 2013 (has links) abstract: With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables. / Dissertation/Thesis / Ph.D. Industrial Engineering 2013 Statistics Computer science Information science asymmetry feature selection matched data random forest stratified data support vector machine
272	Développement d'une méthodologie robuste de sélection de gènes dans le cadre d'une activation pharmacologique de la voie PPAR / Development of a robust methodology of selected genes in the context of pharmacological activation of the PPAR pathway Cotillard, Aurélie 03 December 2009 (has links) De part leur dimension élevée, les données de puces à ADN nécessitent l’application de méthodes statistiques pour en extraire une information pertinente. Dans le cadre de l’étude des différences entre deux agonistes de PPAR (Peroxisome Proliferator-Activated Receptor), nous avons sélectionné trois méthodes de sélection de variables : T-test, Nearest Shrunken Centroids (NSC) et Support Vector Machine – Recursive Feature Elimination. Ces méthodes ont été testées sur des données simulées et sur les données réelles de l’étude PPAR. En parallèle, une nouvelle méthodologie, MetRob, a été développée afin d’améliorer la robustesse ce ces méthodes vis à vis de la variabilité technique des puces à ADN, ainsi que leur reproductibilité. Cette nouvelle méthodologie permet principalement d’améliorer la valeur prédictive positive, c’est-à-dire la confiance accordée aux résultats. La méthode NSC s’est révélée la plus robuste et ce sont donc les résultats de cette méthode, associée à MetRob, qui ont été étudiés d’un point de vue biologique. / The microarray technology provides high dimensional data that need to be statistically treated for extracting relevant information. Within the context of the study of the differences between two PPAR (Peroxisome Proliferator-Activated Receptor) agonists, we selected three feature selection methods : T-test, Nearest Shrunken Centroids (NSC) and Support Vector Machine – Recursive Feature Elimination. These methods were tested on simulated and on real data. At the same time, a new methodology, MetRob, was developed in order to improve the robustness of these methods towards the technical variability of microarrays, as well as their reproducibility. This new methodology mainly improves the positive predictive value, which means the confidence in the results. The NSC method was found to be the most robust. The results of the association of MetRob and NSC were thus studied from a biological point of view. Puces à ADN Sélection de variables Traitement de données PPAR Diabète de type 2 Microarray Feature selection Data Mining PPAR Type 2 Diabetes
273	Método de mineração de dados para diagnóstico de câncer de mama baseado na seleção de variáveis / A data mining method for breast cancer diagnosis based on selected features Holsbach, Nicole January 2012 (has links) A presente dissertação propõe métodos para mineração de dados para diagnóstico de câncer de mama (CM) baseado na seleção de variáveis. Partindo-se de uma revisão sistemática, sugere-se um método para a seleção de variáveis para classificação das observações (pacientes) em duas classes de resultado, benigno ou maligno, baseado na análise citopatológica de amostras de célula da mama de pacientes. O método de seleção de variáveis para categorização das observações baseia-se em 4 passos operacionais: (i) dividir o banco de dados original em porções de treino e de teste, e aplicar a ACP (Análise de Componentes Principais) na porção de treino; (ii) gerar índices de importância das variáveis baseados nos pesos da ACP e na percentagem da variância explicada pelos componentes retidos; (iii) classificar a porção de treino utilizando as técnicas KVP (k-vizinhos mais próximos) ou AD (Análise Discriminante). Em seguida eliminar a variável com o menor índice de importância, classificar o banco de dados novamente e calcular a acurácia de classificação; continuar tal processo iterativo até restar uma variável; e (iv) selecionar o subgrupo de variáveis responsável pela máxima acurácia de classificação e classificar a porção de teste utilizando tais variáveis. Quando aplicado ao WBCD (Wisconsin Breast Cancer Database), o método proposto apresentou acurácia média de 97,77%, retendo uma média de 5,8 variáveis. Uma variação do método é proposta, utilizando quatro diferentes tipos de kernels polinomiais para remapear o banco de dados original; os passos (i) a (iv) acima descritos são então aplicados aos kernels propostos. Ao aplicar-se a variação do método ao WBCD, obteve-se acurácia média de 98,09%, retendo uma média de 17,24 variáveis de um total de 54 variáveis geradas pelo kernel polinomial recomendado. O método proposto pode auxiliar o médico na elaboração do diagnóstico, selecionando um menor número de variáveis (envolvidas na tomada de decisão) com a maior acurácia, obtendo assim o maior acerto possível. / This dissertation presents a data mining method for breast cancer (BC) diagnosis based on selected features. We first carried out a systematic literature review, and then suggested a method for feature selection and classification of observations, i.e., patients, into benign or malignant classes based on patients’ breast tissue measures. The proposed method relies on four operational steps: (i) split the original dataset into training and testing sets and apply PCA (Principal Component Analysis) on the training set; (ii) generate attribute importance indices based on PCA weights and percent of variance explained by the retained components; (iii) classify the training set using KNN (k-Nearest Neighbor) or DA (Discriminant Analysis) techniques, eliminate irrelevant features and compute the classification accuracy. Next, eliminate the feature with the lowest importance index, classify the dataset, and re-compute the accuracy. Continue such iterative process until one feature is left; and (iv) choose the subset of features yielding the maximum classification accuracy, and classify the testing set based on those features. When applied to the WBCD (Wisconsin Breast Cancer Database), the proposed method led to average 97.77% accurate classifications while retaining average 5.8 features. One variation of the proposed method is presented based on four different types of polynomial kernels aimed at remapping the original database; steps (i) to (iv) are then applied to such kernels. When applied to the WBCD, the proposed modification increased average accuracy to 98.09% while retaining average of 17.24 features from the 54 variables generated by the recommended kernel. The proposed method can assist the physician in making the diagnosis, selecting a smaller number of variables (involved in the decision-making) with greater accuracy, thereby obtaining the highest possible accuracy. Análise multivariada Mineração de dados Neoplasias mamárias : Diagnóstico Feature selection Breast cancer diagnosis K-nearest neighbor Discriminant Kernel
274	Seleção de características apoiada por mineração visual de dados / Feature selection supported by visual data mining Glenda Michele Botelho 17 February 2011 (has links) Devido ao crescimento do volume de imagens e, consequentemente, da grande quantidade e complexidade das características que as representam, surge a necessidade de selecionar características mais relevantes que minimizam os problemas causados pela alta dimensionalidade e correlação e que melhoram a eficiência e a eficácia das atividades que utilizarão o conjunto de dados. Existem diversos métodos tradicionais de seleção que se baseiam em análises estatísticas dos dados ou em redes neurais artificiais. Este trabalho propõe a inclusão de técnicas de mineração visual de dados, particularmente, projeção de dados multidimensionais, para apoiar o processo de seleção. Projeção de dados busca mapear dados de um espaço m-dimensional em um espaço p-dimensional, p < m e geralmente igual a 2 ou 3, preservando ao máximo as relações de distância existentes entre os dados. Tradicionalmente, cada imagem é representada por um ponto e pontos projetados próximos uns aos outros indicam agrupamentos de imagens que compartilham as mesmas propriedades. No entanto, este trabalho propõe a projeção de características. Dessa forma, ao selecionarmos apenas algumas amostras de cada agrupamento da projeção, teremos um subconjunto de características, configurando um processo de seleção. A qualidade dos subconjuntos de características selecionados é avaliada comparando-se as projeções obtidas para estes subconjuntos com a projeção obtida com conjunto original de dados. Isto é feito quantitativamente, por meio da medida de silhueta, e qualitativamente, pela observação visual da projeção. Além da seleção apoiada por projeção, este trabalho propõe um aprimoramento no seletor de características baseado no cálculo de saliências de uma rede neural Multilayer Perceptron. Esta alteração, que visa selecionar características mais discriminantes e reduzir a quantidade de cálculos para se obter as saliências, utiliza informações provenientes dos agrupamentos de características, de forma a alterar a topologia da rede neural em que se baseia o seletor. Os resultados mostraram que a seleção de características baseada em projeção obtém subconjuntos capazes de gerar novas projeções com qualidade visual satisfatória. Em relação ao seletor por saliência proposto, este também gera subconjuntos responsáveis por altas taxas de classificação de imagens e por novas projeções com bons valores de silhueta / Due to the ever growing amount of digital images and, consequently, the quantity and complexity of your features, there has been a need to select the most relevant features so that not only problems caused by high dimensional data sets, correlated features can be minimized, and also the efficiency of the tasks that may employ such features can be enhanced. Many feature selection methods are based on statistical analysis or neural network approaches. This work proposes the addition of visual data mining techniques, particularly multidimensional data projection approaches, to aid the feature selection process. Multidimensional data projection seeks to map a m-dimensional data space onto a p-dimensional space, so that p < m, usually 2 or 3, while preserving distance relationship among data instances. Traditionally, each image is represented by a point, and points projected close to each other indicate clusters of images which share a common properties. However, this work proposes the projection of features. Hence, if we select only a few samples of each cluster of features from the projection, we will end up with a subset of features, revealing a feature selection process. The quality of the feature subset may be assessed by comparing such projections with those obtained with the original data set. This can be achieved either quantitatively, by means of silhouette measures, or qualitatively, by means of visual inspection of the projection. As well as the projection based feature selection, this work proposes an enhancement in the Multilayer Perceptron salience based feature selector. This enhancement, whose aim is to perfect the selection of more discriminant features at the expenses of less computing power, employs information from feature clusters, so as to change the topology of the neural network on which the selector is based. Results have shown that projection-based feature selection produces subsets capable of generating new data projections of satisfactory visual quality. As for the proposed salience-based selector, new subsets with high image classification rates and good silhouette measures have been reported Agrupamento Projeção de dados multidimensionais Seleção de características Seleção por saliência Silhueta Clustering Feature selection Multidimensional data projection Salience selection Siulhouette
275	Algoritmo genético compacto com dominância para seleção de variáveis / Compact genetic algorithm with dominance for variable selection Nogueira, Heber Valdo 20 April 2017 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2017-05-23T11:37:07Z No. of bitstreams: 2 Dissertação - Heber Valdo Nogueira - 2017.pdf: 1812540 bytes, checksum: 14c0f7496303095925cd3ae974fd4b7b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-05-23T11:37:50Z (GMT) No. of bitstreams: 2 Dissertação - Heber Valdo Nogueira - 2017.pdf: 1812540 bytes, checksum: 14c0f7496303095925cd3ae974fd4b7b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-05-23T11:37:51Z (GMT). No. of bitstreams: 2 Dissertação - Heber Valdo Nogueira - 2017.pdf: 1812540 bytes, checksum: 14c0f7496303095925cd3ae974fd4b7b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2017-04-20 / The features selection problem consists in to select a subset of attributes that is able to reduce computational processing and storage resources, decrease curse of dimensionality effects and improve the performance of predictive models. Among the strategies used to solve this type of problem, we highlight evolutionary algorithms, such as the Genetic Algorithm. Despite the relative success of the Genetic Algorithm in solving various types of problems, different improvements have been proposed in order to improve their performance. Such improvements focus mainly on population representation, search mechanisms, and evaluation methods. In one of these proposals, the Genetic Compact Algorithm (CGA) arose, which proposes new ways of representing the population and guide the search for better solutions. Applying this type of strategy to solve the problem of variable selection often involves overfitting. In this context, this work proposes the implementation of a version of the Compact Genetic Algorithm to minimize more than one objective simultaneously. Such algorithm makes use of the concept of Pareto dominance and, therefore, is called Genetic Algorithm Compacted with Dominance (CGAD). As a case study, to evaluate the performance of the proposed algorithm, AGC-D is combined with Multiple Linear Regression (MLR) to select variables to better predict protein concentration in wheat samples. The proposed algorithm is compared to CGA and the Mutation-based Compact Genetic Algorithm. The results indicate that the CGAD is able to select a small set of variables, reducing the prediction error of the calibration model, reducing the possibility of overfitting. / O problema de seleção de variáveis consiste em selecionar um subconjunto de atributos que seja capaz reduzir os recursos computacionais de processamento e armazenamento, diminuir os efeitos da maldição da dimensionalidade e melhorar a performance de modelos de predição. Dentre as estratégias utilizadas para solucionar esse tipo de problema, destacam-se os algoritmos evolutivos, como o Algoritmo Genético. Apesar do relativo sucesso do Algoritmo Genético na solução de variados tipos de problemas, diferentes propostas de melhoria têm sido apresentadas no sentido de aprimorar seu desempenho. Tais melhorias focam, sobretudo, na representação da população, nos mecanismos de busca e nos métodos de avaliação. Em uma dessas propostas, surgiu o Algoritmo Genético Compacto (AGC), que propõe novas formas de representar a população e de conduzir a busca por melhores soluções. A aplicação desse tipo de estratégia para solucionar o problema de seleção de variáveis, muitas vezes implica no overfitting. Diversas pesquisas na área têm indicado a abordagem multiobjetivo pode ser capaz de mitigar esse tipo de problema. Nesse contexto, este trabalho propõe a implementação de uma versão do Algoritmo Genético Compacto capaz de minimizar mais de um objetivo simultaneamente. Tal algoritmo faz uso do conceito de dominância de Pareto e, por isso, é chamado de Algoritmo Genético Compacto com Dominância (AGC-D). Como estudo de caso, para avaliar o desempenho dos algoritmos propostos, o AGC-D é combinado com a Regressão Linear Múltipla (RLM) com o objetivo de selecionar variáveis para melhor predizer a concentração de proteína em amostras de trigo. O algoritmo proposto é comparado ao AGC e ao AGC com operador de mutação. Os resultados obtidos indicam que o AGC-D é capaz de selecionar um pequeno conjunto de variáveis, reduzindo o erro de predição do modelo de calibração e minimizando a possibilidade de overfitting. Seleção de variáveis Algoritmo genético compacto Otimização multiobjetivo Feature selection Compact genetic algorithm Multiobjective optimization
276	Prediction of material properties based on non-destructive Barkhausen noise measurement Sorsa, A. (Aki) 22 January 2013 (has links) Abstract Barkhausen noise measurement is an intriguing non-destructive testing method suitable for ferromagnetic materials. It is based on the stochastic movements of magnetic domain walls when the tested sample is placed in an external varying magnetic field. Barkhausen noise is typically utilised so that some features are calculated from the signal and then compared with the studied material properties. Typical features are, for example, the root-mean-square value (RMS), peak height, width and position. Better utilisation of the method, however, requires quantitative predictions of material properties. The aim of this thesis is to study and select a suitable methodology for the quantitative prediction of material properties based on Barkhausen noise measurement. The prediction considered is divided into four steps: feature generation, feature selection, model identification and model validation. In feature generation, a large set of features is calculated with different mathematical procedures. This feature set is explored in the feature selection step to find the most significant features in terms of predictions. A model with the selected features is identified and some independent data are usually used for model validation. This thesis presents the developed procedures required in feature generation and the results of the studies using different feature selection strategies and modelling techniques. The studied feature selection methods are forward selection, simulated annealing and genetic algorithms. In addition, two-step algorithms are investigated where a pre-selection step is used before the actual selection. The modelling techniques used are multivariable linear regression, partial least squares regression, principal component regression and artificial neural networks. The studies also consider the use and effect of different objective functions. The results of the studies show that the proposed modelling scheme can be used for the prediction task. The models identified mainly include reasonable terms and the prediction accuracy is fairly good considering the challenge. However, the application of Barkhausen noise measurement is very case-dependent and thus conflicts may occur. Furthermore, the changes in unmeasured material properties may lead to the unexpected behaviour of some features. The results show that linear models are adequate for capturing the major interactions between material properties and Barkhausen noise but indicate that the use of neural networks would lead to better model performance. The results also show that genetic algorithms give better selection results but at the expense of the computational cost. / Tiivistelmä Barkhausen-kohina-mittaus on ferromagneettisille materiaaleille soveltuva materiaalia rikkomaton testausmenetelmä. Mittaus perustuu magneettisten alueiden välisten rajapintojen stokastisiin liikkeisiin, kun testattava kappale asetetaan vaihtuvaan magneettikenttään. Tyypillisesti Barkhausen-kohina-mittaussignaalista lasketaan piirteitä, joita sitten verrataan tutkittaviin materiaaliominaisuuksiin. Usein käytettyjä piirteitä ovat signaalin keskineliön neliöjuuri (RMS-arvo) sekä piikin korkeus, leveys ja paikka. Menetelmää voidaan soveltaa paremmin, jos tutkittavia materiaaliominaisuuksia voidaan ennustaa kvantitatiivisesti. Tämän tutkimuksen tavoitteena on tutkia ja valita menetelmiä, jotka soveltuvat materiaaliominaisuuksien kvantitatiiviseen ennustamiseen Barkhausen-kohina-mittauksen perusteella. Ennustusmallit luodaan neljässä vaiheessa: piirteiden laskenta, piirteiden valinta, mallin identifiointi ja mallin validointi. Piirteiden laskennassa yhdistellään erilaisia matemaattisia laskutoimituksia, joista tuloksena saadaan suuri joukko erilaisia piirteitä. Tästä joukosta valitaan ennustukseen soveltuvimmat piirteiden valinta -vaiheessa. Tämän jälkeen ennustusmalli identifioidaan ja viimeisessä vaiheessa sen toimivuus todennetaan riippumattomalla testausaineistolla. Väitöskirjassa esitetään piirteiden laskentaan kehitettyjä algoritmeja sekä mallinnustuloksia käytettäessä erilaisia piirteiden valintamenetelmiä ja mallinnustekniikoita. Tutkitut valintamenetelmät ovat eteenpäin valinta, taaksepäin eliminointi, simuloitu jäähtyminen ja geneettiset algoritmit. Väitöskirjassa esitellään myös kaksivaiheisia valintamenettelyjä, joissa ennen varsinaista piirteiden valintaa suoritetaan esivalinta. Käytetyt mallinnustekniikat ovat monimuuttujaregressio, osittainen pienimmän neliösumman regressio, pääkomponenttiregressio ja neuroverkot. Tarkasteluissa huomioidaan myös erilaisten kustannusfunktioiden vaikutukset. Esitetyt tulokset osoittavat, että käytetyt menetelmät soveltuvat materiaaliominaisuuksien kvantitatiiviseen ennustamiseen. Identifioidut mallit sisältävät pääasiassa perusteltavia termejä ja mallinnustarkkuus on tyydyttävä. Barkhausen-kohina-mittaus on kuitenkin erittäin tapauskohtainen ja täten ristiriitoja kirjallisuuden kanssa voidaan joskus havaita. Näihin ristiriitoihin vaikuttavat myös ei-mitattavat muutokset materiaaliominaisuuksissa. Esitetyt tulokset osoittavat, että lineaariset mallit kykenevät ennustamaan suurimmat vuorovaikutukset materiaaliominaisuuksien ja Barkhausen-kohinan välillä. Tulokset kuitenkin viittaavat siihen, että neuroverkoilla päästäisiin vielä parempiin mallinnustuloksiin. Tulokset osoittavat myös, että geneettiset algoritmit toimivat piirteiden valinnassa paremmin kuin muut tutkitut menetelmät. Barkhausen noise feature selection modelling non-destructive testing residual stress Barkhausen-kohina ainetta rikkomaton testaus jäännösjännitys mallinnus piirteiden valinta
277	Feature Engineering and Machine Learning for Driver Sleepiness Detection Keelan, Oliver, Mårtensson, Henrik January 2017 (has links) Falling asleep while operating a moving vehicle is a contributing factor to the statistics of road related accidents. It has been estimated that 20% of all accidents where a vehicle has been involved are due to sleepiness behind the wheel. To prevent accidents and to save lives are of uttermost importance. In this thesis, given the world’s largest dataset of driver participants, two methods of evaluating driver sleepiness have been evaluated. The first method was based on the creation of epochs from lane departures and KSS, whilst the second method was based solely on the creation of epochs based on KSS. From the epochs, a number of features were extracted from both physiological signals and the car’s controller area network. The most important features were selected via a feature selection step, using sequential forward floating selection. The selected features were trained and evaluated on linear SVM, Gaussian SVM, KNN, random forest and adaboost. The random forest classifier was chosen in all cases when classifying previously unseen data.The results shows that method 1 was prone to overfit. Method 2 proved to be considerably better, and did not suffer from overfitting. The test results regarding method 2 were as follows; sensitivity = 80.3%, specificity = 96.3% and accuracy = 93.5%.The most prominent features overall were found in the EEG and EOG domain together with the sleep/wake predictor feature. However indications have been made that complexities might contribute to the detection of sleepiness as well, especially the Higuchi’s fractal dimension. Driver Sleepiness Detection KSS Physiological Signals Controller Area Network Machine Learning Feature Selection SWP Signal Processing Medical Engineering Medicinteknik
278	Optimisation de la configuration d'un instrument superspectral aéroporté pour la classification : application au milieu urbain / Spectral optimization to design a superspectral sensor : application to urban areas Le Bris, Arnaud 07 December 2015 (has links) Ce travail s'inscrit dans la perspective de l'enrichissement des bases de données d'occupation du sol. La description de l'occupation du sol permet de produire des indicateurs environnementaux pour la gestion des écosystèmes et des territoires, en réponse à des besoins sociétaux, réglementaires et scientifiques. Aussi, des bases de données décrivant l'occupation du sol existent à différents niveaux (local, national, européen) ou sont en cours de constitution. Il est toutefois apparu que la connaissance de l'occupation du sol nécessaire pour certaines applications de modélisation de la ville (simulateurs de micro-météorologie, d'hydrologie, ou de suivi de pollutions), voire de suivi réglementaire (imperméabilisation des sols) est plus fine (au niveau sémantique et géométrique) que ce que contiennent ces bases de données. Des cartes de matériaux sont donc nécessaires pour certaines applications. Elles pourraient constituer une couche supplémentaire, à la fois dans des bases de données sur l'occupation du sol (comme l'occupation du sol à grande échelle de l'IGN) et dans des maquettes urbaines 3D.Aucune base de données existante ne contenant cette information, la télédétection apparaît comme la seule solution pour la produire. Néanmoins, du fait de la forte hétérogénéité des matériaux, de leur variabilité, mais aussi des fortes ressemblances entre classes distinctes, il apparaît que les capteurs optiques multispectraux classiques (limités aux 4 canaux rouge - vert - bleu - proche infrarouge) sont insuffisants pour bien discriminer des matériaux. Un capteur dit superspectral, c'est-à-dire plus riche spectralement, pourrait apporter une solution à cette limite. Ce travail s'est donc positionné dans l'optique de la conception d'un tel capteur et a consisté à identifier la meilleure configuration spectrale pour la classification des matériaux urbains, ou du moins à proposer des solutions s'en approchant. Un travail d'optimisation spectrale a donc été réalisé afin d'optimiser à la fois la position des bandes dans le spectre ainsi que leur largeur. Le travail s'est déroulé en deux temps. Une première tâche a consisté à définir et préciser les méthodes d'optimisation de bandes, et à les valider sur des jeux de données de référence de la littérature. Deux heuristiques d'optimisation classiques (l'une incrémentale, l'autre stochastique) ont été choisies du fait de leur généricité et de leur flexibilité, et donc de leur capacité à être utilisées pour différents critères de sélection d'attributs. Une comparaison de différentes mesures de la pertinence d'un jeu de bandes a été effectuée afin de définir le score à optimiser lors du processus de sélection de bandes. L'optimisation de la largeur des bandes a ensuite été étudiée : la méthode proposée consiste à préalablement construire une hiérarchie de bandes fusionnées en fonction de leur similarité, le processus de sélection de bandes se déroulant ensuite au sein de cette hiérarchie. La seconde partie du travail a consisté en l'application de ces algorithmes d'optimisation spectrale au cas d'étude des matériaux urbains. Une collection de spectres de matériaux urbains a d'abord été réunie à partir de différentes librairies spectrales (ASTER, MEMOIRES, ...). L'optimisation spectrale a ensuite été menée à partir de ce jeu de données. Il est apparu qu'un nombre limité de bandes bien choisies suffisait pour discriminer 9 classes de matériaux communs (ardoise - asphalte - ciment - gravier - métal - pavés en pierre - shingle - terre – tuile). L'apport de bandes issues du domaine de l'infrarouge onde courte (1400 - 2500 nm) pour la discrimination des matériaux a également été vérifiée. La portée des résultats chiffrés obtenus en terme de confusions entre les matériaux reste toutefois à nuancer du fait de la très faible représentation de certains matériaux dans la librairie de spectres collectés, ne couvrant donc pas la totalité de leur variabilité / This work was performed in the context of a possible enrichment of land cover databases. The description of land cover is necessary it possible to produce environmental indicators for the management of ecosystems and territories, in response to various societal and scientific needs. Thus, different land cover databases already exist at various levels (global, European, national, regional or local) or are currently being produced. However, it appeared that knowledge about land cover should more detailled in urban areas, since it is required by several city modeling applications (micro-meteorological, hydrological, or pollution monitoring simulators), or public regulations monitoring (e.g. concerning ground perviousness). Such materials maps would be (both semantically and spatially) finer than what is contained in existing land cover databases. Therefore, they could be an additional layer, both in land cover databases (such as in IGN High Resolution land cover database) and in 3D city models. No existing database contains such information about urban material maps. Thus remote sensing is the only solution to produce it. However, due to the high heterogeneity of urban materials, their variability, but also the strong similarities between different material classes, usual optical multispectral sensors (with only the 4 red - green - blue - near infrared bands) are not sufficient to reach a good discrimination of materials. A multispectral sensor or superspectral, that is to say spectrally richer, could therefore provide a solution to this limit. Thus, this work was performed intending the design of such sensor. It aimed at identifying the best spectral configuration for classification of urban materials, or at least to propose sub-optimal solutions. In other words, a spectral optimization was carried out in order to optimize both the position of the bands in the spectrum and their width. Automatic feature selection methods were used. This work was performed in two steps. A first task aimed at defining the spectral optimization methods and at validating them on literature reference data sets. Two state-of-the-art optimization heuristics (Sequential Forward Floating Search and genetic algorithms) were chosen owing to their genericity and flexibility, and therefore their ability to be used to optimize different feature selection criteria. A benchmark of different scores measuring the relevance of a set of features was performed to decide which score to optimize during the band selection process. Band width optimization was then studied: the proposed method consisted in building a hierarchy of bands merged according to their similarities. Band selection was then processed within this hierarchy. The second part of the work consisted in the application of these spectral optimization algorithms to the case study of urban materials. A collection of urban materials spectra was first caught and from various spectral libraries ( ASTER , MEMORIES...). Spectral optimization was then performed on this dataset. A limited number (about 10) of well chosen bands appeared to be sufficient to classify next common materials (slates - asphalt - cement - gravel - metal - cobblestones - shingle - earth – tiles). Bands from short wave infrared spectral domain (1400 - 2500 nm) were shown again to be very useful to discriminate urban materials. However, quantitative results assessing the confusions between the materials must be considered carefully since some materials are very uncommon in the library of collected spectra, and thus their possible variability is not completely considered Optimisation spectrale Classification Occupation du sol Hyperspectral Matériaux urbains Sélection d'attributs Spectral optimization Classification Land cover Hyperspectral Urban materials Feature selection
279	Développement de stratégies de test pour les systèmes de communications millimétriques / Development of test strategies for millimeter communications systems Verdy, Matthieu 22 September 2016 (has links) L’objectif de cette thèse est de développer une stratégie de test globale pour réduire le cout du test tout en garantissant une couverture de test complète. On s’intéressera plus particulièrement aux communications millimétriques à base de modulation OFDM. Les investigations devront être orientées vers l’implémentation de « BIST » dans le circuit pour relaxer les contraintes sur l’environnement de test. L’environnement de test est composé de l’ATE et de l’interface de test. Pour relaxer les contraintes sur l’environnement de test et ainsi réduire le cout du test, notre approche est d’opter pour un « ATE » standard » et d’implémenter le minimum possible de composants dans l’interface de test. Les spécifications des BIST et éventuellement des modules à implémenter dans l’interface de test devront être suffisamment précis et réalistes pour permettre une implémentation physique. Pour atteindre ces objectifs notre approche est de s’appuyer sur les modèles des différents blocs et de procéder à des simulations appropriées pour identifier les paramètres de test pertinents d’abord et ensuite proposer une solution de test qui permet de mesurer chaque paramètre. Les paramètres de test pertinents sont les paramètres qui permettent de tester le système de communication en un temps minimal avec une couverture de test convenable. Ces paramètres de test peuvent être déterminés en combinant le test fonctionnel au test structurel. Le test fonctionnel permet de détecter l’existence de fautes catastrophiques en un minimum de temps et le test structurel permet de localiser les fautes catastrophiques et de déterminer les performances individuelles des blocs critiques pour améliorer le rendement. Pour le test structurel, les performances individuelles des blocs critiques peuvent être déterminées directement au moyen de BIST dédiés ou indirectement en procédant à une corrélation entre les paramètres des blocs et un paramètre global tel que l’EVM ou tout autre type de paramètre adapté. / The thesis' goal is to develop global test strategy in order to reduce test cost and ensure total test cover. OFDM millimeter communications will a point of interest in this thesis. The investigation has to reach the circuit BIST implementation to release constraint over test environment. The test environment contains ATE and test interface. Our approach consists in using a standard ATE and implementing few components on test interface. BIST specification and modules of test interface must be precise and realistic in order to ensure the physical implementation. To reach these goal, we will first rely on models of different blocks and appropriate simulations to identify relevant test parameters. Secondly, we will produce test solution that ensure the measure of each relevant parameters. Relevant test parameters are parameters that allow to test the system quickly, wih maximal test cover. These parameters can be computed using both functional model and structural model. Functional model is used to detect catastrophic faults, and structural model determines each blocks performance to improve efficiency. Dealing with structural test, individual block performances can be determined using BIST, or computing correlation between local blocks parameters and global system parameters (ie. EVM, or any relevant parameter). Test de circuit Sélection de caractéristiques Réduction de coût de test Classification Circuit testing Feature selection Test cost reduction Classification 620
280	Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos / Evaluation of unsupervised feature selection methods for Text Mining Bruno Magalhães Nogueira 27 March 2009 (has links) Selecionar atributos é, por vezes, uma atividade necessária para o correto desenvolvimento de tarefas de aprendizado de máquina. Em Mineração de Textos, reduzir o número de atributos em uma base de textos é essencial para a eficácia do processo e a compreensibilidade do conhecimento extraído, uma vez que se lida com espaços de alta dimensionalidade e esparsos. Quando se lida com contextos nos quais a coleção de textos é não-rotulada, métodos não-supervisionados de redução de atributos são utilizados. No entanto, não existe forma geral predefinida para a obtenção de medidas de utilidade de atributos em métodos não-supervisionados, demandando um esforço maior em sua realização. Assim, este trabalho aborda a seleção não-supervisionada de atributos por meio de um estudo exploratório de métodos dessa natureza, comparando a eficácia de cada um deles na redução do número de atributos em aplicações de Mineração de Textos. Dez métodos são comparados - Ranking porTerm Frequency, Ranking por Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Método de Luhn, Método LuhnDF, Método de Salton e Zone-Scored Term Frequency - sendo dois deles aqui propostos - Método LuhnDF e Zone-Scored Term Frequency. A avaliação se dá em dois focos, supervisionado, pelo medida de acurácia de quatro classificadores (C4.5, SVM, KNN e Naïve Bayes), e não-supervisionado, por meio da medida estatística de Expected Mutual Information Measure. Aos resultados de avaliação, aplica-se o teste estatístico de Kruskal-Wallis para determinação de significância estatística na diferença de desempenho dos diferentes métodos de seleção de atributos comparados. Seis bases de textos são utilizadas nas avaliações experimentais, cada uma relativa a um grande domínio e contendo subdomínios, os quais correspondiam às classes usadas para avaliação supervisionada. Com esse estudo, este trabalho visa contribuir com uma aplicação de Mineração de Textos que visa extrair taxonomias de tópicos a partir de bases textuais não-rotuladas, selecionando os atributos mais representativos em uma coleção de textos. Os resultados das avaliações mostram que não há diferença estatística significativa entre os métodos não-supervisionados de seleção de atributos comparados. Além disso, comparações desses métodos não-supervisionados com outros supervisionados (Razão de Ganho e Ganho de Informação) apontam que é possível utilizar os métodos não-supervisionados em atividades supervisionadas de Mineração de Textos, obtendo eficiência compatível com os métodos supervisionados, dado que não detectou-se diferença estatística nessas comparações, e com um custo computacional menor / Feature selection is an activity sometimes necessary to obtain good results in machine learning tasks. In Text Mining, reducing the number of features in a text base is essential for the effectiveness of the process and the comprehensibility of the extracted knowledge, since it deals with high dimensionalities and sparse contexts. When dealing with contexts in which the text collection is not labeled, unsupervised methods for feature reduction have to be used. However, there aren\'t any general predefined feature quality measures for unsupervised methods, therefore demanding a higher effort for its execution. So, this work broaches the unsupervised feature selection through an exploratory study of methods of this kind, comparing their efficacies in the reduction of the number of features in the Text Mining process. Ten methods are compared - Ranking by Term Frequency, Ranking by Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Luhn\'s Method, LuhnDF Method, Salton\'s Method and Zone-Scored Term Frequency - and two of them are proposed in this work - LuhnDF Method and Zone-Scored Term Frequency. The evaluation process is done in two ways, supervised, through the accuracy measure of four classifiers (C4.5, SVM, KNN and Naïve Bayes), and unsupervised, using the Expected Mutual Information Measure. The evaluation results are submitted to the statistical test of Kruskal-Wallis in order to determine the statistical significance of the performance difference of the different feature selection methods. Six text bases are used in the experimental evaluation, each one related to one domain and containing sub domains, which correspond to the classes used for supervised evaluation. Through this study, this work aims to contribute with a Text Mining application that extracts topic taxonomies from unlabeled text collections, through the selection of the most representative features in a text collection. The evaluation results show that there is no statistical difference between the unsupervised feature selection methods compared. Moreover, comparisons of these unsupervised methods with other supervised ones (Gain Ratio and Information Gain) show that it is possible to use unsupervised methods in supervised Text Mining activities, obtaining an efficiency compatible with supervised methods, since there isn\'t any statistical difference the statistical test detected in these comparisons, and with a lower computational effort Aprendizado de máquina Aprendizado não-supervisionado Mineração de textos Seleção de atributos Feature selection Machine learning Text mining Unsupervised learning

Search results