Spelling suggestions: "subject:"dissimilarity measures"" "subject:"bisimilarity measures""
1 |
Categorização de dados quantitativos para estudos de diversidade genética / Categorization quantitative data for studies of genetic diversityBarroso, Natália Caixeta 15 December 2010 (has links)
Made available in DSpace on 2015-03-26T13:32:11Z (GMT). No. of bitstreams: 1
texto completo.pdf: 2217621 bytes, checksum: 73d2ddc4b72290d7ed609d146e107caf (MD5)
Previous issue date: 2010-12-15 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / The genetic diversity study is an important tool in the identification of genetically divergent individuals, wich, can increase the effect of heterosis in the progeny when combinaded. A statistical technique usually applied in this type of study is the cluster analysis. However, before applying this technique, it must be obtained a similarity matrix (or distance) between the genotypes. These distances can be calculated in several ways, which different proposals are found in the literature for quantitative variables, binary and multicategoric. The transformation of quantitative variables in multicategoric can be used to facilitate their characterization with preliminary useful information. There are quite a few methods to make such changes, but they need to be better understood so that the loss of information occurred in such changes does not damage significantly the results of the analysis. Therefore the purposes of this study are: to determine which of these variables categorization methods are efficient; to research the influence of the choice of different coefficients of dissimilarity in cluster analysis, made from simulated data by using quantitative variables and multicategoric; and to investigate whether some hierarchical methods group efficiently the simulated data. For that, there were made 50 simulations of ten quantitative variables to twenty genotypes of a species of reference as corn, each one with four replications. These data were converted in multicategoric using the following methods: equitable division of amplitude, equitable percentage, square rule, Sturges rule and normal distribution. A number of classes had to be established for the first two methods, which were used four and five classes for both. Were used to create distance matrices, in the original data and multicategoric, the dissimilarity measures: Euclidean distance, the average Euclidean, squared Euclidean distance, Mahalanobis distance and weighted distance. Subsequently, the grouping was done by the method of nearest neighbor and the average linkage between groups (UPGMA). The efficiency of these was verified by the statistics of efficiency cophenetic correlation coefficient, stress and distortion degree between the phenetic and cophenetic matrices. The results showed that the cluster method UPGMA was superior to method of nearest neighbor for all distance measures used. Euclidean distances and average Euclidean showed similar performance in all cluster analysis done. Moreover, these two measures got the best performance in all groups performed. All methods of data categorization achieved a satisfactory performance when grouped by UPGMA, except the method of equal percentage with four and five classes. However, the data which have their classes estimated by the square rule had the most similar dendrogram when compared to the obtained using the original data, and therefore, this is the recommended method to perform the categorization of data. / O estudo da divergência genética é uma ferramenta importante na identificação de indivíduos geneticamente divergentes que, ao serem combinados, possam aumentar o efeito heterótico na progênie. Uma técnica estatística muito aplicada nesse tipo de estudo é a análise de agrupamento. Entretanto, antes dessa técnica ser empregada, deve ser obtida uma matriz de similaridade (ou distância) entre os genótipos. Essas distâncias podem ser calculadas de diversas maneiras, sendo que diferentes propostas são encontradas na literatura para as variáveis quantitativas, binárias e multicategóricas. A transformação de variáveis quantitativas em multicategóricas pode ser utilizada para facilitar sua caracterização com informações preliminares de grande utilidade. Existem vários métodos para se fazer essa transformação, porém estes precisam ser melhor entendidos para que a perda de informações ocorrida na transformação não prejudique significativamente os resultados da análise. Portanto, este trabalho teve como objetivos: verificar quais desses métodos de categorização de variáveis são eficientes; pesquisar a influência da escolha de diferentes coeficientes de dissimilaridades na análise de agrupamentos, feita a partir de dados simulados utilizando variáveis quantitativas e multicategóricas; e averiguar se alguns métodos hierárquicos agrupam com eficiência os dados simulados. Para isto, foram feitas 50 simulações de dez variáveis quantitativas para vinte genótipos de uma espécie de referência como o milho, cada um com quatro repetições. Estes dados foram transformados em multicategóricos através dos métodos: divisão equitativa da amplitude, percentual equitativo, regra do Quadrado, regra de Sturges e distribuição normal. O número de classes tinha que ser estabelecido para os dois primeiros, no caso, foi utilizado quatro e cinco classes para ambos. Foram utilizadas para construir as matrizes de distâncias, nos dados originais e multicategóricos, as medidas de dissimilaridade: distância euclidiana, euclidiana média, quadrado da distância euclidiana, distância de Mahalanobis e distância ponderada. Posteriormente, o agrupamento foi feito pelo método do vizinho mais próximo e pela ligação média entre grupos (UPGMA). A eficiência destes foi verificada através das estatísticas de eficiência coeficiente de correlação cofenética, estresse e grau de distorção entre as matrizes fenéticas e cofenéticas. Os resultados mostraram que o método de agrupamento UPGMA foi superior ao método do vizinho mais próximo para todas as medidas de distância utilizadas. As distâncias euclidiana e euclidiana média apresentaram a mesma performance em todas as análises de agrupamento feitas. Além disso, essas duas medidas obtiveram os melhores desempenhos em todos os agrupamentos realizados. Todos os métodos de categorização de dados conseguiram um desempenho satisfatório quando agrupados por UPGMA, exceto o método do percentual equitativo com quatro e cinco classes. Contudo, os dados que possuem suas classes estimadas pela regra do Quadrado apresentaram o dendrograma mais semelhante com o obtido pormeio dos dados originais, sendo este, então, o método mais recomendado para se fazer a categorização de dados.
|
2 |
Appariement de formes, recherche par forme clef / Shape matching, shape retrievalMokhtari, Bilal 10 November 2016 (has links)
Cette thèse porte sur l’appariement des formes, et la recherche par forme clef. Elle décrit quatrecontributions à ce domaine. La première contribution est une amélioration de la méthode des nuéesdynamiques pour partitionner au mieux les voxels à l’intérieur d’une forme donnée ; les partitionsobtenues permettent d’apparier les objets par un couplage optimal dans un graphe biparti. Laseconde contribution est la fusion de deux descripteurs, l’un local, l’autre global, par la règle duproduit. La troisième contribution considère le graphe complet, dont les sommets sont les formes dela base ou la requête, et les arêtes sont étiquetées par plusieurs distances, une par descripteur ;ensuite cette méthode calcule par programmation linéaire la combinaison convexe des distancesqui maximise soit la somme des longueurs des plus courts chemins entre la requête et les objetsde la base de données, soit la longueur du plus court chemin entre la requête et l’objet comparé àla requête. La quatrième contribution consiste à perturber la requête avec un algorithme génétiquepour la rapprocher des formes de la base de données, pour un ou des descripteur(s) donné(s) ; cetteméthode est massivement parallèle, et une architecture multi-agent est proposée. Ces méthodes sontcomparées aux méthodes classiques, et ont de meilleures performances, en terme de précision. / This thesis concerns shape matching and shape retrieval. It describes four contributions to thisdomain. The first is an improvement of the k-means method, in order to find the best partition ofvoxels inside a given shape ; these best partitions permit to match shapes using an optimal matchingin a bipartite graph. The second contribution is the fusion of two descriptors, one local, the otherglobal, with the product rule. The third contribution considers the complete graph, the vertices ofwhich are the shapes in the database and the query. Edges are labelled with several distances,one per descriptor. Then the method computes, with linear programming, the convex combinationof distances which maximizes either the sum of the lengths of all shortest paths from the query toall shapes of the database, or the length of the shortest path in the graph from query to the currentshape compared to query. The fourth contribution consists in perturbing the shape query, to make itcloser to shapes in the database, for any given descriptors. This method is massively parallel and amulti-agent architecture is proposed. These methods are compared to classical methods in the field,they achieve better retrieval performances.
|
3 |
Image Analysis Applications of the Maximum Mean Discrepancy Distance MeasureDiu, Michael January 2013 (has links)
The need to quantify distance between two groups of objects is prevalent throughout the signal processing world. The difference of group means computed using the Euclidean, or L2 distance, is one of the predominant distance measures used to compare feature vectors and groups of vectors, but many problems arise with it when high data dimensionality is present. Maximum mean discrepancy (MMD) is a recent unsupervised kernel-based pattern recognition method which may improve differentiation between two distinct populations over many commonly used methods such as the difference of means, when paired with the proper feature representations and kernels. MMD-based distance computation combines many powerful concepts from the machine learning literature, such as data distribution-leveraging similarity measures and kernel methods for machine learning.
Due to this heritage, we posit that dissimilarity-based classification and changepoint detection using MMD can lead to enhanced separation between different populations. To test this hypothesis, we conduct studies comparing MMD and the difference of means in two subareas of image analysis and understanding: first, to detect scene changes in video in an unsupervised manner, and secondly, in the biomedical imaging field, using clinical ultrasound to assess tumor response to treatment. We leverage effective computer vision data descriptors, such as the bag-of-visual-words and sparse combinations of SIFT descriptors, and choose from an assessment of several similarity kernels (e.g. Histogram Intersection, Radial Basis Function) in order to engineer useful systems using MMD. Promising improvements over the difference of means, measured primarily using precision/recall for scene change detection, and k-nearest neighbour classification accuracy for tumor response assessment, are obtained in both applications.
|
4 |
Abstraction et comparaison de traces d'exécution pour l'analyse d'applications multimédias embarquées / Abstraction and comparison of execution traces for analysis of embedded multimedia applicationsKamdem Kengne, Christiane 05 December 2014 (has links)
Le projet SoC-Trace a pour objectif le développement d'un ensemble de méthodes et d'outils basés sur les traces d'éxécution d'applications embarquées multicoeur afin de répondre aux besoins croissants d'observabilité et de 'débogabilité' requis par l'industrie. Le projet vise en particulier le développement de nouvelles méthodes d'analyse, s'appuyant sur différentes techniques d'analyse de données telles que l'analyse probabiliste, la fouille de données, et l'agrégation de données. Elles devraient permettre l'identification automatique d'anomalies,l'analyse des corrélations et dépendances complexes entre plusieurs composants d'une application embarquées ainsi que la maîtrise du volume important des traces qui peut désormais dépasser le GigaOctet. L'objectif de la thèse est de fournir une représentation de haut niveau des informations contenues dans les traces, basée sur la sémantique. Il s'agira dans un premier temps de développer un outil efficace de comparaison entre traces;de définir une distance démantique adaptée aux traces, puis dans un second temps d'analyser et d'interpréter les résultats des comparaisons de traces en se basant sur la distance définie. / The SoC-Trace project aims to develop a set of methods and tools based on execution traces of multicore embedded applications to meet the growing needs of observability and 'débogability' required by the industry. The project aims in particular the development of new analytical methods, based on different data analysis techniques such as probabilistic analysis, data mining, and data aggregation. They should allow the automatic identification of anomalies, the analysis of complex correlations and dependencies between different components of an embedded application and control of the volume traces that can now exceed the gigabyte. The aim of the thesis is to provide a high-level representation of information in the trace based semantics. It will initially develop an effective tool for comparing traces, to define a semantic distance for execution traces, then a second time to analyze and interpret the results of comparisons of traces based on the defined distance.
|
Page generated in 0.0715 seconds