Spelling suggestions: "subject:"similarity"" "subject:"imilarity""
121 |
Avaliação da qualidade de funções de similaridade no contexto de consultas por abrangência / Quality evaluation of similarity functions for range queriesStasiu, Raquel Kolitski January 2007 (has links)
Em sistemas reais, os dados armazenados tipicamente apresentam inconsistências causadas por erros de gra a, abreviações, caracteres trocados, entre outros. Isto faz com que diferentes representações do mesmo objeto do mundo real sejam registrados como elementos distintos, causando um problema no momento de consultar os dados. Portanto, o problema investigado nesta tese refere-se às consultas por abrangência, que procuram encontrar objetos que representam o mesmo objeto real consultado . Esse tipo de consulta não pode ser processado por coincidência exata, necessitando de um mecanismo de consulta com suporte à similaridade. Para cada consulta submetida a uma determinada coleção, a função de similaridade produz um ranking dos elementos dessa coleção ordenados pelo valor de similaridade entre cada elemento e o objeto consulta. Como somente os elementos que são variações do objeto consulta são relevantes e deveriam ser retornados, é necessário o uso de um limiar para delimitar o resultado. O primeiro desa o das consultas por abrangência é a de nição do limiar. Geralmente é o especialista humano que faz a estimativa manualmente através da identi - cação de elementos relevantes e irrelevantes para cada consulta e em seguida, utiliza uma medida como revocação e precisão (R&P). A alta dependência do especialista humano di culta o uso de consultas por abrangência na prática, principalmente em grandes coleções. Por esta razão, o método apresentado nesta tese tem por objetivo estimar R&P para vários limiares com baixa dependência do especialista humano. Como um sub-produto do método, também é possível selecionar o limiar mais adequado para uma função sobre uma determinada coleção. Considerando que as funções de similaridade são imperfeitas e que apresentam níveis diferentes de qualidade, é necessário avaliar a função de similaridade para cada coleção, pois o resultado é dependente dos dados. Um limiar para uma coleção pode ser totalmente inadequado para outra coleção, embora utilizando a mesma função de similaridade. Como forma de medir a qualidade de funções de similaridade no contexto de consultas por abrangência, esta tese apresenta a discernibilidade. Trata-se de uma medida que de ne a habilidade da função de similaridade de separar elementos relevantes e irrelevantes. Comparando com a precisão média, a discernibilidade captura variações que não são percebidas pela precisão média, o que mostra que a discernibilidade é mais apropriada para consultas por abrangência. Uma extensa avaliação experimental usando dados reais mostra a viabilidade tanto do método de estimativas como da medida de discernibilidade para consultas por abrangência. / In real systems, stored data typically have inconsistencies caused by typing errors, abbreviations, transposed characters, amongst others. For this reason, di erent representations of the same real world object are stored as distinct elements, causing problems during query processing. In this sense, this thesis investigates range queries which nd objects that represent the same real world object being queried . This type of query cannot be processed by exact matching, thus requiring the support for querying by similarity. For each query submitted to a given collection, the similarity function produces a ranked list of all elements in this collection. This ranked list is sorted decreasingly by the similarity score value with the query object. Only the variations of the query object should be part of the result as only those items are relevant. For this reason, it is necessary to apply a threshold value to properly split the ranking. The rst challenge of range queries is the de nition of a proper threshold. Usually, a human specialist makes the estimation manually through the identi cation of relevant and irrelevant elements for each query. Then, he/she uses measures such as recall and precision (R&P). The high dependency on the human specialist is the main di culty related to use of range queries in real situations, specially for large collections. In this sense, the method presented in this thesis has the objective of estimating R&P at several thresholds with low human intervention. As a by-product of this method, it is possible to select the optimal threshold for a similarity function in a given collection. Considering the fact that the similarity functions are imperfect and vary in quality, it is necessary to evaluate the similarity function for each collection as the result is domain dependent. A threshold value for a collection could be totally inappropriate for another, even though the same similarity function is applied. As a measure of quality of similarity functions for range queries, this thesis introduces discernability. This is a measure to quantify the ability of the similarity function in separating relevant and irrelevant elements. Comparing discernability and mean average precision, the rst one can capture variations that are not noticed by precision-based measures. This property shows that discernability presents better results for evaluating similarity functions for range queries. An extended experimental evaluation using real data shows the viability of both, the estimation method and the discernability measure, applied to range queries.
|
122 |
Análise da produtividade da soja associada a fatores agrometeorológicos, por meio de estatística espacial de área na Região Oeste do Estado do Paraná.Araújo, Everton Coimbra de 01 December 2012 (has links)
Made available in DSpace on 2017-07-10T19:23:39Z (GMT). No. of bitstreams: 1
Everton.pdf: 4714138 bytes, checksum: a59b9d4eb09d8201b1cddd3c78f52e24 (MD5)
Previous issue date: 2012-12-01 / This paper aimed to present methods to be applied in the area of spatial statistics on soybean yield and agrometeorological factors in Western Paraná state. The data used, related to crop years from 2000/2001 to 2007/2008, are the following variables: soybean yield (t ha-1) and agrometeorological factors, such as rainfall (mm), average temperature (oC) and solar global radiation average (W m-2). In the first phase,it was used indices of spatial autocorrelation (Moran Global and Local) and presented multiple spatial regression models, with performance evaluations. The estimation of parameters occurred when using the Maximum Likelihood method and the performance evaluation of the models was based on the coefficient of determination (R2), the maximum value of the function of the logarithm of the maximum value of the likelihood function logarithm and the Bayesian information criterion of Schwarz. In a second step, cluster analysis was performed using spatial statistical multivariate associations, seeking to identify the same set of variables, but with a larger number of crop years. Finally, the data from one crop year were utilized in an approach based on fuzzy clustering, through the Fuzzy C-Means algorithm and the similarity measure by defining an index for this purpose. The first phase of the study showed the correlation between spatial autocorrelation and soybean yield and agrometeorological elements, through the analysis of spatial area, using techniques such as index Global Moran's I and Local univariate and bivariate and significance tests. It was possible to demonstrate, through the performance indicators used, that the SAR and CAR models offered better results than the classical multiple regression model. In the second phase, it was possible to present the formation of groups of cities using the similarities of the variables under analysis. Cluster analysis is a useful tool for better management of production activities in agriculture, since, with the grouping, it was possible to establish similarities parameters that provide better management of production processes that bring quantitative and qualitatively better, results sought by the farmer. In the final step, through the use of Fuzzy C-Means algorithm, it was possible to form groups of cities of similar soybean yield using the method of decision by the Higher Degree of Relevance (MDMGP) and Method of Decision Threshold by β (β CDM). Subsequently, identification of the adequate number of clusters was obtained using modified partition entropy. To measure the degree of similarity of each cluster, a Cluster Similarity Index (ISCl) was designed and used, which considers the degree of relevance of each city within the group to which it belongs. Within the perspective of this study, the method used was adequate, allowing to identify clusters of cities with degrees of similarities in the order of 60 to 78%. / Este trabalho apresenta métodos para serem aplicados na estatística espacial de área na produtividade da soja e fatores agrometeorológicos na região oeste do estado do Paraná. Os dados utilizados estão relacionados aos anos-safra de 2000/2001 a 2007/2008, sendo as variáveis: produtividade da soja (t ha-1) e agrometeorológicas, tais como precipitação pluvial (mm), temperatura média (oC) e radiação solar global média (W m-2). Em uma primeira fase foram utilizados índices de autocorrelação espacial (Moran Global e Local) e apresentados modelos de regressão espacial múltipla, com avaliações de desempenho. A estimativa dos parâmetros dos modelos ajustados se deu pelo uso do método de Máxima Verossimilhança e a avaliação do desempenho dos modelos foi realizada com base no coeficiente de determinação (R2), no máximo valor do logaritmo da função do máximo valor do logaritmo da função verossimilhança e no critério de informação bayesiano de Schwarz. Em uma segunda etapa foram realizadas análises de agrupamento espacial por meio da estatística multivariada, buscando identificar associações no mesmo conjunto de variáveis, porém com um número maior de anos-safra. Finalmente, os dados de um ano-safra foram aplicados em uma abordagem baseada em agrupamento difuso, por meio do algoritmo Fuzzy c-Means, tendo a similaridade medida pela definição de um índice com este objetivo. O estudo da primeira fase permitiu verificar a correlação e a autocorrelação espacial entre a produtividade da soja e os elementos agrometeorológicos, por meio da análise espacial de área, usando técnicas como o índice I de Moran Global e Local uni e bivariado e os testes de significância. Foi possível demonstrar que, por meio dos indicadores de desempenho utilizados, os modelos SAR e CAR ofereceram melhores resultados em relação ao modelo de regressão múltipla clássica. Na segunda fase, foi possível apresentar a formação de grupos de municípios utilizando as similaridades das variáveis em análise. A análise de agrupamento foi um instrumento útil para uma melhor gestão das atividades de produção da agricultura, em função de que, com o agrupamento, foi possível se estabelecer similaridades que proporcionem parâmetros para uma melhor gestão dos processos de produção que traga, quantitativa e qualitativamente, resultados almejados pelo agricultor. Na etapa final, por meio do algoritmo Fuzzy c-Means, foi possível a formação de grupos de municípios similares à produtividade de soja, utilizando o Método de Decisão pelo Maior Grau de Pertinência (MDMGP) e o Método de Decisão pelo Limiar β (MDL β). Posteriormente, a identificação do número adequado de agrupamentos foi obtida utilizando a Entropia de Partição Modificada. Para mensurar o nível de similaridade de cada agrupamento, foi criado e utilizado um Índice de Similaridade de Clusters (ISCl), que considera o grau de pertinência de cada município dentro do agrupamento a que pertence. Dentro das perspectivas deste estudo, o método empregado se mostrou adequado, permitindo identificar agrupamentos de municípios com graus de similaridades da ordem de 60 a 78%.
espacial
|
123 |
The effect of referent similarity and phonological similarity on concurrent word learningZhao, Libo 01 May 2013 (has links)
Similarity has been regarded as a primary means by which lexical representations are organized, and hence an important determinant of processing interactions between lexical items. A central question on lexical-semantics similarity is how it influences lexical processing. There have been much fewer investigations, however, on how lexical-semantic similarity might influence novel word learning. This dissertation work aimed to fill this gap by addressing one kind of lexical-semantic similarity, similarity among the novel words that are being learned concurrently (concurrent similarity), on the learning of phonological word forms. Importantly, it aimed to use tests that eliminated the real time processing confound at test so as to provide convincing evidence on whether learning was indeed affected by similarity.
The first part of the dissertation addressed the effect of concurrent referent similarity on the learning of the phonological word forms. Experiment 1 used a naming test to provide evidence on the direction of the effect. Experiment 2 and Experiment 3 used the stem completion test and the recognition from mis-pronunciation test that controlled for real time processing between conditions. Then a 4-layer Hebbian Normalized Recurrent Network was also developed to provide even more convincing evidence on whether learning was affected (the connection weights). Consistently across the three tasks and the simulation, a detrimental effect of referent similarity on the phonological word form learning was revealed.
The second part of the dissertation addressed the effect of cohort similarity on the learning of the phonological word forms. The recognition from mis-pronunciation on partial words was developed to control for real time processing between conditions so as to capture the effect of learning. We examined the effect of cohort similarity at different syllable positions and found a detrimental effect at the second syllable and non-effect at the third syllable. This is consistent with the previous finding that competition among cohorts diminishes as the stimulus is received, suggesting that the effect of cohort similarity depends on the status of competition dynamics among cohorts.
The theoretical and methodological implications of this study are discussed.
|
124 |
Evaluation of the correlation between test cases dependency and their semantic text similarityAndersson, Filip January 2020 (has links)
An important step in developing software is to test the system thoroughly. Testing software requires a generation of test cases that can reach large numbers and is important to be performed in the correct order. Certain information is critical to know to schedule the test cases incorrectly order and isn’t always available. This leads to a lot of required manual work and valuable resources to get correct. By instead analyzing their test specification it could be possible to detect the functional dependencies between test cases. This study presents a natural language processing (NLP) based approach and performs cluster analysis on a set of test cases to evaluate the correlation between test case dependencies and their semantic similarities. After an initial feature selection, the test cases’ similarities are calculated through the Cosine distance function. The result of the similarity calculation is then clustered using the HDBSCAN clustering algorithm. The clusters would represent test cases’ relations where test cases with close similarities are put in the same cluster as they were expected to share dependencies. The clusters are then validated with a Ground Truth containing the correct dependencies. The result is an F-Score of 0.7741. The approach in this study is used on an industrial testing project at Bombardier Transportation in Sweden.
|
125 |
Fourier Decompositions of Graphs with Symmetries and Equitable PartitionsLund, Darren Scott 31 March 2021 (has links)
We show that equitable partitions, which are generalizations of graph symmetries, and Fourier transforms are fundamentally related. For a partition of a graph's vertices we define a Fourier similarity transform of the graph's adjacency matrix built from the matrices used to carryout discrete Fourier transformations. We show that the matrix (graph) decomposes into a number of smaller matrices (graphs) under this transformation if and only if the partition is an equitable partition. To extend this result to directed graphs we define two new types of equitable partitions, equitable receiving and equitable transmitting partitions, and show that if a partition of a directed graph is both, then the graph's adjacency matrix will similarly decomposes under this transformation. Since the transformation we use is a similarity transform the collective eigenvalues of the resulting matrices (graphs) is the same as the eigenvalues of the original untransformed matrix (graph).
|
126 |
Similarity Learning and Stochastic Language Models for Tree-Represented MusicBernabeu Briones, José Francisco 20 July 2017 (has links)
Similarity computation is a difficult issue in music information retrieval tasks, because it tries to emulate the special ability that humans show for pattern recognition in general, and particularly in the presence of noisy data. A number of works have addressed the problem of what is the best representation for symbolic music in this context. The tree representation, using rhythm for defining the tree structure and pitch information for leaf and node labelling has proven to be effective in melodic similarity computation. In this dissertation we try to built a system that allowed to classify and generate melodies using the information from the tree encoding, capturing the inherent dependencies which are inside this kind of structure, and improving the current methods in terms of accuracy and running time. In this way, we try to find more efficient methods that is key to use the tree structure in large datasets. First, we study the possibilities of the tree edit similarity to classify melodies using a new approach for estimate the weights of the edit operations. Once the possibilities of the cited approach are studied, an alternative approach is used. For that a grammatical inference approach is used to infer tree languages. The inference of these languages give us the possibility to use them to classify new trees (melodies).
|
127 |
The Choice of Brand Extension: The Moderating Role of Brand Loyalty on Fit and Brand FamiliarityLiang, Beichen, Fu, Wei 01 March 2021 (has links)
The purpose of this study is to investigate the role of loyalty in consumers’ selection of brand extensions in the presence of familiar competitors. The findings show that fit may not have a linear relationship with the choice of an extension when loyalty and brand familiarity are considered. Loyal consumers’ likelihood to choose high-fit and moderate-fit extensions is not much lower than their likelihood to choose products from familiar competitors. We also find an inverted-U-shaped relationship between choice behavior and degree of perceived fit for loyal and moderately loyal consumers. Moreover, brand concepts can make a brand more elastic and extendable, increasing loyal and moderately loyal consumers’ likelihood to choose moderate- and even low-fit extensions. However, disloyal consumers are highly unlikely to choose extensions over products from familiar competitors regardless of fit and types of similarity. Finally, the effect of similarity on consumers’ choice of extensions is fully mediated by loyalty and perceived risks.
|
128 |
Understanding convolutional networks and semantic similaritySingh, Vineeta 22 October 2020 (has links)
No description available.
|
129 |
Record LinkageLarsen, Stasha Ann Bown 11 December 2013 (has links) (PDF)
This document explains the use of different metrics involved with record linkage. There are two forms of record linkage: deterministic and probabilistic. We will focus on probabilistic record linkage used in merging and updating two databases. Record pairs will be compared using character-based and phonetic-based similarity metrics to determine at what level they match. Performance measures are then calculated and Receiver Operating Characteristic (ROC) curves are formed. Finally, an economic model is applied that returns the optimal tolerance level two databases should use to determine a record pair match in order to maximize profit.
|
130 |
Termediator-II: Identification of Interdisciplinary Term Ambiguity Through Hierarchical Cluster AnalysisRiley, Owen G. 23 April 2014 (has links) (PDF)
Technical disciplines are evolving rapidly leading to changes in their associated vocabularies. Confusion in interdisciplinary communication occurs due to this evolving terminology. Two causes of confusion are multiple definitions (overloaded terms) and synonymous terms. The formal names for these two problems are polysemy and synonymy. Termediator-I, a web application built on top of a collection of glossaries, uses definition count as a measure of term confusion. This tool was an attempt to identify confusing cross-disciplinary terms. As more glossaries were added to the collection, this measure became ineffective. This thesis provides a measure of term polysemy. Term polysemy is effectively measured by semantically clustering the text concepts, or definitions, of each term and counting the number of resulting clusters. Hierarchical clustering uses a measure of proximity between the text concepts. Three such measures are evaluated: cosine similarity, latent semantic indexing, and latent Dirichlet allocation. Two linkage types, for determining cluster proximity during the hierarchical clustering process, are also evaluated: complete linkage and average linkage. Crowdsourcing through a web application was unsuccessfully attempted to obtain a viable clustering threshold by public consensus. An alternate metric of polysemy, convergence value, is identified and tested as a viable clustering threshold. Six resulting lists of terms ranked by cluster count based on convergence values are generated, one for each similarity measure and linkage type combination. Each combination produces a competitive list, and no clear combination can be determined as superior. Semantic clustering successfully identifies polysemous terms, but each similarity measure and linkage type combination provides slightly different results.
|
Page generated in 0.0339 seconds