Spelling suggestions: "subject:"[een] SIMILARITY"" "subject:"[enn] SIMILARITY""
821 |
Operadores físicos binários para consultas por similaridade em SGBDR / Physical binary operators for similarity queries in RDBMSLuiz Olmes Carvalho 26 March 2018 (has links)
O operador de Junção é um operador importante da Álgebra Relacional que combina os pares de tuplas que atendem a uma dada condição de comparação entre os valores dos atributos de duas relações. Quando a comparação avalia a similaridade entre pares de valores, o operador é chamado Junção por Similaridade. Esse operador tem aplicações em diversos contextos, tais como o suporte de tarefas de mineração e análise de dados em geral, e a detecção de quase-duplicatas, limpeza de dados e casamento de cadeias de caracteres em especial. Dentre os operadores de junção por similaridade existentes, a Junção por Abrangência (range join) é a mais explorada na literatura. Contudo, ela apresenta limitações, tal como a dificuldade para se encontrar um limiar de similaridade adequado. Nesse contexto, a Junção por k-vizinhos mais próximos (knearest neighbor join kNN join) é considerada mais intuitiva, e portanto mais útil que o range join. Entretanto, executar um kNN join é computacionalmente mais caro, o que demanda por abordagens baseadas na técnica de laço aninhado, e as técnicas existentes para a otimização do algoritmo são restritas a um domínio de dados em particular. Visando agilizar e generalizar a execução do kNN join, a primeira contribuição desta tese foi o desenvolvimento do algoritmo QuickNearest, baseado na técnica de divisão e conquista, que é independente do domínio dos dados, independente da função de distância utilizada, e que computa kNNjoins de maneira muito eficiente. Os experimentos realizados apontam que o QuickNearest chega a ser 4 ordens de magnitude mais rápido que os métodos atuais. Além disso, o uso de operadores de junção por similaridade em ambientes relacionais é problemático, principalmente por dois motivos: (i)emgeral o resultado tem cardinalidade muito maior do que o realmente necessário ou esperado pela maioria das aplicações de análise de dados; e (ii) as consultas que os utilizam envolvem também operações de ordenação, embora a ordem seja um conceito não associado à teoria relacional. A segunda contribuição da tese aborda esses dois problemas, tratando os operadores de junção por similaridade existentes como casos particulares de um conjunto mais amplo de operadores binários, para o qual foi definido o conceito de Wide-joins. Os operadores wide-joins recuperam os pares mais similares em geral e incorporam a ordenação como uma operação interna ao processamento, de forma compatível com a teoria relacional e que permite restringir a cardinalidade dos resultados a tuplas de maior interesse para as aplicações. Os experimentos realizados mostram que os wide-joins são rápidos o suficiente para serem usados em aplicações reais, retornam resultados de qualidade melhor do que os métodos concorrentes e são mais adequados para execução num ambiente relacional do que os operadores de junção por similaridade tradicionais. / Joins are important Relational Algebra operators. They pair tuples from two relations that meet a given comparison condition between the attribute values. When the evaluation compares the similarity among the values, the operator is called a Similarity Join. This operator has application to a variety of contexts, such as supporting data mining tasks and data analysis in general, and near-duplicate detection, data cleaning and string matching in particular. Among the existing types of similarity joins, the range join is the most explored one in the literature. However, it has several shortcomings, such as the diculty to find adequate similarity thresholds. In such context, the k-nearest neighbors join (kNN join) is considered more intuitive, and therefore more useful than the range join. However, the kNN join execution is computationally well more expensive, thus demanding implementations either based on nested loop techniques, which are generic, or on optimizing techniques but that are specific data given domains. In order to accelerate and generalize kNN join execution, the first contribution of this thesis was the development of the QuickNearest algorithm, based on the divide and conquest approach that is independent of the data domain, independent of the distance function used, and that computes kNN joins very eciently. Experiments performed with the QuickNearest algorithm show that it is up to four orders of magnitude faster than current methods. Nevertheless, using similarity join operators in relational environments remains generally troublesome, due to two main reasons: (i) the result often has a cardinality much larger than what is actually needed or expected by most of the data analysis applications; and (ii) queries that use them almost always also require sorting operations, but order concept is not present in the relational theory. The second contribution of the thesis addresses these two problems through the definition of the concept of Wide-joins, which turns the existing similarity join operators just as particular cases of a more powerful set of binary operators. Awide-join operator retrieves the pairs most similar in general and already incorporates ordering as an internal operation to its processing, what makes it fully compatible with the relational theory. The concept also provides powerful ways to restrict the result cardinality just to tuples really meaningful for the applications. In fact, the experiments have also shown that wide-joins are fast enough to be useful for real applications, they return results of better quality than competing methods, and are more suitable for execution in a relational environment than the traditional similarity join operators.
|
822 |
Seleção de características por meio de algoritmos genéticos para aprimoramento de rankings e de modelos de classificação / Feature selection by genetic algorithms to improve ranking and classification modelsSérgio Francisco da Silva 25 April 2011 (has links)
Sistemas de recuperação de imagens por conteúdo (Content-based image retrieval { CBIR) e de classificação dependem fortemente de vetores de características que são extraídos das imagens considerando critérios visuais específicos. É comum que o tamanho dos vetores de características seja da ordem de centenas de elementos. Conforme se aumenta o tamanho (dimensionalidade) do vetor de características, também se aumentam os graus de irrelevâncias e redundâncias, levando ao problema da \"maldição da dimensionalidade\". Desse modo, a seleção das características relevantes é um passo primordial para o bom funcionamento de sistemas CBIR e de classificação. Nesta tese são apresentados novos métodos de seleção de características baseados em algoritmos genéticos (do inglês genetic algorithms - GA), visando o aprimoramento de consultas por similaridade e modelos de classificação. A família Fc (\"Fitness coach\") de funções de avaliação proposta vale-se de funções de avaliação de ranking, para desenvolver uma nova abordagem de seleção de características baseada em GA que visa aprimorar a acurácia de sistemas CBIR. A habilidade de busca de GA considerando os critérios de avaliação propostos (família Fc) trouxe uma melhora de precisão de consultas por similaridade de até 22% quando comparado com métodos wrapper tradicionais para seleção de características baseados em decision-trees (C4.5), naive bayes, support vector machine, 1-nearest neighbor e mineração de regras de associação. Outras contribuições desta tese são dois métodos de seleção de características baseados em filtragem, com aplicações em classificação de imagens, que utilizam o cálculo supervisionado da estatística de silhueta simplificada como função de avaliação: o silhouette-based greedy search (SiGS) e o silhouette-based genetic algorithm search (SiGAS). Os métodos propostos superaram os métodos concorrentes na literatura (CFS, FCBF, ReliefF, entre outros). É importante também ressaltar que o ganho em acurácia obtido pela família Fc, e pelos métodos SiGS e SiGAS propostos proporcionam também um decréscimo significativo no tamanho do vetor de características, de até 90% / Content-based image retrieval (CBIR) and classification systems rely on feature vectors extracted from images considering specific visual criteria. It is common that the size of a feature vector is of the order of hundreds of elements. When the size (dimensionality) of the feature vector is increased, a higher degree of redundancy and irrelevancy can be observed, leading to the \"curse of dimensionality\" problem. Thus, the selection of relevant features is a key aspect in a CBIR or classification system. This thesis presents new methods based on genetic algorithms (GA) to perform feature selection. The Fc (\"Fitness coach\") family of fitness functions proposed takes advantage of single valued ranking evaluation functions, in order to develop a new method of genetic feature selection tailored to improve the accuracy of CBIR systems. The ability of the genetic algorithms to boost feature selection by employing evaluation criteria (fitness functions) improves up to 22% the precision of the query answers in the analyzed databases when compared to traditional wrapper feature selection methods based on decision-tree (C4.5), naive bayes, support vector machine, 1-nearest neighbor and association rule mining. Other contributions of this thesis are two filter-based feature selection algorithms for classification purposes, which calculate the simplified silhouette statistic as evaluation function: the silhouette-based greedy search (SiGS) and the silhouette-based genetic algorithm search (SiGAS). The proposed algorithms overcome the state-of-the-art ones (CFS, FCBF and ReliefF, among others). It is important to stress that the gain in accuracy of the proposed methods family Fc, SiGS and SIGAS is allied to a significant decrease in the feature vector size, what can reach up to 90%
|
823 |
Development of new computational methods for a synthetic gene set annotation / Développement de nouvelles méthodes informatiques pour une annotation synthétique d’un ensemble de gènes.Ayllón-Benítez, Aarón 05 December 2019 (has links)
Les avancées dans l'analyse de l'expression différentielle de gènes ont suscité un vif intérêt pour l'étude d'ensembles de gènes présentant une similarité d'expression au cours d'une même condition expérimentale. Les approches classiques pour interpréter l'information biologique reposent sur l'utilisation de méthodes statistiques. Cependant, ces méthodes se focalisent sur les gènes les plus connus tout en générant des informations redondantes qui peuvent être éliminées en prenant en compte la structure des ressources de connaissances qui fournissent l'annotation. Au cours de cette thèse, nous avons exploré différentes méthodes permettant l'annotation d'ensembles de gènes.Premièrement, nous présentons les solutions visuelles développées pour faciliter l'interprétation des résultats d'annota-tion d'un ou plusieurs ensembles de gènes. Dans ce travail, nous avons développé un prototype de visualisation, appelé MOTVIS, qui explore l'annotation d'une collection d'ensembles des gènes. MOTVIS utilise ainsi une combinaison de deux vues inter-connectées : une arborescence qui fournit un aperçu global des données mais aussi des informations détaillées sur les ensembles de gènes, et une visualisation qui permet de se concentrer sur les termes d'annotation d'intérêt. La combinaison de ces deux visualisations a l'avantage de faciliter la compréhension des résultats biologiques lorsque des données complexes sont représentées.Deuxièmement, nous abordons les limitations des approches d'enrichissement statistique en proposant une méthode originale qui analyse l'impact d'utiliser différentes mesures de similarité sémantique pour annoter les ensembles de gènes. Pour évaluer l'impact de chaque mesure, nous avons considéré deux critères comme étant pertinents pour évaluer une annotation synthétique de qualité d'un ensemble de gènes : (i) le nombre de termes d'annotation doit être réduit considérablement tout en gardant un niveau suffisant de détail, et (ii) le nombre de gènes décrits par les termes sélectionnés doit être maximisé. Ainsi, neuf mesures de similarité sémantique ont été analysées pour trouver le meilleur compromis possible entre réduire le nombre de termes et maintenir un niveau suffisant de détails fournis par les termes choisis. Tout en utilisant la Gene Ontology (GO) pour annoter les ensembles de gènes, nous avons obtenu de meilleurs résultats pour les mesures de similarité sémantique basées sur les nœuds qui utilisent les attributs des termes, par rapport aux mesures basées sur les arêtes qui utilisent les relations qui connectent les termes. Enfin, nous avons développé GSAn, un serveur web basé sur les développements précédents et dédié à l'annotation d'un ensemble de gènes a priori. GSAn intègre MOTVIS comme outil de visualisation pour présenter conjointement les termes représentatifs et les gènes de l'ensemble étudié. Nous avons comparé GSAn avec des outils d'enrichissement et avons montré que les résultats de GSAn constituent un bon compromis pour maximiser la couverture de gènes tout en minimisant le nombre de termes.Le dernier point exploré est une étape visant à étudier la faisabilité d'intégrer d'autres ressources dans GSAn. Nous avons ainsi intégré deux ressources, l'une décrivant les maladies humaines avec Disease Ontology (DO) et l'autre les voies métaboliques avec Reactome. Le but était de fournir de l'information supplémentaire aux utilisateurs finaux de GSAn. Nous avons évalué l'impact de l'ajout de ces ressources dans GSAn lors de l'analyse d’ensembles de gènes. L'intégration a amélioré les résultats en couvrant d'avantage de gènes sans pour autant affecter de manière significative le nombre de termes impliqués. Ensuite, les termes GO ont été mis en correspondance avec les termes DO et Reactome, a priori et a posteriori des calculs effectués par GSAn. Nous avons montré qu'un processus de mise en correspondance appliqué a priori permettait d'obtenir un plus grand nombre d'inter-relations entre les deux ressources. / The revolution in new sequencing technologies, by strongly improving the production of omics data, is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and focus on the most studied genes that may represent a limited coverage of annotated genes within a gene set. During this thesis, we explored different methods for annotating gene sets. In this frame, we developed three studies allowing the annotation of gene sets and thus improving the understanding of their biological context.First, visualization approaches were applied to represent annotation results provided by enrichment analysis for a gene set or a repertoire of gene sets. In this work, a visualization prototype called MOTVIS (MOdular Term VISualization) has been developed to provide an interactive representation of a repertoire of gene sets combining two visual metaphors: a treemap view that provides an overview and also displays detailed information about gene sets, and an indented tree view that can be used to focus on the annotation terms of interest. MOTVIS has the advantage to solve the limitations of each visual metaphor when used individually. This illustrates the interest of using different visual metaphors to facilitate the comprehension of biological results by representing complex data.Secondly, to address the issues of enrichment analysis, a new method for analyzing the impact of using different semantic similarity measures on gene set annotation was proposed. To evaluate the impact of each measure, two relevant criteria were considered for characterizing a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced while maintaining a sufficient level of details, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, nine semantic similarity measures were analyzed to identify the best possible compromise between both criteria while maintaining a sufficient level of details. Using GO to annotate the gene sets, we observed better results with node-based measures that use the terms’ characteristics than with edge-based measures that use the relations terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of the terms used. Then, we developed GSAn (Gene Set Annotation), a novel gene set annotation web server that uses semantic similarity measures to synthesize a priori GO annotation terms. GSAn contains the interactive visualization MOTVIS, dedicated to visualize the representative terms of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.At last, the third work consisted in enriching the annotation results provided by GSAn. Since the knowledge described in GO may not be sufficient for interpreting gene sets, other biological information, such as pathways and diseases, may be useful to provide a wider biological context. Thus, two additional knowledge resources, being Reactome and Disease Ontology (DO), were integrated within GSAn. In practice, GO terms were mapped to terms of Reactome and DO, before and after applying the GSAn method. The integration of these resources improved the results in terms of gene coverage without affecting significantly the number of involved terms. Two strategies were applied to find mappings (generated or extracted from the web) between each new resource and GO. We have shown that a mapping process before computing the GSAn method allowed to obtain a larger number of inter-relations between the two knowledge resources.
|
824 |
TOWARDS TIME-AWARE COLLABORATIVE FILTERING RECOMMENDATION SYSTEMDawei Wang (9216029) 12 October 2021 (has links)
<div><div><div><p>As technological capacity to store and exchange information progress, the amount of available data grows explosively, which can lead to information overload. The dif- ficulty of making decisions effectively increases when one has too much information about that issue. Recommendation systems are a subclass of information filtering systems that aim to predict a user’s opinion or preference of topic or item, thereby providing personalized recommendations to users by exploiting historic data. They are widely used in e-commerce such as Amazon.com, online movie streaming com- panies such as Netflix, and social media networks such as Facebook. Memory-based collaborative filtering (CF) is one of the recommendation system methods used to predict a user’s rating or preference by exploring historic ratings, but without in- corporating any content information about users or items. Many studies have been conducted on memory-based CFs to improve prediction accuracy, but none of them have achieved better prediction accuracy than state-of-the-art model-based CFs. Fur- thermore, A product or service is not judged only by its own characteristics but also by the characteristics of other products or services offered concurrently. It can also be judged by anchoring based on users’ memories. Rating or satisfaction is viewed as a function of the discrepancy or contrast between expected and obtained outcomes documented as contrast effects. Thus, a rating given to an item by a user is a compar- ative opinion based on the user’s past experiences. Therefore, the score of ratings can be affected by the sequence and time of ratings. However, in traditional CFs, pairwise similarities measured between items do not consider time factors such as the sequence of rating, which could introduce biases caused by contrast effects. In this research, we proposed a new approach that combines both structural and rating-based similarity measurement used in memory-based CFs. We found that memory-based CF using combined similarity measurement can achieve better prediction accuracy than model-based CFs in terms of lower MAE and reduce memory and time by using less neighbors than traditional memory-based CFs on MovieLens and Netflix datasets. We also proposed techniques to reduce the biases caused by those user comparing, anchoring and adjustment behaviors by introducing the time-aware similarity measurements used in memory-based CFs. At last, we introduced novel techniques to identify, quantify, and visualize user preference dynamics and how it could be used in generating dynamic recommendation lists that fits each user’s current preferences.</p></div></div></div>
|
825 |
Počítač jako inteligentní spoluhráč ve slovně-asociační hře Krycí jména / Computer as an Intelligent Partner in the Word-Association Game CodenamesObrtlík, Petr January 2018 (has links)
This thesis deals with associations between words. Describes the design and implementation of a system that can represent a human in the word-association game Codenames. The system uses the Gensim and FastText libraries to create semantic models. The relationship between words is taught by the analysis of the text corpus CWC-2011.
|
826 |
Určení místa původu hudebních interpretací české komorní a orchestrální hudby za pomoci technik Music Information Retrieval / Music information retrieval techniques for determining the place of origin of the Czech chamber and orchestral music interpretationsMiklánek, Štěpán January 2019 (has links)
This diploma thesis is focused on the statistical analysis of chamber and orchestral classical music recordings composed by Czech authors. One of the chapters is dedicated to the description of a feature extraction process that precedes the statistical analysis. Techniques of Music Information Retrieval are used during several stages of this thesis. Databases used for analysis are described and pre-processing steps are proposed. A tool for synchronization of the recordings was implemented in MATLAB. Finally the system used for classification of recordings based on their geographical origin is proposed. The recordings are sorted by a binary classifier into two categories of Czech and world recordings. The first part of the statistical analysis is focused on individual analysis of features. The features are evaluated based on their discrimination strength. The second part of the statistical analysis is focused on feature selection, which can improve the overall accuracy of the binary classifier compared to the individual analysis of the features.
|
827 |
Určování podobnosti objektů na základě obrazové informace / Determination of Objects Similarity Based on Image InformationRajnoha, Martin January 2021 (has links)
Monitoring of public areas and their automatic real-time processing became increasingly significant due to the changing security situation in the world. However, the problem is an analysis of low-quality records, where even the state-of-the-art methods fail in some cases. This work investigates an important area of image similarity – biometric identification based on face image. The work deals primarily with the face super-resolution from a sequence of low-resolution images and it compares this approach to the single-frame methods, that are still considered as the most accurate. A new dataset was created for this purpose, which is directly designed for the multi-frame face super-resolution methods from the low-resolution input sequence, and it is of comparable size with the leading world datasets. The results were evaluated by both a survey of human perception and defined objective metrics. A hypothesis that multi-frame methods achieve better results than single-frame methods was proved by a comparison of both methods. Architectures, source code and the dataset were released. That caused a creation of the basis for future research in this field.
|
828 |
Grafická reprezentace genomických a proteomických sekvencí / Graphical representation of DNA and protein sequencesPražák, Ondřej January 2011 (has links)
Modification of DNA sequences and their suitable representation is important part of analysis, comparison and another processing. Goal of this paper is finding of suitable methods for representation of genomic and proteomic sequences. Because there is great number of metods, this paper will introduce only some of them. All selected methods, are described in the first part of this paper and they were programed in Matlab. Selected methods are illustrated on coding sequences of the first exon of the b-globin gene of 11 different species. Results are compared withresults from the original papers. Some methods are capable of another processing like cluster analysis. Output of this paper is comparison of results, gained from different methods, and finding the most suitable one.
|
829 |
Biometrie sítnice pro účely rozpoznávání osob / Retinal biometry for human recognitionSikorová, Eva January 2015 (has links)
This master thesis deals with recognition of a person by comparing symptom sets extracted from images of the retinal vessels pattern. The first part includes the insight into biometric issues, the punctual analysis of human identification using retina images, and especially the literature research of methods of extraction and comparison. In the practical part there were realized algorithms for human identification with the method of nearest neighbor search (NS), translation, template matching (TM) and extended NS and TM including more symptoms, for which MATLAB program was used. The thesis includes testing of suggested programs on the biometric database of symptomatic vectors with the following evaluation.
|
830 |
Sémantická blízkost pro vědecké články / Semantic Relatedness of Scientific ArticlesDresto, Erik January 2011 (has links)
The main goal of the thesis is to explore basic methods which can be used to find semantically related scientific articles. All the methods are explained in detail, compared and in the end evaluated by the standard metrics. Based on the evaluation, a new method for computing semantic similarity of scientific articles is proposed. The proposed method is based on the current state-of-the-art methods and adds the another important factor for computing similarity - citations. Using citations is important, since they represent a static bond between the articles. Finally, the proposed method is evaluated on the real data and compared with other described methods.
|
Page generated in 0.0777 seconds