Global ETD Search

21	Graph-based Centrality Algorithms for Unsupervised Word Sense Disambiguation Sinha, Ravi Som 12 1900 (has links) This thesis introduces an innovative methodology of combining some traditional dictionary based approaches to word sense disambiguation (semantic similarity measures and overlap of word glosses, both based on WordNet) with some graph-based centrality methods, namely the degree of the vertices, Pagerank, closeness, and betweenness. The approach is completely unsupervised, and is based on creating graphs for the words to be disambiguated. We experiment with several possible combinations of the semantic similarity measures as the first stage in our experiments. The next stage attempts to score individual vertices in the graphs previously created based on several graph connectivity measures. During the final stage, several voting schemes are applied on the results obtained from the different centrality algorithms. The most important contributions of this work are not only that it is a novel approach and it works well, but also that it has great potential in overcoming the new-knowledge-acquisition bottleneck which has apparently brought research in supervised WSD as an explicit application to a plateau. The type of research reported in this thesis, which does not require manually annotated data, holds promise of a lot of new and interesting things, and our work is one of the first steps, despite being a small one, in this direction. The complete system is built and tested on standard benchmarks, and is comparable with work done on graph-based word sense disambiguation as well as lexical chains. The evaluation indicates that the right combination of the above mentioned metrics can be used to develop an unsupervised disambiguation engine as powerful as the state-of-the-art in WSD. Word sense measures of semantic similarity graph centrality algorithms disambiguation Computational linguistics. Semantics -- Data processing. Discourse analysis.
22	Evaluation of the correlation between test cases dependency and their semantic text similarity Andersson, Filip January 2020 (has links) An important step in developing software is to test the system thoroughly. Testing software requires a generation of test cases that can reach large numbers and is important to be performed in the correct order. Certain information is critical to know to schedule the test cases incorrectly order and isn’t always available. This leads to a lot of required manual work and valuable resources to get correct. By instead analyzing their test speciﬁcation it could be possible to detect the functional dependencies between test cases. This study presents a natural language processing (NLP) based approach and performs cluster analysis on a set of test cases to evaluate the correlation between test case dependencies and their semantic similarities. After an initial feature selection, the test cases’ similarities are calculated through the Cosine distance function. The result of the similarity calculation is then clustered using the HDBSCAN clustering algorithm. The clusters would represent test cases’ relations where test cases with close similarities are put in the same cluster as they were expected to share dependencies. The clusters are then validated with a Ground Truth containing the correct dependencies. The result is an F-Score of 0.7741. The approach in this study is used on an industrial testing project at Bombardier Transportation in Sweden. Software Testing Test optimization NLP Dependency Semantic Similarity Clustering Cosine Similarity HDBSCAN Computer Sciences Datavetenskap (datalogi)
23	Résumé automatique multi-document dynamique / Multi-document Update-summarization Mnasri, Maali 20 September 2018 (has links) Cette thèse s’intéresse au Résumé Automatique de texte et plus particulièrement au résumémis-à-jour. Cette problématique de recherche vise à produire un résumé différentiel d'un ensemble denouveaux documents par rapport à un ensemble de documents supposés connus. Elle intègre ainsidans la problématique du résumé à la fois la question de la dimension temporelle de l'information etcelle de l’historique de l’utilisateur. Dans ce contexte, le travail présenté s'inscrit dans les approchespar extraction fondées sur une optimisation linéaire en nombres entiers (ILP) et s’articule autour dedeux axes principaux : la détection de la redondance des informations sélectionnées et la maximisationde leur saillance. Pour le premier axe, nous nous sommes plus particulièrement intéressés àl'exploitation des similarités inter-phrastiques pour détecter, par la définition d'une méthode deregroupement sémantique de phrases, les redondances entre les informations des nouveaux documentset celles présentes dans les documents déjà connus. Concernant notre second axe, nous avons étudiél’impact de la prise en compte de la structure discursive des documents, dans le cadre de la Théorie dela Structure Rhétorique (RS), pour favoriser la sélection des informations considérées comme les plusimportantes. L'intérêt des méthodes ainsi définies a été démontré dans le cadre d'évaluations menéessur les données des campagnes TAC et DUC. Enfin, l'intégration de ces critères sémantique etdiscursif au travers d'un mécanisme de fusion tardive a permis de montrer dans le même cadre lacomplémentarité de ces deux axes et le bénéfice de leur combinaison. / This thesis focuses on text Automatic Summarization and particularly on UpdateSummarization. This research problem aims to produce a differential summary of a set of newdocuments with regard to a set of old documents assumed to be known. It thus adds two issues to thetask of generic automatic summarization: the temporal dimension of the information and the history ofthe user. In this context, the work presented here is based on an extractive approach using integerlinear programming (ILP) and is organized around two main axes: the redundancy detection betweenthe selected information and the user history and the maximization of their saliency . For the first axis,we were particularly interested in the exploitation of inter-sentence similarities to detect theredundancies between the information of the new documents and those present in the already knownones, by defining a method of semantic clustering of sentences. Concerning our second axis, westudied the impact of taking into account the discursive structure of documents, in the context of theRhetorical Structure Theory (RST), to favor the selection of information considered as the mostimportant. The benefit of the methods thus defined has been demonstrated in the context ofevaluations carried out on the data of TAC and DUC campaigns. Finally, the integration of thesesemantic and discursive criteria through a delayed fusion mechanism has proved the complementarityof these two axes and the benefit of their combination. Similarité sémantique Regroupement ILP Analyse discursive Semantic similarity Clustering ILP Discourse analysis
24	Evaluation of BERT-like models for small scale ad-hoc information retrieval / Utvärdering av BERT-liknande modeller för småskalig ad-hoc informationshämtning Roos, Daniel January 2021 (has links) Measuring semantic similarity between two sentences is an ongoing research field with big leaps being taken every year. This thesis looks at using modern methods of semantic similarity measurement for an ad-hoc information retrieval (IR) system. The main challenge tackled was answering the question "What happens when you don’t have situation-specific data?". Using encoder-based transformer architectures pioneered by Devlin et al., which excel at fine-tuning to situationally specific domains, this thesis shows just how well the presented methodology can work and makes recommendations for future attempts at similar domain-specific tasks. It also shows an example of how a web application can be created to make use of these fast-learning architectures. NLP Transformers BERT Information retrieval Semantic similarity
25	A Method for Integrating Heterogeneous Datasets based on GO Term Similarity Thanthiriwatte, Chamali Lankara 11 December 2009 (has links) This thesis presents a method for integrating heterogeneous gene/protein datasets at the functional level based on Gene Ontology term similarity. Often biologists want to integrate heterogeneous data sets obtain from different biological samples. A major challenge in this process is how to link the heterogeneous datasets. Currently, the most common approach is to link them through common reference database identifiers which tend to result in small number of matching identifiers. This is due to lack of standard accession schemes. Due to this problem, biologists may not recognize the underlying biological phenomena revealed by a combination of the data but by each data set individually. We discuss an approach for integrating heterogeneous datasets by computing the similarity among them based on the similarity of their GO annotations. Then we group the genes and/or proteins with similar annotations by applying a hierarchical clustering algorithm. The results demonstrate a more comprehensive understanding of the biological processes involved. Functional Annotations Similarity Matrix Transcriptomics Hierarchical Clustering Gene Ontology Proteomics Semantic Similarity Gene Expression Protein Expression
26	SEMANTIC SIMILARITY IN THE EVALUATION OF ONTOLOGY ALIGNMENT Hu, Xueheng 12 December 2011 (has links) No description available. Computer Science Semantic similarity ontological similarity ontology alignment information content semantic alignment quality
27	Semi-automated co-reference identification in digital humanities collections Croft, David January 2014 (has links) Locating specific information within museum collections represents a significant challenge for collection users. Even when the collections and catalogues exist in a searchable digital format, formatting differences and the imprecise nature of the information to be searched mean that information can be recorded in a large number of different ways. This variation exists not just between different collections, but also within individual ones. This means that traditional information retrieval techniques are badly suited to the challenges of locating particular information in digital humanities collections and searching, therefore, takes an excessive amount of time and resources. This thesis focuses on a particular search problem, that of co-reference identification. This is the process of identifying when the same real world item is recorded in multiple digital locations. In this thesis, a real world example of a co-reference identification problem for digital humanities collections is identified and explored. In particular the time consuming nature of identifying co-referent records. In order to address the identified problem, this thesis presents a novel method for co-reference identification between digitised records in humanities collections. Whilst the specific focus of this thesis is co-reference identification, elements of the method described also have applications for general information retrieval. The new co-reference method uses elements from a broad range of areas including; query expansion, co-reference identification, short text semantic similarity and fuzzy logic. The new method was tested against real world collections information, the results of which suggest that, in terms of the quality of the co-referent matches found, the new co-reference identification method is at least as effective as a manual search. The number of co-referent matches found however, is higher using the new method. The approach presented here is capable of searching collections stored using differing metadata schemas. More significantly, the approach is capable of identifying potential co-reference matches despite the highly heterogeneous and syntax independent nature of the Gallery, Library Archive and Museum (GLAM) search space and the photo-history domain in particular. The most significant benefit of the new method is, however, that it requires comparatively little manual intervention. A co-reference search using it has, therefore, significantly lower person hour requirements than a manually conducted search. In addition to the overall co-reference identification method, this thesis also presents: • A novel and computationally lightweight short text semantic similarity metric. This new metric has a significantly higher throughput than the current prominent techniques but a negligible drop in accuracy. • A novel method for comparing photographic processes in the presence of variable terminology and inaccurate field information. This is the first computational approach to do so. 006.3
28	Sdílená osobní databáze znalostí / Shared Personal Knowledge Database Folk, Michal January 2013 (has links) The goal of the paper is to suggest a solution of an inefficiency in searching for previously searched and found information. The suggested solution is based on the use of a personal knowledge base built upon existing technologies and adapted to needs of common users. The thesis is focused especially to the search based on semantic similarities between tags. Collective knowledge is used for finding the similarities. The first part of the paper introduces the repetitive search problem by a few real world scenarios. In the second part the problem is analyzed from the personal knowledge base point of view. The third part explains the suggested solution that is built upon Delicious, a bookmarking service and DBpedia. The suggested solution is implemented as a prototype. In the final part the prototype is tested and evaluated. The test results suggest that the presented solution can make the repetitive search easier, but at the same time it exposes some performance issues that the proposed method brings up. The paper recommends modifications that could improve the performance and allow more extensive prototype testing. Powered by TCPDF (www.tcpdf.org)
29	Searching Documents With Semantically Related Keyphrases Aygul, Ibrahim 01 December 2010 (has links) (PDF) In this thesis, we developed SemKPSearch which is a tool for searching documents by the keyphrases that are semantically related with the given query phrase. By relating the keyphrases semantically, we aim to provide users an extended search and browsing capability over a document collection and to increase the number of related results returned for a keyphrase query. Keyphrases provide a brief summary of the content of documents. They can be either author assigned or automatically extracted from the documents. SemKPSearch uses SemKPIndexes which are generated with the keyphrases of the documents. SemKPIndex is a keyphrase index extended with a keyphrase to keyphrase index which stores the semantic relation score between the keyphrases in the document collection. Semantic relation score between keyphrases is calculated using a metric which considers the similarity score between words of the keyphrases. The semantic similarity score between two words is determined with the help of two word-to-word semantic similarity metrics, namely the metric of Wu&amp / Palmer and the metric of Li et al. SemKPSearch is evaluated by the human evaluators which are all computer engineers. For the evaluation, in addition to the author assigned keyphrases, the keyphrases automatically extracted by employing the state-of-the-art algorithm KEA are used to create keyphrase indexes. QA Computer Software 76.75-76.765
30	Clustering Frequent Navigation Patterns From Website Logs Using Ontology And Temporal Information Kilic, Sefa 01 January 2012 (has links) (PDF) Given set of web pages labeled with ontological items, the level of similarity between two web pages is measured using the level of similarity between ontological items of pages labeled with. Using similarity measure between two pages, degree of similarity between two sequences of web page visits can be calculated as well. Using clustering algorithms, similar frequent sequences are grouped and representative sequences are selected from these groups. A new sequence is compared with all clusters and it is assigned to most similar one. Representatives of the most similar cluster can be used in several real world cases. They can be used for predicting and prefetching the next page user will visit or for helping the navigation of user in the website. They can also be used to improve the structure of website for easier navigation. In this study the effect of time spent on each web page during the session is analyzed.

Search results