Global ETD Search

1	Using the SKOS Model for Standardizing Semantic Similarity and Relatedness Measures for Ontological Terminologies Arockiasamy, Savarimuthu 14 August 2009 (has links) No description available. Computer Science Semantic similarity Semantic relatedness UMLS2SKOS
2	Enhancing Document Retrieval in the FinTech Domain : Applications of Advanced Language Models Hansen, Jesper January 2024 (has links) In this thesis, methods of creating an information retrieval (IR) model within the Fin-Tech domain are explored. Given the domain-specific and data-scarce environment, methods of artificially generating data to train and evaluate IR models are implemented and their limitations are discussed. The generative model GPT-J 6B is used to generate pseudo-queries for a document corpus, resulting in a training- and test-set of 148 and 166 query-document pairs respectively. Transformer-based models, fine-tuned- and original versions, are put to the test against the baseline model BM25 which historically has been seen as an effective document retrieval model. The models are evaluated using mean reciprocal rank at k (MRR@k) and time-cost to retrieve relevant documents. The main findings are that the historical BM25 model performs well in comparison to the transformer alternatives, it reaches the highest score for MRR@2 = 0.612. The results show that for MRR@5 and MRR@10, a combination model of BM25 and a cross encoder slightly outperforms the baseline reaching scores of MRR@5 = 0.655 and MRR@10 = 0.672. However, the increase in performance is slim and may not be enough to motivate an implementation. Finally, further research using real-world data is required to argue that transformer-based models are more robust in a real-world setting. NLP Information retrieval Semantic similarity Mathematics Matematik
3	Combining text-based and vision-based semantics / Combining text-based and vision-based semantics Tran, Binh Giang January 2011 (has links) Learning and representing semantics is one of the most important tasks that significantly contribute to some growing areas, as successful stories in the recent survey of Turney and Pantel (2010). In this thesis, we present an in- novative (and first) framework for creating a multimodal distributional semantic model from state of the art text-and image-based semantic models. We evaluate this multimodal semantic model on simulating similarity judgements, concept clustering and the newly introduced BLESS benchmark. We also propose an effective algorithm, namely Parameter Estimation, to integrate text- and image- based features in order to have a robust multimodal system. By experiments, we show that our technique is very promising. Across all experiments, our best multimodal model claims the first position. By relatively comparing with other text-based models, we are justified to affirm that our model can stay in the top line with other state of the art models. We explore various types of visual features including SIFT and other color SIFT channels in order to have prelim- inary insights about how computer-vision techniques should be applied in the natural language processing domain. Importantly, in this thesis, we show evi- dences that adding visual features (as the perceptual information coming from...
4	Hypothesis formulation in medical records space Ba-Dhfari, Thamer Omer Faraj January 2017 (has links) Patient medical records are a valuable resource that can be used for many purposes including managing and planning for future health needs as well as clinical research. Health databases such as the clinical practice research datalink (CPRD) and many other similar initiatives can provide researchers with a useful data source on which they can test their medical hypotheses. However, this can only be the case when researchers have a good set of hypotheses to test on the data. Conversely, the data may have other equally important areas that remain unexplored. There is a chance that some important signals in the data could be missed. Therefore, further analysis is required to make such hidden areas become more obvious and attainable for future exploration and investigation. Data mining techniques can be effective tools in discovering patterns and signals in large-scale patient data sets. These techniques have been widely applied to different areas in medical domain. Therefore, analysing patient data using such techniques has the potential to explore the data and to provide a better understanding of the information in patient records. However, the heterogeneity and complexity of medical data can be an obstacle in applying data mining techniques. Much of the potential value of this data therefore goes untapped. This thesis describes a novel methodology that reduces the dimensionality of primary care data, to make it more amenable to visualisation, mining and clustering. The methodology involves employing a combination of ontology-based semantic similarity and principal component analysis (PCA) to map the data into an appropriate and informative low dimensional space. The aim of this thesis is to develop a novel methodology that provides a visualisation of patient records. This visualisation provides a systematic method that allows the formulation of new and testable hypotheses which can be fed to researchers to carry out the subsequent phases of research. In a small-scale study based on Salford Integrated Record (SIR) data, I have demonstrated that this mapping provides informative views of patient phenotypes across a population and allows the construction of clusters of patients sharing common diagnosis and treatments. The next phase of the research was to develop this methodology and explore its application using larger patient cohorts. This data contains more precise relationships between features than small-scale data. It also leads to the understanding of distinct population patterns and extracting common features. For such reasons, I applied the mapping methodology to patient records from the CPRD database. The study data set consisted of anonymised patient records for a population of 2.7 million patients. The work done in this analysis shows that methodology scales as O(n) in ways that did not require large computing resources. The low dimensional visualisation of high dimensional patient data allowed the identification of different subpopulations of patients across the study data set, where each subpopulation consisted of patients sharing similar characteristics such as age, gender and certain types of diseases. A key finding of this research is the wealth of data that can be produced. In the first use case of looking at the stratification of patients with falls, the methodology gave important hypotheses; however, this work has barely scratched the surface of how this mapping could be used. It opens up the possibility of applying a wide range of data mining strategies that have not yet been explored. What the thesis has shown is one strategy that works, but there could be many more. Furthermore, there is no aspect of the implementation of this methodology that restricts it to medical data. The same methodology could equally be applied to the analysis and visualisation of many other sources of data that are described using terms from taxonomies or ontologies. 004
5	Grouping Biological Data Rundqvist, David January 2006 (has links) <p>Today, scientists in various biomedical fields rely on biological data sources in their research. Large amounts of information concerning, for instance, genes, proteins and diseases are publicly available on the internet, and are used daily for acquiring knowledge. Typically, biological data is spread across multiple sources, which has led to heterogeneity and redundancy.</p><p>The current thesis suggests grouping as one way of computationally managing biological data. A conceptual model for this purpose is presented, which takes properties specific for biological data into account. The model defines sub-tasks and key issues where multiple solutions are possible, and describes what approaches for these that have been used in earlier work. Further, an implementation of this model is described, as well as test cases which show that the model is indeed useful.</p><p>Since the use of ontologies is relatively new in the management of biological data, the main focus of the thesis is on how semantic similarity of ontological annotations can be used for grouping. The results of the test cases show for example that the implementation of the model, using Gene Ontology, is capable of producing groups of data entries with similar molecular functions.</p> Biological Data Grouping Ontologies Semantic Similarity Bioinformatics Bioinformatik
6	A framework for data loss prevention using document semantic signature Alhindi, Hanan 22 November 2019 (has links) The theft and exfiltration of sensitive data (e.g., state secrets, trade secrets, company records, etc.) represent one of the most damaging threats that can be carried out by malicious insiders against institutions and organizations because this could seriously diminish the confidentiality, integrity, and availability of the organization’s data. Data protection and insider threat detection and prevention are significant steps for any organization to enhance its internal security. In the last decade, data loss prevention (DLP) has emerged as one of the key mechanisms currently used by organizations to detect and block unauthorized data transfer from the organization perimeter. However, existing DLP approaches face several practical challenges, such as their relatively low accuracy that in turn affects their prevention capability. Also, current DLP approaches are ineffective in handling unstructured data or searching and comparing content semantically when confronted with evasion tactics where sensitive content is rewritten without changing its semantic. In the current dissertation, we present a new DLP model that tracks sensitive data using a summarized version of the content semantic called document semantic signature (DSS). The DSS can be updated dynamically as the protected content change and it is resilient against evasion tactics, such as content rewriting. We use domain specific ontologies to capture content semantics and track conceptual similarity and relevancy using adequate metrics to identify data leak from sensitive documents. The evaluation of the DSS model on two public datasets of different domain of interests achieved very encouraging results in terms of detection effectiveness. / Graduate Document Semnatic signature data loss prevention semantic similarity semnatic relevance
7	Ontology Alignment using Semantic Similarity with Reference Ontologies Pramit, Silwal January 2012 (has links) No description available. Computer Science ontology alignment reference ontologies semantic similarity
8	ONTOLOGY ALIGNMENT USING SEMANTIC SIMILARITY WITH REFERENCE ONTOLOGIES Silwal, Pramit January 2012 (has links) No description available. Computer Science Ontology Alignment semantic similarity, Mediating Ontology OAEI
9	Semantic similarities at the core of generic indexing and clustering approaches / Les similarités sémantiques au cœur d’approches génériques d’indexation et de catégorisation Fiorini, Nicolas 04 November 2015 (has links) Pour exploiter efficacement une masse toujours croissante de documents électroniques, une branche de l'Intelligence Artificielle s'est focalisée sur la création et l'utilisation de systèmes à base de connaissance. Ces approches ont prouvé leur efficacité, notamment en recherche d'information. Cependant elles imposent une indexation sémantique des ressources exploitées, i.e. que soit associé à chaque ressource un ensemble de termes qui caractérise son contenu. Pour s'affranchir de toute ambiguïté liée au langage naturel, ces termes peuvent être remplacés par des concepts issus d'une ontologie de domaine, on parle alors d'indexation conceptuelle.Le plus souvent cette indexation est réalisée en procédant à l'extraction des concepts du contenu même des documents. On note, dans ce cas, une forte dépendance des techniques associées à ce traitement au type de document et à l'utilisation d'algorithmes dédiés. Pourtant une des forces des approches conceptuelles réside dans leur généricité. En effet, par l'exploitation d'indexation sémantique, ces approches permettent de traiter de la même manière un ensemble d'images, de gènes, de textes ou de personnes, pour peu que ceux-ci aient été correctement indexés. Cette thèse explore ce paradigme de généricité en proposant des systèmes génériques et en les comparant aux approches existantes qui font référence. L'idée est de se reposer sur les annotations sémantiques et d'utiliser des mesures de similarité sémantique afin de créer des approches performantes. De telles approches génériques peuvent par la suite être enrichies par des modules plus spécifiques afin d'améliorer le résultat final. Deux axes de recherche sont suivis dans cette thèse. Le premier et le plus riche est celui de l'indexation sémantique. L'approche proposée exploite la définition et l'utilisation de documents proches en contenu pour annoter un document cible. Grâce à l'utilisation de similarités sémantiques entre les annotations des documents proches et à l'utilisation d'une heuristique, notre approche, USI (User-oriented Semantic Indexer), permet d'annoter des documents plus rapidement que les méthodes existantes en fournissant une qualité comparable. Ce processus a ensuite été étendu à une autre tâche, la classification. Le tri est une opération indispensable à laquelle l'Homme s'est attaché depuis l'Antiquité, qui est aujourd'hui de plus en plus automatisée. Nous proposons une approche de classification hiérarchique qui se base sur les annotations sémantiques des documents à classifier. Là encore, la méthode est indépendante des types de documents puisque l'approche repose uniquement sur leur annotations. Un autre avantage de cette approche est le fait que lorsque des documents sont rassemblés, le groupe qu'il forme est automatiquement annoté (suivant notre algorithme d'indexation). Par conséquent, le résultat fourni est une hiérarchie de classes contenant des documents, chaque classe étant annotée. Cela évite l'annotation manuelle fastidieuse des classes par l'exploration des documents qu'elle contient comme c'est souvent le cas.L'ensemble de nos travaux a montré que l'utilisation des ontologies permettait d'abstraire plusieurs processus et ainsi de réaliser des approches génériques. Cette généricité n'empêche en aucun cas d'être couplée à des approches plus spécifiques, mais constitue en soi une simplicité de mise en place dès lors que l'on dispose de documents annotés sémantiquement. / In order to improve the exploitation of even growing number of electronic documents, Artificial Intelligence has dedicated a lot of effort to the creation and use of systems grounded on knowledge bases. In particular in the information retrieval field, such semantic approaches have proved their efficiency.Therefore, indexing documents is a necessary task. It consists of associating them with sets of terms that describe their content. These terms can be keywords but also concepts from an ontology, in which case the annotation is said to be semantic and benefit from the inherent properties of ontologies which are the absence of ambiguities.Most approaches designed to annotate documents have to parse them and extract concepts from this parsing. This underlines the dependance of such approaches to the type of documents, since parsing requires dedicated algorithms.On the other hand, approaches that solely rely on semantic annotations can ignore the document type, enabling the creation of generic processes. This thesis capitalizes on genericity to build novel systems and compare them to state-of-the-art approaches. To this end, we rely on semantic annotations coupled with semantic similarity measures. Of course, such generic approaches can then be enriched with type-specific ones, which would further increase the quality of the results.First of all, this work explores the relevance of this paradigm for indexing documents. The idea is to rely on already annotated close documents to annotate a target document. We define a heuristic algorithm for this purpose that uses the semantic annotations of these close documents and semantic similarities to provide a generic indexing method. This results in USI (User-oriented Semantic Indexer) that we show to perform as well as best current systems while being faster.Second of all, this idea is extended to another task, clustering. Clustering is a very common and ancient process that is very useful for finding documents or understanding a set of documents. We propose a hierarchical clustering algorithm that reuses the same components of classical methods to provide a novel one applicable to any kind of documents. Another benefit of this approach is that when documents are grouped together, the group can be annotated by using our indexing algorithm. Therefore, the result is not only a hierarchy of clusters containing documents as clusters are actually described by concepts as well. This helps a lot to better understand the results of the clustering.This thesis shows that apart from enhancing classical approaches, building conceptual approaches allows us to abstract them and provide a generic framework. Yet, while bringing easy-to-set-up methods – as long as documents are semantically annotated –, genericity does not prevent us from mixing these methods with type-specific ones, in other words creating hybrid methods. Similarité sémantique Ontologies Indexation Catégorisation Algorithmique Semantic similarity Ontologies Indexing Clustering Algorithm
10	NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTS Wang, Qianqian 01 September 2017 (has links) Natural Language Processing (NLP) is the field of study that focuses on the interactions between human language and computers. By “natural language” we mean a language that is used for everyday communication by humans. Different from programming languages, natural languages are hard to be defined with accurate rules. NLP is developing rapidly and it has been widely used in different industries. Technologies based on NLP are becoming increasingly widespread, for example, Siri or Alexa are intelligent personal assistants using NLP build in an algorithm to communicate with people. “Natural Language Processing Based Generator of Testing Instruments” is a stand-alone program that generates “plausible” multiple-choice selections by analyzing word sense disambiguation and calculating semantic similarity between two natural language entities. The core is Word Sense Disambiguation (WSD), WSD is identifying which sense of a word is used in a sentence when the word has multiple meanings. WSD is considered as an AI-hard problem. The project presents several algorithms to resolve WSD problem and compute semantic similarity, along with experimental results demonstrating their effectiveness. Natural Language Processing Semantic Similarity Word Sense Disambiguation Other Computer Engineering

Search results