Global ETD Search

31	Unsupervised Information Extraction From Text - Extraction and Clustering of Relations between Entities Wang, Wei 16 May 2013 (has links) (PDF) Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Natural language processing Unsupervised information extraction Relation clustering Semantic similarity
32	Analýza nizozemských idiomů s komponentem "peníze", srovnání míry ekvivalence v češtině / Analysis of the Dutch Idioms Containing the Word "Money" and their Semantic Similarity with the Czech Idioms Šivecová, Barbora January 2017 (has links) The thesis aims to conduct a semantic analysis of the Dutch idioms containing the word "money", categorization of the idioms to semantic categories and comparison of the semantic similarity with the Czech idioms. The theoretical part describes general phraseology and also takes up with the features and the classification of idioms. Moreover, it focuses on the theory of cognitive semantics and the history of phraseology in general, within the Netherlandistics and the Czech phraseology. The practical part takes up with the characterization of the Dutch corpus, the semantic analysis of the Dutch idioms and determining of the semantic similarity in Czech. The corpus consists of 133 Dutch idioms containing the word "money". The results of the thesis show that the idioms can be categorized on the ground of their prototypical qualities or concepts. The most dominant concept is called "the power of money". In general, we can assume that the meanings of the idioms are really various and do not have to relate only with financial affairs. The analysis of the semantic similarity with the Czech idioms has shown that the most of the Dutch idioms do not have any Czech equivalent. The result of this thesis is also a Dutch-Czech phraseological dictionary.
33	Development of new computational methods for a synthetic gene set annotation / Développement de nouvelles méthodes informatiques pour une annotation synthétique d’un ensemble de gènes. Ayllón-Benítez, Aarón 05 December 2019 (has links) Les avancées dans l'analyse de l'expression différentielle de gènes ont suscité un vif intérêt pour l'étude d'ensembles de gènes présentant une similarité d'expression au cours d'une même condition expérimentale. Les approches classiques pour interpréter l'information biologique reposent sur l'utilisation de méthodes statistiques. Cependant, ces méthodes se focalisent sur les gènes les plus connus tout en générant des informations redondantes qui peuvent être éliminées en prenant en compte la structure des ressources de connaissances qui fournissent l'annotation. Au cours de cette thèse, nous avons exploré différentes méthodes permettant l'annotation d'ensembles de gènes.Premièrement, nous présentons les solutions visuelles développées pour faciliter l'interprétation des résultats d'annota-tion d'un ou plusieurs ensembles de gènes. Dans ce travail, nous avons développé un prototype de visualisation, appelé MOTVIS, qui explore l'annotation d'une collection d'ensembles des gènes. MOTVIS utilise ainsi une combinaison de deux vues inter-connectées : une arborescence qui fournit un aperçu global des données mais aussi des informations détaillées sur les ensembles de gènes, et une visualisation qui permet de se concentrer sur les termes d'annotation d'intérêt. La combinaison de ces deux visualisations a l'avantage de faciliter la compréhension des résultats biologiques lorsque des données complexes sont représentées.Deuxièmement, nous abordons les limitations des approches d'enrichissement statistique en proposant une méthode originale qui analyse l'impact d'utiliser différentes mesures de similarité sémantique pour annoter les ensembles de gènes. Pour évaluer l'impact de chaque mesure, nous avons considéré deux critères comme étant pertinents pour évaluer une annotation synthétique de qualité d'un ensemble de gènes : (i) le nombre de termes d'annotation doit être réduit considérablement tout en gardant un niveau suffisant de détail, et (ii) le nombre de gènes décrits par les termes sélectionnés doit être maximisé. Ainsi, neuf mesures de similarité sémantique ont été analysées pour trouver le meilleur compromis possible entre réduire le nombre de termes et maintenir un niveau suffisant de détails fournis par les termes choisis. Tout en utilisant la Gene Ontology (GO) pour annoter les ensembles de gènes, nous avons obtenu de meilleurs résultats pour les mesures de similarité sémantique basées sur les nœuds qui utilisent les attributs des termes, par rapport aux mesures basées sur les arêtes qui utilisent les relations qui connectent les termes. Enfin, nous avons développé GSAn, un serveur web basé sur les développements précédents et dédié à l'annotation d'un ensemble de gènes a priori. GSAn intègre MOTVIS comme outil de visualisation pour présenter conjointement les termes représentatifs et les gènes de l'ensemble étudié. Nous avons comparé GSAn avec des outils d'enrichissement et avons montré que les résultats de GSAn constituent un bon compromis pour maximiser la couverture de gènes tout en minimisant le nombre de termes.Le dernier point exploré est une étape visant à étudier la faisabilité d'intégrer d'autres ressources dans GSAn. Nous avons ainsi intégré deux ressources, l'une décrivant les maladies humaines avec Disease Ontology (DO) et l'autre les voies métaboliques avec Reactome. Le but était de fournir de l'information supplémentaire aux utilisateurs finaux de GSAn. Nous avons évalué l'impact de l'ajout de ces ressources dans GSAn lors de l'analyse d’ensembles de gènes. L'intégration a amélioré les résultats en couvrant d'avantage de gènes sans pour autant affecter de manière significative le nombre de termes impliqués. Ensuite, les termes GO ont été mis en correspondance avec les termes DO et Reactome, a priori et a posteriori des calculs effectués par GSAn. Nous avons montré qu'un processus de mise en correspondance appliqué a priori permettait d'obtenir un plus grand nombre d'inter-relations entre les deux ressources. / The revolution in new sequencing technologies, by strongly improving the production of omics data, is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and focus on the most studied genes that may represent a limited coverage of annotated genes within a gene set. During this thesis, we explored different methods for annotating gene sets. In this frame, we developed three studies allowing the annotation of gene sets and thus improving the understanding of their biological context.First, visualization approaches were applied to represent annotation results provided by enrichment analysis for a gene set or a repertoire of gene sets. In this work, a visualization prototype called MOTVIS (MOdular Term VISualization) has been developed to provide an interactive representation of a repertoire of gene sets combining two visual metaphors: a treemap view that provides an overview and also displays detailed information about gene sets, and an indented tree view that can be used to focus on the annotation terms of interest. MOTVIS has the advantage to solve the limitations of each visual metaphor when used individually. This illustrates the interest of using different visual metaphors to facilitate the comprehension of biological results by representing complex data.Secondly, to address the issues of enrichment analysis, a new method for analyzing the impact of using different semantic similarity measures on gene set annotation was proposed. To evaluate the impact of each measure, two relevant criteria were considered for characterizing a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced while maintaining a sufficient level of details, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, nine semantic similarity measures were analyzed to identify the best possible compromise between both criteria while maintaining a sufficient level of details. Using GO to annotate the gene sets, we observed better results with node-based measures that use the terms’ characteristics than with edge-based measures that use the relations terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of the terms used. Then, we developed GSAn (Gene Set Annotation), a novel gene set annotation web server that uses semantic similarity measures to synthesize a priori GO annotation terms. GSAn contains the interactive visualization MOTVIS, dedicated to visualize the representative terms of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.At last, the third work consisted in enriching the annotation results provided by GSAn. Since the knowledge described in GO may not be sufficient for interpreting gene sets, other biological information, such as pathways and diseases, may be useful to provide a wider biological context. Thus, two additional knowledge resources, being Reactome and Disease Ontology (DO), were integrated within GSAn. In practice, GO terms were mapped to terms of Reactome and DO, before and after applying the GSAn method. The integration of these resources improved the results in terms of gene coverage without affecting significantly the number of involved terms. Two strategies were applied to find mappings (generated or extracted from the web) between each new resource and GO. We have shown that a mapping process before computing the GSAn method allowed to obtain a larger number of inter-relations between the two knowledge resources. Bioinformatique Ontologies biologiques Annotation fonctionnelle Similarité sémantique Intégration Visualisation Bioinformatics Biological Ontologies Functional annotation Semantic similarity Integration Visualization
34	Počítač jako inteligentní spoluhráč ve slovně-asociační hře Krycí jména / Computer as an Intelligent Partner in the Word-Association Game Codenames Obrtlík, Petr January 2018 (has links) This thesis deals with associations between words. Describes the design and implementation of a system that can represent a human in the word-association game Codenames. The system uses the Gensim and FastText libraries to create semantic models. The relationship between words is taught by the analysis of the text corpus CWC-2011.
35	Sémantická blízkost pro vědecké články / Semantic Relatedness of Scientific Articles Dresto, Erik January 2011 (has links) The main goal of the thesis is to explore basic methods which can be used to find semantically related scientific articles. All the methods are explained in detail, compared and in the end evaluated by the standard metrics. Based on the evaluation, a new method for computing semantic similarity of scientific articles is proposed. The proposed method is based on the current state-of-the-art methods and adds the another important factor for computing similarity - citations. Using citations is important, since they represent a static bond between the articles. Finally, the proposed method is evaluated on the real data and compared with other described methods.
36	Exploring State-of-the-Art Natural Language Processing Models with Regards to Matching Job Adverts and Resumes Rückert, Lise, Sjögren, Henry January 2022 (has links) The ability to automate the process of comparing and matching resumes with job adverts is a growing research field. This can be done through the use of the machine learning area Natural Language Processing (NLP), which enables a model to learn human language. This thesis explores and evaluates the application of the state-of-the-art NLP model, SBERT, on the task of comparing and calculating a measure of similarity between extracted text from resumes and adverts. This thesis also investigates what type of data that generates the best performing model on said task. The results show that SBERT quickly can be trained on unlabeled data from the HR domain with the usage of a Triplet network, and achieves high performance and good results when tested on various tasks. The models are shown to be bilingual, can tackle unseen vocabulary and understand the concept and descriptive context of entire sentences instead of solely single words. Thus, the conclusion is that the models have a neat understanding of semantic similarity and relatedness. However, in some cases the models are also shown to become binary in their calculations of similarity between inputs. Moreover, it is hard to tune a model that is exhaustively comprehensive of such diverse domain such as HR. A model fine-tuned on clean and generic data extracted from adverts shows the overall best performance in terms of loss and consistency. Deep Learning Natural Language Processing SBERT Cosine similarity Recruitment Triplet network Semantic similarity Computer Sciences Datavetenskap (datalogi)
37	Automated Extraction of Insurance Policy Information : Natural Language Processing techniques to automate the process of extracting information about the insurance coverage from unstructured insurance policy documents. Hedberg, Jacob, Furberg, Erik January 2023 (has links) This thesis investigates Natural Language Processing (NLP) techniques to extract relevant information from long and unstructured insurance policy documents. The goal is to reduce the amount of time required by readers to understand the coverage within the documents. The study uses predefined insurance policy coverage parameters, created by industry experts to represent what is covered in the policy documents. Three NLP approaches are used to classify the text sequences as insurance parameter classes. The thesis shows that using SBERT to create vector representations of text to allow cosine similarity calculations is an effective approach. The top scoring sequences for each parameter are assigned that parameter class. This approach shows a significant reduction in the number of sequences required to read by a user but misclassifies some positive examples. To improve the model, the parameter definitions and training data were combined into a support set. Similarity scores were calculated between all sequences and the support sets for each parameter using different pooling strategies. This few-shot classification approach performed well for the use case, improving the model’s performance significantly. In conclusion, this thesis demonstrates that NLP techniques can be applied to help understand unstructured insurance policy documents. The model developed in this study can be used to extract important information and reduce the time needed to understand the contents of aninsurance policy document. A human expert would however still be required to interpret the extracted text. The balance between the amount of relevant information and the amount of text shown would depend on how many of the top-scoring sequences are classified for each parameter. This study also identifies some limitations of the approach depending on available data. Overall, this research provides insight into the potential implications of NLP techniques for information extraction and the insurance industry. NLP SBERT AI Insurance Semantic similarity Computer Sciences Datavetenskap (datalogi)
38	Semantic Similarity Comparison of Political Statements by ChatGPT and Political Representatives / Jämförelse i semantisk likhet mellan politiska uttalanden från ChatGPT och från politiska representanter Lihammer, Sebastian January 2023 (has links) ChatGPT is a recently released chatbot that through the use of deep learning can generate human-like statements on a variety of topics. Deep learning models have a potential to affect politics. They can for instance be used as a source for political information or be used to create and spread political messages. ChatGPT is itself able to describe the stances of different political parties and can generate political messages based on these stances. In this thesis, a semantic similarity program, utilizing the models Stanza and Sentence-BERT, is implemented. This program is used to compare the semantic similarity of political statements and information generated by ChatGPT to authentic statements and information written by Swedish political representatives prior to the 2022 general election. The results of the thesis demonstrate that ChatGPT with relatively high accuracy (over 60 % when three options are available) is able to correctly reflect the standpoints of Swedish political parties in specific political questions. When compared to authentic political information using semantic similarity, there is no discernible difference between the scores achieved by ChatGPT’s statements and the scores achieved by authentic statements from political representatives. This might reflect that ChatGPT performs well in semantically mimicking the style used by political representatives. Alternatively, the result could indicate limited usefulness of semantic similarity as a comparative method for political statements. / ChatGPT är en nyligen släppt chattrobot som med hjälp av djupinlärning kan skapa människo-liknande uttalanden inom en rad olika ämnen. Det är möjligt för djupinlärningsmodeller att ha politisk påverkan. Djupinlärningsmodeller kan exempelvis användas som källor för politisk information eller användas för att skapa och sprida politiska meddelanden. ChatGPT har förmågan att beskriva ståndpunkterna hos olika politiska partier samt generera politiska meddelanden baserat på dessa ståndpunkter. I denna studie implementeras ett program för att avgöra semantisk likhet mellan texter. Programmet använder modellerna Stanza och Sentence-BERT. Med hjälp av programmet jämförs semantisk likhet mellan politiska uttalanden och information genererad av ChatGPT, och autentiska uttalanden och autentisk information skriven av svenska politiska representanter innan riksdagsvalet i Sverige 2022. Studiens resultat visar att ChatGPT med relativt hög korrekthet (över 60 % när tre alternativ är möjliga) lyckas framföra samma ståndpunkter som riktiga representanter från de olika partierna i specifika politiska frågor. Ingen märkbar skillnad i semantisk likhet hittas när ChatGPT:s och riktiga representanters uttalanden jämförs med riktig politisk information. Detta kan visa på att ChatGPT är bra på att semantiskt härma stilen som används av politiska representanter. Resultatet kan alternativt tolkas som tydande på att semantisk likhet har ett begränsat värde som jämförelsemetod för politiska texter. ChatGPT Deep learning Artificial intelligence Politics Semantic similarity ChatGPT Djupinlärning Artificiell intelligens Politik Semantisk likhet Computer and Information Sciences Data- och informationsvetenskap
39	Biological and clinical data integration and its applications in healthcare Hagen, Matthew 07 January 2016 (has links) Answers to the most complex biological questions are rarely determined solely from the experimental evidence. It requires subsequent analysis of many data sources that are often heterogeneous. Most biological data repositories focus on providing only one particular type of data, such as sequences, molecular interactions, protein structure, or gene expression. In many cases, it is required for researchers to visit several different databases to answer one scientific question. It is essential to develop strategies to integrate disparate biological data sources that are efficient and seamless to facilitate the discovery of novel associations and validate existing hypotheses. This thesis presents the design and development of different integration strategies of biological and clinical systems. The BioSPIDA system is a data warehousing solution that integrates many NCBI databases and other biological sources on protein sequences, protein domains, and biological pathways. It utilizes a universal parser facilitating integration without developing separate source code for each data site. This enables users to execute fine-grained queries that can filter genes by their protein interactions, gene expressions, functional annotation, and protein domain representation. Relational databases can powerfully return and generate quickly filtered results to research questions, but they are not the most suitable solution in all cases. Clinical patients and genes are typically annotated by concepts in hierarchical ontologies and performance of relational databases are weakened considerably when traversing and representing graph structures. This thesis illustrates when relational databases are most suitable as well as comparing the performance benchmarks of semantic web technologies and graph databases when comparing ontological concepts. Several approaches of analyzing integrated data will be discussed to demonstrate the advantages over dependencies on remote data centers. Intensive Care Patients are prioritized by their length of stay and their severity class is estimated by their diagnosis to help minimize wait time and preferentially treat patients by their condition. In a separate study, semantic clustering of patients is conducted by integrating a clinical database and a medical ontology to help identify multi-morbidity patterns. In the biological area, gene pathways, protein interaction networks, and functional annotation are integrated to help predict and prioritize candidate disease genes. This thesis will present the results that were able to be generated from each project through utilizing a local repository of genes, functional annotations, protein interactions, clinical patients, and medical ontologies. Biological database integration Clinical data warehouse Candidate gene prioritization Disease Diffusion kernel Data mining Ontology Semantic similarity Clustering Intensive care unit Hospital prioritization Patient Machine learning
40	Recomendação de conteúdo baseada em informações semânticas extraídas de bases de conhecimento / Content recommendation based on semantic information extracted from knowledge bases Silva Junior, Salmo Marques da 10 May 2017 (has links) A fim de auxiliar usuários durante o consumo de produtos, sistemas Web passaram a incorporar módulos de recomendação de itens. As abordagens mais populares são a baseada em conteúdo, que recomenda itens a partir de características que são do seu interesse, e a filtragem colaborativa, que recomenda itens bem avaliados por usuários com perfis semelhantes ao do usuário alvo, ou que são semelhantes aos que foram bem avaliados pelo usuário alvo. Enquanto que a primeira abordagem apresenta limitações como a sobre-especialização e a análise limitada de conteúdo, a segunda enfrenta problemas como o novo usuário e/ou novo item, também conhecido como partida fria. Apesar da variedade de técnicas disponíveis, um problema comum existente na maioria das abordagens é a falta de informações semânticas para representar os itens do acervo. Trabalhos recentes na área de Sistemas de Recomendação têm estudado a possibilidade de usar bases de conhecimento da Web como fonte de informações semânticas. Contudo, ainda é necessário investigar como usufruir de tais informações e integrá-las de modo eficiente em sistemas de recomendação. Dessa maneira, este trabalho tem o objetivo de investigar como informações semânticas provenientes de bases de conhecimento podem beneficiar sistemas de recomendação por meio da descrição semântica de itens, e como o cálculo da similaridade semântica pode amenizar o desafio enfrentado no cenário de partida fria. Como resultado, obtém-se uma técnica que pode gerar recomendações adequadas ao perfil dos usuários, incluindo itens novos do acervo que sejam relevantes. Pode-se observar uma melhora de até 10% no RMSE, no cenário de partida fria, quando se compara o sistema proposto com o sistema cuja predição de notas é baseada na correlação de notas. / In order to support users during the consumption of products,Web systems have incorporated recommendation techniques. The most popular approaches are content-based, which recommends items based on interesting features to the user, and collaborative filtering, which recommends items that were well evaluated by users with similar preferences to the target user, or that have similar features to items which were positively evaluated. While the first approach has limitations such as overspecialization and limited content analysis, the second technique has problems such as the new user and the new item, limitation also known as cold start. In spite of the variety of techniques available, a common problem is the lack of semantic information to represent items features. Recent works in the field of recommender systems have been studying the possibility to use knowledge databases from the Web as a source of semantic information. However, it is still necessary to investigate how to use and integrate such semantic information in recommender systems. In this way, this work has the proposal to investigate how semantic information gathered from knowledge databases can help recommender systems by semantically describing items, and how semantic similarity can overcome the challenge confronted in the cold-start scenario. As a result, we obtained a technique that can produce recommendations suited to users profiles, including relevant new items available in the database. It can be observed an improvement of up to 10% in the RMSE in the cold start scenario when comparing the proposed system with the system whose rating prediction is based on the correlation of rates. Cenário de partida fria Cold start scenario Content-based filtering Distância semântica Filtragem colaborativa Modelos de vizinhança Neighborhood models Semantic distance Semantic similarity Similaridade semântica

Search results