Spelling suggestions: "subject:"aemantic extual similarity"" "subject:"aemantic extual imilarity""
1 |
Methods for measuring semantic similarity of textsGaona, Miguel Angel Rios January 2014 (has links)
Measuring semantic similarity is a task needed in many Natural Language Processing (NLP) applications. For example, in Machine Translation evaluation, semantic similarity is used to assess the quality of the machine translation output by measuring the degree of equivalence between a reference translation and the machine translation output. The problem of semantic similarity (Corley and Mihalcea, 2005) is de ned as measuring and recognising semantic relations between two texts. Semantic similarity covers di erent types of semantic relations, mainly bidirectional and directional. This thesis proposes new methods to address the limitations of existing work on both types of semantic relations. Recognising Textual Entailment (RTE) is a directional relation where a text T entails the hypothesis H (entailment pair) if the meaning of H can be inferred from the meaning of T (Dagan and Glickman, 2005; Dagan et al., 2013). Most of the RTE methods rely on machine learning algorithms. de Marne e et al. (2006) propose a multi-stage architecture where a rst stage determines an alignment between the T-H pairs to be followed by an entailment decision stage. A limitation of such approaches is that instead of recognising a non-entailment, an alignment that ts an optimisation criterion will be returned, but the alignment by itself is a poor predictor for iii non-entailment. We propose an RTE method following a multi-stage architecture, where both stages are based on semantic representations. Furthermore, instead of using simple similarity metrics to predict the entailment decision, we use a Markov Logic Network (MLN). The MLN is based on rich relational features extracted from the output of the predicate-argument alignment structures between T-H pairs. This MLN learns to reward pairs with similar predicates and similar arguments, and penalise pairs otherwise. The proposed methods show promising results. A source of errors was found to be the alignment step, which has low coverage. However, we show that when an alignment is found, the relational features improve the nal entailment decision. The task of Semantic Textual Similarity (STS) (Agirre et al., 2012) is de- ned as measuring the degree of bidirectional semantic equivalence between a pair of texts. The STS evaluation campaigns use datasets that consist of pairs of texts from NLP tasks such as Paraphrasing and Machine Translation evaluation. Methods for STS are commonly based on computing similarity metrics between the pair of sentences, where the similarity scores are used as features to train regression algorithms. Existing methods for STS achieve high performances over certain tasks, but poor results over others, particularly on unknown (surprise) tasks. Our solution to alleviate this unbalanced performances is to model STS in the context of Multi-task Learning using Gaussian Processes (MTL-GP) ( Alvarez et al., 2012) and state-of-the-art iv STS features ( Sari c et al., 2012). We show that the MTL-GP outperforms previous work on the same datasets.
|
2 |
Použití neruonových sítí pro určení sémantické podobnosti dvou vět / Using Neural Networks to Determine Semantic Similarity of Two SentencesHrinčár, Peter January 2017 (has links)
Figuring out the degree of semantic similarity between two sentences is important for many practical applications of natural language processing. The goal is to determine the similarity of sentences on a scale from "sentences are unrelated" to "sentences are equivalent". In this thesis we examined application of di erent neural network architectures to solve this problem. We proposed models based on Recurrent neural networks, which convert text sequence to constant sized vector. We followed up with suitable representation of unknown words. Our experiments showed that simple architectures achieved better results on the used dataset. We see a future extension of this thesis by using bigger training dataset. 1
|
3 |
Modelo para sumarização computacional de textos científicos. / Scientific text computational summarization model.Tarafa Guzmán, Alejandro 07 March 2017 (has links)
Neste trabalho, propõe-se um modelo para a sumarização computacional extrativa de textos de artigos técnico-cientificos em inglês. A metodologia utilizada baseia-se em um módulo de avaliação de similaridade semântica textual entre sentenças, desenvolvido especialmente para integrar o modelo de sumarização. A aplicação deste módulo de similaridade à extração de sentenças é feita por intermédio do conceito de uma janela deslizante de comprimento variável, que facilita a detecção de equivalência semântica entre frases do artigo e aquelas de um léxico de frases típicas, atribuíveis a uma estrutura básica dos artigos. Os sumários obtidos em aplicações do modelo apresentam qualidade razoável e utilizável, para os efeitos de antecipar a informação contida nos artigos. / In this work a model is proposed for the computational extractive summarization of scientific papers in English. Its methodology is based on a semantic textual similarity module, for the evaluation of equivalence between sentences, specially developed to integrate the summarization model. A variable width window facilitates the application of this module to detect semantic similarity between phrases in the article and those in a basic structure, assignable to the articles. Practical summaries obtained with the model show usable quality to anticipate the information found in the papers.
|
4 |
Modelo para sumarização computacional de textos científicos. / Scientific text computational summarization model.Alejandro Tarafa Guzmán 07 March 2017 (has links)
Neste trabalho, propõe-se um modelo para a sumarização computacional extrativa de textos de artigos técnico-cientificos em inglês. A metodologia utilizada baseia-se em um módulo de avaliação de similaridade semântica textual entre sentenças, desenvolvido especialmente para integrar o modelo de sumarização. A aplicação deste módulo de similaridade à extração de sentenças é feita por intermédio do conceito de uma janela deslizante de comprimento variável, que facilita a detecção de equivalência semântica entre frases do artigo e aquelas de um léxico de frases típicas, atribuíveis a uma estrutura básica dos artigos. Os sumários obtidos em aplicações do modelo apresentam qualidade razoável e utilizável, para os efeitos de antecipar a informação contida nos artigos. / In this work a model is proposed for the computational extractive summarization of scientific papers in English. Its methodology is based on a semantic textual similarity module, for the evaluation of equivalence between sentences, specially developed to integrate the summarization model. A variable width window facilitates the application of this module to detect semantic similarity between phrases in the article and those in a basic structure, assignable to the articles. Practical summaries obtained with the model show usable quality to anticipate the information found in the papers.
|
5 |
Sentence Pair Modeling and BeyondLan, Wuwei January 2021 (has links)
No description available.
|
6 |
Similarités textuelles sémantiques translingues : vers la détection automatique du plagiat par traduction / Cross-lingual semantic textual similarity : towards automatic cross-language plagiarism detectionFerrero, Jérémy 08 December 2017 (has links)
La mise à disposition massive de documents via Internet (pages Web, entrepôts de données,documents numériques, numérisés ou retranscrits, etc.) rend de plus en plus aisée la récupération d’idées. Malheureusement, ce phénomène s’accompagne d’une augmentation des cas de plagiat.En effet, s’approprier du contenu, peu importe sa forme, sans le consentement de son auteur (ou de ses ayants droit) et sans citer ses sources, dans le but de le présenter comme sa propre œuvre ou création est considéré comme plagiat. De plus, ces dernières années, l’expansion d’Internet a également facilité l’accès à des documents du monde entier (écrits dans des langues étrangères)et à des outils de traduction automatique de plus en plus performants, accentuant ainsi la progression d’un nouveau type de plagiat : le plagiat translingue. Ce plagiat implique l’emprunt d’un texte tout en le traduisant (manuellement ou automatiquement) de sa langue originale vers la langue du document dans lequel le plagiaire veut l’inclure. De nos jours, la prévention du plagiat commence à porter ses fruits, grâce notamment à des logiciels anti-plagiat performants qui reposent sur des techniques de comparaison monolingue déjà bien éprouvées. Néanmoins, ces derniers ne traitent pas encore de manière efficace les cas translingues. Cette thèse est née du besoin de Compilatio, une société d’édition de l’un de ces logiciels anti-plagiat, de mesurer des similarités textuelles sémantiques translingues (sous-tâche de la détection du plagiat). Après avoir défini le plagiat et les différents concepts abordés au cours de cette thèse, nous établissons un état de l’art des différentes approches de détection du plagiat translingue. Nousprésentons également les différents corpus déjà existants pour la détection du plagiat translingue et exposons les limites qu’ils peuvent rencontrer lors d’une évaluation de méthodes de détection du plagiat translingue. Nous présentons ensuite le corpus que nous avons constitué et qui ne possède pas la plupart des limites rencontrées par les différents corpus déjà existants. Nous menons,à l’aide de ce nouveau corpus, une évaluation de plusieurs méthodes de l’état de l’art et découvrons que ces dernières se comportent différemment en fonction de certaines caractéristiques des textes sur lesquelles elles opèrent. Ensuite, nous présentons des nouvelles méthodes de mesure de similarités textuelles sémantiques translingues basées sur des représentations continues de mots(word embeddings). Nous proposons également une notion de pondération morphosyntaxique et fréquentielle de mots, qui peut aussi bien être utilisée au sein d’un vecteur qu’au sein d’un sac de mots, et nous montrons que son introduction dans ces nouvelles méthodes augmente leurs performances respectives. Nous testons ensuite différents systèmes de fusion et combinaison entre différentes méthodes et étudions les performances, sur notre corpus, de ces méthodes et fusions en les comparant à celles des méthodes de l’état de l’art. Nous obtenons ainsi de meilleurs résultats que l’état de l’art dans la totalité des sous-corpus étudiés. Nous terminons en présentant et discutant les résultats de ces méthodes lors de notre participation à la tâche de similarité textuelle sémantique (STS) translingue de la campagne d’évaluation SemEval 2017, où nous nous sommes classés 1er à la sous-tâche correspondant le plus au scénario industriel de Compilatio. / The massive amount of documents through the Internet (e.g. web pages, data warehouses anddigital or transcribed texts) makes easier the recycling of ideas. Unfortunately, this phenomenonis accompanied by an increase of plagiarism cases. Indeed, claim ownership of content, withoutthe consent of its author and without crediting its source, and present it as new and original, isconsidered as plagiarism. In addition, the expansion of the Internet, which facilitates access todocuments throughout the world (written in foreign languages) as well as increasingly efficient(and freely available) machine translation tools, contribute to spread a new kind of plagiarism:cross-language plagiarism. Cross-language plagiarism means plagiarism by translation, i.e. a texthas been plagiarized while being translated (manually or automatically) from its original languageinto the language of the document in which the plagiarist wishes to include it. While prevention ofplagiarism is an active field of research and development, it covers mostly monolingual comparisontechniques. This thesis is a joint work between an academic laboratory (LIG) and Compilatio (asoftware publishing company of solutions for plagiarism detection), and proposes cross-lingualsemantic textual similarity measures, which is an important sub-task of cross-language plagiarismdetection.After defining the plagiarism and the different concepts discussed during this thesis, wepresent a state-of-the-art of the different cross-language plagiarism detection approaches. Wealso present the preexisting corpora for cross-language plagiarism detection and show their limits.Then we describe how we have gathered and built a new dataset, which does not contain mostof the limits encountered by the preexisting corpora. Using this new dataset, we conduct arigorous evaluation of several state-of-the-art methods and discover that they behave differentlyaccording to certain characteristics of the texts on which they operate. We next present newmethods for measuring cross-lingual semantic textual similarities based on word embeddings.We also propose a notion of morphosyntactic and frequency weighting of words, which can beused both within a vector and within a bag-of-words, and we show that its introduction inthe new methods increases their respective performance. Then we test different fusion systems(mostly based on linear regression). Our experiments show that we obtain better results thanthe state-of-the-art in all the sub-corpora studied. We conclude by presenting and discussingthe results of these methods obtained during our participation to the cross-lingual SemanticTextual Similarity (STS) task of SemEval-2017, where we ranked 1st on the sub-task that bestcorresponds to Compilatio’s use-case scenario.
|
7 |
The Vectorian API – A Research Framework for Semantic Textual Similarity (STS) SearchesBurghardt, Manuel, Liebl, Bernhard 26 June 2024 (has links)
No description available.
|
8 |
O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbridaSilva, Allan de Barcelos 14 December 2017 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-04T11:46:54Z
No. of bitstreams: 1
Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) / Made available in DSpace on 2018-04-04T11:46:55Z (GMT). No. of bitstreams: 1
Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5)
Previous issue date: 2017-12-14 / Nenhuma / Na área de Processamento de Linguagem Natural, a avaliação da similaridade semântica textual é considerada como um elemento importante para a construção de recursos em diversas frentes de trabalho, tais como a recuperação de informações, a classificação de textos, o agrupamento de documentos, as aplicações de tradução, a interação através de diálogos, entre outras. A literatura da área descreve aplicações e técnicas voltadas, em grande parte, para a língua inglesa. Além disso, observa-se o uso prioritário de recursos probabilísticos, enquanto os aspectos linguísticos são utilizados de forma incipiente. Trabalhos na área destacam que a linguística possui um papel fundamental na avaliação de similaridade semântica textual, justamente por ampliar o potencial dos métodos exclusivamente probabilísticos e evitar algumas de suas falhas, que em boa medida são resultado da falta de tratamento mais aprofundado de aspectos da língua. Este contexto é potencializado no tratamento de frases curtas, que consistem no maior campo de utilização das técnicas de similaridade semântica textual, pois este tipo de sentença é composto por um conjunto reduzido de informações, diminuindo assim a capacidade de tratamento probabilístico eficiente. Logo, considera-se vital a identificação e aplicação de recursos a partir do estudo mais aprofundado da língua para melhor compreensão dos aspectos que definem a similaridade entre sentenças. O presente trabalho apresenta uma abordagem para avaliação da similaridade semântica textual em frases curtas no idioma português brasileiro. O principal diferencial apresentado é o uso de uma abordagem híbrida, na qual tanto os recursos de representação distribuída como os aspectos léxicos e linguísticos são utilizados. Para a consolidação do estudo, foi definida uma metodologia que permite a análise de diversas combinações de recursos, possibilitando a avaliação dos ganhos que são introduzidos com a ampliação de aspectos linguísticos e também através de sua combinação com o conhecimento gerado por outras técnicas. A abordagem proposta foi avaliada com relação a conjuntos de dados conhecidos na literatura (evento PROPOR 2016) e obteve bons resultados. / One of the areas of Natural language processing (NLP), the task of assessing the Semantic Textual Similarity (STS) is one of the challenges in NLP and comes playing an increasingly important role in related applications. The STS is a fundamental part of techniques and approaches in several areas, such as information retrieval, text classification, document clustering, applications in the areas of translation, check for duplicates and others. The literature describes the experimentation with almost exclusive application in the English language, in addition to the priority use of probabilistic resources, exploring the linguistic ones
in an incipient way. Since the linguistic plays a fundamental role in the analysis of semantic textual similarity between short sentences, because exclusively probabilistic works fails in some way (e.g. identification of far or close related sentences, anaphora) due to lack of understanding of the language. This fact stems from the few non-linguistic information in short sentences. Therefore, it is vital to identify and apply linguistic resources for better understand what make two or more sentences similar or not. The current work presents a hybrid approach, in which are used both of distributed, lexical and linguistic aspects for an evaluation of semantic textual similarity between short sentences in Brazilian Portuguese. We evaluated proposed approach with well-known and respected datasets in the literature (PROPOR 2016) and obtained good results.
|
9 |
Miljöpartiet and the never-ending nuclear energy debate : A computational rhetorical analysis of Swedish climate policyDickerson, Claire January 2022 (has links)
The domain of rhetoric has changed dramatically since its inception as the art of persuasion. It has adapted to encompass many forms of digital media, including, for example, data visualization and coding as a form of literature, but the approach has frequently been that of an outsider looking in. The use of comprehensive computational tools as a part of rhetorical analysis has largely been lacking. In this report, we attempt to address this lack by means of three case studies in natural language processing tasks, all of which can be used as part of a computational approach to rhetoric. At this same moment in time, it is becoming all the more important to transition to renewable energy in order to keep global warming under 1.5 degrees Celsius and ensure that countries meet the conditions of the Paris Agreement. Thus, we make use of speech data on climate policy from the Swedish parliament to ground these three analyses in semantic textual similarity, topic modeling, and political party attribution. We find that speeches are, to a certain extent, consistent within parties, given that a slight majority of most semantically similar speeches come from the same party. We also find that some of the most common topics discussed in these speeches are nuclear energy and the Swedish Green party, purported environmental risks due to renewable energy sources, and the job market. Finally, we find that though pairs of speeches are semantically similar, party rhetoric on the whole is generally not unique enough for speeches to be distinguishable by party. These results then open the door for a broader exploration of computational rhetoric for Swedish political science in the future.
|
10 |
Improving customer support efficiency through decision support powered by machine learningBoman, Simon January 2023 (has links)
More and more aspects of today’s healthcare are becoming integrated with medical technology and dependent on medical IT systems, which consequently puts stricter re-quirements on the companies delivering these solutions. As a result, companies delivering medical technology solutions need to spend a lot of resources maintaining high-quality, responsive customer support. In this report, possible ways of increasing customer support efficiency using machine learning and NLP is examined at Sectra, a medical technology company. This is done through a qualitative case study, where empirical data collection methods are used to elicit requirements and find ways of adding decision support. Next, a prototype is built featuring a ticket recommendation system powered by GPT-3 and based on 65 000 available support tickets, which is integrated with the customer supports workflow. Lastly, this is evaluated by having six end users test the prototype for five weeks, followed by a qualitative evaluation consisting of interviews, and a quantitative measurement of the user-perceivedusability of the proposed prototype. The results show some support that machine learning can be used to create decision support in a customer support context, as six out of six test users believed that their long-term efficiency could improve using the prototype in terms of reducing the average ticket resolution time. However, one out of the six test users expressed some skepticism towards the relevance of the recommendations generated by the system, indicating that improvements to the model must be made. The study also indicates that the use of state-of-the-art NLP models for semantic textual similarity can possibly outperform keyword searches.
|
Page generated in 0.1102 seconds