Spelling suggestions: "subject:"[een] WORD EMBEDDINGS"" "subject:"[enn] WORD EMBEDDINGS""
41 |
Lexicalisation souple en réalisation de texteGazeau, Avril 08 1900 (has links)
GenDR est un réalisateur de texte symbolique qui prend en entrée un graphe, une repré-
sentation sémantique, et génère les graphes sous forme d’arbres de dépendances syntaxiques
lui correspondant. L’une des tâches de GenDR lui permettant d’effectuer cette transduction
est la lexicalisation profonde. Il s’agit de choisir les bonnes unités lexicales exprimant les
sémantèmes de la représentation sémantique d’entrée. Pour ce faire, GenDR a besoin d’un
dictionnaire sémantique établissant la correspondance entre les sémantèmes et les unités lexi-
cales correspondantes dans une langue donnée.
L’objectif de cette étude est d’élaborer un module de lexicalisation souple construisant
automatiquement un dictionnaire sémantique du français riche pour GenDR, son dictionnaire
actuel étant très pauvre. Plus le dictionnaire de GenDR est riche, plus sa capacité à paraphra-
ser s’élargit, ce qui lui permet de produire la base de textes variés et naturels correspondant à
un même sens. Pour y parvenir, nous avons testé deux méthodes.
La première méthode consistait à réorganiser les données du Réseau Lexical du Français
sous la forme d’un dictionnaire sémantique, en faisant de chacun de ses noeuds une entrée
du dictionnaire et des noeuds y étant reliés par un type de lien lexical que nous appelons
fonctions lexicales paradigmatiques sémantiquement vides ses lexicalisations.
La deuxième méthode consistait à tester la capacité d’un modèle de langue neuronal
contextuel à générer des lexicalisations supplémentaires potentielles correspondant aux plus
proches voisins du vecteur calculé pour chaque entrée du dictionnaire afin de l’enrichir.
Le dictionnaire construit à partir du Réseau lexical du français est compatible avec GenDR
et sa couverture a été considérablement élargie. L’utilité des lexicalisations supplémentaires
générées par le modèle neuronal s’est avérée limitée, ce qui nous amène à conclure que le
modèle testé n’est pas tout à fait apte à accomplir le genre de tâche que nous lui avons de-
mandée. / GenDR is an automatic text realiser. Its input is a graph; a semantic representation, and
its output is the corresponding syntactic dependencies tree graphs. One of GenDR’s tasks
to operate this transduction successfully is called deep lexicalization, i.e. choosing the right
lexical units to express the input semantic representation’s semantemes. To do so, GenDR
needs access to a semantic dictionnary that maps the semantemes to the corresponding lexical
units in a given language.
This study aims to develop a flexible lexicalization module to build a rich French semantic
dictionary automatically for GenDR, its current one being very poor. The more data the
semantic dictionary contains, the more paraphrases GenDR is able to produce, which enables
it to generate the basis for natural and diverse texts associated to a same meaning. To achieve
this, we have tested two different methods.
The first one involved the reorganization of the French Lexical Network in the shape of a
semantic dictionary, by using each of the network’s nodes as a dictionary entry and the nodes
linked to it by a special lexical relationship we call semantically empty paradigmatic lexical
functions as its lexicalizations.
The second method involved testing a contextual neural language model’s ability to gen-
erate potential additional lexicalizations by calculating the vector of each of the dictionary
entries and generating its closest neighbours in order to expand the semantic dictionary’s
coverage.
The dictionary we built from the data contained in the French Lexical Network is com-
patible with GenDR and its coverage has been significantly broadened. Use of the additional
lexicalizations produced by the language model turned out to be limited, which brings us to
the conclusion that the tested model isn’t completely able to perform the task we’ve asked
from it.
|
42 |
Classification of Transcribed Voice Recordings : Determining the Claim Type of Recordings Submitted by Swedish Insurance Clients / Klassificering av Transkriberade RöstinspelningarPiehl, Carl January 2021 (has links)
In this thesis, we investigate the problem of building a text classifier for transcribed voice recordings submitted by insurance clients. We compare different models in the context of two tasks. The first is a binary classification problem, where the models are tasked with determining if a transcript belongs to a particular type or not. The second is a multiclass problem, where the models have to choose between several types when labelling transcripts, resulting in a data set with a highly imbalanced class distribution. We evaluate four different models: pretrained BERT and three LSTMs with different word embeddings. The used word embeddings are ELMo, word2vec and a baseline model with randomly initialized embedding layer. In the binary task, we are more concerned with false positives than false negatives. Thus, we also use weighted cross entropy loss to achieve high precision for the positive class, while sacrificing recall. In the multiclass task, we use focal loss and weighted cross entropy loss to reduce bias toward majority classes. We find that BERT outperforms the other models and the baseline model is worst across both tasks. The difference in performance is greatest in the multiclass task on classes with fewer samples. This demonstrates the benefit of using large language models in data constrained scenarios. In the binary task, we find that weighted cross entropy loss provides a simple, yet effective, framework for conditioning the model to favor certain types of errors. In the multiclass task, both focal loss and weighted cross entropy loss are shown to reduce bias toward majority classes. However, we also find that BERT fine tuned with regular cross entropy loss does not show bias toward majority classes, having high recall across all classes. / I examensarbetet undersöks klassificering av transkriberade röstinspelningar från försäkringskunder. Flera modeller jämförs på två uppgifter. Den första är binär klassificering, där modellerna ska särskilja på inspelningar som tillhör en specifik klass av ärende från resterande inspelningar. I det andra inkluderas flera olika klasser som modellerna ska välja mellan när inspelningar klassificeras, vilket leder till en ojämn klassfördelning. Fyra modeller jämförs: förtränad BERT och tre LSTM-nätverk med olika varianter av förtränade inbäddningar. De inbäddningar som används är ELMo, word2vec och en basmodell som har inbäddningar som inte förtränats. I det binära klassificeringsproblemet ligger fokus på att minimera antalet falskt positiva klassificeringar, därför används viktad korsentropi. Utöver detta används även fokal förlustfunktion när flera klasser inkluderas, för att minska partiskhet mot majoritetsklasser. Resultaten indikerar att BERT är en starkare modell än de andra modellerna i båda uppgifterna. Skillnaden mellan modellerna är tydligast när flera klasser används, speciellt på de klasser som är underrepresenterade. Detta visar på fördelen av att använda stora, förtränade, modeller när mängden data är begränsad. I det binära klassificeringsproblemet ser vi även att en viktad förlustfunktion ger ett enkelt men effektivt sätt att reglera vilken typ av fel modellen ska vara partisk mot. När flera klasser inkluderas ser vi att viktad korsentropi, samt fokal förlustfunktion, kan bidra till att minska partiskhet mot överrepresenterade klasser. Detta var dock inte fallet för BERT, som visade bra resultat på minoritetsklasser även utan att modifiera förlustfunktionen.
|
43 |
Optimering av en chattbot för det svenska språket / Optimization of a Chatbot for the Swedish LanguageMutaliev, Mohammed, Almimar, Ibrahim January 2021 (has links)
Chattbotutvecklare på Softronic använder i dagsläget Rasa-ramverket och dess standardkomponenter för bearbetning av användarinmatning. Det här är problematiskt då standardkomponenterna inte är optimerade för det svenska språket. Till följd av detta efterfrågades en utvärdering av samtliga Rasa-komponenter med syfte att identifiera de mest gynnsamma komponenterna för att maximera klassificeringsträffsäkerhet. I detta examensarbete framtogs och jämfördes flera Rasa-pipelines med olika komponenter för tokenisering, känneteckensextrahering och klassificering. Resultaten av komponenterna för tokenisering visade att Rasas WhitespaceTokenizer överträffade både SpacyTokenizer och StanzaTokenizer. För känneteckensextrahering var CountVectorsFeaturizer, LanguageModelFeaturizer (med LaBSE-modellen) och FastTextFeaturizer (med den officiella fastText-modellen tränad på svenska Wikipedia) de mest optimala komponenterna. Den klassificerare som i allmänhet presterade bäst var DIETClassifier, men det fanns flera tillfällen där SklearnIntentClassifier överträffade den. Detta arbete resulterade i flera pipelines som överträffade Rasas standard-pipeline. Av dessa pipelines var det två som presterade bäst. Den första pipeline implementerade komponenterna WhitespaceTokenizer, CountVectorsFeaturizer, FastTextFeaturizer (med den officiella fastText-modellen tränad på svenska Wikipedia) och DIETClassifier med en klassificeringsträffsäkerhet på 91% (F1-score). Den andra pipeline implementerade komponenterna WhitespaceTokenizer, LanguageModelFeaturizer (med LaBSE-modellen) och SklearnIntentClassifier med en klassificeringsträffsäkerhet på 91,5% (F1-score). / Chatbot developers at Softronic currently use the Rasa framework and its default components for processing user input. This is problematic as the default components are not optimized for the Swedish language. Following this an evaluation of all Rasa components was requested with the purpose of identifying the most favorable components to maximize classification accuracy. In this thesis, several Rasa pipelines were developed and compared with different components for tokenization, feature extraction and classification. The results of the tokenization components showed that Rasa's WhitespaceTokenizer surpassed both SpacyTokenizer and StanzaTokenizer. For feature extraction, CountVectorsFeaturizer, LanguageModelFeaturizer (with the LaBSE model) and FastTextFeaturizer (with the official fastText model trained on Swedish Wikipedia) were the most optimal components. The classifier that generally performed best was DIETClassifier, but there were several occasions where SklearnIntentClassifier surpassed it. This work resulted in several pipelines that exceeded Rasa’s standard pipeline. Of these pipelines, two performed best. The first pipeline implemented the components WhitespaceTokenizer, CountVectorsFeaturizer, FastTextFeaturizer (with the official fastText model trained on Swedish Wikipedia) and DIETClassifier with a classification accuracy of 91% (F1 score). The other pipeline implemented the components WhitespaceTokenizer, LanguageModelFeaturizer (with the LaBSE model) and SklearnIntentClassifier with a classification accuracy of 91.5% (F1 score).
|
44 |
„The Vectorian“ – Eine parametrisierbare Suchmaschine für intertextuelle ReferenzenLiebl, Bernhard, Burghardt, Manuel 20 June 2024 (has links)
No description available.
|
45 |
Étude sur les représentations continues de mots appliquées à la détection automatique des erreurs de reconnaissance de la parole / A study of continuous word representations applied to the automatic detection of speech recognition errorsGhannay, Sahar 20 September 2017 (has links)
Nous abordons, dans cette thèse, une étude sur les représentations continues de mots (en anglais word embeddings) appliquées à la détection automatique des erreurs dans les transcriptions de la parole. Notre étude se concentre sur l’utilisation d’une approche neuronale pour améliorer la détection automatique des erreurs dans les transcriptions automatiques, en exploitant les word embeddings. L’exploitation des embeddings repose sur l’idée que la détection d’erreurs consiste à trouver les possibles incongruités linguistiques ou acoustiques au sein des transcriptions automatiques. L’intérêt est donc de trouver la représentation appropriée du mot qui permet de capturer des informations pertinentes pour pouvoir détecter ces anomalies. Notre contribution dans le cadre de cette thèse porte sur plusieurs axes. D’abord, nous commençons par une étude préliminaire dans laquelle nous proposons une architecture neuronale capable d’intégrer différents types de descripteurs, y compris les embeddings. Ensuite, nous nous focalisons sur une étude approfondie des représentations continues de mots. Cette étude porte d’une part sur l’évaluation de différents types d’embeddings linguistiques puis sur leurs combinaisons. D’autre part, elle s’intéresse aux embeddings acoustiques de mots. Puis, nous présentons une étude sur l’analyse des erreurs de classifications, qui a pour objectif de percevoir les erreurs difficiles à détecter.Finalement, nous exploitons les embeddings linguistiques et acoustiques ainsi que l’information fournie par notre système de détections d’erreurs dans plusieurs cadres applicatifs. / My thesis concerns a study of continuous word representations applied to the automatic detection of speech recognition errors. Our study focuses on the use of a neural approach to improve ASR errors detection, using word embeddings. The exploitation of continuous word representations is motivated by the fact that ASR error detection consists on locating the possible linguistic or acoustic incongruities in automatic transcriptions. The aim is therefore to find the appropriate word representation which makes it possible to capture pertinent information in order to be able to detect these anomalies. Our contribution in this thesis concerns several initiatives. First, we start with a preliminary study in which we propose a neural architecture able to integrate different types of features, including word embeddings. Second, we propose a deep study of continuous word representations. This study focuses on the evaluation of different types of linguistic word embeddings and their combination in order to take advantage of their complementarities. On the other hand, it focuses on acoustic word embeddings. Then, we present a study on the analysis of classification errors, with the aim of perceiving the errors that are difficult to detect. Perspectives for improving the performance of our system are also proposed, by modeling the errors at the sentence level. Finally, we exploit the linguistic and acoustic embeddings as well as the information provided by our ASR error detection system in several downstream applications.
|
46 |
Clustering of Distributed Word Representations and its Applicability for Enterprise SearchKorger, Christina 04 October 2016 (has links) (PDF)
Machine learning of distributed word representations with neural embeddings is a state-of-the-art approach to modelling semantic relationships hidden in natural language. The thesis “Clustering of Distributed Word Representations and its Applicability for Enterprise Search” covers different aspects of how such a model can be applied to knowledge management in enterprises. A review of distributed word representations and related language modelling techniques, combined with an overview of applicable clustering algorithms, constitutes the basis for practical studies. The latter have two goals: firstly, they examine the quality of German embedding models trained with gensim and a selected choice of parameter configurations. Secondly, clusterings conducted on the resulting word representations are evaluated against the objective of retrieving immediate semantic relations for a given term. The application of the final results to company-wide knowledge management is subsequently outlined by the example of the platform intergator and conceptual extensions."
|
47 |
Duplicate Detection and Text Classification on Simplified Technical English / Dublettdetektion och textklassificering på Förenklad Teknisk EngelskaLund, Max January 2019 (has links)
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
|
48 |
Um analisador sintático neural multilíngue baseado em transiçõesCosta, Pablo Botton da 24 January 2017 (has links)
Submitted by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T18:26:08Z
No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T18:26:15Z (GMT) No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T18:26:21Z (GMT) No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5) / Made available in DSpace on 2017-08-23T18:26:28Z (GMT). No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5)
Previous issue date: 2017-01-24 / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / A dependency parser consists in inducing a model that is capable of extracting the right
dependency tree from an input natural language sentence. Nowadays, the multilingual techniques
are being used more and more in Natural Language Processing (NLP) (BROWN
et al., 1995; COHEN; DAS; SMITH, 2011), especially in the dependency parsing task.
Intuitively, a multilingual parser can be seen as vector of different parsers, in which each
one is individually trained on one language. However, this approach can be a really pain
in the neck in terms of processing time and resources. As an alternative, many parsing
techniques have been developed in order to solve this problem (MCDONALD; PETROV;
HALL, 2011; TACKSTROM; MCDONALD; USZKOREIT, 2012; TITOV; HENDERSON,
2007) but all of them depends on word alignment (TACKSTROM; MCDONALD;
USZKOREIT, 2012) or word clustering, which increases the complexity since it is difficult
to induce alignments between words and syntactic resources (TSARFATY et al., 2013;
BOHNET et al., 2013a). A simple solution proposed recently (NIVRE et al., 2016a)
uses an universal annotated corpus in order to reduce the complexity associated with the
construction of a multilingual parser. In this context, this work presents an universal
model for dependency parsing: the NNParser. Our model is a modification of Chen e
Manning (2014) with a more greedy and accurate model to capture distributional representations
(MIKOLOV et al., 2011). The NNparser reached 93.08% UAS in English
Penn Treebank (WSJ) and better results than the state of the art Stack LSTM parser for
Portuguese (87.93% × 86.2% LAS) and Spanish (86.95% × 85.7% LAS) on the universal
dependencies corpus. / Um analisador sintático de dependência consiste em um modelo capaz de extrair a estrutura
de dependência de uma sentença em língua natural. No Processamento de Linguagem
Natural (PLN), os métodos multilíngues tem sido cada vez mais utilizados (BROWN et
al., 1995; COHEN; DAS; SMITH, 2011), inclusive na tarefa de análise de dependência.
Intuitivamente, um analisador sintático multilíngue pode ser visto como um vetor de analisadores
sintáticos treinados individualmente em cada língua. Contudo, a tarefa realizada
com base neste vetor torna-se inviável devido a sua alta demanda por recursos. Como
alternativa, diversos métodos de análise sintática foram propostos (MCDONALD; PETROV;
HALL, 2011; TACKSTROM; MCDONALD; USZKOREIT, 2012; TITOV; HENDERSON,
2007), mas todos dependentes de alinhamento entre palavras (TACKSTROM;
MCDONALD; USZKOREIT, 2012) ou de técnicas de agrupamento, o que também aumenta
a complexidade associada ao modelo (TSARFATY et al., 2013; BOHNET et al.,
2013a). Uma solução simples surgiu recentemente com a construção de recursos universais
(NIVRE et al., 2016a). Estes recursos universais têm o potencial de diminuir a complexidade
associada à construção de um modelo multilíngue, uma vez que não é necessário
um mapeamento entre as diferentes notações das línguas. Nesta linha, este trabalho apresenta
um modelo para análise sintática universal de dependência: o NNParser. O modelo
em questão é uma modificação da proposta de Chen e Manning (2014) com um modelo
mais guloso e preciso na captura de representações distribuídas (MIKOLOV et al., 2011).
Nos experimentos aqui apresentados o NNParser atingiu 93, 08% de UAS para o inglês
no córpus Penn Treebank e resultados melhores do que o estado da arte, o Stack LSTM,
para o português (87,93% × 86,2% LAS) e o espanhol (86,95% × 85,7% LAS) no córpus
UD 1.2.
|
49 |
Rozpoznávání pojmenovaných entit pomocí neuronových sítí / Neural Network Based Named Entity RecognitionStraková, Jana January 2017 (has links)
Title: Neural Network Based Named Entity Recognition Author: Jana Straková Institute: Institute of Formal and Applied Linguistics Supervisor of the doctoral thesis: prof. RNDr. Jan Hajič, Dr., Institute of Formal and Applied Linguistics Abstract: Czech named entity recognition (the task of automatic identification and classification of proper names in text, such as names of people, locations and organizations) has become a well-established field since the publication of the Czech Named Entity Corpus (CNEC). This doctoral thesis presents the author's research of named entity recognition, mainly in the Czech language. It presents work and research carried out during CNEC publication and its evaluation. It fur- ther envelops the author's research results, which improved Czech state-of-the-art results in named entity recognition in recent years, with special focus on artificial neural network based solutions. Starting with a simple feed-forward neural net- work with softmax output layer, with a standard set of classification features for the task, the thesis presents methodology and results, which were later used in open-source software solution for named entity recognition, NameTag. The thesis finalizes with a recurrent neural network based recognizer with word embeddings and character-level word embeddings,...
|
50 |
Word Representations and Machine Learning Models for Implicit Sense Classification in Shallow Discourse ParsingCallin, Jimmy January 2017 (has links)
CoNLL 2015 featured a shared task on shallow discourse parsing. In 2016, the efforts continued with an increasing focus on sense classification. In the case of implicit sense classification, there was an interesting mix of traditional and modern machine learning classifiers using word representation models. In this thesis, we explore the performance of a number of these models, and investigate how they perform using a variety of word representation models. We show that there are large performance differences between word representation models for certain machine learning classifiers, while others are more robust to the choice of word representation model. We also show that with the right choice of word representation model, simple and traditional machine learning classifiers can reach competitive scores even when compared with modern neural network approaches.
|
Page generated in 0.0559 seconds