Spelling suggestions: "subject:"work embedding""
51 |
Classification of Transcribed Voice Recordings : Determining the Claim Type of Recordings Submitted by Swedish Insurance Clients / Klassificering av Transkriberade RöstinspelningarPiehl, Carl January 2021 (has links)
In this thesis, we investigate the problem of building a text classifier for transcribed voice recordings submitted by insurance clients. We compare different models in the context of two tasks. The first is a binary classification problem, where the models are tasked with determining if a transcript belongs to a particular type or not. The second is a multiclass problem, where the models have to choose between several types when labelling transcripts, resulting in a data set with a highly imbalanced class distribution. We evaluate four different models: pretrained BERT and three LSTMs with different word embeddings. The used word embeddings are ELMo, word2vec and a baseline model with randomly initialized embedding layer. In the binary task, we are more concerned with false positives than false negatives. Thus, we also use weighted cross entropy loss to achieve high precision for the positive class, while sacrificing recall. In the multiclass task, we use focal loss and weighted cross entropy loss to reduce bias toward majority classes. We find that BERT outperforms the other models and the baseline model is worst across both tasks. The difference in performance is greatest in the multiclass task on classes with fewer samples. This demonstrates the benefit of using large language models in data constrained scenarios. In the binary task, we find that weighted cross entropy loss provides a simple, yet effective, framework for conditioning the model to favor certain types of errors. In the multiclass task, both focal loss and weighted cross entropy loss are shown to reduce bias toward majority classes. However, we also find that BERT fine tuned with regular cross entropy loss does not show bias toward majority classes, having high recall across all classes. / I examensarbetet undersöks klassificering av transkriberade röstinspelningar från försäkringskunder. Flera modeller jämförs på två uppgifter. Den första är binär klassificering, där modellerna ska särskilja på inspelningar som tillhör en specifik klass av ärende från resterande inspelningar. I det andra inkluderas flera olika klasser som modellerna ska välja mellan när inspelningar klassificeras, vilket leder till en ojämn klassfördelning. Fyra modeller jämförs: förtränad BERT och tre LSTM-nätverk med olika varianter av förtränade inbäddningar. De inbäddningar som används är ELMo, word2vec och en basmodell som har inbäddningar som inte förtränats. I det binära klassificeringsproblemet ligger fokus på att minimera antalet falskt positiva klassificeringar, därför används viktad korsentropi. Utöver detta används även fokal förlustfunktion när flera klasser inkluderas, för att minska partiskhet mot majoritetsklasser. Resultaten indikerar att BERT är en starkare modell än de andra modellerna i båda uppgifterna. Skillnaden mellan modellerna är tydligast när flera klasser används, speciellt på de klasser som är underrepresenterade. Detta visar på fördelen av att använda stora, förtränade, modeller när mängden data är begränsad. I det binära klassificeringsproblemet ser vi även att en viktad förlustfunktion ger ett enkelt men effektivt sätt att reglera vilken typ av fel modellen ska vara partisk mot. När flera klasser inkluderas ser vi att viktad korsentropi, samt fokal förlustfunktion, kan bidra till att minska partiskhet mot överrepresenterade klasser. Detta var dock inte fallet för BERT, som visade bra resultat på minoritetsklasser även utan att modifiera förlustfunktionen.
|
52 |
Optimering av en chattbot för det svenska språket / Optimization of a Chatbot for the Swedish LanguageMutaliev, Mohammed, Almimar, Ibrahim January 2021 (has links)
Chattbotutvecklare på Softronic använder i dagsläget Rasa-ramverket och dess standardkomponenter för bearbetning av användarinmatning. Det här är problematiskt då standardkomponenterna inte är optimerade för det svenska språket. Till följd av detta efterfrågades en utvärdering av samtliga Rasa-komponenter med syfte att identifiera de mest gynnsamma komponenterna för att maximera klassificeringsträffsäkerhet. I detta examensarbete framtogs och jämfördes flera Rasa-pipelines med olika komponenter för tokenisering, känneteckensextrahering och klassificering. Resultaten av komponenterna för tokenisering visade att Rasas WhitespaceTokenizer överträffade både SpacyTokenizer och StanzaTokenizer. För känneteckensextrahering var CountVectorsFeaturizer, LanguageModelFeaturizer (med LaBSE-modellen) och FastTextFeaturizer (med den officiella fastText-modellen tränad på svenska Wikipedia) de mest optimala komponenterna. Den klassificerare som i allmänhet presterade bäst var DIETClassifier, men det fanns flera tillfällen där SklearnIntentClassifier överträffade den. Detta arbete resulterade i flera pipelines som överträffade Rasas standard-pipeline. Av dessa pipelines var det två som presterade bäst. Den första pipeline implementerade komponenterna WhitespaceTokenizer, CountVectorsFeaturizer, FastTextFeaturizer (med den officiella fastText-modellen tränad på svenska Wikipedia) och DIETClassifier med en klassificeringsträffsäkerhet på 91% (F1-score). Den andra pipeline implementerade komponenterna WhitespaceTokenizer, LanguageModelFeaturizer (med LaBSE-modellen) och SklearnIntentClassifier med en klassificeringsträffsäkerhet på 91,5% (F1-score). / Chatbot developers at Softronic currently use the Rasa framework and its default components for processing user input. This is problematic as the default components are not optimized for the Swedish language. Following this an evaluation of all Rasa components was requested with the purpose of identifying the most favorable components to maximize classification accuracy. In this thesis, several Rasa pipelines were developed and compared with different components for tokenization, feature extraction and classification. The results of the tokenization components showed that Rasa's WhitespaceTokenizer surpassed both SpacyTokenizer and StanzaTokenizer. For feature extraction, CountVectorsFeaturizer, LanguageModelFeaturizer (with the LaBSE model) and FastTextFeaturizer (with the official fastText model trained on Swedish Wikipedia) were the most optimal components. The classifier that generally performed best was DIETClassifier, but there were several occasions where SklearnIntentClassifier surpassed it. This work resulted in several pipelines that exceeded Rasa’s standard pipeline. Of these pipelines, two performed best. The first pipeline implemented the components WhitespaceTokenizer, CountVectorsFeaturizer, FastTextFeaturizer (with the official fastText model trained on Swedish Wikipedia) and DIETClassifier with a classification accuracy of 91% (F1 score). The other pipeline implemented the components WhitespaceTokenizer, LanguageModelFeaturizer (with the LaBSE model) and SklearnIntentClassifier with a classification accuracy of 91.5% (F1 score).
|
53 |
„The Vectorian“ – Eine parametrisierbare Suchmaschine für intertextuelle ReferenzenLiebl, Bernhard, Burghardt, Manuel 20 June 2024 (has links)
No description available.
|
54 |
Clustering of Distributed Word Representations and its Applicability for Enterprise SearchKorger, Christina 04 October 2016 (has links) (PDF)
Machine learning of distributed word representations with neural embeddings is a state-of-the-art approach to modelling semantic relationships hidden in natural language. The thesis “Clustering of Distributed Word Representations and its Applicability for Enterprise Search” covers different aspects of how such a model can be applied to knowledge management in enterprises. A review of distributed word representations and related language modelling techniques, combined with an overview of applicable clustering algorithms, constitutes the basis for practical studies. The latter have two goals: firstly, they examine the quality of German embedding models trained with gensim and a selected choice of parameter configurations. Secondly, clusterings conducted on the resulting word representations are evaluated against the objective of retrieving immediate semantic relations for a given term. The application of the final results to company-wide knowledge management is subsequently outlined by the example of the platform intergator and conceptual extensions."
|
55 |
Duplicate Detection and Text Classification on Simplified Technical English / Dublettdetektion och textklassificering på Förenklad Teknisk EngelskaLund, Max January 2019 (has links)
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
|
56 |
Um analisador sintático neural multilíngue baseado em transiçõesCosta, Pablo Botton da 24 January 2017 (has links)
Submitted by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T18:26:08Z
No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T18:26:15Z (GMT) No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T18:26:21Z (GMT) No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5) / Made available in DSpace on 2017-08-23T18:26:28Z (GMT). No. of bitstreams: 1
DissPBC.pdf: 1229668 bytes, checksum: 806b06dd0fbdd6a4076384a7d0f90456 (MD5)
Previous issue date: 2017-01-24 / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / A dependency parser consists in inducing a model that is capable of extracting the right
dependency tree from an input natural language sentence. Nowadays, the multilingual techniques
are being used more and more in Natural Language Processing (NLP) (BROWN
et al., 1995; COHEN; DAS; SMITH, 2011), especially in the dependency parsing task.
Intuitively, a multilingual parser can be seen as vector of different parsers, in which each
one is individually trained on one language. However, this approach can be a really pain
in the neck in terms of processing time and resources. As an alternative, many parsing
techniques have been developed in order to solve this problem (MCDONALD; PETROV;
HALL, 2011; TACKSTROM; MCDONALD; USZKOREIT, 2012; TITOV; HENDERSON,
2007) but all of them depends on word alignment (TACKSTROM; MCDONALD;
USZKOREIT, 2012) or word clustering, which increases the complexity since it is difficult
to induce alignments between words and syntactic resources (TSARFATY et al., 2013;
BOHNET et al., 2013a). A simple solution proposed recently (NIVRE et al., 2016a)
uses an universal annotated corpus in order to reduce the complexity associated with the
construction of a multilingual parser. In this context, this work presents an universal
model for dependency parsing: the NNParser. Our model is a modification of Chen e
Manning (2014) with a more greedy and accurate model to capture distributional representations
(MIKOLOV et al., 2011). The NNparser reached 93.08% UAS in English
Penn Treebank (WSJ) and better results than the state of the art Stack LSTM parser for
Portuguese (87.93% × 86.2% LAS) and Spanish (86.95% × 85.7% LAS) on the universal
dependencies corpus. / Um analisador sintático de dependência consiste em um modelo capaz de extrair a estrutura
de dependência de uma sentença em língua natural. No Processamento de Linguagem
Natural (PLN), os métodos multilíngues tem sido cada vez mais utilizados (BROWN et
al., 1995; COHEN; DAS; SMITH, 2011), inclusive na tarefa de análise de dependência.
Intuitivamente, um analisador sintático multilíngue pode ser visto como um vetor de analisadores
sintáticos treinados individualmente em cada língua. Contudo, a tarefa realizada
com base neste vetor torna-se inviável devido a sua alta demanda por recursos. Como
alternativa, diversos métodos de análise sintática foram propostos (MCDONALD; PETROV;
HALL, 2011; TACKSTROM; MCDONALD; USZKOREIT, 2012; TITOV; HENDERSON,
2007), mas todos dependentes de alinhamento entre palavras (TACKSTROM;
MCDONALD; USZKOREIT, 2012) ou de técnicas de agrupamento, o que também aumenta
a complexidade associada ao modelo (TSARFATY et al., 2013; BOHNET et al.,
2013a). Uma solução simples surgiu recentemente com a construção de recursos universais
(NIVRE et al., 2016a). Estes recursos universais têm o potencial de diminuir a complexidade
associada à construção de um modelo multilíngue, uma vez que não é necessário
um mapeamento entre as diferentes notações das línguas. Nesta linha, este trabalho apresenta
um modelo para análise sintática universal de dependência: o NNParser. O modelo
em questão é uma modificação da proposta de Chen e Manning (2014) com um modelo
mais guloso e preciso na captura de representações distribuídas (MIKOLOV et al., 2011).
Nos experimentos aqui apresentados o NNParser atingiu 93, 08% de UAS para o inglês
no córpus Penn Treebank e resultados melhores do que o estado da arte, o Stack LSTM,
para o português (87,93% × 86,2% LAS) e o espanhol (86,95% × 85,7% LAS) no córpus
UD 1.2.
|
57 |
Word Representations and Machine Learning Models for Implicit Sense Classification in Shallow Discourse ParsingCallin, Jimmy January 2017 (has links)
CoNLL 2015 featured a shared task on shallow discourse parsing. In 2016, the efforts continued with an increasing focus on sense classification. In the case of implicit sense classification, there was an interesting mix of traditional and modern machine learning classifiers using word representation models. In this thesis, we explore the performance of a number of these models, and investigate how they perform using a variety of word representation models. We show that there are large performance differences between word representation models for certain machine learning classifiers, while others are more robust to the choice of word representation model. We also show that with the right choice of word representation model, simple and traditional machine learning classifiers can reach competitive scores even when compared with modern neural network approaches.
|
58 |
Word2vec modely s přidanou kontextovou informací / Word2vec Models with Added Context InformationŠůstek, Martin January 2017 (has links)
This thesis is concerned with the explanation of the word2vec models. Even though word2vec was introduced recently (2013), many researchers have already tried to extend, understand or at least use the model because it provides surprisingly rich semantic information. This information is encoded in N-dim vector representation and can be recall by performing some operations over the algebra. As an addition, I suggest a model modifications in order to obtain different word representation. To achieve that, I use public picture datasets. This thesis also includes parts dedicated to word2vec extension based on convolution neural network.
|
59 |
Clustering of Distributed Word Representations and its Applicability for Enterprise SearchKorger, Christina 18 August 2016 (has links)
Machine learning of distributed word representations with neural embeddings is a state-of-the-art approach to modelling semantic relationships hidden in natural language. The thesis “Clustering of Distributed Word Representations and its Applicability for Enterprise Search” covers different aspects of how such a model can be applied to knowledge management in enterprises. A review of distributed word representations and related language modelling techniques, combined with an overview of applicable clustering algorithms, constitutes the basis for practical studies. The latter have two goals: firstly, they examine the quality of German embedding models trained with gensim and a selected choice of parameter configurations. Secondly, clusterings conducted on the resulting word representations are evaluated against the objective of retrieving immediate semantic relations for a given term. The application of the final results to company-wide knowledge management is subsequently outlined by the example of the platform intergator and conceptual extensions.":1 Introduction
1.1 Motivation
1.2 Thesis Structure
2 Related Work
3 Distributed Word Representations
3.1 History
3.2 Parallels to Biological Neurons
3.3 Feedforward and Recurrent Neural Networks
3.4 Learning Representations via Backpropagation and Stochastic Gradient Descent
3.5 Word2Vec
3.5.1 Neural Network Architectures and Update Frequency
3.5.2 Hierarchical Softmax
3.5.3 Negative Sampling
3.5.4 Parallelisation
3.5.5 Exploration of Linguistic Regularities
4 Clustering Techniques
4.1 Categorisation
4.2 The Curse of Dimensionality
5 Training and Evaluation of Neural Embedding Models
5.1 Technical Setup
5.2 Model Training
5.2.1 Corpus
5.2.2 Data Segmentation and Ordering
5.2.3 Stopword Removal
5.2.4 Morphological Reduction
5.2.5 Extraction of Multi-Word Concepts
5.2.6 Parameter Selection
5.3 Evaluation Datasets
5.3.1 Measurement Quality Concerns
5.3.2 Semantic Similarities
5.3.3 Regularities Expressed by Analogies
5.3.4 Construction of a Representative Test Set for Evaluation of Paradigmatic Relations
5.3.5 Metrics
5.4 Discussion
6 Evaluation of Semantic Clustering on Word Embeddings
6.1 Qualitative Evaluation
6.2 Discussion
6.3 Summary
7 Conceptual Integration with an Enterprise Search Platform
7.1 The intergator Search Platform
7.2 Deployment Concepts of Distributed Word Representations
7.2.1 Improved Document Retrieval
7.2.2 Improved Query Suggestions
7.2.3 Additional Support in Explorative Search
8 Conclusion
8.1 Summary
8.2 Further Work
Bibliography
List of Figures
List of Tables
Appendix
|
60 |
Cooperative security log analysis using machine learning : Analyzing different approaches to log featurization and classification / Kooperativ säkerhetslogganalys med maskininlärningMalmfors, Fredrik January 2022 (has links)
This thesis evaluates the performance of different machine learning approaches to log classification based on a dataset derived from simulating intrusive behavior towards an enterprise web application. The first experiment consists of performing attacks towards the web app in correlation with the logs to create a labeled dataset. The second experiment consists of one unsupervised model based on a variational autoencoder and four super- vised models based on both conventional feature-engineering techniques with deep neural networks and embedding-based feature techniques followed by long-short-term memory architectures and convolutional neural networks. With this dataset, the embedding-based approaches performed much better than the conventional one. The autoencoder did not perform well compared to the supervised models. To conclude, embedding-based ap- proaches show promise even on datasets with different characteristics compared to natural language.
|
Page generated in 0.0896 seconds