1 |
Exploração de informações contextuais para enriquecimento semântico em representações de textos / Exploration of contextual information for semantic enrichment in text representationsRibeiro, João Vítor Antunes 14 November 2018 (has links)
Em decorrência da crescente quantidade de documentos disponíveis em formato digital, a importância da análise computacional de grandes volumes de dados torna-se ainda mais evidente na atualidade. Embora grande parte desses documentos esteja disponível em formato de língua natural, a análise por meio de processos como a Mineração de Textos ainda é um desafio a ser superado. Normalmente, abordagens tradicionais de representação de textos como a Bag of Words desconsideram aspectos semânticos e contextuais das coleções de textos analisadas, ignorando informações que podem potencializar o desempenho das tarefas realizadas. Os principais problemas associados a essas abordagens são a alta esparsidade e dimensionalidade que prejudicam consideravelmente o desempenho das tarefas realizadas. Como o enriquecimento de representações de textos é uma das possibilidades efetivas para atenuar esses tipos de problemas, nesta dissertação foi investigada a aplicação conjunta de enriquecimentos semânticos e contextuais. Para isso foi proposta uma nova técnica de representação de textos, cuja principal novidade é a abordagem utilizada para calcular a frequência dos atributos (contextos) baseando-se em suas similaridades. Os atributos extraídos por meio dessa técnica proposta são considerados dependentes já que são formados por conjuntos de termos correlacionados que podem compartilhar informações semelhantes. A efetividade da técnica foi avaliada na tarefa de classificação automática de textos, na qual foram explorados diferentes procedimentos de enriquecimento textual e versões de modelos de linguagem baseados em word embeddings. De acordo com os resultados obtidos, há evidências favoráveis a respeito da efetividade e da aplicabilidade da técnica de representação de textos proposta. Segundo os testes de significância estatística realizados, a aplicação de enriquecimentos textuais baseados em Reconhecimento de Entidades Nomeadas e em Desambiguação Lexical de Sentido pode contribuir efetivamente para o aumento do desempenho da tarefa de classificação automática de textos, principalmente nas abordagens em que também são considerados textos de fontes externas de conhecimento como a Wikipédia. Constatou-se empiricamente que a efetividade dessa técnica proposta pode ser superior às abordagens tradicionais em cenários de aplicação baseados em informações semânticas das coleções de textos, caracterizando-a como uma alternativa promissora para a geração de representações de textos com alta densidade de informações semânticas e contextuais que se destacam pela interpretabilidade. / Due to the increasing number of available documents in digital format, the importance of computational analysis of large volumes of data becomes even more evident recently. Although most of these documents are available in natural language format, analysis through processes such as text mining is still a challenge to be overcome. Normally, traditional text representation approaches such as the bag of words disregard semantic and contextual aspects of the analyzed text collections, ignoring information that can enhance the performance of the tasks performed. The main problems associated with these approaches are the high sparsity and dimensionality that considerably impair the performance of the tasks performed. As the text representations enrichment is one of the effective possibilities to attenuate these types of problems, in this dissertation the joint application of semantic and contextual enrichment was investigated. For that a new text representation technique was proposed, whose main novelty is the approach used to calculate the frequency of attributes (contexts) based on their similarities. The attributes attributes extracted by this proposed technique are considered dependent because they are formed by sets of correlated terms that can share similar information. The effectiveness of the technique was evaluated in the automatic text classification task, in which different procedures of textual enrichment and versions of language models based on word embeddings were explored. According to the results, there is favorable evidence regarding the effectiveness and applicability of the proposed text representation technique. According to the statistical significance tests, the application of textual enrichment based on named entity recognition and word sense disambiguation can effectively contribute to the increase of the performance of the automatic text classification task, especially in the approaches that are also considered texts from external knowledge sources such asWikipedia. It has been empirically verified that the effectiveness of this proposed technique can be superior to the traditional approaches in application scenarios based on semantic information of the text collections, characterizing it as a promising alternative for the generation of text representations with high density of semantic and contextual information that stand out for their interpretability.
|
2 |
Clustering and Summarization of Chat Dialogues : To understand a company’s customer base / Klustring och Summering av Chatt-DialogerHidén, Oskar, Björelind, David January 2021 (has links)
The Customer Success department at Visma handles about 200 000 customer chats each year, the chat dialogues are stored and contain both questions and answers. In order to get an idea of what customers ask about, the Customer Success department has to read a random sample of the chat dialogues manually. This thesis develops and investigates an analysis tool for the chat data, using the approach of clustering and summarization. The approach aims to decrease the time spent and increase the quality of the analysis. Models for clustering (K-means, DBSCAN and HDBSCAN) and extractive summarization (K-means, LSA and TextRank) are compared. Each algorithm is combined with three different text representations (TFIDF, S-BERT and FastText) to create models for evaluation. These models are evaluated against a test set, created for the purpose of this thesis. Silhouette Index and Adjusted Rand Index are used to evaluate the clustering models. ROUGE measure together with a qualitative evaluation are used to evaluate the extractive summarization models. In addition to this, the best clustering model is further evaluated to understand how different data sizes impact performance. TFIDF Unigram together with HDBSCAN or K-means obtained the best results for clustering, whereas FastText together with TextRank obtained the best results for extractive summarization. This thesis applies known models on a textual domain of customer chat dialogues, something that, to our knowledge, has previously not been done in literature.
|
3 |
Evaluation of the performance of machine learning techniques for email classification / Utvärdering av prestationen av maskininlärningstekniker för e-post klassificeringTapper, Isabella January 2022 (has links)
Manual categorization of a mail inbox can often become time-consuming. Therefore many attempts have been made to use machine learning for this task. One essential Natural Language Processing (NLP) task is text classification, which is a big challenge since an NLP engine is not a native speaker of any human language. An NLP engine often fails at understanding sarcasm and underlying intent. One of the NLP challenges is to represent text. Text embeddings can be learned, or they can be generated from a pre-trained model. Google’s pre-trained model Sentence Bidirectional Encoder Representations from Transformers (SBERT) is state-of-the-art for generating pre-trained vector representation of longer text. In this project, different methods of classifying and clustering emails were studied. The performances of three supervised classification models were compared to each other. A Support Vector Machine (SVM) and a Neural Network (NN) were trained with SBERT embeddings, and the third model, a Recurrent Neural Network (RNN) was trained on raw data. The motivation for this experiment was to see whether SBERT embedding is an excellent choice of text representation when combined with simpler classification models in an email classification task. The results show that the SVM and NN perform higher than RNN in the email classification task. Since most real data is unlabeled, this thesis also evaluated how well unsupervised methods could perform in email clustering taking advantage of the available labels and using SBERT embeddings as text representations. Three unsupervised clustering models are reviewed in this thesis: K-Means (KM), Spectral Clustering (SC), and Hierarchical Agglomerative Clustering (HAC). The results show that the unsupervised models all had a similar performance in terms of precision, recall and F1-score, and the performances were evaluated using the available labeled dataset. In conclusion, this thesis gives evidence that in an email classification task, it is better for supervised models to train with pre-trained SBERT embeddings than to train on raw data. This thesis also showed that the output of the clustering methods compared on par with the output of the selected supervised learning techniques. / Manuell kategorisering av en inkorg kan ofta bli tidskrävande. Därför har många försök gjorts att använda maskininlärning för denna uppgift. En viktig uppgift för Natural Language Processing (NLP) är textklassificering, vilket är en stor utmaning eftersom en språkmotor inte talar något mänskligt språk som modersmål. En språkmotor misslyckas ofta med att förstå sarkasm och underliggande avsikt. En av språkmotorns utmaningar är att representera text. Textinbäddningar kan bli inlärda, eller så kan de genereras av en förutbildad modell. Googles förutbildade modell Sentence Bidirectional Encoder Representations from Transformers (SBERT) är den senaste tekniken för att generera förtränade vektorrepresentation av längre text. I detta projekt studerades olika metoder för att klassificera e-postmeddelanden. Prestandan av tre övervakade klassificeringsmodeller jämfördes med varandra, och av dessa var två utbildade med SBERT-inbäddningar: Support Vector Machine (SVM), Neural Network (NN) och den tredje modellen tränades på rådata: Recurrent Neural Network (RNN). Motivationen till detta experiment var att se om SBERT-inbäddningar tillsammans med enklare klassificeringsmodeller är ett bra val av textrepresentation i en e-post klassificeringsuppgift. Resultaten visar att SVM och NN har högre prestanda än RNN i e-postklassificeringsuppgiften. Eftersom mycket verklig data är omärkt utvärderade denna avhandling också hur väl oövervakade metoder kan utföras i samma e-postklassificeringsuppgift med SBERT-inbäddningar som textrepresentationer. Tre oövervakade klustringsmodeller utvärderas i denna avhandling: K-Means (KM), Spectral Clustering (SC) och Hierarchical Agglomerative Clustering (HAC). Resultaten visar att de oövervakade modeller hade liknande prestanda i precision, recall och F1-score, och prestandan var baserad på de tillgängliga klassannoteringarna. Sammanfattningsvis ger denna avhandling bevis på att i en e-postklassificeringsuppgift är det bättre att övervakade modeller tränar med förtränade SBERT-inbäddningar än att träna på rådata. Denna avhandling visade också att resultatet av klustringsmodellerna hade en jämförbar prestanda med resultatet av de valda övervakade inlärningstekniker.
|
4 |
Better representation learning for TPMSRaza, Amir 10 1900 (has links)
Avec l’augmentation de la popularité de l’IA et de l’apprentissage automatique, le nombre
de participants a explosé dans les conférences AI/ML. Le grand nombre d’articles soumis
et la nature évolutive des sujets constituent des défis supplémentaires pour les systèmes
d’évaluation par les pairs qui sont cruciaux pour nos communautés scientifiques. Certaines
conférences ont évolué vers l’automatisation de l’attribution des examinateurs pour
les soumissions, le TPMS [1] étant l’un de ces systèmes existants. Actuellement, TPMS
prépare des profils de chercheurs et de soumissions basés sur le contenu, afin de modéliser
l’adéquation des paires examinateur-soumission.
Dans ce travail, nous explorons différentes approches pour le réglage fin auto-supervisé
des transformateurs BERT pour les données des documents de conférence. Nous démontrons
quelques nouvelles approches des vues d’augmentation pour l’auto-supervision dans le
traitement du langage naturel, qui jusqu’à présent était davantage axée sur les problèmes de
vision par ordinateur. Nous utilisons ensuite ces représentations d’articles individuels pour
construire un modèle d’expertise qui apprend à combiner la représentation des différents
travaux publiés d’un examinateur et à prédire leur pertinence pour l’examen d’un article
soumis. Au final, nous montrons que de meilleures représentations individuelles des papiers
et une meilleure modélisation de l’expertise conduisent à de meilleures performances dans
la tâche de prédiction de l’adéquation de l’examinateur. / With the increase in popularity of AI and Machine learning, participation numbers have
exploded in AI/ML conferences. The large number of submission papers and the evolving
nature of topics constitute additional challenges for peer-review systems that are crucial for
our scientific communities. Some conferences have moved towards automating the reviewer
assignment for submissions, TPMS [1] being one such existing system. Currently, TPMS
prepares content-based profiles of researchers and submission papers, to model the suitability
of reviewer-submission pairs.
In this work, we explore different approaches to self-supervised fine-tuning of BERT
transformers for conference papers data. We demonstrate some new approaches to augmentation
views for self-supervision in natural language processing, which till now has
been more focused on problems in computer vision. We then use these individual paper
representations for building an expertise model which learns to combine the representation
of different published works of a reviewer and predict their relevance for reviewing
a submission paper. In the end, we show that better individual paper representations
and expertise modeling lead to better performance on the reviewer suitability prediction task.
|
Page generated in 0.1045 seconds