Spelling suggestions: "subject:"[een] WORD EMBEDDINGS"" "subject:"[enn] WORD EMBEDDINGS""
21 |
Automatic Poetry Classification Using Natural Language ProcessingKesarwani, Vaibhav January 2018 (has links)
Poetry, as a special form of literature, is crucial for computational linguistics. It has a high density of emotions, figures of speech, vividness, creativity, and ambiguity. Poetry poses a much greater challenge for the application of Natural Language Processing algorithms than any other literary genre.
Our system establishes a computational model that classifies poems based on similarity features like rhyme, diction, and metaphor.
For rhyme analysis, we investigate the methods used to classify poems based on rhyme patterns. First, the overview of different types of rhymes is given along with the detailed description of detecting rhyme type and sub-types by the application of a pronunciation dictionary on our poetry dataset. We achieve an accuracy of 96.51% in identifying rhymes in poetry by applying a phonetic similarity model. Then we achieve a rhyme quantification metric RhymeScore based on the matching phonetic transcription of each poem. We also develop an application for the visualization of this quantified RhymeScore as a scatter plot in 2 or 3 dimensions.
For diction analysis, we investigate the methods used to classify poems based on diction. First the linguistic quantitative and semantic features that constitute diction are enumerated. Then we investigate the methodology used to compute these features from our poetry dataset. We also build a word embeddings model on our poetry dataset with 1.5 million words in 100 dimensions and do a comparative analysis with GloVe embeddings.
Metaphor is a part of diction, but as it is a very complex topic in its own right, we address it as a stand-alone issue and develop several methods for it. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our methods focus on metaphor detection in a poetry corpus, but we test on non-poetry data as well. We combine rule-based and statistical models (word embeddings) to develop a new classification system. Our first metaphor detection method achieves a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry, by using a Support Vector Machine classifier with various types of features. Furthermore, our deep learning model based on a Convolutional Neural Network achieves a precision of 0.831 and a recall of 0.836 for the same task. We also develop an application for generic metaphor detection in any type of natural text.
22 |
Exploring the Compositionality of German Particle VerbsRawein, Carina January 2018 (has links)
In this thesis we explore the compositionality of particle verbs using distributional similarity and pre-trained word embeddings. We investigate the compositionality of 100 pairs of particle verbs with their base verbs. The ranking of our findings are compared to a ranking of human ratings on compositionality. In our distributional approach we use features such as context window size, content words, and only use particle verbs with one word sense. We then compare the distributional approach to a ranking done with pre-trained word embeddings. While none of the results are statistically significant, it is shown that word embeddings are not automatically superior to the more traditional distributional approach.
23 |
Optimizing Deep Neural Networks for Classification of Short TextsPettersson, Fredrik January 2019 (has links)
This master's thesis investigates how a state-of-the-art (SOTA) deep neural network (NN) model can be created for a specific natural language processing (NLP) dataset, the effects of using different dimensionality reduction techniques on common pre-trained word embeddings and how well this model generalize on a secondary dataset. The research is motivated by two factors. One is that the construction of a machine learning (ML) text classification (TC) model is typically done around a specific dataset and often requires a lot of manual intervention. It's therefore hard to know exactly what procedures to implement for a specific dataset and how the result will be affected. The other reason is that, if the dimensionality of pre-trained embedding vectors can be lowered without losing accuracy, and thus saving execution time, other techniques can be used during the time saved to achieve even higher accuracy. A handful of deep neural network architectures are used, namely a convolutional neural network (CNN), long short-term memory neural network (LSTM) and a bidirectional LSTM (Bi-LSTM) architecture. These deep neural network architectures are combined with four different word embeddings: GoogleNews-vectors-negative300, glove.840B.300d, paragram_300_sl999 and wiki-news-300d-1M. Three main experiments are conducted in this thesis. In the first experiment, a top-performing TC model is created for a recent NLP competition held at Kaggle.com. Each implemented procedure is benchmarked on how the accuracy and execution time of the model is affected. In the second experiment, principal component analysis (PCA) and random projection (RP) are applied to the pre-trained word embeddings used in the top-performing model to investigate how the accuracy and execution time is affected when creating lower-dimensional embedding vectors. In the third experiment, the same model is benchmarked on a separate dataset (Sentiment140) to investigate how well it generalizes on other data and how each implemented procedure affects the accuracy compared to on the original dataset. The first experiment results in a bidirectional LSTM model and a combination of the three embeddings: glove, paragram and wiki-news concatenated together. The model is able to give predictions with an F1 score of 71% which is good enough to reach 9th place out of 1,401 participating teams in the competition. In the second experiment, the execution time is improved by 13%, by using PCA, while lowering the dimensionality of the embeddings by 66% and only losing half a percent of F1 accuracy. RP gave a constant accuracy of 66-67% regardless of the projected dimensions compared to over 70% when using PCA. In the third experiment, the model gained around 12% accuracy from the initial to the final benchmarks, compared to 19% on the competition dataset. The best-achieved accuracy on the Sentiment140 dataset is 86% and thus higher than the 71% achieved on the Quora dataset.
24 |
Extractive Text Summarization of Greek News Articles Based on Sentence-ClustersKantzola, Evangelia January 2020 (has links)
This thesis introduces an extractive summarization system for Greek news articles based on sentence clustering. The main purpose of the paper is to evaluate the impact of three different types of text representation, Word2Vec embeddings, TF-IDF and LASER embeddings, on the summarization task. By taking these techniques into account, we build three different versions of the initial summarizer. Moreover, we create a new corpus of gold standard summaries to evaluate them against the system summaries. The new collection of reference summaries is merged with a part of the MultiLing Pilot 2011 in order to constitute our main dataset. We perform both automatic and human evaluation. Our automatic ROUGE results suggest that System A which employs Average Word2Vec vectors to create sentence embeddings, outperforms the other two systems by yielding higher ROUGE-L F-scores. Contrary to our initial hypotheses, System C using LASER embeddings fails to surpass even the Word2Vec embeddings method, showing sometimes a weak sentence representation. With regard to the scores obtained by the manual evaluation task, we observe that System A using Average Word2Vec vectors and System C with LASER embeddings tend to produce more coherent and adequate summaries than System B employing TF-IDF. Furthermore, the majority of system summaries are rated very high with respect to non-redundancy. Overall, System A utilizing Average Word2Vec embeddings performs quite successfully according to both evaluations.
25 |
Cluster selection for Clustered Federated Learning using Min-wise Independent Permutations and Word Embeddings / Kluster selektion för Klustrad Federerad Inlärning med användning av “Min-wise” Oberoende Permutations och OrdinbäddningarRaveen Bandara Harasgama, Pulasthi January 2022 (has links)
Federated learning is a widely established modern machine learning methodology where training is done directly on the client device with local client data and the local training results are shared to compute a global model. Federated learning emerged as a result of data ownership and the privacy concerns of traditional machine learning methodologies where data is collected and trained at a central location. However, in a distributed data environment, the training suffers significantly when the client data is not identically distributed. Hence, clustered federated learning was proposed where similar clients are clustered and trained independently to form specialized cluster models which are then used to compute a global model. In this approach, the cluster selection for clustered federated learning is a major factor that affects the effectiveness of the global model. This research presents two approaches for client clustering using local client data for clustered federated learning while preserving data privacy. The two proposed approaches use min-wise independent permutations to compute client signatures using text and word embeddings. These client signatures are then used as a representation of client data to cluster clients using agglomerative hierarchical clustering. Unlike previously proposed clustering methods, the two presented approaches do not use model updates, provide a better privacy-preserving mechanism and have a lower communication overhead. With extensive experimentation, we show that the proposed approaches outperform the random clustering approach. Finally, we present a client clustering methodology that can be utilized in a practical clustered federated learning environment. / Federerad inlärning är en etablerad och modern maskininlärnings metod. Träningen är utförd direkt på klientenheten med lokal klient data. Sen är dem lokala träningsresultat delad för att beräkna en global modell. Federerad inlärning har utvecklats på grund av dataägarskap- och dataintegritetsproblem vid traditionella maskininlärnings metoder. Dessa metoder samlar och tränar data på en central enhet. I den här metoden är kluster selektionen en viktig faktor som påverkar effektiviteten av den globala modellen. Detta forskningsarbete presenterar två metoder för klient klustring med hjälp av lokala klientdata för federerad inlärning samtidigt tar metoderna hänsyn på dataintegritet. Metoderna använder “min-wise” oberoende permutations och förtränade (“text och word”) inbäddningar. Dessa klientsignaturer används som en klientdata representation för att klustrar klienter med hjälp av agglomerativ hierarkisk klustring. Till skillnad från tidigare klustringsmetoder använder de två presenterade metoderna inte modelluppdateringar. Detta ger en bättre sekretessbevarande mekanism och har lägre kommunikationskostnader. De två presenterade metoderna överträffar den slumpmässiga klustringsmetoden genom omfattande experiment och analys. Till slut presenterar vi en klientklustermetodik som kan användas i en praktisk klustrad federerad inlärningsmiljö.
26 |
Extending a Text Classifier to Multiple Languages / Utöka en textklassificeringsmodell till flera språkByström, Albin January 2021 (has links)
This thesis explores the possibility to extend monolingual and bilingual text classifiers to multiple languages. Two different language models are explored, language aligned word embeddings and a transformer model. The goal was to take a classifier based on Swedish and English samples and extend it to Danish, German, and Finnish samples. The result shows that extending a text classifier by word embeddings alignment or by finetuning a multilingual transformer model is possible but with varying accuracy depending on the language. / Denna avhandling undersöker möjligheten att utvidga enspråkiga och tvåspråkiga textklassificatorer till flera språk. Två olika språkmodeller utforskas, justeras ordinbäddningar och en transformatormodell. Målet var att ta en klassificerare baserad på svenska och engelska texter och utvidga den till danska, tyska och finska texter. Resultatet visar att det är möjligt att utöka en textklassificering med ordinbäddning eller genom att finjustera en flerspråkig transformatormodell, men träffsäkerheten varierar beroende på språk.
27 |
Readability: Man and Machine : Using readability metrics to predict results from unsupervised sentiment analysis / Läsbarhet: Människa och maskin : Användning av läsbarhetsmått för att förutsäga resultaten från oövervakad sentimentanalysLarsson, Martin, Ljungberg, Samuel January 2021 (has links)
Readability metrics assess the ease with which human beings read and understand written texts. With the advent of machine learning techniques that allow computers to also analyse text, this provides an interesting opportunity to investigate whether readability metrics can be used to inform on the ease with which machines understand texts. To that end, the specific machine analysed in this paper uses word embeddings to conduct unsupervised sentiment analysis. This specification minimises the need for labelling and human intervention, thus relying heavily on the machine instead of the human. Across two different datasets, sentiment predictions are made using Google’s Word2Vec word embedding algorithm, and are evaluated to produce a dichotomous output variable per sentiment. This variable, representing whether a prediction is correct or not, is then used as the dependent variable in a logistic regression with 17 readability metrics as independent variables. The resulting model has high explanatory power and the effects of readability metrics on the results from the sentiment analysis are mostly statistically significant. However, metrics affect sentiment classification in the two datasets differently, indicating that the metrics are expressions of linguistic behaviour unique to the datasets. The implication of the findings is that readability metrics could be used directly in sentiment classification models to improve modelling accuracy. Moreover, the results also indicate that machines are able to pick up on information that human beings do not pick up on, for instance that certain words are associated with more positive or negative sentiments. / Läsbarhetsmått bedömer hur lätt eller svårt det är för människor att läsa och förstå skrivna texter. Eftersom nya maskininlärningstekniker har utvecklats kan datorer numera också analysera texter. Därför är en intressant infallsvinkel huruvida läsbarhetsmåtten också kan användas för att bedöma hur lätt eller svårt det är för maskiner att förstå texter. Mot denna bakgrund använder den specifika maskinen i denna uppsats ordinbäddningar i syfte att utföra oövervakad sentimentanalys. Således minimeras behovet av etikettering och mänsklig handpåläggning, vilket resulterar i en mer djupgående analys av maskinen istället för människan. I två olika dataset jämförs rätt svar mot sentimentförutsägelser från Googles ordinbäddnings-algoritm Word2Vec för att producera en binär utdatavariabel per sentiment. Denna variabel, som representerar om en förutsägelse är korrekt eller inte, används sedan som beroende variabel i en logistisk regression med 17 olika läsbarhetsmått som oberoende variabler. Den resulterande modellen har högt förklaringsvärde och effekterna av läsbarhetsmåtten på resultaten från sentimentanalysen är mestadels statistiskt signifikanta. Emellertid är effekten på klassificeringen beroende på dataset, vilket indikerar att läsbarhetsmåtten ger uttryck för olika lingvistiska beteenden som är unika till datamängderna. Implikationen av resultaten är att läsbarhetsmåtten kan användas direkt i modeller som utför sentimentanalys för att förbättra deras prediktionsförmåga. Dessutom indikerar resultaten också att maskiner kan plocka upp på information som människor inte kan, exempelvis att vissa ord är associerade med positiva eller negativa sentiment.
28 |
Finding Street Gang Member Profiles on TwitterBalasuriya, Lakshika January 2017 (has links)
No description available.
29 |
Learning representations for Information RetrievalSordoni, Alessandro 03 1900 (has links)
La recherche d'informations s'intéresse, entre autres, à répondre à des questions comme: est-ce qu'un document est pertinent à une requête ?
Est-ce que deux requêtes ou deux documents sont similaires ? Comment la similarité entre deux requêtes ou documents peut être utilisée pour améliorer
l'estimation de la pertinence ? Pour donner réponse à ces questions, il est nécessaire d'associer chaque document et requête à des représentations interprétables
par ordinateur. Une fois ces représentations estimées, la similarité peut correspondre, par exemple, à une distance ou une divergence qui opère dans l'espace de représentation.
On admet généralement que la qualité d'une représentation a un impact direct sur l'erreur d'estimation par rapport à la vraie pertinence, jugée par un humain.
Estimer de bonnes représentations des documents et des requêtes a longtemps été un problème central de la recherche d'informations.
Le but de cette thèse est de proposer des nouvelles méthodes pour estimer les représentations des documents et des requêtes, la relation de pertinence entre eux et ainsi modestement avancer l'état de l'art du domaine.
Nous présentons quatre articles publiés dans des conférences internationales et un article publié dans un forum d'évaluation. Les deux premiers articles concernent des méthodes qui créent l'espace de représentation selon une connaissance à priori sur les caractéristiques qui sont importantes pour la tâche à accomplir. Ceux-ci nous amènent à présenter un nouveau modèle de recherche d'informations qui diffère des modèles existants sur le plan théorique et de l'efficacité expérimentale. Les deux derniers articles marquent un changement fondamental dans l'approche de construction des représentations. Ils bénéficient notamment de l'intérêt de recherche dont les techniques d'apprentissage profond par réseaux de neurones, ou deep learning, ont fait récemment l'objet. Ces modèles d'apprentissage élicitent automatiquement les caractéristiques importantes pour la tâche demandée à partir d'une quantité importante de données. Nous nous intéressons à la modélisation des relations sémantiques entre documents et requêtes ainsi qu'entre deux ou plusieurs requêtes. Ces derniers articles marquent les premières applications de l'apprentissage de représentations par réseaux de neurones à la recherche d'informations. Les modèles proposés ont aussi produit une performance améliorée sur des collections de test standard. Nos travaux nous mènent à la conclusion générale suivante: la performance en recherche d'informations pourrait drastiquement être améliorée en se basant sur les approches d'apprentissage de représentations. / Information retrieval is generally concerned with answering questions such as: is this document relevant to this query?
How similar are two queries or two documents?
How query and document similarity can be used to enhance relevance estimation?
In order to answer these questions, it is necessary to access computational representations of documents and queries.
For example, similarities between documents and queries may correspond to a distance or a divergence defined on the representation space.
It is generally assumed that the quality of the representation has a direct impact on the bias with respect to the true similarity, estimated by means of human intervention.
Building useful representations for documents and queries has always been central to information retrieval research.
The goal of this thesis is to provide new ways of estimating such representations and the relevance relationship between them.
We present four articles that have been published in international conferences and one published in an information retrieval evaluation
forum. The first two articles can be categorized as feature engineering approaches, which transduce a priori knowledge about the domain into the features of the representation.
We present a novel retrieval model that compares favorably to existing models in terms of both theoretical originality and experimental effectiveness.
The remaining two articles mark a significant change in our vision and originate from the widespread interest in deep learning research that took place during the time they were written.
Therefore, they naturally belong to the category of representation learning approaches, also known as feature learning. Differently from previous approaches, the learning model discovers alone the most important features for the task at hand, given a considerable amount of labeled data. We propose to model the semantic relationships between documents and queries and between queries themselves.
The models presented have also shown improved effectiveness on standard test collections. These last articles are amongst the first applications of representation learning with neural networks for information retrieval. This series of research leads to the following observation: future improvements of information retrieval effectiveness has to rely on representation learning techniques instead of manually defining the representation space.
30 |
Deep neural semantic parsing: translating from natural language into SPARQL / Análise semântica neural profunda: traduzindo de linguagem natural para SPARQLLuz, Fabiano Ferreira 07 February 2019 (has links)
Semantic parsing is the process of mapping a natural-language sentence into a machine-readable, formal representation of its meaning. The LSTM Encoder-Decoder is a neural architecture with the ability to map a source language into a target one. We are interested in the problem of mapping natural language into SPARQL queries, and we seek to contribute with strategies that do not rely on handcrafted rules, high-quality lexicons, manually-built templates or other handmade complex structures. In this context, we present two contributions to the problem of semantic parsing departing from the LSTM encoder-decoder. While natural language has well defined vector representation methods that use a very large volume of texts, formal languages, like SPARQL queries, suffer from lack of suitable methods for vector representation. In the first contribution we improve the representation of SPARQL vectors. We start by obtaining an alignment matrix between the two vocabularies, natural language and SPARQL terms, which allows us to refine a vectorial representation of SPARQL items. With this refinement we obtained better results in the posterior training for the semantic parsing model. In the second contribution we propose a neural architecture, that we call Encoder CFG-Decoder, whose output conforms to a given context-free grammar. Unlike the traditional LSTM encoder-decoder, our model provides a grammatical guarantee for the mapping process, which is particularly important for practical cases where grammatical errors can cause critical failures. Results confirm that any output generated by our model obeys the given CFG, and we observe a translation accuracy improvement when compared with other results from the literature. / A análise semântica é o processo de mapear uma sentença em linguagem natural para uma representação formal, interpretável por máquina, do seu significado. O LSTM Encoder-Decoder é uma arquitetura de rede neural com a capacidade de mapear uma sequência de origem para uma sequência de destino. Estamos interessados no problema de mapear a linguagem natural em consultas SPARQL e procuramos contribuir com estratégias que não dependam de regras artesanais, léxico de alta qualidade, modelos construídos manualmente ou outras estruturas complexas feitas à mão. Neste contexto, apresentamos duas contribuições para o problema de análise semântica partindo da arquitetura LSTM Encoder-Decoder. Enquanto para a linguagem natural existem métodos de representação vetorial bem definidos que usam um volume muito grande de textos, as linguagens formais, como as consultas SPARQL, sofrem com a falta de métodos adequados para representação vetorial. Na primeira contribuição, melhoramos a representação dos vetores SPARQL. Começamos obtendo uma matriz de alinhamento entre os dois vocabulários, linguagem natural e termos SPARQL, o que nos permite refinar uma representação vetorial dos termos SPARQL. Com esse refinamento, obtivemos melhores resultados no treinamento posterior para o modelo de análise semântica. Na segunda contribuição, propomos uma arquitetura neural, que chamamos de Encoder CFG-Decoder, cuja saída está de acordo com uma determinada gramática livre de contexto. Ao contrário do modelo tradicional LSTM Encoder-Decoder, nosso modelo fornece uma garantia gramatical para o processo de mapeamento, o que é particularmente importante para casos práticos nos quais erros gramaticais podem causar falhas críticas em um compilador ou interpretador. Os resultados confirmam que qualquer resultado gerado pelo nosso modelo obedece à CFG dada, e observamos uma melhora na precisão da tradução quando comparada com outros resultados da literatura.
Page generated in 0.0366 seconds