Spelling suggestions: "subject:"embedding""
111 |
Automatic Poetry Classification Using Natural Language ProcessingKesarwani, Vaibhav January 2018 (has links)
Poetry, as a special form of literature, is crucial for computational linguistics. It has a high density of emotions, figures of speech, vividness, creativity, and ambiguity. Poetry poses a much greater challenge for the application of Natural Language Processing algorithms than any other literary genre.
Our system establishes a computational model that classifies poems based on similarity features like rhyme, diction, and metaphor.
For rhyme analysis, we investigate the methods used to classify poems based on rhyme patterns. First, the overview of different types of rhymes is given along with the detailed description of detecting rhyme type and sub-types by the application of a pronunciation dictionary on our poetry dataset. We achieve an accuracy of 96.51% in identifying rhymes in poetry by applying a phonetic similarity model. Then we achieve a rhyme quantification metric RhymeScore based on the matching phonetic transcription of each poem. We also develop an application for the visualization of this quantified RhymeScore as a scatter plot in 2 or 3 dimensions.
For diction analysis, we investigate the methods used to classify poems based on diction. First the linguistic quantitative and semantic features that constitute diction are enumerated. Then we investigate the methodology used to compute these features from our poetry dataset. We also build a word embeddings model on our poetry dataset with 1.5 million words in 100 dimensions and do a comparative analysis with GloVe embeddings.
Metaphor is a part of diction, but as it is a very complex topic in its own right, we address it as a stand-alone issue and develop several methods for it. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our methods focus on metaphor detection in a poetry corpus, but we test on non-poetry data as well. We combine rule-based and statistical models (word embeddings) to develop a new classification system. Our first metaphor detection method achieves a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry, by using a Support Vector Machine classifier with various types of features. Furthermore, our deep learning model based on a Convolutional Neural Network achieves a precision of 0.831 and a recall of 0.836 for the same task. We also develop an application for generic metaphor detection in any type of natural text.
|
112 |
Exploring the Compositionality of German Particle VerbsRawein, Carina January 2018 (has links)
In this thesis we explore the compositionality of particle verbs using distributional similarity and pre-trained word embeddings. We investigate the compositionality of 100 pairs of particle verbs with their base verbs. The ranking of our findings are compared to a ranking of human ratings on compositionality. In our distributional approach we use features such as context window size, content words, and only use particle verbs with one word sense. We then compare the distributional approach to a ranking done with pre-trained word embeddings. While none of the results are statistically significant, it is shown that word embeddings are not automatically superior to the more traditional distributional approach.
|
113 |
DiSH: Democracy in State HousesRusso, Nicholas A 01 February 2019 (has links)
In our current political climate, state level legislators have become increasingly impor- tant. Due to cuts in funding and growing focus at the national level, public oversight for these legislators has drastically decreased. This makes it difficult for citizens and activists to understand the relationships and commonalities between legislators. This thesis provides three contributions to address this issue. First, we created a data set containing over 1200 features focused on a legislator’s activity on bills. Second, we created embeddings that represented a legislator’s level of activity and engagement for a given bill using a custom model called Democracy2Vec. Third, we provided a case study focused on the 2015-2016 California State Legislator and had our results verified by a political expert. Our results show that our embeddings can explain relationships between legislator and how they will likely act during the legislative process.
|
114 |
Optimizing Deep Neural Networks for Classification of Short TextsPettersson, Fredrik January 2019 (has links)
This master's thesis investigates how a state-of-the-art (SOTA) deep neural network (NN) model can be created for a specific natural language processing (NLP) dataset, the effects of using different dimensionality reduction techniques on common pre-trained word embeddings and how well this model generalize on a secondary dataset. The research is motivated by two factors. One is that the construction of a machine learning (ML) text classification (TC) model is typically done around a specific dataset and often requires a lot of manual intervention. It's therefore hard to know exactly what procedures to implement for a specific dataset and how the result will be affected. The other reason is that, if the dimensionality of pre-trained embedding vectors can be lowered without losing accuracy, and thus saving execution time, other techniques can be used during the time saved to achieve even higher accuracy. A handful of deep neural network architectures are used, namely a convolutional neural network (CNN), long short-term memory neural network (LSTM) and a bidirectional LSTM (Bi-LSTM) architecture. These deep neural network architectures are combined with four different word embeddings: GoogleNews-vectors-negative300, glove.840B.300d, paragram_300_sl999 and wiki-news-300d-1M. Three main experiments are conducted in this thesis. In the first experiment, a top-performing TC model is created for a recent NLP competition held at Kaggle.com. Each implemented procedure is benchmarked on how the accuracy and execution time of the model is affected. In the second experiment, principal component analysis (PCA) and random projection (RP) are applied to the pre-trained word embeddings used in the top-performing model to investigate how the accuracy and execution time is affected when creating lower-dimensional embedding vectors. In the third experiment, the same model is benchmarked on a separate dataset (Sentiment140) to investigate how well it generalizes on other data and how each implemented procedure affects the accuracy compared to on the original dataset. The first experiment results in a bidirectional LSTM model and a combination of the three embeddings: glove, paragram and wiki-news concatenated together. The model is able to give predictions with an F1 score of 71% which is good enough to reach 9th place out of 1,401 participating teams in the competition. In the second experiment, the execution time is improved by 13%, by using PCA, while lowering the dimensionality of the embeddings by 66% and only losing half a percent of F1 accuracy. RP gave a constant accuracy of 66-67% regardless of the projected dimensions compared to over 70% when using PCA. In the third experiment, the model gained around 12% accuracy from the initial to the final benchmarks, compared to 19% on the competition dataset. The best-achieved accuracy on the Sentiment140 dataset is 86% and thus higher than the 71% achieved on the Quora dataset.
|
115 |
Methods for increasing cohesion in automatically extracted summaries of Swedish news articles : Using and extending multilingual sentence transformers in the data-processing stage of training BERT models for extractive text summarization / Metoder för att öka kohesionen i automatiskt extraherade sammanfattningar av svenska nyhetsartiklarAndersson, Elsa January 2022 (has links)
Developments in deep learning and machine learning overall has created a plethora of opportunities for easier training of automatic text summarization (ATS) models for producing summaries with higher quality. ATS can be split into extractive and abstractive tasks; extractive models extract sentences from the original text to create summaries. On the contrary, abstractive models generate novel sentences to create summaries. While extractive summaries are often preferred over abstractive ones, summaries created by extractive models trained on Swedish texts often lack cohesion, which affects the readability and overall quality of the summary. Therefore, there is a need to improve the process of training ATS models in terms of cohesion, while maintaining other text qualities such as content coverage. This thesis explores and implements methods at the data-processing stage aimed at improving cohesion of generated summaries. The methods are based around Sentence-BERT for creating advanced sentence embeddings that can be used to rank sentences in a text in terms of if it should be included in the extractive summary or not. Three models are trained using different methods and evaluated using ROUGE, BERTScore for measuring content coverage and Coh-Metrix for measuring cohesion. The results of the evaluation suggest that the methods can indeed be used to create more cohesive summaries, although content coverage was reduced, which gives rise to the potential for extensive future exploration of further implementation.
|
116 |
Morphisms of real calculi from a geometric and algebraic perspectiveTiger Norkvist, Axel January 2021 (has links)
Noncommutative geometry has over the past four of decades grown into a rich field of study. Novel ideas and concepts are rapidly being developed, and a notable application of the theory outside of pure mathematics is quantum theory. This thesis will focus on a derivation-based approach to noncommutative geometry using the framework of real calculi, which is a rather direct approach to the subject. Due to their direct nature, real calculi are useful when studying classical concepts in Riemannian geometry and how they may be generalized to a noncommutative setting. This thesis aims to shed light on algebraic aspects of real calculi by introducing a concept of morphisms of real calculi, which enables the study of real calculi on a structural level. In particular, real calculi over matrix algebras are discussed both from an algebraic and a geometric perspective.Morphisms are also interpreted geometrically, giving a way to develop a noncommutative theory of embeddings. As an example, the noncommutative torus is minimally embedded into the noncommutative 3-sphere. / Ickekommutativ geometri har under de senaste fyra decennierna blivit ett etablerat forskningsområde inom matematiken. Nya idéer och koncept utvecklas i snabb takt, och en viktig fysikalisk tillämpning av teorin är inom kvantteorin. Denna avhandling kommer att fokusera på ett derivationsbaserat tillvägagångssätt inom ickekommutativ geometri där ramverket real calculi används, vilket är ett relativt direkt sätt att studera ämnet på. Eftersom analogin mellan real calculi och klassisk Riemanngeometri är intuitivt klar så är real calculi användbara när man undersöker hur klassiska koncept inom Riemanngeometri kan generaliseras till en ickekommutativ kontext. Denna avhandling ämnar att klargöra vissa algebraiska aspekter av real calculi genom att introducera morfismer för dessa, vilket möjliggör studiet av real calculi på en strukturell nivå. I synnerhet diskuteras real calculi över matrisalgebror från både ett algebraiskt och ett geometriskt perspektiv. Morfismer tolkas även geometriskt, vilket leder till en ickekommutativ teori för inbäddningar. Som ett exempel blir den ickekommutativa torusen minimalt inbäddad i den ickekommutativa 3-sfären.
|
117 |
Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss FunctionAttieh, Joseph January 2022 (has links)
Recent studies show that the spatial distribution of the sentence representations generated from pre-trained language models is highly anisotropic, meaning that the representations are not uniformly distributed among the directions of the embedding space. Thus, the expressiveness of the embedding space is limited, as the embeddings are less distinguishable and less diverse. This results in a degradation in the performance of the models on the downstream task. Most methods that define the state-of-the-art in this area proceed by improving the isotropy of the sentence embeddings by refining the corresponding contextual word representations, then deriving the sentence embeddings from these refined representations. In this thesis, we propose to improve the quality and distribution of the sentence embeddings extracted from the [CLS] token of the pre-trained language models by improving the isotropy of the embeddings. We add one feed-forward layer, referred to as the Isotropy Layer, between the model and the downstream task layers. We train this layer using a novel joint loss function that optimizes an isotropy quality measure and the downstream task loss. This joint loss pushes the embeddings outputted by the Isotropy Layer to be more isotropic, and it also retains the semantics needed to perform the downstream task. The proposed approach results in transformed embeddings with better isotropy, that generalize better on the downstream task. Furthermore, the approach requires training one feed-forward layer, instead of retraining the whole network. We quantify and evaluate the isotropy through multiple metrics, mainly the Explained Variance and the IsoScore. Experimental results on 3 GLUE datasets with classification as the downstream task show that our proposed method is on par with the state-of-the-art, as it achieves performance gains of around 2-3% on the downstream tasks compared to the baseline. We also present a small case study on one language abuse detection dataset, then interpret some of the findings in light of the results. / Nya studier visar att den rumsliga fördelningen av de meningsrepresentationer som ge- nereras från förtränade språkmodeller är mycket anisotropisk, vilket innebär att representationerna mellan riktningarna i inbäddningsutrymmet inte är jämnt fördelade. Inbäddningsutrymmets uttrycksförmåga är således begränsad, eftersom inbäddningarna är mindre särskiljbara och mindre varierande. Detta leder till att modellernas prestanda försämras i nedströmsuppgiften. De flesta metoder som definierar den senaste tekniken på detta område går ut på att förbättra isotropin hos inbäddningarna av meningar genom att förädla motsvarande kontextuella ordrepresentationer och sedan härleda inbäddningarna av meningar från dessa förädlade representationer. I den här avhandlingen föreslår vi att kvaliteten och fördelningen av de inbäddningar av meningar som utvinns från [CLS]-tokenet i de förtränade språkmodellerna förbättras genom inbäddningarnas isotropi. Vi lägger till ett feed-forward-skikt, kallat det isotropa skiktet, mellan modellen och de nedströms liggande uppgiftsskikten. Detta lager tränas med hjälp av en ny gemensam förlustfunktion som optimerar ett kvalitetsmått för isotropi och förlusten av nedströmsuppgiften. Den gemensamma förlusten resulterar i att de inbäddningar som produceras av det isotropa lagret blir mer isotropa, samtidigt som den semantik som behövs för att utföra den nedströms liggande uppgiften bibehålls. Det föreslagna tillvägagångssättet resulterar i transformerade inbäddningar med bättre isotropi, som generaliseras bättre för den efterföljande uppgiften. Dessutom kräver tillvägagångssättet träning av ett feed-forward-skikt, i stället för omskolning av hela nätverket. Vi kvantifierar och utvärderar isotropin med hjälp av flera mått, främst Förklarad Varians och IsoScore. Experimentella resultat på tre GLUE-dataset visar att vår föreslagna metod är likvärdig med den senaste tekniken, eftersom den uppnår prestandaökningar på cirka 2-3 % på nedströmsuppgifterna jämfört med baslinjen. Vi presenterar även en liten fallstudie på ett dataset för upptäckt av språkmissbruk och tolkar sedan några av resultaten mot bakgrund av dessa.
|
118 |
Combining Node Embeddings From Multiple Contexts Using Multi Dimensional ScalingYandrapally, Aruna Harini 04 October 2021 (has links)
No description available.
|
119 |
Increasing speaker invariance in unsupervised speech learning by partitioning probabilistic models using linear siamese networks / Ökad talarinvarians i obevakad talinlärning genom partitionering av probabilistiska modeller med hjälp av linjära siamesiska nätverkFahlström Myrman, Arvid January 2017 (has links)
Unsupervised learning of speech is concerned with automatically finding patterns such as words or speech sounds, without supervision in the form of orthographical transcriptions or a priori knowledge of the language. However, a fundamental problem is that unsupervised speech learning methods tend to discover highly speaker-specific and context-dependent representations of speech. We propose a method for improving the quality of posteriorgrams generated from an unsupervised model through partitioning of the latent classes discovered by the model. We do this by training a sparse siamese model to find a linear transformation of input posteriorgrams, extracted from the unsupervised model, to lower-dimensional posteriorgrams. The siamese model makes use of same-category and different-category speech fragment pairs obtained through unsupervised term discovery. After training, the model is converted into an exact partitioning of the posteriorgrams. We evaluate the model on the minimal-pair ABX task in the context of the Zero Resource Speech Challenge. We are able to demonstrate that our method significantly reduces the dimensionality of standard Gaussian mixture model posteriorgrams, while also making them more speaker invariant. This suggests that the model may be viable as a general post-processing step to improve probabilistic acoustic features obtained by unsupervised learning. / Obevakad inlärning av tal innebär att automatiskt hitta mönster i tal, t ex ord eller talljud, utan bevakning i form av ortografiska transkriptioner eller tidigare kunskap om språket. Ett grundläggande problem är dock att obevakad talinlärning tenderar att hitta väldigt talar- och kontextspecifika representationer av tal. Vi föreslår en metod för att förbättra kvaliteten av posteriorgram genererade med en obevakad modell, genom att partitionera de latenta klasserna funna av modellen. Vi gör detta genom att träna en gles siamesisk modell för att hitta en linjär transformering av de givna posteriorgrammen, extraherade från den obevakade modellen, till lågdimensionella posteriorgram. Den siamesiska modellen använder sig av talfragmentpar funna med obevakad ordupptäckning, där varje par består av fragment som antingen tillhör samma eller olika klasser. Den färdigtränade modellen görs sedan om till en exakt partitionering av posteriorgrammen. Vi följer Zero Resource Speech Challenge, och evaluerar modellen med hjälp av minimala ordpar-ABX-uppgiften. Vi demonstrerar att vår metod avsevärt minskar posteriorgrammens dimensionalitet, samtidigt som posteriorgrammen blir mer talarinvarianta. Detta antyder att modellen kan vara användbar som ett generellt extra steg för att förbättra probabilistiska akustiska särdrag från obevakade modeller.
|
120 |
Cluster selection for Clustered Federated Learning using Min-wise Independent Permutations and Word Embeddings / Kluster selektion för Klustrad Federerad Inlärning med användning av “Min-wise” Oberoende Permutations och OrdinbäddningarRaveen Bandara Harasgama, Pulasthi January 2022 (has links)
Federated learning is a widely established modern machine learning methodology where training is done directly on the client device with local client data and the local training results are shared to compute a global model. Federated learning emerged as a result of data ownership and the privacy concerns of traditional machine learning methodologies where data is collected and trained at a central location. However, in a distributed data environment, the training suffers significantly when the client data is not identically distributed. Hence, clustered federated learning was proposed where similar clients are clustered and trained independently to form specialized cluster models which are then used to compute a global model. In this approach, the cluster selection for clustered federated learning is a major factor that affects the effectiveness of the global model. This research presents two approaches for client clustering using local client data for clustered federated learning while preserving data privacy. The two proposed approaches use min-wise independent permutations to compute client signatures using text and word embeddings. These client signatures are then used as a representation of client data to cluster clients using agglomerative hierarchical clustering. Unlike previously proposed clustering methods, the two presented approaches do not use model updates, provide a better privacy-preserving mechanism and have a lower communication overhead. With extensive experimentation, we show that the proposed approaches outperform the random clustering approach. Finally, we present a client clustering methodology that can be utilized in a practical clustered federated learning environment. / Federerad inlärning är en etablerad och modern maskininlärnings metod. Träningen är utförd direkt på klientenheten med lokal klient data. Sen är dem lokala träningsresultat delad för att beräkna en global modell. Federerad inlärning har utvecklats på grund av dataägarskap- och dataintegritetsproblem vid traditionella maskininlärnings metoder. Dessa metoder samlar och tränar data på en central enhet. I den här metoden är kluster selektionen en viktig faktor som påverkar effektiviteten av den globala modellen. Detta forskningsarbete presenterar två metoder för klient klustring med hjälp av lokala klientdata för federerad inlärning samtidigt tar metoderna hänsyn på dataintegritet. Metoderna använder “min-wise” oberoende permutations och förtränade (“text och word”) inbäddningar. Dessa klientsignaturer används som en klientdata representation för att klustrar klienter med hjälp av agglomerativ hierarkisk klustring. Till skillnad från tidigare klustringsmetoder använder de två presenterade metoderna inte modelluppdateringar. Detta ger en bättre sekretessbevarande mekanism och har lägre kommunikationskostnader. De två presenterade metoderna överträffar den slumpmässiga klustringsmetoden genom omfattande experiment och analys. Till slut presenterar vi en klientklustermetodik som kan användas i en praktisk klustrad federerad inlärningsmiljö.
|
Page generated in 0.1108 seconds