Global ETD Search

1	A Confidence-based Hierarchical Word Clustering for Document Classification Yin, Kai-Tai 09 August 2007 (has links) We propose a novel feature reduction approach to group words hierarchically into clusters which can then be used as new features for document classification. Initially, each word constitutes a cluster. We calculate the mutual confidence between any two different words. The pair of clusters containing the two words with the highest mutual confidence are combined into a new cluster. This process of merging is iterated until all the mutual confidences between the un-processed pair of words are smaller than a predefined threshold or only one cluster exists. In this way, a hierarchy of word clusters is obtained. The user can decide the clusters, from a certain level, to be used as new features for document classification. Experimental results have shown that our method can perform better than other methods. Classification Word Clustering Confidence Hierarchical
2	Longitudinal Comparison of Word Associations in Shallow Word Embeddings Geetanjali Bihani (8815607) 08 May 2020 (has links) Word embeddings are utilized in various natural language processing tasks. Although effective in helping computers learn linguistic patterns employed in natural language, word embeddings also tend to learn unwanted word associations. This affects the performance of NLP tasks, as unwanted word associations propagate and amplify biases. Current word association evaluation methods for word embeddings do not account for changes in word embedding models and training corpora, when creating the rubric for word association evaluation. Current literature also lacks a consistent training and evaluation protocol for comparison of word associations across varying word embedding models and varying training corpora. In order to address this gap in prior literature, this research aims to evaluate different types of word associations, not limited to gender, racial or religious attributes, incorporating and evaluating the diachronic and variable nature of words over text data collected over a period of 200 years. This thesis introduces a framework to track changes in word associations between neutral words (proper nouns) and attributes (adjectives), across different word embedding models, over a temporal dimension, by evaluating clustering tendencies between neutral words (proper nouns) and attributive words (adjectives) over five different word embedding frameworks: Word2vec (CBOW), Word2vec (Skip-gram), GloVe, fastText (CBOW) and fastText (Skip-gram) and 20 decades of text data from 1810s to 2000s. <a>Finally, various cluster level and corpus level measurements will be compared across aforementioned word embedding frameworks, to find how</a> word associations evolve with changes in the embedding model and the training corpus. Applied Computer Science Information Systems Natural Language Processing word embeddings longitudinal corpora evaluation word clustering
3	Word Clustering in an Interactive Text Analysis Tool / Klustring av ord i ett interaktivt textanalysverktyg Gränsbo, Gustav January 2019 (has links) A central operation of users of the text analysis tool Gavagai Explorer is to look through a list of words and arrange them in groups. This thesis explores the use of word clustering to automatically arrange the words in groups intended to help users. A new word clustering algorithm is introduced, which attempts to produce word clusters tailored to be small enough for a user to quickly grasp the common theme of the words. The proposed algorithm computes similarities among words using word embeddings, and clusters them using hierarchical graph clustering. Multiple variants of the algorithm are evaluated in an unsupervised manner by analysing the clusters they produce when applied to 110 data sets previously analysed by users of Gavagai Explorer. A supervised evaluation is performed to compare clusters to the groups of words previously created by users of Gavagai Explorer. Results show that it was possible to choose a set of hyperparameters deemed to perform well across most data sets in the unsupervised evaluation. These hyperparameters also performed among the best on the supervised evaluation. It was concluded that the choice of word embedding and graph clustering algorithm had little impact on the behaviour of the algorithm. Rather, limiting the maximum size of clusters and filtering out similarities between words had a much larger impact on behaviour. word clustering word embedding distributional semantics hierarchical clustering text analytics language technology natural language processing gavagai

1

Page generated in 0.091 seconds