• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19
  • 3
  • 2
  • Tagged with
  • 27
  • 27
  • 12
  • 11
  • 10
  • 10
  • 9
  • 7
  • 6
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

The Use of Distributional Semantics in Text Classification Models : Comparative performance analysis of popular word embeddings

Norlund, Tobias January 2016 (has links)
In the field of Natural Language Processing, supervised machine learning is commonly used to solve classification tasks such as sentiment analysis and text categorization. The classical way of representing the text has been to use the well known Bag-Of-Words representation. However lately low-dimensional dense word vectors have come to dominate the input to state-of-the-art models. While few studies have made a fair comparison of the models' sensibility to the text representation, this thesis tries to fill that gap. We especially seek insight in the impact various unsupervised pre-trained vectors have on the performance. In addition, we take a closer look at the Random Indexing representation and try to optimize it jointly with the classification task. The results show that while low-dimensional pre-trained representations often have computational benefits and have also reported state-of-the-art performance, they do not necessarily outperform the classical representations in all cases.
2

Quantifying Determiners from the Distributional Semantics View / Quantifying Determiners from the Distributional Semantics View

Gutiérrez Vasques, María Ximena January 2013 (has links)
Název práce: Quantifying Determiners from the Distributional Semantics View Autor: Maria Ximena Gutierrez Vasques Katedra: Ústav formální a aplikované lingvistiky Vedoucí diplomové práce: doc. RNDr. Markéta Lopatková, Ph.D. Abstrakt: Distribuční sémanika představuje moderní přístup k zachycení sémantiky přirozeného jazyka. Jedním z témat, kterým zatím v rámci tohoto přístupu nebyla věnována dostatečná pozornost, je možnost automatické detekce logických relací jako vyplývání. Tato diplomová práce navazuje na práci autorů Baroni, Bernar- di, Do and Shan (2012), kteří se zabývají relací vyplývání mezi kvantifikujícími výrazy. Citovaná práce využívá detekce pomocí SVN klasifikátorů natrénavaných na sémantických vektorech reprezentujících relaci vyplývání. Popisované exper- imenty se nezaměřovaly na nastaveni parametrů SVN klasifikátoru, proto se v této práci vracíme k původním experimentům popisujícím relaci vyplývání mezi kvantifikovanýmo jmennými konstrukcemi, navrhujeme nové konfigurace klasi- fikátoru a optimalizujeme nastavení parametrů. Dosaženou přesnost predikce porovnáváme s původními výsledky a ukazujeme, že SVM klasifikátor s kvadrat- ickým polynomiálním jádrem dosahuje lepších výsledků....
3

Distributional models of multiword expression compositionality prediction / Modelos distribucionais para a predição de composicionalidade de expressões multipalavras

Cordeiro, Silvio Ricardo January 2018 (has links)
Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art.
4

Automatic Supervised Thesauri Construction with Roget’s Thesaurus

Kennedy, Alistair H 07 December 2012 (has links)
Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus. I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process. I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus. The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications.
5

Automatic Supervised Thesauri Construction with Roget’s Thesaurus

Kennedy, Alistair H 07 December 2012 (has links)
Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus. I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process. I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus. The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications.
6

Distributional models of multiword expression compositionality prediction / Modelos distribucionais para a predição de composicionalidade de expressões multipalavras

Cordeiro, Silvio Ricardo January 2018 (has links)
Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art.
7

Distributional models of multiword expression compositionality prediction / Modelos distribucionais para a predição de composicionalidade de expressões multipalavras

Cordeiro, Silvio Ricardo January 2018 (has links)
Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art.
8

Automatic Supervised Thesauri Construction with Roget’s Thesaurus

Kennedy, Alistair H January 2012 (has links)
Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus. I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process. I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus. The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications.
9

Parameters, Interactions, and Model Selection in Distributional Semantics

Lapesa, Gabriella 22 December 2020 (has links)
Distributional Semantic Models are one of the possible answers produced in (computational) semantics to the question of what the meaning of a word is. The distributional semantic answer to this question is a usage-based one, as distributional semantics models (henceforth, DSMs) are employed to produce semantic representations of words from co-occurrence patterns in texts or documents. DSMs have proven to be useful in many applications in the domains of Natural Language Processing. Despite this progress, however, a full understanding of the different parameters governing a DSM and their influence on model performance (which, in fact, is also important for getting a better linguistic understanding of neural word embeddings) has not been achieved yet. This is precisely the goal of this dissertation. Taken together, the experiments presented in this thesis represent (to the best of our knowledge) the largest-scope study in which window and syntax-based DSMs have been tested in all parameter settings. As a further contribution, the thesis proposes a novel methodology for the interpretation of evaluation results: we employ linear regression as a statistical tool to understand the impact of different parameters on model performance. In this way, we achieve a solid understanding of the influence of specific parameters and parameter interactions on DSM performance, which can inform the selection of DSM settings that are robust to overfitting. This thesis has a strong focus on cognitive data, that is, on DSM parameters that lend themselves to a cognitive interpretation and on evaluation tasks in which DSMs are tested in their capability of mirroring speakers’ behavior in psychological tasks (semantic priming and free associations). One of the most important contributions of this thesis is the consistent finding that neighbor rank (i.e., the rank of a word among the distributional neighbors of a target) is a better indicator of semantic similarity/relatedness than the distance in the semantic space, which is commonly used in the literature. The cognitive interpretation of this result is straightforward: neighbor rank, which is evaluated systematically for the first time in this thesis, is able to capture asymmetry in the relation between two words, while distance metrics, commonly employed in distributional semantics, are symmetric.
10

An empirical study of semantic similarity in WordNet and Word2Vec

Handler, Abram 18 December 2014 (has links)
This thesis performs an empirical analysis of Word2Vec by comparing its output to WordNet, a well-known, human-curated lexical database. It finds that Word2Vec tends to uncover more of certain types of semantic relations than others -- with Word2Vec returning more hypernyms, synonomyns and hyponyms than hyponyms or holonyms. It also shows the probability that neighbors separated by a given cosine distance in Word2Vec are semantically related in WordNet. This result both adds to our understanding of the still-unknown Word2Vec and helps to benchmark new semantic tools built from word vectors.

Page generated in 0.0963 seconds