Global ETD Search

1	The automatic acquisition of knowledge about discourse connectives Hutchinson, Ben January 2005 (has links) This thesis considers the automatic acquisition of knowledge about discourse connectives. It focuses in particular on their semantic properties, and on the relationships that hold between them. There is a considerable body of theoretical and empirical work on discourse connectives. For example, Knott (1996) motivates a taxonomy of discourse connectives based on relationships between them, such as HYPONYMY and EXCLUSIVE, which are defined in terms of substitution tests. Such work requires either great theoretical insight or manual analysis of large quantities of data. As a result, to date no manual classification of English discourse connectives has achieved complete coverage. For example, Knott gives relationships between only about 18% of pairs obtained from a list of 350 discourse connectives. This thesis explores the possibility of classifying discourse connectives automatically, based on their distributions in texts. This thesis demonstrates that state-of-the-art techniques in lexical acquisition can successfully be applied to acquiring information about discourse connectives. Central to this thesis is the hypothesis that distributional similarity correlates positively with semantic similarity. Support for this hypothesis has previously been found for word classes such as nouns and verbs (Miller and Charles, 1991; Resnik and Diab, 2000, for example), but there has been little exploration of the degree to which it also holds for discourse connectives. We investigate the hypothesis through a number of machine learning experiments. These experiments all use unsupervised learning techniques, in the sense that they do not require any manually annotated data, although they do make use of an automatic parser. First, we show that a range of semantic properties of discourse connectives, such as polarity and veridicality (whether or not the semantics of a connective involves some underlying negation, and whether the connective implies the truth of its arguments, respectively), can be acquired automatically with a high degree of accuracy. Second, we consider the tasks of predicting the similarity and substitutability of pairs of discourse connectives. To assist in this, we introduce a novel information theoretic function based on variance that, in combination with distributional similarity, is useful for learning such relationships. Third, we attempt to automatically construct taxonomies of discourse connectives capturing substitutability relationships. We introduce a probability model of taxonomies, and show that this can improve accuracy on learning substitutability relationships. Finally, we develop an algorithm for automatically constructing or extending such taxonomies which uses beam search to help find the optimal taxonomy. 006.3
2	Semi-supervised lexical acquisition for wide-coverage parsing Thomforde, Emily Jane January 2013 (has links) State-of-the-art parsers suffer from incomplete lexicons, as evidenced by the fact that they all contain built-in methods for dealing with out-of-lexicon items at parse time. Since new labelled data is expensive to produce and no amount of it will conquer the long tail, we attempt to address this problem by leveraging the enormous amount of raw text available for free, and expanding the lexicon offline, with a semi-supervised word learner. We accomplish this with a method similar to self-training, where a fully trained parser is used to generate new parses with which the next generation of parser is trained. This thesis introduces Chart Inference (CI), a two-phase word-learning method with Combinatory Categorial Grammar (CCG), operating on the level of the partial parse as produced by a trained parser. CI uses the parsing model and lexicon to identify the CCG category type for one unknown word in a context of known words by inferring the type of the sentence using a model of end punctuation, then traversing the chart from the top down, filling in each empty cell as a function of its mother and its sister. We first specify the CI algorithm, and then compare it to two baseline wordlearning systems over a battery of learning tasks. CI is shown to outperform the baselines in every task, and to function in a number of applications, including grammar acquisition and domain adaptation. This method performs consistently better than self-training, and improves upon the standard POS-backoff strategy employed by the baseline StatCCG parser by adding new entries to the lexicon. The first learning task establishes lexical convergence over a toy corpus, showing that CI’s ability to accurately model a target lexicon is more robust to initial conditions than either of the baseline methods. We then introduce a novel natural language corpus based on children’s educational materials, which is fully annotated with CCG derivations. We use this corpus as a testbed to establish that CI is capable in principle of recovering the whole range of category types necessary for a wide-coverage lexicon. The complexity of the learning task is then increased, using the CCGbank corpus, a version of the Penn Treebank, and showing that CI improves as its initial seed corpus is increased. The next experiment uses CCGbank as the seed and attempts to recover missing question-type categories in the TREC question answering corpus. The final task extends the coverage of the CCGbank-trained parser by running CI over the raw text of the Gigaword corpus. Where appropriate, a fine-grained error analysis is also undertaken to supplement the quantitative evaluation of the parser performance with deeper reasoning as to the linguistic points of the lexicon and parsing model. 006.3
3	The Representation of Newly Learned Words in the Mental Lexicon Qiao, Xiaomei January 2009 (has links) Most research in word recognition uses words that already exist in the reader's lexicon, and it is therefore of interest to see whether newly learned words are represented and processed in the same way as already known words. For example, are newly learned words immediately represented in a special form of lexical memory, or is there a gradual process of assimilation? As for L2 language learners, are newly learned words incorporated into the same processing system that serves L1, or are they represented quite independently?The current study examines this issue by testing for the existence of the Prime Lexicality Effect (PLE) observed in masked priming experiments (Forster & Veres, 1998). Strong form priming was found with nonword primes (e.g., contrapt-CONTRACT), but not with word primes (e.g., contrast-CONTRACT). This effect is generally assumed to result from competition between the prime and the target. So if the readers had been trained to treat "contrapt" as a new word, would it now function like a word and produce much weaker priming? Elgort (2007) demonstrated such an effect with unmasked primes with L2 bilinguals. The current study investigates the PLE in both L1 and L2 bilinguals under different training conditions. When the training program involves mere familiarization (learning to type the words), a PLE was found with visible primes, but not with masked primes, which suggests that unmasked PLE is not the best indicator of lexicalization. In the case of "real" acquisition where the new word is given a definition and a picture of the object it refers to, and learning is spread over two weeks, a clear PLE was obtained. However, when the same experiment was carried out on Chinese-English bilinguals using the same English materials, completely opposite results were obtained. The learning enhanced priming, rather than reducing it, suggesting that the L2 lexicon might differ qualitatively from the L1 lexicon. The implications of these results for competitive theories of lexical access are discussed, and alternative explanations are considered. bilingual lexicon lexical acquisition masked priming Prime Lexicality Effect
4	Statistical modeling of multiword expressions Su, Kim Nam January 2008 (has links) In natural languages, words can occur in single units called simplex words or in a group of simplex words that function as a single unit, called multiword expressions (MWEs). Although MWEs are similar to simplex words in their syntax and semantics, they pose their own sets of challenges (Sag et al. 2002). MWEs are arguably one of the biggest roadblocks in computational linguistics due to the bewildering range of syntactic, semantic, pragmatic and statistical idiomaticity they are associated with, and their high productivity. In addition, the large numbers in which they occur demand specialized handling. Moreover, dealing with MWEs has a broad range of applications, from syntactic disambiguation to semantic analysis in natural language processing (NLP) (Wacholder and Song 2003; Piao et al. 2003; Baldwin et al. 2004; Venkatapathy and Joshi 2006). / Our goals in this research are: to use computational techniques to shed light on the underlying linguistic processes giving rise to MWEs across constructions and languages; to generalize existing techniques by abstracting away from individual MWE types; and finally to exemplify the utility of MWE interpretation within general NLP tasks. / In this thesis, we target English MWEs due to resource availability. In particular, we focus on noun compounds (NCs) and verb-particle constructions (VPCs) due to their high productivity and frequency. / Challenges in processing noun compounds are: (1) interpreting the semantic relation (SR) that represents the underlying connection between the head noun and modifier(s); (2) resolving syntactic ambiguity in NCs comprising three or more terms; and (3) analyzing the impact of word sense on noun compound interpretation. Our basic approach to interpreting NCs relies on the semantic similarity of the NC components using firstly a nearest-neighbor method (Chapter 5), then verb semantics based on the observation that it is often an underlying verb that relates the nouns in NCs (Chapter 6), and finally semantic variation within NC sense collocations, in combination with bootstrapping (Chapter 7). / Challenges in dealing with verb-particle constructions are: (1) identifying VPCs in raw text data (Chapter 8); and (2) modeling the semantic compositionality of VPCs (Chapter 5). We place particular focus on identifying VPCs in context, and measuring the compositionality of unseen VPCs in order to predict their meaning. Our primary approach to the identification task is to adapt localized context information derived from linguistic features of VPCs to distinguish between VPCs and simple verb-PP combinations. To measure the compositionality of VPCs, we use semantic similarity among VPCs by testing the semantic contribution of each component. / Finally, we conclude the thesis with a chapter-by-chapter summary and outline of the findings of our work, suggestions of potential NLP applications, and a presentation of further research directions (Chapter 9).
5	The Unsupervised Acquisition of a Lexicon from Continuous Speech Marcken, Carl de 18 January 1996 (has links) We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency. AI MIT Artificial Intelligence induction unsupervised learning language acquisition lexical acquisition continuous speech
6	Exploiting Linguistic Knowledge to Infer Properties of Neologisms Cook, C. Paul 14 February 2011 (has links) Neologisms, or newly-coined words, pose problems for natural language processing (NLP) systems. Due to the recency of their coinage, neologisms are typically not listed in computational lexicons---dictionary-like resources that many NLP applications depend on. Therefore when a neologism is encountered in a text being processed, the performance of an NLP system will likely suffer due to the missing word-level information. Identifying and documenting the usage of neologisms is also a challenge in lexicography, the making of dictionaries. The traditional approach to these tasks has been to manually read a lot of text. However, due to the vast quantities of text being produced nowadays, particularly in electronic media such as blogs, it is no longer possible to manually analyze it all in search of neologisms. Methods for automatically identifying and inferring syntactic and semantic properties of neologisms would therefore address problems encountered in both natural language processing and lexicography. Because neologisms are typically infrequent due to their recent addition to the language, approaches to automatically learning word-level information relying on statistical distributional information are in many cases inappropriate. Moreover, neologisms occur in many domains and genres, and therefore approaches relying on domain-specific resources are also inappropriate. The hypothesis of this thesis is that knowledge about etymology---including word formation processes and types of semantic change---can be exploited for the acquisition of aspects of the syntax and semantics of neologisms. Evidence supporting this hypothesis is found in three case studies: lexical blends (e.g., "webisode" a blend of "web" and "episode"), text messaging forms (e.g., "any1" for "anyone"), and ameliorations and pejorations (e.g., the use of "sick" to mean `excellent', an amelioration). Moreover, this thesis presents the first computational work on lexical blends and ameliorations and pejorations, and the first unsupervised approach to text message normalization. Computer science Computational linguistics Natural language processing Lexical acquisition Neologisms 0984
7	Exploiting Linguistic Knowledge to Infer Properties of Neologisms Cook, C. Paul 14 February 2011 (has links) Neologisms, or newly-coined words, pose problems for natural language processing (NLP) systems. Due to the recency of their coinage, neologisms are typically not listed in computational lexicons---dictionary-like resources that many NLP applications depend on. Therefore when a neologism is encountered in a text being processed, the performance of an NLP system will likely suffer due to the missing word-level information. Identifying and documenting the usage of neologisms is also a challenge in lexicography, the making of dictionaries. The traditional approach to these tasks has been to manually read a lot of text. However, due to the vast quantities of text being produced nowadays, particularly in electronic media such as blogs, it is no longer possible to manually analyze it all in search of neologisms. Methods for automatically identifying and inferring syntactic and semantic properties of neologisms would therefore address problems encountered in both natural language processing and lexicography. Because neologisms are typically infrequent due to their recent addition to the language, approaches to automatically learning word-level information relying on statistical distributional information are in many cases inappropriate. Moreover, neologisms occur in many domains and genres, and therefore approaches relying on domain-specific resources are also inappropriate. The hypothesis of this thesis is that knowledge about etymology---including word formation processes and types of semantic change---can be exploited for the acquisition of aspects of the syntax and semantics of neologisms. Evidence supporting this hypothesis is found in three case studies: lexical blends (e.g., "webisode" a blend of "web" and "episode"), text messaging forms (e.g., "any1" for "anyone"), and ameliorations and pejorations (e.g., the use of "sick" to mean `excellent', an amelioration). Moreover, this thesis presents the first computational work on lexical blends and ameliorations and pejorations, and the first unsupervised approach to text message normalization. Computer science Computational linguistics Natural language processing Lexical acquisition Neologisms 0984
8	O vocabulário básico do português no processo de aquisição da língua materna / Souza, Vanzorico Carlos de. January 2005 (has links) Orientador: Claudia Maria Xatara / Banca: Waldenice Moreira Cano / Banca: Claudia Zavaglia / Resumo: Os professores de todas as disciplinas são unânimes em apontar que uma das maiores dificuldades no aprendizado é justamente as limitações do vocabulário dos alunos. Sabemos que não basta resolver apenas essa questão do vocabulário para que os textos sejam compreendidos sem maiores dificuldades pelos alunos. Por um lado, é inegável que o domínio de um vocabulário variado certamente ajudará no processo de aprendizagem. A Lexicologia, portanto, pode trazer grandes contribuições ao ensino da língua materna e ao desenvolvimento de um Vocabulário Básico do Português Fundamental. Por outro lado, reconhecemos que a competência lexical do indivíduo é de fato um produto social, isto é, resultado das suas relações interativas na sociedade em que vive. Essa sociedade não é homogênea, mas se divide em classes sociais, portanto é de se esperar que os grupos sociais dentro de uma mesma sociedade utilizem vocabulários diferentes nas suas inter-relações cotidianas. O léxico também é, por conseguinte, organizado de maneira diferente nos diversos estratos sociais. Este trabalho apresenta os resultados de uma pesquisa em que testamos a competência lexical de alunos da oitava série do Ensino Fundamental de duas escolas representativas de grupos sociais diferentes. Verificou-se, pois, se a condição socioeconômica interfere na aquisição do Vocabulário Básico do Português Brasileiro, cujas unidades léxicas foram distribuídas em diversos campos semânticos, e como o livro didático, importante instrumento pedagógico que trabalha sobretudo a interpretação de texto, trata a questão do léxico e se esse tratamento contribui para a aquisição e expansão do vocabulário dos alunos. / Abstract: Teachers of most school subjects know that one of the most difficult things in learning is the poor pupils' vocabulary. As everybody knows, it doesn't worth to solve only this problem of vocabulary to improve the comprehension of texts in the class. On the one hand, it's undisputed that a varied vocabulary will surely help in the learning process. Lexicology can bring great impulse to mother tongue teaching and developing a Basic Vocabulary of Fundamental Portuguese. On the other hand we recognize that the lexical skill of a person is in fact a social product, result of his interacting with the relationships living in the same society he lives in. This society is not uniform but divided in social levels, so it's clear that different social groups in the same society uses different vocabularies in their daily connections. Lexicon is differently organized in the varied social layers. This work shows the results of a research where we tested the lexical competence of pupils from eighth grade of two schools representing two different social groups. We verified, too, if the social-economic condition interferes in the learning of the Basic Vocabulary of Fundamental Brazilian Portuguese, whose lexical units were distributed in different semantic fields. We also aim to observe as the textbook, important pedagogic tool which works mainly the text interpretation, contributes in the reception and expansion of the students' vocabulary. / Mestre Lexicologia. Lexicografia. Basic vocabulary. eng Lexical acquisition and expansion. eng
9	A aquisição verbal e o processamento morfológico por crianças adquirindo o PB Molina, Daniele de Souza Leite 31 March 2014 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2016-02-04T10:41:54Z No. of bitstreams: 1 danieledesouzaleitemolina.pdf: 1596778 bytes, checksum: 6be4b71981dbef447d38df98ca986b76 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2016-02-05T10:52:22Z (GMT) No. of bitstreams: 1 danieledesouzaleitemolina.pdf: 1596778 bytes, checksum: 6be4b71981dbef447d38df98ca986b76 (MD5) / Made available in DSpace on 2016-02-05T10:52:22Z (GMT). No. of bitstreams: 1 danieledesouzaleitemolina.pdf: 1596778 bytes, checksum: 6be4b71981dbef447d38df98ca986b76 (MD5) Previous issue date: 2014-03-31 / Este trabalho investiga o reconhecimento, por crianças adquirindo o PB, da raiz como a parte do verbo que veicula seu significado permanente, apesar das variações flexionais disponibilizadas pelos afixos. Estudos sugerem que crianças em fase inicial de aquisição lexical tomam como palavras diferentes vocábulos que se distinguem em sua forma fonológica (JUSCZYK; ASLIN, 1995; BORTFELD et al., 2005; SHI; LEPAGE, 2008; JUSCZYK; HOUSTON; NEWSOME, 1999). Nesse sentido, a morfologia representaria um impasse para a aquisição lexical, já que os processos morfológicos (de derivação e, principalmente, de flexão) originam palavras fonologicamente distintas, porém relacionadas quanto ao significado. Como fundamentação teórica, assumimos a proposta de conciliação (CORRÊA, 2006; 2009a; 2011) entre a teoria linguística do Programa Minimalista (CHOMSKY, 1995 e obras posteriores) e o modelo de processamento psicolinguístico voltado para a aquisição da linguagem do Bootstrapping Fonológico (MORGAN; DEMUTH, 1996; CHRISTOPHE et al., 1997) com vistas a caracterizar a passagem de uma análise de base fonológica e distribucional do input para o tratamento sintático de enunciados linguísticos. Consideramos também a hipótese do Bootstrapping Sintático (GLEITMAN, 1990), segundo a qual a estrutura sintática (a grade argumental) guia o mapeamento do significado da sentença. Buscamos verificar, portanto, em que idade as crianças adquirindo o PB mapeiam variações de um mesmo verbo como tendo o mesmo conceito base. Partimos da hipótese de que é por meio do reconhecimento de afixos verbais recorrentes na língua em aquisição que a criança procede à segmentação interna do verbo em raiz e afixos, atribuindo à raiz verbal o conceito permanente. Com a técnica de Seleção de Imagem, obtivemos resultados que sugerem que, aos três anos de idade, crianças tendem a mapear uma ação a um novo verbo, porém, sobrecarga de memória parece limitar esse mapeamento. As crianças dessa faixa etária aparentam indecisão quanto ao significado das variações flexionais desse verbo. Já aos quatro anos de idade, dados robustos com as técnicas de Seleção de Imagem e de Encenação de Ações sugerem que crianças mapeiam uma ação a um novo verbo e que tratam as variações desse verbo como tendo o mesmo significado base. Além disso, uma atividade experimental realizada com uma técnica mais refinada, a de Fixação Preferencial do Olhar, aponta para o mapeamento de uma ação a um novo verbo e o tratamento de variações flexionais como tendo o mesmo conceito base por crianças mais novas, com idade em torno de dois anos. Com base no escopo teórico assumido neste trabalho, tais resultados apontam para o tratamento de uma pseudopalavra como verbo a partir de pistas distribucionais. Os resultados podem ser interpretados, ainda, como evidência da segmentação interna do verbo e do consequente reconhecimento da raiz verbal como a parte que veicula o significado base do vocábulo, adquirido a partir de pistas observacionais. / This work investigates the acknowledgement by children acquiring BP (Brazilian Portuguese) of the root as the part of the verb that has the permanent meaning, despite inflectional variations of affixes. Previous works suggest that children on an initial period of lexical acquisition treat words that have different phonological forms as completely different words (JUSCZYK; ASLIN, 1995; BORTFELD et al., 2005; SHI; LEPAGE, 2008; JUSCZYK; HOUSTON; NEWSOME, 1999). In this sense, morphology could represent trouble to lexical acquisition, as morphological processes (derivation and, mainly, inflection) create phonologically different words, but these words are related in meaning. We assume, as theoretical foundation, the proposal of conciliation (CORRÊA, 2006; 2009a; 2011) between a linguistic theory of the Minimalist Program (CHOMSKY, 1995 and latter works) and the psycholinguistic processing model aimed at language acquisition of Phonological Bootstrapping (MORGAN; DEMUTH, 1996; CHRISTOPHE et al., 1997) with the purpose to characterize the passage from a phonological and distributional analysis of the input to the syntactic treatment of linguistic statement. We also consider the Syntactic Bootstrapping hypothesis, which defends that syntactic structure guides the mapping of sentences’ meaning. We seek thus to verify in which age children acquiring BP map variations of the same verb as having the same base concept. We assume the hypothesis that is by recognizing recurrent verbal affixes on the language that is being acquired that child proceeds to the verbal internal segmentation between root and affixes, assigning the permanent concept to the root. The results we obtained with Picture Identification Tasks suggest three-year-old children tend to map an action into a novel verb, although memory seems to limit this mapping. Children of this age group appear to be uncertain about the meaning of the new verb’s variations. The methodological techniques of Picture Identification Task and Act Out provide robust data that four-year-old children map an action into a novel verb and treat variations of this novel verb as having the same base meaning. Besides, an experimental activity with Split-Screen Preferential Looking Paradigm, a finer technique, points out to the mapping of an action into a novel verb and the treatment of the verbal variations as having the same base concept by younger children, a two-year-old range group. According to the theoretical approach assumed in this work, our results point out to the treatment of a non-word as a verb due to distributional cues. The results can also be interpreted as evidence of the verbal internal segmentation and the consequent acknowledgement of the verbal root as the part of the word that has the base meaning, acquired by observational cues. Aquisição lexical Processamento morfológico Verbos Lexical acquisition, Morphological processing Verbs
10	A identificação de nomes e adjetivos por crianças adquirindo o PB Almeida, Christiano Pereira de 29 August 2007 (has links) Submitted by isabela.moljf@hotmail.com (isabela.moljf@hotmail.com) on 2017-02-22T11:54:05Z No. of bitstreams: 1 cristianopereiradealmeida.pdf: 1470118 bytes, checksum: 522a21d105ff4f78a62d83e79c30931c (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-02-22T14:52:08Z (GMT) No. of bitstreams: 1 cristianopereiradealmeida.pdf: 1470118 bytes, checksum: 522a21d105ff4f78a62d83e79c30931c (MD5) / Made available in DSpace on 2017-02-22T14:52:08Z (GMT). No. of bitstreams: 1 cristianopereiradealmeida.pdf: 1470118 bytes, checksum: 522a21d105ff4f78a62d83e79c30931c (MD5) Previous issue date: 2007-08-29 / Esta dissertação diz respeito ao processo de identificação dos elementos das categorias lexicais Nome e Adjetivo no Português do Brasil, buscando dar conta de especificidades do PB, como a variação possível de ordem dos elementos constituintes do DP complexo e a ausência de marcas morfofonológicas que distingam adjetivos de nomes A hipótese de trabalho assumida é a de que, em um primeiro momento da aquisição da linguagem, a informação estrutural disponibilizada pela ordem preferencial de ocorrência destes elementos no PB atua como uma pista robusta da qual a criança faz uso. A abordagem teórica adotada busca conciliar um modelo de língua que trate do fenômeno de aquisição (Programa Minimalista: Chomsky, 1995), com modelos de processamento voltados para a aquisição da linguagem (modelos de bootstrapping). Foram desenvolvidas atividades experimentais cujos resultados apontam para a importância da informação estrutural, em um primeiro momento, e da informação semântica em etapas posteriores do processo de identificação de nomes e adjetivos. / This dissertation aims at the identification process of elements in the lexical categories of Name and Adjectives in Brazilian Portuguese trying to treat the specificities of the BP such as the possible variation of the complex DP constituent elements order and the absence of morphophonological marks that distinguish adjectives and names. The working hypothesis assumed is that in a first moment of the language acquisition the structural information made available by the preferential order of occurrence of that elements in BP acts as a powerful clue of which the child makes use of. The theoretical approach adopted tries to conciliate a language model that treats the phenomenon of acquisition (Minimalist Program: Chomsky, 1995) and processing models for the language acquisition (bootstrapping models). Experimental activities were developed and the results point to the importance of the structural information in a first moment and the semantic information in posterior stages of the identification process of names and adjectives. Aquisição lexical Nome Adjetivo Bootstrapping Lexical acquisition Name Adjective Bootstrappin

Search results