Spelling suggestions: "subject:"[een] CORPUS"" "subject:"[enn] CORPUS""
81 |
Colocações verbais em um corpus de aprendizes brasileiros de inglês / Verbal collocations in a corpus of Brazilian learners of EnglishDanilo Suzuki Murakami 22 March 2016 (has links)
Muitas pesquisas reconhecem a importância das colocações para o aprendizado da língua inglesa. Contudo, poucos estudos investigaram o tema na escrita de aprendizes brasileiros de inglês. Esta pesquisa examina o papel das colocações verbais em um subcorpus do EF-Cambridge Open Language Database (EFCAMDAT) composto por redações de aprendizes brasileiros de inglês de nível avançado. A abordagem metodológica adotada neste estudo é baseada em técnicas da Linguística de Corpus. Para essa investigação, foi elaborada uma classificação semiautomática de todos os verbos com o auxílio de um programa de anotação de corpora. Em geral, os resultados mostram que praticamente uma em cada cinco combinações entre um verbo e um substantivo é uma colocação. No entanto, os aprendizes não empregam colocações verbais com sucesso mesmo sendo de nível avançado de aprendizado. As colocações verbais apresentaram desvios em 25% dos casos. O principal tipo de inadequação é o uso de um verbo inapropriado causado pela influência do português. Um pequeno número de estruturas sintáticas também pode ser responsável por desvios colocacionais. Mais pesquisas sobre esse tópico precisam ser conduzidas para a total compreensão dos fatores que determinam a taxa de sucesso. Os achados devem contribuir para a área de aprendizagem de inglês por brasileiros. / There is a growing body of literature that recognizes the importance of collocations in English language learning. However, few studies have investigated the use of collocations in the writing of Brazilian learners of English. This research examines the role of verbal collocations in a subcorpus of the EF-Cambridge Open Language Database (EFCAMDAT). The subcorpus comprises writings by advanced learners of English from Brazil. The methodological approach taken in this study is based on Corpus Linguistics. For this investigation, a semi-automatic classification of all verbs was applied with the aid of a computer program for annotation of text. Overall, the results indicate that nearly one out of every five combinations between a verb and a noun is a collocation and that learners are not completely successful in the use of verbal collocations despite their advanced level of learning. The use of verbal collocations was found to be deviant in 25% of the cases. The main type of inadequacy was the use of an inappropriate verb caused by the influence of Portuguese. A small number of syntactic patterns may also have been responsible for collocational deviations. More research on this topic needs to be undertaken before full comprehension of the factors that determine success rate. The findings should make a contribution to the field of English learning by Brazilians.
|
82 |
[en] THE CORPUS NEVER LIES: ON THE IDENTIFICATION AND USE OF MULTIWORD EXPRESSIONS / [pt] O CÓRPUS NÃO MENTE JAMAIS: SOBRE A IDENTIFICAÇÃO E USO DE COMBINAÇÕES MULTIVOCABULARES DO TIPO VERBO MAIS SINTAGMA NOMINALMILENA DE UZEDA GARRAO 22 August 2006 (has links)
[pt] Muitos estudos recentes sobre a identificação e uso de
combinações
multivocabulares (CMs) adotam uma perspectiva
representacionista do
significado da palavra. Este estudo propõe que é muito
mais interessante
identificar as CMs por um olhar não-representacionista. A
metodologia proposta
foi testada em CMs do tipo V+SN, um padrão bastante
freqüente no português do
Brasil (PB). Trata-se de uma análise estatística com base
em córpus que pode ser
resumida em três etapas: 1) córpus robusto do PB como base
de análise, 2)
aplicação de um teste estatístico ao córpus, a saber,
teste de Logaritmo de
Verossimilhança (Banerjee e Pedersen, 2003), para detecção
das CMs mais
freqüentes com padrão V+SN (como tomar café) e exclusão de
co-ocorrências
sintáticas aleatórias dos mesmos itens lexicais, 3)
aplicação de Medidas de
Similaridade (Baeza-Yates e Ribeiro-Neto, 1999) entre
todos os parágrafos
contendo uma certa CM (por exemplo, fazer campanha) e
todos os parágrafos
contendo o substantivo fora da CM (campanha). Esta última
etapa foi utilizada
para avaliar o grau de composicionalidade da CM. Pôde-se
concluir que quanto
maior a similaridade entre os parágrafos contendo a CM e
os parágrafos contendo
o substantivo fora da expressão, maior será o grau de
composicionalidade da CM.
Por essa razão, este estudo tem um impacto tanto teórico
quanto prático para a
semântica. / [en] A considerable amount of recent researches on defining
multi-word
expressions´ (MWE) phenomenon has an underlying
representational framework
of word meaning. In this study we claim that it is much
more interesting to view
MWE from a non-representational perspective. By choosing
this path, we avoid
the time-consuming and controversial human intuitions to
MWE identification
and definition. Our methodology was tested on Brazilian
Portuguese verbal
phrases of V+NP pattern. It is a statistically-based
corpus analysis which could be
summed up as the following three sequent steps: 1) robust
linguistic corpora as
output, 2) application of a probabilistic test to the
corpora, namely Log Likelihood
test (Banerjee and Pedersen, 2003), in order to spot the
Portuguese MWEs of V+NP
pattern (such as tomar café) and disregard casual
syntactic and not otherwise
motivated co-occurrences of the same lexical items, 3)
application of Similarity
Measures (Baeza-Yates and Ribeiro-Neto, 1999) between all
the paragraphs
containing a certain MWE and all the paragraphs containing
its separate noun.
This latter step is crucial to assess the MWE
compositionality level. We conclude
that the higher are the similarity measures between the
MWE (such as fazer
campanha) and its separate noun (campanha), the more
compositional will be the
MWE. Therefore, we believe that this work has both a
practical and a theoretical
impact to semantics.
|
83 |
Supervised machine learning for email thread summarizationUlrich, Jan 11 1900 (has links)
Email has become a part of most people's lives, and the ever increasing amount of messages people receive can lead to email overload. We attempt to mitigate this problem using email thread summarization. Summaries can be used for things other than just replacing an incoming email message. They can be used in the business world as a form of corporate memory, or to allow a new team member an easy way to catch up on an ongoing conversation. Email threads are of particular interest to summarization because they contain much structural redundancy due to their conversational nature.
Our email thread summarization approach uses machine learning to pick which sentences from the email thread to use in the summary. A machine learning summarizer must be trained using previously labeled data, i.e. manually created summaries. After being trained our summarization algorithm can generate summaries that on average contain over 70% of the same sentences as human annotators. We show that labeling some key features such as speech acts, meta sentences, and subjectivity can improve performance to over 80% weighted recall.
To create such email summarization software, an email dataset is needed for training and evaluation. Since email communication is a private matter, it is hard to get access to real emails for research. Furthermore these emails must be annotated with human generated summaries as well. As these annotated datasets are rare, we have created one and made it publicly available. The BC3 corpus contains annotations for 40 email threads which include extractive summaries, abstractive summaries with links, and labeled speech acts, meta sentences, and subjective sentences.
While previous research has shown that machine learning algorithms are a promising approach to email summarization, there has not been a study on the impact of the choice of algorithm. We explore new techniques in email thread summarization using several different kinds of regression, and the results show that the choice of classifier is very critical. We also present a novel feature set for email summarization and do analysis on two email corpora: the BC3 corpus and the Enron corpus. / Science, Faculty of / Computer Science, Department of / Graduate
|
84 |
Alternative Complementation in Partially Schematic Constructions: a Quantitative Corpus-based Examination of COME to V2 and GET to V2Lester, Nicholas A. 05 1900 (has links)
This paper examines two English polyverbal constructions, COME to V2 and GET to V2, as exemplified in Examples 1 and 2, respectively. (1) the senator came to know thousands of his constituents (2) Little Johnny got to eat ice cream after every little league game. Previous studies considered these types of constructions (though come and get as used here have not been sufficiently studied) as belonging to a special class of complement constructions, in which the infinitive is regarded as instantiating a separate, subordinate predication from that of the “matrix” or leftward finite verb. These constructions, however, exhibit systematic deviation from the various criteria proposed in previous research. This study uses the American National Corpus to investigate the statistical propensities of the target phenomena via lexico-syntactic (collostructional analysis) and morpho-syntactic (binary logistic regression) features, as captured through the lens of construction grammar.
|
85 |
Corpus design for Setswana lexicographyOtlogetswe, Thapelo Joseph 01 July 2008 (has links)
This PhD thesis is about the design of a Setswana corpus for lexicography. While various corpora have been compiled and a variety of corpora-based researches attempted in African languages, no effort has been made towards corpus design. Additionally, although extensive analysis of the Setswana language has been done by missionaries, grammarians and linguists since the 1800s, none of such research is in corpus design. Most research has been largely on the grammatical study of the language. The recent corpora research in African languages in general has been on the use of corpora for the compilation of dictionaries and little of it is in corpus design. Pioneers of this kind of corpora research in African languages are Prinsloo and De Schryver (1999), De Schryver and Prisloo (2000 and 2001) and Gouws and Prisloo (2005). Because of a lack of research in corpora design particularly in African languages, this thesis is an attempt at filling that gap, especially for Setswana. It is hoped that the finding of this study will inspire similar designs in other languages comparable to Setswana. We explore corpus design by focusing on measuring a variety of text types for lexical richness at comparable token points. The study explores the question of whether a corpus compiled for lexicography must comprise a variety of texts drawn from different text types or whether the quality of retrieved information for lexicographic purposes from a corpus comprising diverse text varieties could be equally extracted from a corpus with a single text type. This study therefore determines whether linguistic variability is crucial in corpus design for lexicography. / Thesis (PhD (African Languages))--University of Pretoria, 2008. / African Languages / unrestricted
|
86 |
Gender Vs. Sex: Defining Meaning in a Modern World through use of Corpora and Semantic SurveysGarceau, Mary Elizabeth 08 June 2020 (has links)
Considerable resources in U.S. legal studies are devoted to determining the precise meaning of contested terms specifically in statutory interpretation. Traditional judicial approaches have defined meaning using dictionaries. This reliance has led to Mouritsen’s (2010) observation that "the judicial conception of lexical meaning—i.e., what judges think about what words mean … is often [subjectively] outcome determinative." Beginning with Mouritsen’s (2010) article, a movement in U.S. legal scholarship offers corpus linguistics as a more objective method to resolving contested meaning (Lee and Mouritsen, 2018). However, I assert that weaknesses still exist in contemporary applications of corpus linguistics to legal interpretation. I first review methodological differences in two corpus-based projects that attempt to resolve the meaning of the contested term, "emoluments," a high-profile Supreme Court-bound contemporary issue related to the legitimacy of the Trump presidency (Phillips and White, 2018; Cunningham and Egbert, 2019). Unfortunately, the results of these two studies are in conflict. Based upon a critique of these projects, I advocate for a more objective method of interpreting the results of corpus analyses using multiple human coders following rater reliability research models often used in sociolinguistics and second language acquisition research. In order to test our assumptions, I apply this approach to utilizing corpus linguistics to define the meaning of "sex" in two highly charged cases pending in the U.S. Supreme Court within the context of Title VII of the Civil Rights Act of 1964 which prohibits discrimination "because of. . . sex" (42 U.S.C. § 2000e-2(a)(1). The first case, Harris Funeral Home v. EEOC, questions if "sex" encompasses "gender identity;" while the second, Altitude v. Zarda, asks if the meaning of "sex" includes "sexual orientation." I discuss results of this research model and its implications to further corpus linguistic applications to the law.
|
87 |
Effect of Double Ovulation on Peripheral Concentrations of Progesterone, Luteal Blood Perfusion and Hepatic Steroid Inactivating EnzymesVoelz, Benjamin Eugene 17 May 2014 (has links)
Progesterone is essential for the maintenance of pregnancy in cattle. Recent trends in decreased reproductive efficiency in dairy cattle have led researchers to believe that increased catabolism and decreased peripheral concentrations of progesterone are at fault. The objective of this study was to determine if the induction of an accessory corpus luteum (CL), via human chorionic gonadotropin (hCG), alters blood perfusion of CL, peripheral concentrations of progesterone, or hepatic steroid inactivating enzymes. We hypothesized that the induction of an accessory CL would decrease blood perfusion of the CL, decrease peripheral concentrations of progesterone, and increase clearance of progesterone in the liver. Total blood perfusion of the CL was increased in cows with 2 CL compared to cows with 1 CL, but concentrations of progesterone and hepatic enzymes did not differ. Overall, the increased blood perfusion in cows with 2 CL did not alter concentrations of progesterone or progesterone clearance.
|
88 |
The Language of Trauma: A Linguistic Analysis of Interviews with Holocaust SurvivorsAltman, Emilie January 2023 (has links)
We performed quantitative analysis on transcriptions of 784 interviews with Holocaust survivors. The interviews were collected by the University of Southern California Shoah Foundation, and the first 15 minutes of each interview had been transcribed using automatic speech recognition. The survivors were an aging population as the interviews were conducted around fifty years after the end of the Holocaust. We used statistical methods and algorithms to analyze the data including keyness analysis, topic modeling, and emotionality analysis. We used the Contemporary Corpus of American English (COCA) as a comparative corpus for these analyses. Overall, we found that survivors prioritized themes of the Holocaust and their families in the interviews. Specific words and themes reoccurred across the corpus demonstrating a collective and consistent memory of trauma. Our emotionality analyses revealed that survivors used slightly more positive language and fewer words relating to anger, disgust, and fear than the speakers in our comparative corpus. / Thesis / Master of Science (MSc) / For this thesis, we analyzed 784 transcribed interviews with Holocaust survivors. The interviews were conducted by the Shoah Foundation and took place from 1994-2000; around 50 years after the end of World War II. We compared the language in the interviews to the spoken component of a large corpus (collection of texts) called The Contemporary Corpus of American English (COCA). In our analyses, we found the words that are most representative of the survivors' language across the corpus. We also found topics that were discussed most frequently in the interviews. Words and topics relating to family, Judaism, and experiences of the Holocaust were the most common. We also analyzed the emotionality of the survivors' language and found that overall, they used slightly more positive words than the words in COCA. They also used fewer words associated with the emotions anger, fear, and disgust.
|
89 |
An Investigation of the Role of the Corpus Callosum in the Lateralised Skilled Reaching Task / The Corpus Callosum in the Lateralised Reaching TaskMpandare, Farirai 09 1900 (has links)
Long-term potentiation (LTP), a long-lasting enhancement of synaptic efficacy, is believed to be the mechanism by which memory storage occurs in the brain. Several studies have shown that LTP can be induced in various neural sites, not only by electrical stimulation, but also as a result of behavioural modifications. It has previously been shown that LTP in the primary motor cortex accompanies motor skill learning. One study showed that potentiation occurred following training on a lateralised skilled reaching task. In this task, animals are trained to use only one paw to grasp a small food pellet. An interesting finding that has been uncovered from these studies is that, although only one hemisphere actively participates in the task (the trained hemisphere), the other hemisphere (untrained hemisphere) also shows potentiation. This has led to the hypothesis that the corpus callosum is involved in the transfer of information from one hemisphere to another during training on the reaching task. The nature of this communication, however, is unknown. Two possibilities were considered. The first was that the callosum transfers information that allows the animal to maintain its balance while the reaching paw is elevated. Careful observation of videorecording made while animals performed the task however, failed to reveal any deficits in balance in animals that had undergone a callosal transection. A second possibility is that the corpus callosum transfers information about the task from the trained to the untrained hemisphere such that, even though it does not actively participate in the task, the untrained paw may "know" how to perform above chance level. Analysis of the rate of successful reaching with the untrained paw revealed no advantage for normal animals over transected animals. Work is however, currently underway to increase the number of animals in the study in order to obtain a more conclusive result. / Thesis / Master of Science (MS)
|
90 |
Categorización semi-supervisada de Documentos usando la Web como corpusGuzmán Cabrera, Rafael 04 December 2009 (has links)
La mayoría de los métodos para la categorización automática de documentos está
basada en técnicas de aprendizaje supervisado y por consecuencia, tienen el problema
de requerir un gran número de instancias de entrenamiento. Con la finalidad de afrontar
este problema, en esta tesis se propone un nuevo método semi-supervisado para la
categorización de documentos, el cual considera la extracción automática de ejemplos
no etiquetados de la Web y su incorporación al conjunto de entrenamiento. Los
ejemplos no etiquetados que se incorporan al conjunto de entrenamiento son
seleccionados por medio de un método basado en aprendizaje automático. Este modelo
incremental permite la selección sólo de los mejores ejemplos no etiquetados en cada
iteración. Sin embargo, en algunos dominios esta técnica no permite mejorar la
precisión de clasificación, principalmente cuando los datos etiquetados son dispersos.
Esto es, entre más relación tengan los ejemplos etiquetados con la categoría a la que
pertenecen, mejores resultados se obtendrán con este método. Éste es independiente del
dominio y del lenguaje, su funcionamiento resulta más adecuado en aquellos escenarios
en los cuales no se cuenta con suficientes instancias de entrenamiento manualmente
etiquetadas. La evaluación experimental del método se llevó a cabo con tres
experimentos de categorización de documentos tanto temática (utilizando colecciones
con diferentes características de documentos, como son: muy pocos ejemplos de
entrenamiento y un alto grado de traslape) así como no temática (tarea de atribución de
autoría). Un cuarto experimento se llevó a cabo para la tarea de la desambiguación del
sentido de las palabras. Los resultados obtenidos en cada uno de estos experimentos nos
permiten ver la efectividad de incorporar datos no etiquetados descargados de la Web al
conjunto de entrenamiento. / Guzmán Cabrera, R. (2009). Categorización semi-supervisada de Documentos usando la Web como corpus [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/6562
|
Page generated in 0.045 seconds