Global ETD Search

131	Gramaticalização e Preposições Complexas do Português: um estudo baseado em corpus / Grammaticalization and complex prepositions Camilla Canella Moraes Luzorio 31 March 2008 (has links) Este trabalho apresenta um estudo que aplica a teoria de gramaticalização a um corpus eletrônico diacrônico a fim de dar conta das mudanças ocorridas em estruturas da língua portuguesa normalmente denominadas Preposições Complexas. O estudo teve como objetivos: 1) investigar as preposições complexas em face de, em face a, face a, em vista de, em frente de, em frente a e frente a com vistas a compreender seu funcionamento em termos sintáticos e semânticos a fim de verificar se elas estão se gramaticalizando; 2) examinar textos de períodos históricos diferentes de modo que se compreenda a possível trajetória empreendida por tais formas entre os séculos XIV e XX; 3) averiguar se os itens frente a e face a podem ser considerados reduções das formas em frente a e em face a, respectivamente. A teoria da gramaticalização forneceu um arcabouço teórico para explicar os fenômenos de mudança que afetam os itens lingüísticos. O processo de gramaticalização consiste na passagem de uma construção de um status lexical para um status gramatical ou de um status menos gramatical para um mais gramatical. Um dos fatores desencadeantes desse processo é a freqüência de uso que leva o item a ser mais previsível e estável. A Lingüística de Corpus entra nesta pesquisa fornecendo a metodologia de compilação, extração e observação dos dados, pois à semelhança dos estudos de Hoffman (2005) foi realizada uma investigação baseada em corpora eletrônicos. O corpus base foi o Corpus do Português, composto por textos em língua portuguesa escritos a partir do século XIV até o século XX, disponível online em http://www.corpusdoportugues.org/. Verificou-se que as preposições complexas analisadas ascenderam a escala de gramaticalidade, pois se expandiram suas possibilidades de uso através do desenvolvimento de polissemias de semântica abstrata. Constatou-se, ainda, que, em muitos sentidos, elas coexistem como camadas, mas que pode haver uma tendência que conduzirá a escolha de uma forma para expressar cada sentido evidenciado / The present dissertation introduces a study which applies the theory of Grammaticalization to a digital diachronic corpus, with a view to mapping some of the changes which have taken place in certain structures of Portuguese, the so-called prepositional phrases. The objectives of the research were threefold. First, the study aimed at investigating the complex prepositions em face de, em face a, face a, em vista de, em frente de, em frente a e frente a, in order to understand their syntactic and semantic development and, in turn, to evaluate whether they are undergoing a process of grammaticalization. Secondly, the study sought to examine texts from a variety of historical periods, so as to map a possible trajectory taken by the afore mentioned forms between the 14th and the 20th centuries. Thirdly, the study intended to verify whether the items frente a e face a may be considered reductions of em frente a and em face a, respectively. The theoretical framework for the study has been taken from Grammaticalization, a theory which explains phenomena which affect linguistic items. The process of grammaticalization may consists in one item, lexical or grammatical, becoming more grammatical. The triggering factor in this case is said to be the frequency of use. Corpus Linguistics has provided a methodology for the compilation, extraction and treatment of the textual data in this dissertation. Similarly to Hoffman (2005) the investigation here was based on electronic corpora. The study corpus was the Corpus do Português, which consists of texts in Portuguese, written between the 14th and the 20th century, available at http://www.corpusdoportugues.org/. The study suggests that the complex prepositions analysed have become increasingly grammaticalised, because they have acquired additional abstract meanings. It has also been observed that, in many ways, these abstract meanings coexist as layers. However, there seems to be a tendency for one form to become the preferred way of expressing each of these new meanings Gramaticalização Preposição complexa Lingüística de Corpus Grammaticalization Complex prepositions Corpus Linguistics LINGUISTICA
132	Brasil brasileiro: o léxico e a identidade nacional / Brazilian Brazil : lexis and national identity Lúcia Deborah Ramos de Araújo 15 May 2010 (has links) Esta pesquisa dedica-se a realizar um trabalho com base no diálogo entre teorias semióticas e a Linguística de Córpus, estudando, especificamente, marcas linguísticas que possam caracterizar o perfil do brasileiro e suas características socioculturais plurais. Interessam-nos, sobretudo, os substantivos e adjetivos em função nomeadora e/ou qualificadora dos termos Brasil e brasileiro. Com isso, pretende-se oferecer um panorama bastante próximo da realidade linguística do brasileiro e de sua identidade. Para que os resultados sejam significativos, contamos com o concurso da Linguística de Córpus, servindo-nos de base a obra Linguística de Corpus (SARDINHA, 2004). Com a Linguística de Córpus, adotando a pesquisa direcionada pelo córpus (corpus-driven research) como metodologia, se pôde levantar, quantificar e tabular os signos em uso, identificando-lhes a frequência e a organização em feixes lexicais para avaliá-los quanto à significância no trato comunicativo. No desenvolvimento da análise e leitura crítica dos dados coletados, amparou-nos a Semiótica de extração peirceana, mais especificamente da Teoria da Iconicidade Verbal (SIMÕES, 2007), que permitiu delinear o potencial icônico das palavras de busca e de seus colocados. Com relação ao conceito de identidade em suas faces filosófica, social e antropológica, fornecem-nos suporte os pensamentos de NIETZSCHE (1991) acerca da necessidade do esquecimento para a construção de uma identidade e de HALL (1998), quanto aos eixos temporais que presidem o processamento discursivo dos fatos históricos e, por conseguinte, da construção identitária. O contraponto entre estes últimos autores contribui para a definição dos gêneros textuais interessantes à pesquisa, basicamente os textos argumentativos, publicados em jornais de grande circulação, no eixo Rio-São Paulo. A respeito da identidade na sociedade em rede, característica da contemporaneidade, apoia-nos obra de CASTELLS (2006). Os estudos específicos sobre a identidade nacional amparam-se sobretudo em DAMATTA (1978 e 1989) e LEITE (2007). A pesquisa demonstrou que a iconicidade lexical vem a ser mais apropriadamente delineada a partir de um universo de dados amplo, ao qual se tem acesso a partir da Linguística de Córpus, sendo, portanto, correto afirmar que os traços componentes da identidade brasileira podem ser apreendidos em seu estágio atual com base na análise de um córpus construído a partir de textos publicados em jornais, representativos das vozes e do pensamento de um estrato social formador de opinião. No contexto de transformações sociais e políticas que ocorrem no Brasil entre os anos 2005 e 2010, a investigação da identidade nacional e a apuração do autoconceito do brasileiro pôde apontar que alguns paradigmas historicamente estabelecidos estão sendo alterados, enquanto outros ainda persistem. O perfil identitário apurado pela pesquisa favorece a construção, por parte do estudioso da linguagem e, mais especificamente, do docente de língua portuguesa, de uma visão atualizada da identidade nacional, no recorte analisado, permitindo um trabalho consciente com as habilidades e competências vinculadas ao desenvolvimento da identidade nacional, conforme orientam os Parâmetros Curriculares Nacionais / This research has the purpose to perform a survey based on the dialogue between semiotic theories and Corpus Linguistics, studying, specifically, the language marks that may characterize the profile of the various Brazilian socio-cultural characteristics. Our special interest is to focus on the nouns and adjectives that nominate and / or qualify the terms 'Brazil' and 'Brazilian'. Through this study, we intend to reach a panorama which is very close to the linguistic reality of the Brazilian people and their identity. We have worked with the Corpus Linguistics, based on the book Corpus Linguistics (SARDINHA, 2004). We chose the corpus-driven research as a method, which allows raising, quantifying and tabulating the signs in use, in order to identify their frequency and lexical organization in bundles, so that they could be evaluated as to their significance in the communicative scene. The theories and works that bolstered this thesis were the Semiotics by Charles Sanders PEIRCE (2000), the works on semiotics by ECO (2007) and SANTAELLA (1996, 2000 e 2001), and the Theory of Verbal Iconicity (SIMOES, 2007). This one aims to establish the iconic potential of the search words in their context. Regarding the philosophical, social and anthropological readings on identity, this work is supported by the thoughts of NIETZSCHE (1991) in an article on the need of forgetfulness in order to build an identity. Another work which supports our conclusions is HALLs paper (1998) on the timelines that govern the discourse processing of the historical facts, which shows how they interfere in the construction of the identity. The counterpoint between these latter authors contribute to the definition of the text genre relevant to this research there were used basically argumentative texts, published in major newspapers in Rio and Sao Paulo. Regarding the identity in the network society as a contemporary issue, the work of CASTELLS (2006) was of great help. The studies on the Brazilian identity by DAMATTA (1978 and 1989) and LEITE (2002) also give basis to the considerations of this thesis. The research showed that the lexical iconicity comes to be more appropriately viewed from a broad universe of data, which has been provided by a large corpus (8 million words approximately) dealt with in the Corpus Linguistics methodology. Its therefore correct to say that components of Brazilian identity may be seized in its current state based on the analysis of a corpus built from texts published in newspapers, representing the voices and thoughts of a social stratum and opinion formers. The investigation of national identity and the self-concept of the Brazilian in the context of social and political transformations that have occurred in Brazil between 2005 and 2010 pointed out that some historically established paradigms have been going through a process of change, while others have persisted. The National Curriculum Parameters in Brazil establish topics on national identity to be developed by native teachers of Portuguese language. The results of this work are meant to be helpful to the aforementioned teachers Iconicidade Ensino Língua Portuguesa Linguística de córpus Identity Corpus Linguistics Portuguese Language Education Iconicity LINGUA PORTUGUESA
133	Prosa argumentativa em língua inglesa: um estudo contrastivo sobre advérbios em corpora digitais / Argumentative prose in English language: a contrastive study about adverbs in digital corpora Maria Izabel de Andrade Almeida 30 March 2010 (has links) Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro / Esta pesquisa tem como objetivo principal investigar como aprendizes brasileiros de língua inglesa usam advérbios com terminação em ly no inglês escrito, e comparar ao uso que deles fazem os falantes de inglês como língua materna. Para tanto, o trabalho encontra suporte teórico e metodológico na Linguística de Corpus e fundamenta-se na área chamada de pesquisa sobre corpora de aprendizes, que se ocupa da coleta e armazenagem de dados linguísticos de sujeitos aprendizes de uma língua estrangeira, para a formação de um corpus que possa ser utilizado para fins descritivos e pedagógicos. Esta área objetiva identificar em que aspectos os aprendizes diferem ou se assemelham aos falantes nativos. Os corpora empregados na pesquisa são o corpus de estudo (Br-ICLE), contendo inglês escrito por brasileiros, compilado de acordo com o projeto ICLE (International Corpus of Learner English) e dois corpora de referência (LOCNESS e BAWE), contendo inglês escrito por falantes de inglês como língua materna. Os resultados indicam que os alunos brasileiros usam, em demasia, as categorias de advérbios que indicam veracidade, realidade e intensidade, em relação ao uso que deles fazem os falantes nativos, além de usarem esses advérbios de forma distinta. Os resultados sugerem que, além das diferenças apresentadas em termos de frequência (seja pelo sobreuso ou subuso dos advérbios), os aprendizes apresentavam combinações errôneas, ou em termos de colocados ou em termos de prosódia semântica. E finalmente a pesquisa revela que a preferência dos aprendizes por advérbios que exprimem veracidade, realidade e intensidade cria a impressão de um discurso muito assertivo. Conclui-se que as diferenças encontradas podem estar ligadas a fatores como o tamanho dos corpora, a influência da língua materna dos aprendizes, a internalização dos elementos linguísticos necessários para a produção de um texto em língua estrangeira, a falta de fluência dos aprendizes e o contexto de sala de aula nas universidades / This research investigates how Brazilian learners of English use adverbs ending in-ly in written English and compares their use to that of speakers of English as a mother tongue. To this end, the work resorts to Corpus Linguistics as both theoretical and methodological support. The research is based on the area called Learner Corpora Research, which deals with the collection, storage and analysis of linguistic data produced by learners of a foreign language, which can then be used for descriptive and teaching purposes. This area aims to identify ways in which learners use of the foreign language is different or similar to that of native speakers. The data used in this research are the corpus of study (Br-ICLE), containing written English produced by Brazilian learners, built according to the ICLE project (International Corpus of Learner English), as well as two reference corpora (Locness and BAWE) containing written English produced by speakers of English as a mother tongue. The results indicate that Brazilian learners overuse the categories of adverbs that indicate truth, reality and intensity in comparison to the use made by native speakers, furthermore they use these adverbs in different ways. The results also suggest that, given the differences in frequency (either by overuse or underuse of adverbs), the learners tend to misuse combinations in terms of collocates or in terms of semantic prosody. And finally, the research reveals that the preference of learners for adverbs expressing truth, reality and intensity creates the impression of very assertive voices. We conclude that these differences may be related to factors such as the size of the corpus, the influence of the learners mother tongue, the internalization of linguistic elements needed to produce a text in a foreign language or even the lack of fluency of the learners and the classroom context in the universities Linguística de Corpus Corpora de Aprendizes Análise Contrastiva Corpus Linguistics Learner Corpora Contrastive Analysis LINGUISTICA
134	Análise de quadrigramas na escrita em inglês como língua estrangeira: um estudo baseado em corpus / Analysis of quadrigrams in EFL writing: a corpus-based study Gustavo Estef Lino da Silveira 19 March 2014 (has links) O presente estudo tem como objetivo geral traçar um perfil das escolhas léxico-gramaticais da escrita em inglês de um grupo de aprendizes brasileiros na cidade do Rio de Janeiro, ao longo dos anos de 2009 a 2012, através da análise de sua produção de quadrigramas (ou blocos de quatro itens lexicais usados com frequência por vários aprendizes) em composições escritas como parte da avaliação final de curso. Como objetivo específico, a pesquisa pretendeu analisar se os quadrigramas produzidos estavam dentre aqueles que haviam sido previamente ensinados para a execução da redação ou se pertenceriam a alguma outra categoria, isto é, quadrigramas já incorporados ao uso da língua ou quadrigramas errôneos usados com abrangência pela população investigada. Para tal, foram coletadas composições escritas por aprendizes de mesmo nível de proficiência de várias filiais de um mesmo curso livre de inglês na cidade do Rio de Janeiro. Em seguida, essas composições foram digitadas e anotadas para constituírem um corpus digital facilmente identificável em termos do tipo e gênero textual, perfil do aprendiz, filial e área de origem do Rio de Janeiro. O estudo faz uso de preceitos e métodos da Linguística de Corpus, área da Linguística que compila grandes quantidades de textos e deles extrai dados com o auxílio de um programa de computador para mapear uso, frequência, distribuição e abrangência de determinados fenômenos linguístico ou discursivo. O resultado demonstra que os aprendizes investigados usaram poucos quadrigramas ensinados e, coletivamente, preferiram usar outros que não haviam sido ensinados nas aulas específicas para o nível cursado. O estudo também demonstrou que quando o gênero textual faz parte de seu mundo pessoal, os aprendizes parecem utilizar mais quadrigramas previamente ensinados. Isto pode querer dizer que o gênero pode influenciar nas escolhas léxico-gramaticais corretas. O estudo abre portas para se compreender a importância de blocos léxico-gramaticais em escrita em L2 como forma de assegurar fluência e acuracidade no idioma e sugere que é preciso proporcionar maiores oportunidades de prática e conscientização dos aprendizes quanto ao uso de tais blocos / This study seeks to trace the profile of lexico-grammatical choices of a group of apprentice writers in the city of Rio de Janeiro, between 2009 and 2012. To this end it analyses the apprentices production of 4-grams (or rather blocks of four lexical items used with relative frequency by a number of apprentices) in written compositions, as part of their final assessment. Specifically, the research aimed to analyse whether the 4-grams produced by the apprentices had been taught previously as part of their composition lessons or whether they belonged to some other category. In other words, namely 4-grams already internalized as part of their language use of erroneous 4-grams used frequently and extensively by the subjects investigated. Thus, compositions written by apprentices at the same proficiency level were collected at various branches of a private English school in the city of Rio de Janeiro. Subsequently, these compositions were typed and tagged in order to compile a digital corpus easily identified in terms of type and textual genre, apprentice profile, branch and area of the city of Rio de Janeiro. The study makes use of precepts and methods of Corpus Linguistics, an area of Linguistics that collects large quantities of texts and from them extracts data with the help of a computer programme in order to map use, frequency, distribution and range of a certain linguistic or discursive phenomena. The results demonstrate that the apprentices studied made little use of 4-grams that had been taught them and, collectively, they preferred to use other n-grams that had not been taught in the specific lessons of the level. The study has also shown that when the textual genre is part of ones personal life, the apprentices seem to make use of more previously taught 4-grams. This may lead to believe that the genre may influence the choice of correct lexico-grammatical items. The study creates a research space for the understanding of the importance of lexico-grammatical chunks in L2 writing as a means of ensuring fluency and accuracy in the target language. In addition, it also suggests that more opportunities of practice should be offered to learners so that they become aware of the use of such chunks Escrita em inglês Linguística de corpus Quadrigramas Applied linguistics Writing in English Corpus linguistics 4-grams LINGUISTICA APLICADA
135	Corpop : um corpus de referência do português popular escrito do Brasil Pasqualini, Bianca Franco January 2018 (has links) Esta tese propõe um corpus do Português popular brasileiro escrito, denominado CorPop, com textos selecionados com base no nível de letramento médio dos leitores do país. As bases teórico-metodológicas do CorPop são interdisciplinares e inserem-se no âmbito dos Estudos da Linguagem e disciplinas afins, como Estudos do Léxico e Linguística de Corpus, Linguística Textual e Psicolinguística, dialogando também com estudos de Processamento de Língua Natural. Desse modo, esta investigação abriga-se na Linha de Pesquisa Lexicografia, Terminologia e Tradução: Relações Textuais do PPG-Letras-UFRGS, e nosso recorte, por isso, tende ao destaque para o Léxico. O desenvolvimento do CorPop deu-se através da compilação de dados sobre o nível de letramento dos leitores brasileiros e das características que poderiam compor um padrão de simplicidade textual em um corpus de textos adequados a esses leitores. Tais dados foram coletados das pesquisas do Indicador de Alfabetismo Funcional (INAF) e Retratos da Leitura no Brasil, além de um questionário com leitores. Os textos selecionados para o CorPop são (1) textos do jornalismo popular do Projeto PorPopular (jornal Diário Gaúcho), consumido maciçamente pelas classes C e D, que é o leitor médio brasileiro; (2) textos e autores mais lidos pelos respondentes das últimas edições da pesquisa Retratos da Leitura no Brasil; (3) coleção “É Só o Começo” (adaptação de clássicos da literatura brasileira para leitores com baixo letramento, adaptação esta realizada por linguistas); (4) textos do jornal Boca de Rua, produzido por pessoas em situação de rua, com baixa escolaridade e baixo letramento; e (5) textos do Diário da Causa Operária, imprensa operária brasileira produzida também por pessoas dentro da faixa média de letramento do país. Realizamos, após a coleta, preparação e processamento dos textos do corpus, uma série de experimentos com a lista bruta de frequências e com a lista de frequências lematizada do CorPop. Os resultados obtidos mostram aplicações promissoras do CorPop em diversas tarefas linguísticas, desde simplificação de textos até uso como vocabulário controlado para redação de paráfrases definitórias em dicionários e comprovam que um corpus pequeno pode ter a mesma validade que um corpus de grandes proporções. / This thesis proposes a corpus of Brazilian popular Portuguese written, called CorPop, with texts selected based on the average level of literacy of the country 's readers. CorPop's theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related disciplines, such as Corpus Lexicon and Linguistics Studies, Textual Linguistics and Psycholinguistics, and also dialogues with Natural Language Processing studies. Thus, this research is housed in the Lexicography, Terminology and Translation Research Line: Textual Relations of PPG-Letras-UFRGS, and our cut, therefore, tends to highlight the Lexicon. The development of CorPop took place through the compilation of data about the level of literacy of Brazilian readers and the characteristics that could compose a standard of textual simplicity in a corpus of texts suitable for these readers. These data were collected from the surveys of the Indicator of Functional Literacy (INAF) and Reading Portraits in Brazil, as well as a questionnaire with readers. The texts selected for CorPop are (1) texts of the popular journalism of the PorPopular Project (newspaper Diário Gaúcho), massively consumed by the C and D classes, which is the average Brazilian reader; (2) texts and authors most read by the respondents of the last editions of the research Retratos da Leitura no Brasil; (3) collection "É Só o Começo" (adaptation of classics from Brazilian literature to readers with low literacy, adaptation by linguists); (4) texts of the newspaper Boca de Rua, produced by street people, with low schooling and low literacy; and (5) texts of the Diário da Causa Operária, the Brazilian working press produced also by people within the average literacy range of the country. After the collection, preparation and processing of the texts of the corpus, a series of experiments with the crude list of frequencies and the list of frequencies typed in CorPop. The results obtained show promising applications of CorPop in several linguistic tasks, such as text simplification and use as controlled vocabulary for writing definitions in dictionaries. Also, CorPop proves that a small corpus can have the same validity as a corpus of large proportions. Língua portuguesa Leitura : Compreensão Lingüística de corpus Corpus of popular Brazilian Portuguese Corpus linguistics Text simplification
136	Verblexpor : um recurso léxico com anotação de papéis semânticos para o português Zilio, Leonardo January 2015 (has links) Esta tese propõe um recurso léxico de verbos com anotação de papéis semânticos, denominado VerbLexPor, baseado em recursos como VerbNet, PropBank e FrameNet. As bases teóricas da proposta são interdisciplinares e retiradas da Linguística de Corpus e do Processamento de Linguagem Natural (PLN), visando-se a contribuir para a Linguística e para a Computação. As hipóteses de pesquisa são: a) um mesmo conjunto de papéis semânticos pode ser aplicado a diferentes gêneros textuais; e b) as diferenças entre esses gêneros se destacam no ranqueamento dos papéis semânticos. O desenvolvimento do VerbLexPor se apoia em dois corpora: um especializado, com mais de 1,6 milhão de palavras, composto por artigos científicos de Cardiologia de três periódicos brasileiros; e um não especializado, com mais de 1 milhão de palavras composto por artigos do jornal popular Diário Gaúcho. Os corpora foram anotados com o parser PALAVRAS, e as informações de sentenças, verbos e argumentos foram extraídas e armazenadas em um banco de dados. O VerbLexPor tem 192 verbos e mais de 15 mil argumentos anotados distribuídos em mais de 6 mil sentenças. Observou-se que o corpus do Diário Gaúcho privilegia uma sintaxe direta e pouco uso de voz passiva e adjuntos, enquanto o corpus de Cardiologia apresenta mais voz passiva e um maior uso de INSTRUMENTOS na posição de sujeito, além de uma menor incidência de AGENTES. Foram realizados também alguns experimentos paralelos, como a anotação de papéis semânticos por vários anotadores e o agrupamento automático de verbos. Na tarefa de múltiplos anotadores, cada um anotou exatamente as mesmas 25 orações. Os anotadores receberam um manual de anotação e um treinamento básico (explicação sobre a tarefa e dois exemplos de anotação). Usou-se o cálculo de multi-π para avaliar a concordância entre os anotadores, e o resultado foi de π = 0,25. Os motivos para essa concordância baixa podem estar na falta de um treinamento mais completo. A tarefa de agrupamento de verbos mostrou que a sintaxe e a semântica são igualmente importantes para o agrupamento. Este estudo contribui para a área de Linguística, com um léxico de verbos anotados semanticamente, e também para a Computação, com dados que podem ser consultados e processados para diversas aplicações do PLN, principalmente por estarem disponíveis nos formatos XML e SQL. / This dissertation aims at developing a lexical resource of verbs annotated with semantic roles, called VerbLexPor, and based on other resources, such as VerbNet, PropBank, and FrameNet. The theoretical bases of this study lies in Corpus Linguistics and Natural Language Processing (NLP), so that it aims at contributing to both Linguistics and Computer Science. The hypotheses are: a) one set of semantic roles can be applied to different genres; and b) the differences among genres are shown by the ranking of semantic roles. The development of VerbLexPor has two corpora at the basis: a specialized one, with more than 1.6 million words, composed by scientific papers in the field of Cardiology from three Brazilian journals; and a non-specialized one, with more than 1 million words, composed by newspaper articles from Diário Gaúcho. The corpora were analyzed with the parser PALAVRAS, and sentence, verb and argument information was extracted and stored in a database. VerbLexPor has 192 verbs and more than 15 thousand arguments annotated with semantic roles, distributed among more than 6 thousand sentences. We observed that Diário Gaúcho has a more direct syntax, with less passive voice and adjuncts, while Cardiology has more passive voice and more INSTRUMENTS for subjects, and fewer AGENTS. We also conducted some parallel experiments, such as semantic role labeling with multiple annotators and automatic verbal clustering. In the multiple annotators task, each of them annotated exactly the same 25 sentences. They received an annotation manual and basic training (explanation on the task and two annotation examples). We used multi-π to evaluate agreement among annotators, and results were π = 0,25. Reasons for this low agreement may be a lack of a thoroughly developed training. The verbal clustering task showed that syntax and semantics are equally important for verbal clustering. This study contributes to Linguistics, with a verbal lexicon annotated with semantic roles, and also to Computer Science, with data that can be assessed and processed for various NLP applications, especially because the data are available in both XML and SQL formats. Língua portuguesa Linguística computacional Corpus Linguagem especializada Semantic role labeling Lexical resource NLP Corpus linguistics
137	Análise de um corpus de produção escrita em português por crianças e adultos indígenas bilíngues/monolíngues de Dourados/MS a partir da linguistíca de corpus Espindola, Sandra January 2014 (has links) Com a finalidade de entender a origem das dificuldades apresentadas por crianças e adultos indígenas na produção de textos em português, surgiu a presente investigação. a partir da Linguística de Corpus. Para tanto, foi construído um corpus de 483 textos de crianças e 349 textos de adultos escritos em língua portuguesaproduzidos por crianças e adultos indígenas e não indígenas. A amostra do grupo das crianças contou um total de 175 crianças, sendo 111 indígenas (71 bilíngues Guarani/Kaiowá e 40 Terena monolíngues) e 64 não indígenas, falantes monolíngues de português, alunos do 4º e do 5º ano do Ensino Fundamental. O grupo de adultos foi formado por um total de 118 adultos, sendo 74 indígenas (36 bilíngues Guarani/Kaiowá e 38 Terena monolíngues) e 44 não indígenas, falantes monolíngues de português, do1o e do último ano do Ensino Superior. Os objetivos específicos da pesquisa foram: (a) verificar se existem diferenças entre o tipo de dificuldades reveladas pelos indígenas monolíngues e bilíngues de diferentes etnias – Kaiowá/Guarani e Terena – em comparação com os monolíngues não indígenas na produção de textos narrativos em português; (b) na comparação entre os dois grupos etários, crianças e adultos, observar em que medida o caminho percorrido do ensino básico à formação acadêmica interferiu no desenvolvimento da habilidade de escrita de textos; e (c) no caso dos grupos de participantes adultos, investigar se o tempo de permanência no curso de graduação (alunos que estão no primeiro e no quarto ano de curso) interfere no nível de dificuldade na produção de textos. Os dados foram analisados através da ferramenta AntConc, a partir do viés teórico da Linguística de Corpus. A partir dessa proposta de pesquisa espera-se contribuir para que os professores, tanto os que atendem os acadêmicos quanto os que atendem as crianças, compreendam como a escrita desses dois grupos indígenas se estrutura. Essas informações são essenciais para futuras orientações nos trabalhos de leitura e escritas propostos pela escola e pelos cursos universitários que recebem acadêmicos indígenas. / In order to underste the origin of the difficulties faced by indigenous children e adults in the production of texts in Portuguese, this research emerged, from Corpus Linguistics. To that end, was built a corpus of 483 children e 349 adults texts of texts written in Portuguese produced by children e indigenous e non-indigenous adults.The sample of children group counted a total of 175 children, with 111 indigenous (71 bilingual Guarani / Kaiowá e Terena 40 monolingual) e 64 non-indigenous, monolingual speakers of Portuguese, students of the 4th e 5th year of elementary school.The adult group consisted of a total of 118 adults, with 74 indigenous (36 bilingual Guarani / Kaiowá e Terena 38 monolingual) e 44 non-indigenous, monolingual speakers of Portuguese, the first e last years of higher education.The specific objectives of the research were: (a) determine whether there are differences between the kinds of problems revealed by monolingual e bilingual indigenous ethnic groups - Kaiowá / Guarani e Terena - compared to non-indigenous monolingual in the production of narrative texts in Portuguese;(b) the comparison between the two age groups, children e adults, to observe to what extent the traveled way of basic education to academic interfered in the development of written texts skill;e (c) in the case of adults participating groups, to investigate whether the time spent in the undergraduate course (students who are the first e fourth year of course) interferes with the level of difficulty in producing texts.Data were analyzed by AntConc tool from the theoretical bias of Corpus Linguistics. From this research proposal is expected to contribute to teachers, both those who meet the academic e the attending children, underste how the writing of these two indigenous groups structure.This information is essential for future guidance in reading e written work proposed by schools e university courses receiving indigenous academics. Escrita Língua portuguesa Ensino e aprendizagem Educação indígena Produção textual Language education Teaching indigenous Corpus linguistics
138	Ordet grym i ny användning : En semantisk studie av ordet i tidningstexter 1965-2004 Ericsson, Anna January 2009 (has links) Syftet med denna studie är att se hur ordet grym används i icketraditionell bemärkelse. Undersökningen har skett genom studier av ordet i elva tidningskorpusar mellan åren 1965-2004, sammanställda av Språkbanken. Genom att studera faktorer såsom betydelse, genre, användare och ordklass har jag kommit fram till att ordet har gått från att innan 1970-talet endast använts för någonting negativt till att ordet används som en förstärkning eller för något som är snyggt, häftigt och positivt. Studien visar att ordet främst används inom sport- och musikgenren i tidningarna och majoriteten av användarna är män. I denna studie om bruket av ordet grym i tidningsskriftspråk har ordet inte uppkommit i annan ordklass än vad dagens ordböcker tar upp. / The aim of this study is to see how the Swedish word grym is used in non-traditional sense. The research is based on eleven newspaper corpuses from Språkbanken between the years 1965-2004. By studying factors like meaning, genre, user and part of speech the conclusion is that the word has gone from something negative before the 1970s to be used as reinforcement or for something nice, cool and positive. The research shows that the word mainly is used within the sport and music genre in the newspapers and the majority of the users are men. In the result the word is never used as another part of speech other than the ones presented in the word books, which is adjective and adverb. semantic change corpus linguistics Språkbanken semantisk förändring korpuslingvistik grym Språkbanken Specific Languages Studier av enskilda språk
139	Neutral or not? : A study of gender (in)equality in the use of professional terms in English. Östman, Klara January 2017 (has links) Jenny Cheshire, current editor in chief of Language in Society, stated that there is a bias of masculine terms and referents in the English language (1985, p. 22). This poses a problem, both linguistically and socially, and conscious language reforms need to be imposed in order for the bias to drastically be countered (1985, p. 22). In the past decades, gender-neutral terms, such as chairperson has been gaining ground in English, particularly in business discourse, and are contributing to create a more gender-neutral language. According to Cheshire (2008), media discourse is enormously influential (p.9) in the way we communicate, and this study investigates patterns in the use of chairperson and salesperson, as well as historically male professions priest and manager and female professions nurse and secretary. The data for this study is taken from the TIME Magazine Corpus. The results of this study show that masculine gender collocates appear commonly with the historically female professions and conversely for the historically male professions which appear more often with feminine collocates. Furthermore, through analysis of 1,000 instances of the terms in the corpus, it is noted that there are differences as to how the professions are connected with other words as well. Sexuality, nationality and physicality are ways in which the collocates of the terms differ. It is noted that, over time, there have been both increases and decreases in how gender collocates appear with the terms and that the frequency in usage of the feminine, masculine and gender-neutral terms have all been noted to vary in usage over the past century in the selected discourse. language and gender sociolinguistics professional titles corpus linguistics collocate General Language Studies and Linguistics
140	Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources) / Ad hoc and general-purpose corpus construction from web sources Barbaresi, Adrien 19 June 2015 (has links) Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés. Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité. / At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality. Construction de corpus web Linguistique de corpus Web crawling Web corpus construction Corpus linguistics Web crawling

Search results