Global ETD Search

81	A generic and open framework for multiword expressions treatment : from acquisition to applications Ramisch, Carlos Eduardo January 2012 (has links) The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work. Linguagem natural Linguística computacional Natural language processing Computational linguistics Multiword expressions Lexical acquisition Machine translation Lexicography Corpus linguistics
82	Aperfeiçoamento de um tradutor automático Português-Inglês: tempos verbais / Development of a Portuguese-to-English machine translation system: tenses Lucia Helena Rozario da Silva 03 August 2010 (has links) Esta dissertação apresenta o aperfeiçoamento de um sistema de tradução automática português-inglês. Nosso objetivo principal é criar regras de transferência estrutural entre o par de línguas português e inglês e avaliar, através do uso da métrica de avaliação METEOR, o desempenho do sistema. Para isto, utilizamos um corpus teste criado especialmente para esta pesquisa. Tendo como ponto de partida a relevância de uma correta tradução para os tempos verbais de uma sentença, este trabalho priorizou a criação de regras que tratassem a transferência entre os tempos verbais do português brasileiro para o inglês americano. Devido ao fato de os verbos em português estarem distribuídos por três conjugações, criamos um corpus para cada uma dessas conjugações. O objetivo da criação desses corpora é verificar a aplicação das regras de transferência estrutural entre os tempos verbais em todas as três classes de conjugação. Após a criação dos corpora, mapeamos os tempos verbais em português no modo indicativo, subjuntivo e imperativo para os tempos verbais do inglês. Em seguida, iniciamos a construção das regras de transferência estrutural entre os tempos verbais mapeados. Ao final da construção das regras, submetemos os corpora obedecendo as três classes de conjugação à métrica de avaliação automática METEOR. Os resultados da avaliação do sistema após a inserção das regras apresentaram uma regressão quando comparado a avaliação do sistema no estágio inicial da pesquisa. Detectamos, através de análises dos resultados, que a métrica de avaliação automática METEOR não foi sensível às modificações feitas no sistema, embora as regras criadas sigam a gramática tradicional da língua portuguesa e estejam sendo aplicadas a todas as três classes de conjugação. Apresentamos em detalhes o conjunto de regras sintáticas e os corpora utilizados neste estudo, e que acreditamos serem de utilidade geral para quaisquer sistemas de tradução automática entre o português brasileiro e o inglês americano. Outra contribuição deste trabalho está em discutir os valores apresentados pela métrica METEOR e sugerir que novos ajustes sejam feitos a esses parâmetros utilizados pela métrica. / This dissertation presents the development of a Portuguese-to-English Machine Translation system. Our main objective is creating structural transfer rules between this pair of languages, and evaluate the performance of the system using the METEOR evaluation metric. Therefore, we developed a corpus to enable this study. Taking translation relevance as a starting point, we focused on verbal tenses and developed rules that dealt with transfer between verbal tenses from the Brazilian Portuguese to US English. Due to the fact that verbs in Portuguese are distributed in three conjugations, we created one corpus for each of these conjugations. The objective was to verify the application of structural transfer rules between verbal tenses in each conjugation class in isolation. After creating these corpora, we mapped the Portuguese verbal tenses in the indicative, subjunctive and imperative modes to English. Next, we constructed structural transfer rules to these mapped verbal tenses. After constructing these rules, we evaluated our corpora using the METEOR evaluation metric. The results of this evaluation showed lack of improvement after the insertion of these transfer rules, when compared to the initial stage of the system. We detected that the METEOR evaluation metric was not sensible to these modi_cations made to the system, even though they were linguistically sound and were being applied correctly to the sentences. We introduce in details the set of transfer rules and corpora used in this study, and we believe they are general enough to be useful in any rule-based Portuguese-to-English Machine Translation system. Another contribution of this work lies in the discussion of the results presented by the METEOR metric. We suggest adjustments to be made to its parameters, in order to make it more sensible to sentences variation such as those introduced by our rules. Inteligência artifical Linguística Linguística computacional Processamento de linguagem natural Tradução automática Articial intelligence Computational linguistics Linguistics Machine translation Natural language processing
83	eDictor: da plataforma para a nuvem / eDictor: from platform to the cloud Veronesi, Luiz Henrique Lima 04 February 2015 (has links) Neste trabalho, apresentamos uma nova proposta para edição de textos que fazem parte de um corpus eletrônico. Partindo do histórico de desenvolvimento do corpus Tycho Brahe e da ferramenta eDictor, propõe-se a análise de todo o processo de trabalho de criação de um corpus para obter uma forma de organização da informação mais concisa e sem redundâncias, através do uso de um único repositório de informações contendo os dados textuais e morfossintáticos do texto. Esta forma foi atingida através da criação de uma estrutura de dados baseada em unidades mínimas chamadas tokens e blocos de unidades chamados chunks. A relação entre os tokens e os chunks, da forma como considerada neste trabalho, é capaz de guardar a informação de como o texto é estruturado em sua visualização (página, parágrafos, sentenças) e na sua estrutura sintática em árvores. A base de análise é composta por todos os arquivos pertencentes ao catálogo de textos do corpus Tycho Brahe. Através desta análise, foi possível chegar a elementos genéricos que se relacionam, desconstruindo o texto e criando uma relação de pontos de início e fim relativos às palavras (tokens) e não seguindo sua forma linear. A introdução do conceito de orientação a objetos possibilitou a criação de uma relação entre unidades ainda menores que o token, os split tokens que também são tokens, pois herdam as características do elemento mais significativo, o token. O intuito neste trabalho foi buscar uma forma com o menor número possível de atributos buscando diminuir a necessidade de se criar atributos específicos demais ou genéricos de menos. Na busca deste equilíbrio, foi verificada a necessidade de se criar um atributo específico para o chunk sintático, um atributo de nível que indica a distância de um nó da árvore para o nó raiz. Organizada a informação, o acesso a ela se torna mais simples e parte-se para definição da interface do usuário. A tecnologia web disponível permite que elementos sejam posicionados na tela reproduzindo a visualização que ocorre no livro e também permite que haja uma independência entre um e outro elemento. Esta independência é o que permite que a informação trafegue entre o computador do usuário e a central de processamento na nuvem sem que o usuário perceba. O processamento ocorre em background, utilizando tecnologias assíncronas. A semelhança entre as tecnologias html e xml introduziu uma necessidade de adaptação da informação para apresentação ao usuário. A solução apresentada neste trabalho é pensada de forma a atribuir aos tokens informações que indiquem que eles fazem parte de um chunk. Assim, não seriam as palavras que pertencem a uma sentença, mas cada palavra que possuiria um pedaço de informação que a faz pertencente à sentença. Esta forma de se pensar muda a maneira como a informação é exibida. / In this work, we present a new proposal for text edition organized under an electronic corpus. Starting from Tycho Brahe corpus development history and the eDictor tool, we propose to analyze the whole work process of corpus creation in order to obtain a more concise and less redudant way of organizing information by using a single source repository for textual and morphosyntactic data. This single source repository was achieved by the creation of a data structure based on minimal significative units called tokens and grouping units named chunks. The relationship between tokens and chunks, in the way considered on this work, allows storage of information about how the text is organized visually (pages, paragraphs, sentences) and on how they are organized syntactically as represented by syntactic trees. All files referred to the Tycho Brahe corpus catalog were used as base for analysis. That way, it was possible to achieve generic elements that relate to each other in a manner that the text is deconstructed by using relative pointers to each token in the text instead of following the usual linear form. The introduction of oriented-object conception made the creation of relationship among even smaller units possible, they are the split tokens, but split tokens are also tokens, as they inherit characteristics from the most significative element (the token). The aim here was being attributeless avoiding the necessity of too specific or too vague attributes. Looking for that balance, it was verified the necessity of creating a level attribute for syntactic data that indicates the distance of a tree node to its root node. After information is organized, access to it become simpler and then focus is turned to user-interface definition. Available web technology allows the use of elements that may be positioned on the screen reproducing the way the text is viewed within a book and it also allows each element to be indepedent of each other. This independence is what allows information to travel between user computer and central processing unit at the cloud without user perception. Processing occurs in background using asynchronous technology. Resemblance between html and xml introduced a necessity of adaption to present the information to the user. The adopted solution in this work realizes that tokens must contain the information about the chunk to which they belong. So this is not a point of view where words belong to sentences, but that each word have a piece of information that make them belong to the sentence. This subtile change of behavioring changes the way information is displayed. Annotated corpus Arquitetura web Computational linguistics Corpus anotado Corpus eletrônico Corpus linguistics Edição filológica digital Electronic corpus Linguística computacional Linguística de corpus Philological digital edition Web architecture
84	Abordagem para o desenvolvimento de um etiquetador de alta acurácia para o Português do Brasil DOMINGUES, Miriam Lúcia Campos Serra 21 October 2011 (has links) Submitted by Samira Prince (prince@ufpa.br) on 2012-06-01T13:27:50Z No. of bitstreams: 2 Tese_AbordagemDesenvolvimentoEtiquetador.pdf: 1889587 bytes, checksum: 3c065577821e8f688e91c0a70bb1340e (MD5) license_rdf: 23898 bytes, checksum: e363e809996cf46ada20da1accfcd9c7 (MD5) / Approved for entry into archive by Samira Prince(prince@ufpa.br) on 2012-06-01T13:28:30Z (GMT) No. of bitstreams: 2 Tese_AbordagemDesenvolvimentoEtiquetador.pdf: 1889587 bytes, checksum: 3c065577821e8f688e91c0a70bb1340e (MD5) license_rdf: 23898 bytes, checksum: e363e809996cf46ada20da1accfcd9c7 (MD5) / Made available in DSpace on 2012-06-01T13:28:30Z (GMT). No. of bitstreams: 2 Tese_AbordagemDesenvolvimentoEtiquetador.pdf: 1889587 bytes, checksum: 3c065577821e8f688e91c0a70bb1340e (MD5) license_rdf: 23898 bytes, checksum: e363e809996cf46ada20da1accfcd9c7 (MD5) Previous issue date: 2011 / A etiquetagem morfossintática é uma tarefa básica requerida por muitas aplicações de processamento de linguagem natural, tais como análise gramatical e tradução automática, e por aplicações de processamento de fala, por exemplo, síntese de fala. Essa tarefa consiste em etiquetar palavras em uma sentença com as suas categorias gramaticais. Apesar dessas aplicações requererem etiquetadores que demandem maior precisão, os etiquetadores do estado da arte ainda alcançam acurácia de 96 a 97%. Nesta tese, são investigados recursos de corpus e de software para o desenvolvimento de um etiquetador com acurácia superior à do estado da arte para o português brasileiro. Centrada em uma solução híbrida que combina etiquetagem probabilística com etiquetagem baseada em regras, a proposta de tese se concentra em um estudo exploratório sobre o método de etiquetagem, o tamanho, a qualidade, o conjunto de etiquetas e o gênero dos corpora de treinamento e teste, além de avaliar a desambiguização de palavras novas ou desconhecidas presentes nos textos a serem etiquetados. Quatro corpora foram usados nos experimentos: CETENFolha, Bosque CF 7.4, Mac-Morpho e Selva Científica. O modelo de etiquetagem proposto partiu do uso do método de aprendizado baseado em transformação(TBL) ao qual foram adicionadas três estratégias, combinadas em uma arquitetura que integra as saídas (textos etiquetados) de duas ferramentas de uso livre, o TreeTagger e o -TBL, com os módulos adicionados ao modelo. No modelo de etiquetador treinado com o corpus Mac-Morpho, de gênero jornalístico, foram obtidas taxas de acurácia de 98,05% na etiquetagem de textos do Mac-Morpho e 98,27% em textos do Bosque CF 7.4, ambos de gênero jornalístico. Avaliou-se também o desempenho do modelo de etiquetador híbrido proposto na etiquetagem de textos do corpus Selva Científica, de gênero científico. Foram identificadas necessidades de ajustes no etiquetador e nos corpora e, como resultado, foram alcançadas taxas de acurácia de 98,07% no Selva Científica, 98,06% no conjunto de teste do Mac-Morpho e 98,30% em textos do Bosque CF 7.4. Esses resultados são significativos, pois as taxas de acurácia alcançadas são superiores às do estado da arte, validando o modelo proposto em busca de um etiquetador morfossintático mais confiável. / Part-of-speech tagging is a basic task required by many applications of natural language processing, such as parsing and machine translation, and by applications of speech processing, for example, speech synthesis. This task consists of tagging words in a sentence with their grammatical categories. Although these applications require taggers with greater precision, the state of the art taggers still achieved accuracy of 96 to 97%. In this thesis, corpus and software resources are investigated for the development of a tagger with accuracy above of that of the state of the art for the Brazilian Portuguese language. Based on a hybrid solution that combines probabilistic tagging with rule-based tagging, the proposed thesis focuses on an exploratory study on the tagging method, size, quality, tag set, and the textual genre of the corpora available for training and testing, and evaluates the disambiguation of new or out-of-vocabulary words found in texts to be tagged. Four corpora were used in experiments: CETENFolha, Bosque CF 7.4, Mac-Morpho, and Selva Científica. The proposed tagging model was based on the use of the method of transformation-based learning (TBL) to which were added three strategies combined in a architecture that integrates the outputs (tagged texts) of two free tools, Treetagger and -TBL, with the modules that were added to the model. In the tagger model trained with Mac-Morpho corpus of journalistic genre, tagging accuracy rates of 98.05% on Mac-Morpho test set and 98.27% on Bosque CF 7.4 were achieved, both of journalistic genres. The performance of the proposed hybrid model tagger was also evaluated in the texts of Selva Científica Corpus, of the scientific genre. Needs of adjustments in the tagger and in corpora were identified and, as result, accuracy rates of 98.07% in Selva Científica, 98.06% in the text set of Mac-Morpho, and 98.30% in the texts of the Bosque CF 7.4 have been achieved. These results are significant because the accuracy rates achieved are higher than those of the state of the art, thus validating the proposed model to obtain a more reliable part-of-speech tagger. Etiquetagem morfossintática Linguística computacional Linguística de corpus
85	eDictor: da plataforma para a nuvem / eDictor: from platform to the cloud Luiz Henrique Lima Veronesi 04 February 2015 (has links) Neste trabalho, apresentamos uma nova proposta para edição de textos que fazem parte de um corpus eletrônico. Partindo do histórico de desenvolvimento do corpus Tycho Brahe e da ferramenta eDictor, propõe-se a análise de todo o processo de trabalho de criação de um corpus para obter uma forma de organização da informação mais concisa e sem redundâncias, através do uso de um único repositório de informações contendo os dados textuais e morfossintáticos do texto. Esta forma foi atingida através da criação de uma estrutura de dados baseada em unidades mínimas chamadas tokens e blocos de unidades chamados chunks. A relação entre os tokens e os chunks, da forma como considerada neste trabalho, é capaz de guardar a informação de como o texto é estruturado em sua visualização (página, parágrafos, sentenças) e na sua estrutura sintática em árvores. A base de análise é composta por todos os arquivos pertencentes ao catálogo de textos do corpus Tycho Brahe. Através desta análise, foi possível chegar a elementos genéricos que se relacionam, desconstruindo o texto e criando uma relação de pontos de início e fim relativos às palavras (tokens) e não seguindo sua forma linear. A introdução do conceito de orientação a objetos possibilitou a criação de uma relação entre unidades ainda menores que o token, os split tokens que também são tokens, pois herdam as características do elemento mais significativo, o token. O intuito neste trabalho foi buscar uma forma com o menor número possível de atributos buscando diminuir a necessidade de se criar atributos específicos demais ou genéricos de menos. Na busca deste equilíbrio, foi verificada a necessidade de se criar um atributo específico para o chunk sintático, um atributo de nível que indica a distância de um nó da árvore para o nó raiz. Organizada a informação, o acesso a ela se torna mais simples e parte-se para definição da interface do usuário. A tecnologia web disponível permite que elementos sejam posicionados na tela reproduzindo a visualização que ocorre no livro e também permite que haja uma independência entre um e outro elemento. Esta independência é o que permite que a informação trafegue entre o computador do usuário e a central de processamento na nuvem sem que o usuário perceba. O processamento ocorre em background, utilizando tecnologias assíncronas. A semelhança entre as tecnologias html e xml introduziu uma necessidade de adaptação da informação para apresentação ao usuário. A solução apresentada neste trabalho é pensada de forma a atribuir aos tokens informações que indiquem que eles fazem parte de um chunk. Assim, não seriam as palavras que pertencem a uma sentença, mas cada palavra que possuiria um pedaço de informação que a faz pertencente à sentença. Esta forma de se pensar muda a maneira como a informação é exibida. / In this work, we present a new proposal for text edition organized under an electronic corpus. Starting from Tycho Brahe corpus development history and the eDictor tool, we propose to analyze the whole work process of corpus creation in order to obtain a more concise and less redudant way of organizing information by using a single source repository for textual and morphosyntactic data. This single source repository was achieved by the creation of a data structure based on minimal significative units called tokens and grouping units named chunks. The relationship between tokens and chunks, in the way considered on this work, allows storage of information about how the text is organized visually (pages, paragraphs, sentences) and on how they are organized syntactically as represented by syntactic trees. All files referred to the Tycho Brahe corpus catalog were used as base for analysis. That way, it was possible to achieve generic elements that relate to each other in a manner that the text is deconstructed by using relative pointers to each token in the text instead of following the usual linear form. The introduction of oriented-object conception made the creation of relationship among even smaller units possible, they are the split tokens, but split tokens are also tokens, as they inherit characteristics from the most significative element (the token). The aim here was being attributeless avoiding the necessity of too specific or too vague attributes. Looking for that balance, it was verified the necessity of creating a level attribute for syntactic data that indicates the distance of a tree node to its root node. After information is organized, access to it become simpler and then focus is turned to user-interface definition. Available web technology allows the use of elements that may be positioned on the screen reproducing the way the text is viewed within a book and it also allows each element to be indepedent of each other. This independence is what allows information to travel between user computer and central processing unit at the cloud without user perception. Processing occurs in background using asynchronous technology. Resemblance between html and xml introduced a necessity of adaption to present the information to the user. The adopted solution in this work realizes that tokens must contain the information about the chunk to which they belong. So this is not a point of view where words belong to sentences, but that each word have a piece of information that make them belong to the sentence. This subtile change of behavioring changes the way information is displayed. Arquitetura web Corpus anotado Corpus eletrônico Edição filológica digital Linguística computacional Linguística de corpus Annotated corpus Computational linguistics Corpus linguistics Electronic corpus Philological digital edition Web architecture
86	Formação de gentílicos a partir de topônimos : proposta de geração automática Antunes, Roger Alfredo de Marci Rodrigues 17 February 2017 (has links) Submitted by Ronildo Prado (ronisp@ufscar.br) on 2017-08-21T18:50:20Z No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-21T18:50:28Z (GMT) No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-21T18:50:34Z (GMT) No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) / Made available in DSpace on 2017-08-21T18:50:41Z (GMT). No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) Previous issue date: 2017-02-17 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / It is a common habit to use the adjective of the city name to indicate people’s origin, however the formulating rules of the adjective has been rarely discussed in the literature. The main objective of this work is to describe the gentile adjectives, which originate from the place names called toponyms. Using specific morphological rules of combination and proposing the formal representation of their regularities we can formulate the basis for a computational system, which can automatically generate the gentiles from their place names. The system proposed here is founded on the methodological principles of Dias-da-Silva (1996) - with respect to the three-phase methodology of the Natural language processing (NLP) - and the theoretical assumptions in the works of Borba (1998), Biderman (2001), Dick (2007) Jurafsky (2009) and Sandmann (1992, 1997). The corpus consists of 5,570 municipalities’ names (toponyms) and their respective gentiles, extracted in a form of a list from the database of the Instituto Brasileiro de Geografia e Estatística (IBGE). It was observed that only from a small set of recurrent unities, such as suffixes and ends of lexical entities, it is possible to extract patterns which can be subsequently used to formulate combination rules for automatic word processing. During this work, the issue of computational representation stands out and proves natural language complexity. Although natural languages can be in principle automatically processed using computers, their inherent features may deviate from the formulated rules and make the processing more intricate. Nonetheless, the results show that it is possible to automatize 52% of the generation of gentiles from the municipal toponyms. Conclusively the inherent opacity of the Portuguese does not allow direct processing of all of the language toponyms. / Utilizam-se diariamente nomes de cidades e adjetivos que indicam as pessoas que nasceram ou vivem nessas cidades, mas raramente se reflete sobre as regras de formação dessas palavras. O presente trabalho tem como objetivo descrever os adjetivos pátrios, ou gentílicos, que advêm dos nomes dos lugares - topônimos -, por meio de regras de combinação morfológicas específicas e propor a representação formal das suas regularidades com intuito de servir de base para um sistema computacional capaz de gerar automaticamente os gentílicos a partir dos seus topônimos. Tomou-se como orientação os princípios metodológicos de Dias-da-Silva (1996) - no que concerne à metodologia trifásica do PLN -, e os pressupostos teóricos nos trabalhos de Borba (1998), Biderman (2001), Dick (2007), Jurafsky (2009) e Sandmann (1992, 1997). O corpus da pesquisa consiste na lista dos topônimos de 5.570 municípios e seus respectivos gentílicos, extraídos do banco de dados do Instituto Brasileiro de Geografia e Estatística (IBGE). Com esta pesquisa, foi possível observar que somente a partir das menores unidades recorrentes, como os sufixos e as extremidades finais das unidades léxicas, podem-se extrair padrões para a formulação de regras de combinação para um processamento automático. Além disso, a problemática da representação computacional evidencia a complexidade das línguas naturais, que embora sejam passíveis de processamento automático, são opacas e, desta maneira, sempre haverá questões inerentes a elas que dificultam essa tarefa. Ainda assim, os resultados mostraram que é possível automatizar a geração de gentílicos a partir de topônimos em 52% do total, o que já é um número razoável, considerando a opacidade inerente à língua natural mencionada. Gentílico Toponímia Morfologia lexical Processos de formação de palavras Linguística computacional Processamento de línguas naturais Gentile Toponymy Lexical morphology Computational linguistics Natural language processing LINGUISTICA, LETRAS E ARTES::LINGUISTICA
87	Nova classe média: um estudo empírico sobre os enquadramentos da mídia Soares, Ana Angélica Rodrigues de Andrade 03 1900 (has links) Submitted by Ana Angélica Rodrigues de Andrade Soares (anaangelica11@gmail.com) on 2015-04-28T17:38:46Z No. of bitstreams: 1 tese_MP_ana_angelica_BMHS.pdf: 3884236 bytes, checksum: 0775eca428617ac2ffe1bda17e4e9c47 (MD5) / Approved for entry into archive by Rafael Aguiar (rafael.aguiar@fgv.br) on 2015-05-04T19:02:15Z (GMT) No. of bitstreams: 1 tese_MP_ana_angelica_BMHS.pdf: 3884236 bytes, checksum: 0775eca428617ac2ffe1bda17e4e9c47 (MD5) / Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2015-05-05T19:55:47Z (GMT) No. of bitstreams: 1 tese_MP_ana_angelica_BMHS.pdf: 3884236 bytes, checksum: 0775eca428617ac2ffe1bda17e4e9c47 (MD5) / Made available in DSpace on 2015-05-05T19:55:59Z (GMT). No. of bitstreams: 1 tese_MP_ana_angelica_BMHS.pdf: 3884236 bytes, checksum: 0775eca428617ac2ffe1bda17e4e9c47 (MD5) Previous issue date: 2015-03 / Em agosto de 2012, o economista-chefe do Centro de Políticas Sociais da Fundação Getulio Vargas (FGV), Marcelo Neri, foi nomeado presidente do Instituto de Pesquisa Econômica Aplicada (Ipea). Em março do mesmo ano, ainda pela FGV, Neri lançara o livro A Nova Classe Média: o lado brilhante da base da pirâmide, que retoma a série de estudos sobre a nova classe média que vinha realizando pela Fundação desde 2008. O presente trabalho analisa mudanças no enquadramento das notícias do jornal O Globo relativas à nova classe média nos períodos em que Marcelo Neri atuou na FGV e, posteriormente, no governo federal, por meio de uma Análise de Enquadramento Textualmente Orientada – método crítico de análise dos enquadramentos da mídia cujo intuito é auxiliar na percepção e mensuração de mudanças nos vieses noticiosos em função de variáveis políticas. Tal metodologia alia a análise linguística de grandes volumes de texto à teoria social do discurso, e foi desenvolvida em parceria com a Escola de Matemática Aplicada (EMAp/FGV), tendo como base ferramentas computacionais de Linguística de Corpus e Processamento de Linguagem Natural (PLN). Nova classe média Enquadramento Mídia Análise de discurso Política Linguística computacional Ciências sociais Análise do discurso Jornalismo - Objetividade Classe média Linguística - Processamento de dados
88	Avaliação automática de questões discursivas usando LSA SANTOS, João Carlos Alves dos 05 February 2016 (has links) Submitted by camilla martins (camillasmmartins@gmail.com) on 2017-01-27T15:50:37Z No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Tese_AvaliacaoAutomaticaQuestoes.pdf: 5106074 bytes, checksum: c401d50ce5e666c52948ece7af20b2c3 (MD5) / Approved for entry into archive by Edisangela Bastos (edisangela@ufpa.br) on 2017-01-30T13:02:31Z (GMT) No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Tese_AvaliacaoAutomaticaQuestoes.pdf: 5106074 bytes, checksum: c401d50ce5e666c52948ece7af20b2c3 (MD5) / Made available in DSpace on 2017-01-30T13:02:31Z (GMT). No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Tese_AvaliacaoAutomaticaQuestoes.pdf: 5106074 bytes, checksum: c401d50ce5e666c52948ece7af20b2c3 (MD5) Previous issue date: 2016-02-05 / Este trabalho investiga o uso de um modelo usando Latent Semantic Analysis (LSA) na avaliação automática de respostas curtas, com média de 25 a 70 palavras, de questões discursivas. Com o surgimento de ambientes virtuais de aprendizagem, pesquisas sobre correção automática tornaram-se mais relevantes, pois permitem a correção mecânica com baixo custo para questões abertas. Além disso, a correção automática permite um feedback instantâneo e elimina o trabalho de correção manual. Isto possibilita criar turmas virtuais com grande quantidade de alunos (centenas ou milhares). Pesquisas sobre avaliação automática de textos estão sendo desenvolvidas desde a década de 60, mas somente na década atual estão alcançando a acurácia necessária para uso prático em instituições de ensino. Para que os usuários finais tenham confiança, o desafio de pesquisa é desenvolver sistemas de avaliação robustos e com acurácia próxima de avaliadores humanos. Apesar de alguns estudos apontarem nesta direção, existem ainda muitos pontos a serem explorados nas pesquisas. Um ponto é a utilização de bigramas com LSA, mesmo que não contribua muito com a acurácia, contribui com a robustez, que podemos definir como confiabilidade2, pois considera a ordem das palavras dentro do texto. Buscando aperfeiçoar um modelo LSA na direção de melhorar a acurácia e aumentar a robustez trabalhamos em quatro direções: primeira, incluímos bigramas de palavras no modelo LSA; segunda, combinamos modelos de co-ocorrência de unigrama e bigramas com uso de regressão linear múltipla; terceira, acrescentamos uma etapa de ajustes sobre a pontuação do modelo LSA baseados no número de palavras das respostas avaliadas; quarta, realizamos uma análise da distribuição das pontuações atribuídas pelo modelo LSA contra avaliadores humanos. Para avaliar os resultados comparamos a acurácia do sistema contra a acurácia de avaliadores humanos verificando o quanto o sistema se aproxima de um avaliador humano. Utilizamos um modelo LSA com cinco etapas: 1) pré- processamento, 2) ponderação, 3) decomposição a valores singulares, 4) classificação e 5) ajustes do modelo. Para cada etapa explorou-se estratégias alternativas que influenciaram na acurácia final. Nos experimentos obtivemos uma acurácia de 84,94% numa avaliação comparativa contra especialistas humanos, onde a correlação da acurácia entre especialistas humanos foi de 84,93%. No domínio estudado, a tecnologia de avaliação automática teve resultados próximos aos dos avaliadores humanos mostrando que esta alcançando um grau de maturidade para ser utilizada em sistemas de avaliação automática em ambientes virtuais de aprendizagem. / This work investigates the use of a model using Latent Semantic Analysis (LSA) In the automatic evaluation of short answers, with an average of 25 to 70 words, of questions Discursive With the emergence of virtual learning environments, research on Automatic correction have become more relevant as they allow the mechanical correction With low cost for open questions. In addition, automatic Feedback and eliminates manual correction work. This allows you to create classes With large numbers of students (hundreds or thousands). Evaluation research Texts have been developed since the 1960s, but only in the The current decade are achieving the necessary accuracy for practical use in teaching. For end users to have confidence, the research challenge is to develop Evaluation systems that are robust and close to human evaluators. despite Some studies point in this direction, there are still many points to be explored In the surveys. One point is the use of bigrasms with LSA, even if it does not contribute Very much with the accuracy, contributes with the robustness, that we can define as reliability2, Because it considers the order of words within the text. Seeking to perfect an LSA model In the direction of improving accuracy and increasing robustness we work in four directions: First, we include word bigrasms in the LSA model; Second, we combine models Co-occurrence of unigram and bigrams using multiple linear regression; third, We added a stage of adjustments on the LSA model score based on the Number of words of the responses evaluated; Fourth, we performed an analysis of the Of the scores attributed by the LSA model against human evaluators. To evaluate the We compared the accuracy of the system against the accuracy of human evaluators Verifying how close the system is to a human evaluator. We use a LSA model with five steps: 1) pre-processing, 2) weighting, 3) decomposition a Singular values, 4) classification and 5) model adjustments. For each stage it was explored Strategies that influenced the final accuracy. In the experiments we obtained An 84.94% accuracy in a comparative assessment against human Correlation among human specialists was 84.93%. In the field studied, the Evaluation technology had results close to those of the human evaluators Showing that it is reaching a degree of maturity to be used in Assessment in virtual learning environments. Google Tradutor para empresas:Google Toolkit de tradução para appsTradutor de sitesGlobal Market Finder. CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA Linguística computacional Método de ensino Processamento de dados Ensino por computador Latent Semantic Analysis (LSA) Aprendizado do computador Tecnologia educacional Interação homem-máquina
89	Tradução automática estatística baseada em sintaxe e linguagens de árvores Beck, Daniel Emilio 19 June 2012 (has links) Made available in DSpace on 2016-06-02T19:05:58Z (GMT). No. of bitstreams: 1 4541.pdf: 1339407 bytes, checksum: be0e2f3bb86e7d6b4c8d03f4f20214ef (MD5) Previous issue date: 2012-06-19 / Universidade Federal de Minas Gerais / Machine Translation (MT) is one of the classic Natural Language Processing (NLP) applications. The state-of-the-art in MT is represented by statistical methods that aim to learn all necessary linguistic knowledge automatically through large collections of texts (corpora). However, while the quality of statistical MT systems had improved, nowadays these advances are not significant. For this reason, research in the area have sought to involve more explicit linguistic knowledge in these systems. One issue that purely statistical MT systems have is the lack of correct treatment of syntactic phenomena. Thus, one of the research directions when trying to incorporate linguistic knowledge in those systems is through the addition of syntactic rules. To accomplish this, many methods and formalisms with this goal in mind are studied. This text presents the investigation of methods which aim to advance the state-of-the-art in statistical MT through models that consider syntactic information. The methods and formalisms studied are those used to deal with tree languages, mainly Tree Substitution Grammars (TSGs) and Tree-to-String (TTS) Transducers. From this work, a greater understanding was obtained about the studied formalisms and their behavior when used in NLP applications. / A Tradução Automática (Machine Translation - MT) é uma das aplicações clássicas dentro do Processamento da Língua Natural (Natural Language Processing - NLP). O estado-da-arte em MT é representado por métodos estatísticos, que buscam aprender o conhecimento linguístico necessário de forma automática por meio de grandes coleções de textos (os corpora). Entretanto, ainda que se tenha avançado bastante em relação à qualidade de sistemas estatísticos de MT, hoje em dia esses avanços não estão sendo significativos. Por conta disso, as pesquisas na área têm buscado formas de envolver mais conhecimento linguístico explícito nesses sistemas. Um dos problemas que não é bem resolvido por sistemas de MT puramente estatísticos é o correto tratamento de fenômenos sintáticos. Assim, uma das direções que as pesquisas tomam na hora de incorporar conhecimento linguístico a esses sistemas é através da adição de regras sintáticas. Para isso, uma série de métodos e formalismos foram e são estudados até hoje. Esse texto apresenta a investigação de métodos que se utilizam de informação sintática na tentativa de avançar no estado-da-arte da MT estatística. Foram utilizados métodos e formalismos que lidam com linguagens de a´rvores, em especial as Gramáticas de Substituição de Árvores (Tree Substitution Grammars - TSGs) e os Transdutores Árvore-para-String (Tree-to-String - TTS). Desta investigação, obteve-se maior entendimento sobre os formalismos estudados e seu comportamento em aplicações de NLP. Linguística - processamento de dados Linguagem - tradução automática Processamento da Língua Natural Linguística Computacional Tradução automática estatística Transdutores árvore-para-String Natural language processing Computational linguistics Statistical machine translation Tree substitution grammars Tree-to-string transducers

Search results