Global ETD Search

41	Aprimorando o corretor gramatical CoGrOO / Refining the CoGrOO Grammar Checker Silva, William Daniel Colen de Moura 06 March 2013 (has links) O CoGrOO é um corretor gramatical de código aberto em uso por milhares de usuários de uma popular suíte de escritório de código aberto. Ele é capaz de identificar erros como: colocação pronominal, concordância nominal, concordância sujeito-verbo, uso da crase, concordância nominal e verbal e outros erros comuns de escrita em Português do Brasil. Para tal, o CoGrOO realiza uma análise híbrida: inicialmente o texto é anotado usando técnicas estatísticas de Processamento de Linguagens Naturais e, em seguida, um sistema baseado em regras é responsável por identificar os possíveis erros gramaticais. O objetivo deste trabalho é reduzir a quantidade de omissões e intervenções indevidas e, ao mesmo tempo, aumentar a quantidade de verdadeiros positivos sem, entretanto, adicionar novas regras de detecção de erros. A última avaliação científica do corretor gramatical foi realizada em 2006 e, desde então, não foram realizados estudos detalhados quanto ao seu desempenho, apesar de o código do sistema ter passado por substancial evolução. Este trabalho contribuirá com uma detalhada avaliação dos anotadores estatísticos e os resultados serão comparados com o estado da arte. Uma vez que os anotadores do CoGrOO estão disponíveis como software livre, melhorias nesses módulos gerarão boas alternativas a sistemas proprietários. / CoGrOO is an open source Brazilian Portuguese grammar checker currently used by thousands of users of a popular open source office suite. It is capable of identifying Brazilian Portuguese mistakes such as pronoun placement, noun agreement, subject-verb agreement, usage of the accent stress marker, subject-verb agreement, and other common errors of Brazilian Portuguese writing. To accomplish this, it performs a hybrid analysis; initially it annotates the text using statistical Natural Language Processing (NLP) techniques, and then a rule-based check is performed to identify possible grammar errors. The goal of this work is to reduce omissions and false alarms while improving true positives without adding new error rules. The last rigorous evaluation of the grammar checker was done in 2006 and since then there has been no detailed study on how it has been performing. This work will also contribute a detailed evaluation of low-level NLP modules and the results will be compared to state-of-the-art results. Since the low-level NLP modules are available as open source software, improvements on their performance will make them robust, free and ready-to-use alternatives for other systems. Read more Corretor Gramatical FLOSS Grammar Checker NLP PLN Software Livre
42	Avaliação automática da qualidade de escrita de resumos científicos em inglês / Automatic evaluation of the quality of English abstracts Luiz Carlos Genoves Junior 01 June 2007 (has links) Problemas com a escrita podem afetar o desempenho de profissionais de maneira marcante, principalmente no caso de cientistas e acadêmicos que precisam escrever com proficiência e desembaraço não somente na língua materna, mas principalmente em inglês. Durante os últimos anos, ferramentas de suporte à escrita, algumas com enfoque em textos científicos, como o AMADEUS e o SciPo foram desenvolvidas e têm auxiliado pesquisadores na divulgação de suas pesquisas. Entretanto, a criação dessas ferramentas é baseada em córpus, sendo muito custosa, pois implica em selecionar textos bem escritos, além de segmentá-los de acordo com sua estrutura esquemática. Nesse mestrado estudamos, avaliamos e implementamos métodos de detecção automática da estrutura esquemática e de avaliação automática da qualidade de escrita de resumos científicos em inglês. Investigamos o uso de tais métodos para possibilitar o desenvolvimento de dois tipos de ferramentas: de detecção de bons resumos e de crítica. Nossa abordagem é baseada em córpus e em aprendizado de máquina supervisionado. Desenvolvemos um detector automático da estrutura esquemática, que chamamos de AZEA, com taxa de acerto de 80,4% eKappa de 0,73, superiores ao estado da arte (acerto de 73%, Kappa de 0,65). Experimentamos várias combinações de algoritmos, atributos e diferentes seções de um artigo científicos. Utilizamos o AZEA na implementação de duas dimensões de uma rubrica para o gênero científico, composta de 7 dimensões, e construímos e disponibilizamos uma ferramenta de crítica da estrutura de um resumo. Um detector de erros de uso de artigo também foi desenvolvido, com precisão é de 83,7% (Kappa de 0,63) para a tarefa de decidir entre omitir ou não um artigo, com enfoque no feedback ao usuário e como parte da implementação da dimensão de erros gramaticais da rubrica. Na tarefa de detectar bons resumos, utilizamos métodos usados com sucesso na avaliação automática da qualidade de escrita de redações com as implementações da rubrica e realizamos experimentos iniciais, ainda com resultados fracos, próximos à baseline. Embora não tenhamos construído um bom avaliador automático da qualidade de escrita, acreditamos que este trabalho indica direções para atingir esta meta, e forneça algumas das ferramentas necessárias / Poor writing may have serious implications for a professional\'s career. This is even more serious in the case of scientists and academics whose job requires fluency and proficiency in their mother tongue as well as in English. This is why a number of writing tools have been developed in order to assist researchers to promote their work. Here, we are particularly interested in tools, such as AMADEUS and SciPo, which focus on scientific writing. AMADEUS and SciPo are corpus-based tools and hence they rely on corpus compilation which is by no means an easy task. In addition to the dificult task of selecting well-written texts, it also requires segmenting these texts according to their schematic structure. The present dissertation aims to investigate, evaluate and implement some methods to automatically detect the schematic structure of English abstracts and to automatically evaluate their quality. These methods have been examined with a view to enabling the development of two types of tools, namely: detection of well-written abstracts and a critique tool. For automatically detecting schematic structures, we have developed a tool, named AZEA, which adopts a corpus-based, supervised machine learning approach. AZEA reaches 80.4% accuracy and Kappa of 0.73, which is above the highest rates reported in the literature so far (73% accuracy and Kappa of 0.65). We have tested a number of different combinations of algorithms, features and different paper sections. AZEA has been used to implement two out of seven dimensions of a rubric for analyzing scientific papers. A critique tool for evaluating the structure of abstracts has also been developed and made available. In addition, our work also includes the development of a classifier for identifying errors related to English article usage. This classifier reaches 83.7% accuracy (Kappa de 0.63) in the task of deciding whether or not a given English noun phrase requires an article. If implemented in the dimension of grammatical errors of the above mentioned rubric, it can be used to give users feedback on their errors. As regards the task of detecting well-written abstracts, we have resorted to methods which have been successfully adopted to evaluate quality of essays and some preliminary tests have been carried out. However, our results are not yet satisfactory since they are not much above the baseline. Despite this drawback, we believe this study proves relevant since in addition to offering some of the necessary tools, it provides some fundamental guidelines towards the automatic evaluation of the quality of texts Read more Aprendizado de máquina Lingüística computacional PLN Computacional linguistics Machine learning NLP
43	Implementation of an Acoustic Echo Canceller Using Matlab Raghavendran, Srinivasaprasath 15 October 2003 (has links) The rapid growth of technology in recent decades has changed the whole dimension of communications. Today people are more interested in hands-free communication. In such a situation, the use a regular loudspeaker and a high-gain microphone, in place of a telephone receiver, might seem more appropriate. This would allow more than one person to participate in a conversation at the same time such as a teleconference environment. Another advantage is that it would allow the person to have both hands free and to move freely in the room. However, the presence of a large acoustic coupling between the loudspeaker and microphone would produce a loud echo that would make conversation difficult. Furthermore, the acoustic system could become instable, which would produce a loud howling noise to occur. The solution to these problems is the elimination of the echo with an echo suppression or echo cancellation algorithm. The echo suppressor offers a simple but effective method to counter the echo problem. However, the echo suppressor possesses a main disadvantage since it supports only half-duplex communication. Half-duplex communication permits only one speaker to talk at a time. This drawback led to the invention of echo cancellers. An important aspect of echo cancellers is that full-duplex communication can be maintained, which allows both speakers to talk at the same time. This objective of this research was to produce an improved echo cancellation algorithm, which is capable of providing convincing results. The three basic components of an echo canceller are an adaptive filter, a doubletalk detector and a nonlinear processor. The adaptive filter creates a replica of the echo and subtracts it from the combination of the actual echo and the near-end signal. The doubletalk detector senses the doubletalk. Doubletalk occurs when both ends are talking, which stops the adaptive filter in order to avoid divergence. Finally, the nonlinear processor removes the residual echo from the error signal. Usually, a certain amount of speech is clipped in the final stage of nonlinear processing. In order to avoid clipping, a noise gate was used as a nonlinear processor in this research. The noise gate allowed a threshold value to be set and all signals below the threshold were removed. This action ensured that only residual echoes were removed in the final stage. To date, the real time implementation of echo an cancellation algorithm was performed by utilizing both a VLSI processor and a DSP processor. Since there has been a revolution in the field of personal computers, in recent years, this research attempted to implement the acoustic echo canceller algorithm on a natively running PC with the help of the MATLAB software. Read more aec nlms dtd nlp matlab American Studies Arts and Humanities
44	GeneTUC: Natural Language Understanding in Medical Text Sætre, Rune January 2006 (has links) <p>Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists.</p><p>The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems.</p><p>The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities.</p><p>The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.</p> Read more Information Extraction (IE) Natural Language Processing (NLP) Bio-informatics
45	Corpus-Based Techniques for Word Sense Disambiguation Levow, Gina-Anne 27 May 1998 (has links) The need for robust and easily extensible systems for word sense disambiguation coupled with successes in training systems for a variety of tasks using large on-line corpora has led to extensive research into corpus-based statistical approaches to this problem. Promising results have been achieved by vector space representations of context, clustering combined with a semantic knowledge base, and decision lists based on collocational relations. We evaluate these techniques with respect to three important criteria: how their definition of context affects their ability to incorporate different types of disambiguating information, how they define similarity among senses, and how easily they can generalize to new senses. The strengths and weaknesses of these systems provide guidance for future systems which must capture and model a variety of disambiguating information, both syntactic and semantic. AI MIT Artificial Intelligence NLP Word Sense Disambiguation
46	GeneTUC: Natural Language Understanding in Medical Text Sætre, Rune January 2006 (has links) Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists. The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems. The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities. The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis. Read more Information Extraction (IE) Natural Language Processing (NLP) Bio-informatics
47	A Graph Approach to Measuring Text Distance Tsang, Vivian 26 February 2009 (has links) Text comparison is a key step in many natural language processing (NLP) applications in which texts can be classified on the basis of their semantic distance (how similar or different the texts are). For example, comparing the local context of an ambiguous word with that of a known word can help identify the sense of the ambiguous word. Typically, a distributional measure is used to capture the implicit semantic distance between two pieces of text. In this thesis, we introduce an alternative method of measuring the semantic distance between texts as a combination of distributional information and relational/ontological knowledge. In this work, we propose a novel distance measure within a network-flow formalism that combines these two distinct components in a way that they are not treated as separate and orthogonal pieces of information. First, we represent each text as a collection of frequency-weighted concepts within a relational thesaurus. Then, we make use of a network-flow method which provides an efficient way of measuring the semantic distance between two texts by taking advantage of the inherently graphical structure in an ontology. We evaluate our method in a variety of NLP tasks. In our task-based evaluation, we find that our method performs well on two of three tasks. We introduce a novel measure which is intended to capture how well our network-flow method perform on a dataset (represented as a collection of frequency-weighted concepts). In our analysis, we find that an integrated approach, rather than a purely distributional or graphical analysis, is more effective in explaining the performance inconsistency. Finally, we address a complexity issue that arises from the overhead required to incorporate more sophisticated concept-to-concept distances into the network-flow framework. We propose a graph transformation method which generates a pared-down network that requires less time to process. The new method achieves a significant speed improvement, and does not seriously hamper performance as a result of the transformation, as indicated in our analysis. Read more Graph Approaches in NLP Semantic Distance Distributional Methods Ontological Methods 0984
48	A Graph Approach to Measuring Text Distance Tsang, Vivian 26 February 2009 (has links) Text comparison is a key step in many natural language processing (NLP) applications in which texts can be classified on the basis of their semantic distance (how similar or different the texts are). For example, comparing the local context of an ambiguous word with that of a known word can help identify the sense of the ambiguous word. Typically, a distributional measure is used to capture the implicit semantic distance between two pieces of text. In this thesis, we introduce an alternative method of measuring the semantic distance between texts as a combination of distributional information and relational/ontological knowledge. In this work, we propose a novel distance measure within a network-flow formalism that combines these two distinct components in a way that they are not treated as separate and orthogonal pieces of information. First, we represent each text as a collection of frequency-weighted concepts within a relational thesaurus. Then, we make use of a network-flow method which provides an efficient way of measuring the semantic distance between two texts by taking advantage of the inherently graphical structure in an ontology. We evaluate our method in a variety of NLP tasks. In our task-based evaluation, we find that our method performs well on two of three tasks. We introduce a novel measure which is intended to capture how well our network-flow method perform on a dataset (represented as a collection of frequency-weighted concepts). In our analysis, we find that an integrated approach, rather than a purely distributional or graphical analysis, is more effective in explaining the performance inconsistency. Finally, we address a complexity issue that arises from the overhead required to incorporate more sophisticated concept-to-concept distances into the network-flow framework. We propose a graph transformation method which generates a pared-down network that requires less time to process. The new method achieves a significant speed improvement, and does not seriously hamper performance as a result of the transformation, as indicated in our analysis. Read more Graph Approaches in NLP Semantic Distance Distributional Methods Ontological Methods 0984
49	Knowledge integration in machine reading Kim, Doo Soon 04 November 2011 (has links) Machine reading is the artiﬁcial-intelligence task of automatically reading a corpus of texts and, from the contents, building a knowledge base that supports automated reasoning and question answering. Success at this task could fundamentally solve the knowledge acquisition bottleneck – the widely recognized problem that knowledge-based AI systems are diﬃcult and expensive to build because of the diﬃculty of acquiring knowledge from authoritative sources and building useful knowledge bases. One challenge inherent in machine reading is knowledge integration – the task of correctly and coherently combining knowledge snippets extracted from texts. This dissertation shows that knowledge integration can be automated and that it can signiﬁcantly improve the performance of machine reading. We speciﬁcally focus on two contributions of knowledge integration. The ﬁrst contribution is for improving the coherence of learned knowledge bases to better support automated reasoning and question answering. Knowledge integration achieves this beneﬁt by aligning knowledge snippets that contain overlapping content. The alignment is diﬃcult because the snippets can use signiﬁcantly diﬀerent surface forms. In one common type of variation, two snippets might contain overlapping content that is expressed at diﬀerent levels of granularity or detail. Our matcher can “see past” this diﬀerence to align knowledge snippets drawn from a single document, from multiple documents, or from a document and a background knowledge base. The second contribution is for improving text interpretation. Our approach is to delay ambiguity resolution to enable a machine-reading system to maintain multiple candidate interpretations. This is useful because typically, as the system reads through texts, evidence accumulates to help the knowledge integration system resolve ambiguities correctly. To avoid a combinatorial explosion in the number of candidate interpretations, we propose the packed representation to compactly encode all the candidates. Also, we present an algorithm that prunes interpretations from the packed representation as evidence accumulates. We evaluate our work by building and testing two prototype machine reading systems and measuring the quality of the knowledge bases they construct. The evaluation shows that our knowledge integration algorithms improve the cohesiveness of the knowledge bases, indicating their improved ability to support automated reasoning and question answering. The evaluation also shows that our approach to postponing ambiguity resolution improves the system’s accuracy at text interpretation. / text Read more Machine reading Knowledge integration Artificial intelligence Text understanding NLP
50	Structured classification for multilingual natural language processing Blunsom, Philip Unknown Date (has links) (PDF) This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit complex internal structure, such as labelling a sentence with its parse tree, and there may be an exponential number of them to choose from. Structured classification seeks to exploit the structure of the labels in order to allow both generalisation across labels which differ by only a small amount, and tractable searches over all possible labels. In this thesis we focus on the application of conditional random field (CRF) models (Lafferty et al., 2001). These models assign an undirected graphical structure to the labels of the classification task and leverage dynamic programming algorithms to efficiently identify the optimal label for a given input. We develop a range of models for two multilingual NLP applications: word-alignment for statistical machine translation (SMT), and multilingual super tagging for highly lexicalised grammars.

Search results