Spelling suggestions: "subject:"[een] NATURAL LANGUAGE"" "subject:"[enn] NATURAL LANGUAGE""
321 |
Désignations nominales des événements : étude et extraction automatique dans les textes / Nominal designation of events : study and automatic extraction in textsArnulphy, Béatrice 02 October 2012 (has links)
Ma thèse a pour but l'étude des désignations nominales des événements pour l'extraction automatique. Mes travaux s'inscrivent en traitement automatique des langues, soit dans une démarche pluridisciplinaire qui fait intervenir linguistique et informatique. L'extraction d'information a pour but d'analyser des documents en langage naturel et d'en extraire les informations utiles à une application particulière. Dans ce but général, de nombreuses campagnes d'extraction d'information ont été menées~: pour chaque événement considéré, il s'agit d'extraire certaines informations relatives (participants, dates, nombres, etc.). Dès le départ, ces challenges touchent de près aux entités nommées (éléments « notables » des textes, comme les noms de personnes ou de lieu). Toutes ces informations forment un ensemble autour de l'événement. Pourtant, ces travaux ne s'intéressent que peu aux mots utilisés pour décrire l'événement (particulièrement lorsqu'il s'agit d'un nom). L'événement est vu comme un tout englobant, comme la quantité et la qualité des informations qui le composent. Contrairement aux travaux en extraction d'informations générale, notre intérêt principal est porté uniquement sur la manière dont sont nommés les événements qui se produisent et particulièrement à la désignation nominale utilisée. Pour nous, l'événement est ce qui arrive, ce qui vaut la peine qu'on en parle. Les événements plus importants font l'objet d'articles de presse ou apparaissent dans les manuels d'Histoire. Un événement peut être évoqué par une description verbale ou nominale. Dans cette thèse, nous avons réfléchi à la notion d'événement. Nous avons observé et comparé les différents aspects présentés dans l'état de l'art jusqu'à construire une définition de l'événement et une typologie des événements en général, et qui conviennent dans le cadre de nos travaux et pour les désignations nominales des événements. Nous avons aussi dégagé de nos études sur corpus différents types de formation de ces noms d'événements, dont nous montrons que chacun peut être ambigu à des titres divers. Pour toutes ces études, la composition d'un corpus annoté est une étape indispensable, nous en avons donc profité pour élaborer un guide d'annotation dédié aux désignations nominales d'événements. Nous avons étudié l'importance et la qualité des lexiques existants pour une application dans notre tâche d'extraction automatique. Nous avons aussi, par des règles d'extraction, porté intérêt au cotexte d'apparition des noms pour en déterminer l'événementialité. À la suite de ces études, nous avons extrait un lexique pondéré en événementialité (dont la particularité est d'être dédié à l'extraction des événements nominaux), qui rend compte du fait que certains noms sont plus susceptibles que d'autres de représenter des événements. Utilisée comme indice pour l'extraction des noms d'événements, cette pondération permet d'extraire des noms qui ne sont pas présents dans les lexiques standards existants. Enfin, au moyen de l'apprentissage automatique, nous avons travaillé sur des traits d'apprentissage contextuels en partie fondés sur la syntaxe pour extraire de noms d'événements. / The aim of my PhD thesis is the study of nominal designations of events for automatic extraction. My work is part of natural language processing, or in a multidisciplinary approach that involves Linguistics and Computer Science. The aim of information extraction is to analyze natural language documents and extract information relevant to a particular application. In this general goal, many information extraction campaigns were conducted: for each event considered, the task of the campaign is to extract some information (participants, dates, numbers, etc..). From the outset these challenges relate closely to named entities (elements "significant" texts, such as names of people or places). All these information are set around the event and the work does not care about the words used to describe the event (especially when it comes to a name). The event is seen as an all-encompassing as the quantity and quality of information that compose it. Unlike work in general information retrieval, our main interest is focused only on the way are named events that occur particularly in the nominal designation used. For us, this is the event that happens that is worth talking about. The most important events are the subject of newspaper articles or appear in the history books. An event can be evoked by a verbal or nominal description. In this thesis, we reflected on the notion of event. We observed and compared the different aspects presented in the state of the art to construct a definition of the event and a typology of events generally agree that in the context of our work and designations nominal events. We also released our studies of different types of training corpus of the names of events, we show that each can be ambiguous in various ways. For these studies, the composition of an annotated corpus is an essential step, so we have the opportunity to develop an annotation guide dedicated to nominal designations events. We studied the importance and quality of existing lexicons for application in our extraction task automatically. We also focused on the context of appearance of names to determine the eventness, for this purpose, we used extraction rules. Following these studies, we extracted an eventive relative weighted lexicon (whose peculiarity is to be dedicated to the extraction of nominal events), which reflects the fact that some names are more likely than others to represent events. Used as a tip for the extraction of event names, this weight can extract names that are not present in the lexicons existing standards. Finally, using machine learning, we worked on learning contextual features based in part on the syntax to extract event names.
|
322 |
Génération automatique de questionnaires à choix multiples pédagogiques : évaluation de l'homogénéité des options / Automatic generation of educational multiple-choice questions : evaluation of option homogeneityPho, Van-Minh 24 September 2015 (has links)
Ces dernières années ont connu un renouveau des Environnements Informatiques pour l'Apprentissage Humain. Afin que ces environnements soient largement utilisés par les enseignants et les apprenants, ils doivent fournir des moyens pour assister les enseignants dans leur tâche de génération d'exercices. Parmi ces exercices, les Questionnaires à Choix Multiples (QCM) sont très présents. Cependant, la rédaction d'items à choix multiples évaluant correctement le niveau d'apprentissage des apprenants est une tâche complexe. Des consignes ont été développées pour rédiger manuellement des items, mais une évaluation automatique de la qualité des items constituerait un outil pratique pour les enseignants.Nous nous sommes intéressés à l'évaluation automatique de la qualité des distracteurs (mauvais choix de réponse). Pour cela, nous avons étudié les caractéristiques des distracteurs pertinents à partir de consignes de rédaction de QCM. Cette étude nous a conduits à considérer que l'homogénéité des distracteurs et de la réponse est un critère important pour valider les distracteurs. L'homogénéité est d'ordre syntaxique et sémantique. Nous avons validé la définition de l'homogénéité par une analyse de corpus de QCM, et nous avons proposé des méthodes de reconnaissance automatique de l'homogénéité syntaxique et sémantique à partir de cette analyse.Nous nous sommes ensuite focalisé sur l'homogénéité sémantique des distracteurs. Pour l'estimer automatiquement, nous avons proposé un modèle d'ordonnancement par apprentissage, combinant différentes mesures d'homogénéité sémantique. L'évaluation du modèle a montré que notre méthode est plus efficace que les travaux existants pour estimer l'homogénéité sémantique des distracteurs. / Recent years have seen a revival of Intelligent Tutoring Systems. In order to make these systems widely usable by teachers and learners, they have to provide means to assist teachers in their task of exercise generation. Among these exercises, multiple-choice tests are very common. However, writing Multiple-Choice Questions (MCQ) that correctly assess a learner's level is a complex task. Guidelines were developed to manually write MCQs, but an automatic evaluation of MCQ quality would be a useful tool for teachers.We are interested in automatic evaluation of distractor (wrong answer choice) quality. To do this, we studied characteristics of relevant distractors from multiple-choice test writing guidelines. This study led us to assume that homogeneity between distractors and answer is an important criterion to validate distractors. Homogeneity is both syntactic and semantic. We validated the definition of homogeneity by a MCQ corpus analysis, and we proposed methods for automatic recognition of syntactic and semantic homogeneity based on this analysis.Then, we focused our work on distractor semantic homogeneity. To automatically estimate it, we proposed a ranking model by machine learning, combining different semantic homogeneity measures. The evaluation of the model showed that our method is more efficient than existing work to estimate distractor semantic homogeneity
|
323 |
Drug Name Recognition in Reports on Concomitant MedicationGräns, Arvid January 2019 (has links)
This thesis evaluates if and how drug name recognition can be used to find drug names in verbatims from reports on concomitant medication in clinical trial studies. In clinical trials, reports on concomitant medication are written if a trial participant takes other drugs than the studied drug. This information needs to be coded to a drug reference dictionary. Coded verbatims were used to create the data needed to train the drug name recognition models in this thesis. Labels for where in each verbatim the coded drugs name was, were created using a Levensthein distance. The drug name recognition models were trained and tested on verbatims with labels. Drug name recognition was performed using a logistic regression model and a bidirectional long short-term memory model. The bidirectional long short-term memory model performed the best result with an F1 score of 82.5% on classifying which words in the verbatims that were drug names. When the results were studied from case to case, they showed that the bidirectional long short-term memory classifications sometimes outperformed labels it was trained on in single word verbatims. The model was also tested on manually labelled golden standard data where it performed an F1-score of 46.4%. The results indicate that a bidirectional long short-term memory model can be implemented for drug name recognition, but that label reliability is an issue in this thesis.
|
324 |
Domain Adaptation for Hypernym Discovery via Automatic Collection of Domain-Specific Training Data / Domänanpassning för identifiering av hypernymer via automatisk insamling av domänspecifikt träningsdataPalm Myllylä, Johannes January 2019 (has links)
Identifying semantic relations in natural language text is an important component of many knowledge extraction systems. This thesis studies the task of hypernym discovery, i.e discovering terms that are related by the hypernymy (is-a) relation. Specifically, this thesis explores how state-of-the-art methods for hypernym discovery perform when applied in specific language domains. In recent times, state-of-the-art methods for hypernym discovery are mostly made up by supervised machine learning models that leverage distributional word representations such as word embeddings. These models require labeled training data in the form of term pairs that are known to be related by hypernymy. Such labeled training data is often not available when working with a specific language domain. This thesis presents experiments with an automatic training data collection algorithm. The algorithm leverages a pre-defined domain-specific vocabulary, and the lexical resource WordNet, to extract training pairs automatically. This thesis contributes by presenting experimental results when attempting to leverage such automatically collected domain-specific training data for the purpose of domain adaptation. Experiments are conducted in two different domains: One domain where there is a large amount of text data, and another domain where there is a much smaller amount of text data. Results show that the automatically collected training data has a positive impact on performance in both domains. The performance boost is most significant in the domain with a large amount of text data, with mean average precision increasing by up to 8 points.
|
325 |
Matching events and activities by integrating behavioral aspects and label analysisBaier, Thomas, Di Ciccio, Claudio, Mendling, Jan, Weske, Mathias 05 1900 (has links) (PDF)
Nowadays, business processes are increasingly supported by IT services that produce massive amounts of event data during the execution of a process. These event data can be used to analyze the process using process mining techniques to discover the real process, measure conformance to a given process model, or to enhance existing models with performance information. Mapping the produced events to activities of a given process model is essential for conformance checking, annotation and understanding of process mining results. In order to accomplish this mapping with low manual effort, we developed a semi-automatic approach that maps events to activities using insights from behavioral analysis and label analysis. The approach extracts Declare constraints from both the log and the model to build matching constraints to efficiently reduce the number of possible mappings. These mappings are further reduced using techniques from natural language processing, which allow for a matching based on labels and external knowledge sources. The evaluation with synthetic and real-life data demonstrates the effectiveness of the approach and its robustness toward non-conforming execution logs.
|
326 |
Alguns aspectos de tratamento de dependências de contexto em linguagem natural empregando tecnologia adaptativa. / Some aspects on natural language context dependencies handling using adaptive technology.Moraes, Miryam de 14 December 2006 (has links)
O tratamento de Linguagens Naturais requer o emprego de formalismos mais complexos que aqueles normalmente empregados para Linguagens Livre de Contexto. A maioria de tais formalismos são difíceis de serem utilizados, não práticos e sobretudo, associados a um desempenho de elevado custo. Autômatos de pilha estruturados são excelentes para se representar linguagens regulares e aspectos livre de contexto encontrados em Linguagem Natural, uma vez que é possível decompo-los em uma camada reguar (implementada com máquina de estados finitos) e uma livre de contexto (representada por uma pilha). Tais dispositivos aceitam linguagens determinísticas e livre de contexto em tempo linear. Dessa forma, trata-se de um dispositivo adequado para ser empregado como mecanismo subjacente para os autômatos adaptativos, que permitem o tratamento - sem perda de simplicidade e eficiência - de linguagens mais complexas que aquelas livres de contexo Nesta tese, dependências de contexto são tratadas com tecnologia adaptativa. Este trabalho mostra como uma regra de Linguagem Natural descrita com uma metalinguagem pode ser convertida em um autômato de pilha adaptativo. Foi possível verificar que problemas complexos em análise de Linguagem Natural, tais como os não-determinismos e ambigüidades presentes em situações de concordância, subcategorização, coordenação podem ser resolvidos com eficiência. De fato, todos os mecanismos adaptativos para solucionar estes problemas apresentam desempenho O(n). Uma arquitetura para processamento em Linguagem Natural é apresentada. / Since low-complexity language formalisms are too weak to handle NL, stronger formalisms are required, most of them resource demanding, hard to use or unpractical. Structured pushdown automata are excellent to represent regular and context-free aspects on NLs by allowing them to be split into regular layer (implemented as finite-state machines) and a context-free one (represented by a pushdown store). Such devices accepts deterministic context-free languages in linear time, and is suitable as un underlying mechanism for adaptive automata, allowing handling - without loss of simplicity and efficiency - languages more complex than context-free ones. In this thesis context dependency is handled with adaptive technology. This work shows as a Natural Language rule described with a metalanguage can be converted into adaptive structured pushdown automata. It was possible to verify that complex problems in Natural Language parsing e.g., nondeterminisms and ambiguities present in agreement, subcategorization, coordination can be solved with efficiency. In fact, all adaptive mechanisms attached to these problems have O(n) performance. An adaptive architecture for NL Language processing is presented.
|
327 |
Extração de informações de narrativas clínicas / Clinical reports information retrievalOleynik, Michel 02 October 2013 (has links)
Narrativas clínicas são normalmente escritas em linguagem natural devido a seu poder descritivo e facilidade de comunicação entre os especialistas. Processar esses dados para fins de descoberta de conhecimento e coleta de estatísticas exige técnicas de extração de informações, com alguns resultados já apresentados na literatura para o domínio jornalístico, mas ainda raras no domínio médico. O presente trabalho visa desenvolver um classificador de laudos de anatomia patológica que seja capaz de inferir a topografia e a morfologia de um câncer na Classificação Internacional de Doenças para Oncologia (CID-O). Dados fornecidos pelo A.C. Camargo Cancer Center em São Paulo foram utilizados para treinamento e validação. Técnicas de processamento de linguagem natural (PLN) aliadas a classificadores bayesianos foram exploradas na busca de qualidade da recuperação da informação, avaliada por meio da medida-F2. Valores acima de 74% para o grupo topográfico e de 61% para o grupo morfológico são relatados, com pequena contribuição das técnicas de PLN e suavização. Os resultados corroboram trabalhos similares e demonstram a necessidade de retreinamento das ferramentas de PLN no domínio médico. / Clinical reports are usually written in natural language due to its descriptive power and ease of communication among specialists. Processing data for knowledge discovery and statistical analysis requires information retrieval techniques, already established for newswire texts, but still rare in the medical subdomain. The present work aims at developing an automated classifier of pathology reports, which should be able to infer the topography and the morphology classes of a cancer using codes of the International Classification of Diseases for Oncology (ICD-O). Data provided by the A.C. Camargo Cancer Center located in Sao Paulo was used for training and validation. Techniques of natural language processing (NLP) and Bayes classifiers were used in search for information retrieval quality, evaluated by F2-score. Measures upper than 74% in the topographic group and 61% in the morphologic group are reported, with small contribution from NLP or smoothing techniques. The results agree with similar studies and show that a retraining of NLP tools in the medical domain is necessary.
|
328 |
Development of new models for authorship recognition using complex networks / Desenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexasMarinho, Vanessa Queiroz 14 July 2017 (has links)
Complex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages. / Redes complexas vem sendo aplicadas com sucesso em diferentes domínios, sendo o tema de estudo de distintas áreas que incluem, por exemplo, a física e a computação. A descoberta de que métodos de redes complexas podem ser utilizados para analisar textos em seus distintos níveis de complexidade proporcionou avanços em tarefas de processamento de línguas naturais (PLN). Exemplos de aplicações analisadas com os métodos de redes complexas são a detecção de palavras-chave, a criação de sumarizadores automáticos e o reconhecimento de autoria. Esta última tarefa tem sido estudada com certo sucesso através da representação de redes de co-ocorrência (ou adjacência) de palavras que conectam apenas as palavras mais próximas no texto. Apesar deste sucesso, poucos trabalhos tentaram estender essas redes ou utilizar diferentes representações. Além disso, muitas das abordagens utilizam um conjunto semelhante de medidas de redes complexas e não combinam suas técnicas com as utilizadas tradicionalmente na tarefa de reconhecimento de autoria. Esta pesquisa de mestrado propõe extensões à modelagem tradicional de co-ocorrência e investiga a adequabilidade de novos atributos e de outras modelagens (como as redes mesoscópicas e de entidades nomeadas) para a tarefa. A informação de conectividade de palavras funcionais é utilizada para complementar a caracterização da escrita dos autores, uma vez que essas palavras são relevantes para a tarefa. Finalmente, a maior contribuição deste trabalho consiste no desenvolvimento de classificadores híbridos, denominados labelled motifs, que combinam fatores tradicionais com as propriedades fornecidas pela análise topológica de redes complexas. A relevância desses classificadores é verificada no contexto de reconhecimento de autoria e identificação de translationese. Com esta abordagem híbrida, mostra-se que é possível melhorar o desempenho de técnicas baseadas em rede ao combiná-las com técnicas tradicionais em PLN. Através da adaptação, combinação e aperfeiçoamento da modelagem, não apenas o desempenho dos sistemas de reconhecimento de autoria foi melhorado, mas também foi possível entender melhor quais são os fatores quantitativos textuais (medidos via redes) que podem ser utilizados na área de estilometria. Os avanços obtidos durante este projeto podem ser utilizados para estudar aplicações relacionadas, como é o caso da análise de inconsistências estilísticas e plagiarismos, e análise da complexidade textual. Além disso, muitos dos métodos propostos neste trabalho podem ser facilmente aplicados em diversas línguas naturais.
|
329 |
Exploração de métodos de sumarização automática multidocumento com base em conhecimento semântico-discursivo / Exploration of automatic methods for multi-document summarization using discourse modelsCardoso, Paula Christina Figueira 05 September 2014 (has links)
A sumarização automática multidocumento visa à produção de um sumário a partir de um conjunto de textos relacionados, para ser utilizado por um usuário particular e/ou para determinada tarefa. Com o crescimento exponencial das informações disponíveis e a necessidade das pessoas obterem a informação em um curto espaço de tempo, a tarefa de sumarização automática tem recebido muita atenção nos últimos tempos. Sabe-se que em um conjunto de textos relacionados existem informações redundantes, contraditórias e complementares, que representam os fenômenos multidocumento. Em cada texto-fonte, o assunto principal é descrito em uma sequência de subtópicos. Além disso, as sentenças de um texto-fonte possuem graus de relevância diferentes. Nesse contexto, espera-se que um sumário multidocumento consista das informações relevantes que representem o total de textos do conjunto. No entanto, as estratégias de sumarização automática multidocumento adotadas até o presente utilizam somente os relacionamentos entre textos e descartam a análise da estrutura textual de cada texto-fonte, resultando em sumários que são pouco representativos dos subtópicos textuais e menos informativos do que poderiam ser. A fim de tratar adequadamente a relevância das informações, os fenômenos multidocumento e a distribuição de subtópicos, neste trabalho de doutorado, investigou-se como modelar o processo de sumarização automática usando o conhecimento semântico-discursivo em métodos de seleção de conteúdo e o impacto disso para a produção de sumários mais informativos e representativos dos textos-fonte. Na formalização do conhecimento semântico-discursivo, foram utilizadas as teorias semântico-discursivas RST (Rhetorical Structure Theory) e CST (Cross-document Structure Theory). Para apoiar o trabalho, um córpus multidocumento foi anotado com RST e subtópicos, consistindo em um recurso disponível para outras pesquisas. A partir da análise de córpus, foram propostos 10 métodos de segmentação em subtópicos e 13 métodos inovadores de sumarização automática. A avaliação dos métodos de segmentação em subtópicos mostrou que existe uma forte relação entre a estrutura de subtópicos e a análise retórica de um texto. Quanto à avaliação dos métodos de sumarização automática, os resultados indicam que o uso do conhecimento semântico-discursivo em boas estratégias de seleção de conteúdo afeta positivamente a produção de sumários informativos. / The multi-document summarization aims at producing a summary from a set of related texts to be used for an individual or/and a particular task. Nowadays, with the exponential growth of available information and the peoples need to obtain information in a short time, the task of automatic summarization has received wide attention. It is known that in a set of related texts there are pieces of redundant, contradictory and complementary information that represent the multi-document phenomenon. In each source text, the main subject is described in a sequence of subtopics. Furthermore, some sentences in the same text are more relevant than others. Considering this context, it is expected that a multi-document summary consists of relevant information that represents a set of texts. However, strategies for automatic multi-document summarization adopted until now have used only the relationships between texts and dismissed the analysis of textual structure of each source text, resulting in summaries that are less representative of subtopics and less informative than they could be. In order to properly treat the relevance of information, multi-document phenomena and distribution of subtopics, in this thesis, we investigated how to model the summarization process using the semantic-discursive knowledge and its impact for producing more informative and representative summaries from source texts. In order to formalize the semantic-discursive knowledge, we adopted RST (Rhetorical Structure Theory) and CST (Cross-document Structure Theory) theories. To support the work, a multi-document corpus was annotated with RST and subtopics, consisting of a new resource available for other researchers. From the corpus analysis, 10 methods for subtopic segmentation and 13 orignal methods for automatic summarization were proposed. The assessment of methods for subtopic segmentation showed that there is a strong relationship between the subtopics structure and the rhetorical analysis of a text. In regards to the assessment of the methods for automatic summarization, the results indicate that the use of semantic-discursive knowledge in good strategies for content selection affects positively the production of informative summaries.
|
330 |
Um estudo sobre a Teoria da Predição aplicada à análise semântica de Linguagens Naturais. / A study on the Theory of Prediction applied to the semantical analysis of Natural Languages.Chaer, Iúri 18 February 2010 (has links)
Neste trabalho, estuda-se o aprendizado computacional como um problema de indução. A partir de uma proposta de arquitetura de um sistema de análise semântica de Linguagens Naturais, foram desenvolvidos e testados individualmente os dois módulos necessários para a sua construção: um pré-processador capaz de mapear o conteúdo de textos para uma representação onde a semântica de cada símbolo fique explícita e um módulo indutor capaz de gerar teorias para explicar sequências de eventos. O componente responsável pela indução de teorias implementa uma versão restrita do Preditor de Solomonoff, capaz de tecer hipóteses pertencentes ao conjunto das Linguagens Regulares. O dispositivo apresenta complexidade computacional elevada e tempo de processamento, mesmo para entradas simples, bastante alto. Apesar disso, são apresentados resultados novos interessantes que mostram seu desempenho funcional. O módulo pré-processador do sistema proposto consiste em uma implementação da Análise da Semântica Latente, um método que utiliza correlações estatísticas para obter uma representação capaz de aproximar relações semânticas similares às feitas por seres humanos. Ele foi utilizado para indexar os mais de 470 mil textos contidos no primeiro disco do corpus RCV1 da Reuters, produzindo, a partir de dezenas de variações de parâmetros, 71;5GB de dados que foram utilizados para diversas análises estatísticas. Foi construído também um sistema de recuperação de informações para análises qualitativas do método. Os resultados dos testes levam a crer que o uso desse módulo de pré-processamento leva a ganhos consideráveis no sistema proposto. A integração dos dois componentes em um analisador semântico de Linguagens Naturais se mostra, neste momento, inviável devido ao tempo de processamento exigido pelo módulo indutor e permanece como uma tarefa para um trabalho futuro. No entanto, concluiu-se que a Teoria da Predição de Solomonoff é adequada para tratar o problema da análise semântica de Linguagens Naturais, contanto que sejam concebidas formas de mitigar o problema do seu tempo de computação. / In this work, computer learning is studied as a problem of induction. Starting with the proposal of an architecture for a system of semantic analisys of Natural Languages, the two modules necessary for its construction were built and tested independently: a pre-processor, capable of mapping the contents of texts to a representation in which the semantics of each symbol is explicit, and an inductor module, capable of formulating theories to explain chains of events. The component responsible for the induction of theories implements a restricted version of the Solomonoff Predictor, capable of producing hypotheses pertaining to the set of Regular Languages. Such device presents elevated computational complexity and very high processing time even for very simple inputs. Nonetheless, this work presents new and interesting results showing its functional performance. The pre-processing module of the proposed system consists of an implementation of Latent Semantic Analisys, a method which draws from statistical correlation to build a representation capable of approximating semantical relations made by human beings. It was used to index the more than 470 thousand texts contained in the first disk of the Reuters RCV1 corpus, resulting, through dozens of parameter variations, 71:5GB of data that were used for various statistical analises. The test results are convincing that the use of that pre-processing module leads to considerable gains in the system proposed. The integration of the two components built into a full-fledged semantical analyser of Natural Languages presents itself, at this moment, unachievable due to the processing time required by the inductor module, and remains as a task for future work. Still, Solomonoffs Theory of Prediction shows itself adequate for the treatment of semantical analysis of Natural Languages, provided new ways of palliating its processing time are devised.
|
Page generated in 0.0676 seconds