31 |
Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated LearningHathurusinghe, Rajitha 16 September 2020 (has links)
This thesis explores the training of a deep neural network based named entity recognizer in
an end-to-end privacy preserved setting where dataset creation and model training happen
in an environment with minimal manual interventions. With the improvement of accuracy
in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for
training data for these models amidst the concerns on the data privacy. Several scenarios of
data protection are suggested in the recent past due to public concerns hence the legal guidelines
to enforce them. A promising new development is the decentralized model training
on isolated datasets, which eliminates the compromises of privacy upon providing data to a
centralized entity. However, in this federated setting curating the data source is still a privacy
risk mostly in unstructured data sources such as text.
We explore the feasibility of automatic dataset annotation for a Named Entity Recognition
(NER) task and training a deep learning model with it in two federated learning settings.
We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof-
the-art deep learning language model for the downstream task of named entity recognition.
We also explore this novel setting of deep learning NLP model and federated learning
for its deviation from the classical centralized setting.
We created an automatically annotated dataset containing around 80,000 sentences, a
manual human annotated test set and tools to extend the dataset with more manual annotations.
We observed the noise from automated annotation can be overcome to a level by
increasing the dataset size. We also contributed to the federated learning framework with
state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80
F1-score for recognition of entities in sentences.
|
32 |
Can ChatGPT Generate Code to Support a System Sciences Bachelor’s Thesis? / Kan ChatGPT generera kod för att stödja en kandidatuppsats i systemvetenskap?Amin, Solin, Hellström, Johan January 2023 (has links)
Background ChatGPT is a chatbot released in November 2022. Its usage has grown to include being used in academia and for scientific writing, with varying results. We investigate if ChatGPT can be used for the technical part in a Bachelor’s thesis in System Sciences. Aim We evaluate if it is possible to generate the code for detecting potential gender bias in previous responses from ChatGPT, in the form of a dialogue. Method We use an exploratory case study where an iterative dialogue with ChatGPT is used to generate Python code to be able to analyse previous responses made byChatGPT. The methods for development were chosen by the authors from suggestions by ChatGPT. Results Two separate dialogues resulted in a program that combined a fine-tuned Natural Language Processing model together with sentiment analysis and word frequency analysis. The program successfully identified responses in the dataset as having a female or male gender bias or being gender neutral. Conclusions ChatGPT serves as a powerful tool for coding, although it currently falls short of being a one-stop solution that can generate code sufficient for more complex tasks witha single prompt. Our experience suggests that ChatGPT accelerates one’s work when the user possesses some programming knowledge. With further development, ChatGPT could transform coding workflows and increase productivity in related fields. Implications ChatGPT as a tool is very capable in supporting students in the technical aspect of a Bachelor’s thesis and it is not unreasonable to assume that it works in other contexts, as well. As such, one can achieve more with the tool than without, and consequently it would be for the better to integrate ChatGPT into thesis work. This stresses the point that we need to find better regulations for cheating and plagiarism. / Bakgrund ChatGPT är en chatbot som släpptes den 22 november 2022. Sedan dess har dess användningsområden växt till att inkludera den akademiska världen och vetenskapligt skrivande, med varierande resultat. Vi undersöker om ChatGPT kan användas för den tekniska delen av en kandidatexamen i systemvetenskap. Syfte Vi utvärderar om det är möjligt att i en dialogform generera kod för att upptäcka potentiell könsbias i tidigare svar från ChatGPT. Metod Vi använder en utforskande fallstudie där en iterativ dialog med ChatGPT används för att generera Python-kod för att kunna analysera tidigare svar från ChatGPT. Utvecklingsmetoderna valdes av författarna utifrån förslag från ChatGPT. Resultat Två separata dialoger med ChatGPT resulterade i ett program som kombinerade en finjusterad Natural Language Processing-modell med stämnings- och ordfrekvensanalys. Programmet identifierade svar i datasetet med att ha kvinnlig eller manlig könsbias, eller att vara könsneutralt. Slutsatser ChatGPT är ett kraftfullt verktyg som kan användas för programmering. I dagsläget är ChatGPT ingen komplett lösning som kan generera kod tillräcklig för mer komplexa uppgifter med en enda prompt. Vår erfarenhet visar att ChatGPT accelererar ens arbete då användaren besitter viss kunskap inom programmering. Vid fortsatt utveckling kan ChatGPT ombilda programmeringsflöden och öka produktiviteten i relaterade områden. Följder ChatGPT som verktyg är mer än kapabelt med att stödja studenter med den tekniska delen av ett examensarbete, det är heller inte orealistiskt att anta att det är möjligt att även använda det i andra sammanhang. Med detta sagt kan man utföra mer med verktyget än utan, och följaktligen är det till det bättre att integrera ChatGPT i examensarbeten. Detta driver på poängen att vi behöver finna en lösning vad gäller reglering och hantering av plagiat.
|
33 |
Обнаружение русско-английских лексически родственных слов с использованием NLP машинного обучения и языка Питон : магистерская диссертация / English/Russian lexical cognates detection using NLP Machine Learning with PythonБадр, Я. Э. К. А., Badr, Y. E. K. A. January 2023 (has links)
Изучение языка – это замечательное занятие, которое расширяет наш кругозор и позволяет нам общаться с представителями различных культур и людей по всему миру. Традиционно языковое образование основывалось на традиционных методах, таких как учебники, словарный запас и языковой обмен. Однако с появлением машинного обучения наступила новая эра в обучении языку, предлагающая инновационные и эффективные способы ускорения овладения языком. Одним из интригующих применений машинного обучения в изучении языков является использование родственных слов, слов, которые имеют схожее значение и написание в разных языках. Для решения этой темы в данной исследовательской работе предлагается облегчить процесс изучения второго языка с помощью искусственного интеллекта, в частности нейронных сетей, которые могут идентифицировать и использовать слова, похожие или идентичные как на первом языке учащегося, так и на целевом языке. Эти слова, известные как лексические родственные слова, могут облегчить изучение языка, предоставляя учащимся знакомый ориентир и позволяя им связывать новый словарный запас со словами, которые они уже знают. Используя возможности нейронных сетей для обнаружения и использования этих родственных слов, учащиеся смогут ускорить свой прогресс в освоении второго языка. Хотя исследование семантического сходства в разных языках не является новой темой, наша цель состоит в том, чтобы применить другой подход для выявления русско-английских лексических родственных слов и представить полученные результаты в качестве инструмента изучения языка, используя выборку данных о лексическом и семантическом сходстве. между языками, чтобы построить модель обнаружения лексических родственных слов и ассоциаций слов. Впоследствии, в зависимости от нашего анализа и результатов, мы представим приложение для определения словесных ассоциаций, которое смогут использовать конечные пользователи. Учитывая, что русский и английский являются одними из наиболее распространенных языков в мире, а Россия является популярным местом для иностранных студентов со всего мира, это послужило значительной мотивацией для разработки инструмента искусственного интеллекта, который поможет людям, изучающим русский язык как англоговорящие, или изучающим английский язык. как русскоязычные. / Language learning is a remarkable endeavor that expands our horizons and allows us to connect with diverse cultures and people around the world. Traditionally, language education has relied on conventional methods such as textbooks, vocabulary drills, and language exchanges. However, with the advent of machine learning, a new era has dawned upon language instruction, offering innovative and efficient ways to accelerate language acquisition. One intriguing application of machine learning in language learning is the utilization of cognates, words that share similar meanings and spellings across different languages. To address this subject, this research paper proposes to facilitate the process of acquiring a second language with the help of artificial intelligence, particularly neural networks, which can identify and use words that are similar or identical in both the learner's first language and the target language. These words, known as lexical cognates which can facilitate language learning by providing a familiar point of reference for the learner and enabling them to associate new vocabulary with words they already know. By leveraging the power of neural networks to detect and utilize these cognates, learners will be able to accelerate their progress in acquiring a second language. Although the study of semantic similarity across different languages is not a new topic, our objective is to adopt a different approach for identifying Russian-English Lexical cognates and present the obtained results as a language learning tool, by using the lexical and semantic similarity data sample across languages to build a lexical cognates detection and words association model. Subsequently, depend on our analysis and results, will present a word association application that can be utilized by end users. Given that Russian and English are among the most widely spoken languages globally and that Russia is a popular destination for international students from around the world, it served as a significant motivation to develop an AI tool to assist individuals learning Russian as English speakers or learning English as Russian speakers.
|
34 |
A Semantics-based User Interface Model for Content Annotation, Authoring and ExplorationKhalili, Ali 26 January 2015 (has links)
The Semantic Web and Linked Data movements with the aim of creating, publishing and interconnecting machine readable information have gained traction in the last years.
However, the majority of information still is contained in and exchanged using unstructured documents, such as Web pages, text documents, images and videos.
This can also not be expected to change, since text, images and videos are the natural way in which humans interact with information.
Semantic structuring of content on the other hand provides a wide range of advantages compared to unstructured information.
Semantically-enriched documents facilitate information search and retrieval, presentation, integration, reusability, interoperability and personalization.
Looking at the life-cycle of semantic content on the Web of Data, we see quite some progress on the backend side in storing structured content or for linking data and schemata.
Nevertheless, the currently least developed aspect of the semantic content life-cycle is from our point of view the user-friendly manual and semi-automatic creation of rich semantic content.
In this thesis, we propose a semantics-based user interface model, which aims to reduce the complexity of underlying technologies for semantic enrichment of content by Web users.
By surveying existing tools and approaches for semantic content authoring, we extracted a set of guidelines for designing efficient and effective semantic authoring user interfaces.
We applied these guidelines to devise a semantics-based user interface model called WYSIWYM (What You See Is What You Mean) which enables integrated authoring, visualization and exploration of unstructured and (semi-)structured content.
To assess the applicability of our proposed WYSIWYM model, we incorporated the model into four real-world use cases comprising two general and two domain-specific applications.
These use cases address four aspects of the WYSIWYM implementation:
1) Its integration into existing user interfaces,
2) Utilizing it for lightweight text analytics to incentivize users,
3) Dealing with crowdsourcing of semi-structured e-learning content,
4) Incorporating it for authoring of semantic medical prescriptions.
|
35 |
Statistical semantic processing using Markov logicMeza-Ruiz, Ivan Vladimir January 2009 (has links)
Markov Logic (ML) is a novel approach to Natural Language Processing tasks [Richardson and Domingos, 2006; Riedel, 2008]. It is a Statistical Relational Learning language based on First Order Logic (FOL) and Markov Networks (MN). It allows one to treat a task as structured classification. In this work, we investigate ML for the semantic processing tasks of Spoken Language Understanding (SLU) and Semantic Role Labelling (SRL). Both tasks consist of identifying a semantic representation for the meaning of a given utterance/sentence. However, they differ in nature: SLU is in the field of dialogue systems where the domain is closed and language is spoken [He and Young, 2005], while SRL is for open domains and traditionally for written text [M´arquez et al., 2008]. Robust SLU is a key component of spoken dialogue systems. This component consists of identifying the meaning of the user utterances addressed to the system. Recent statistical approaches to SLU depend on additional resources (e.g., gazetteers, grammars, syntactic treebanks) which are expensive and time-consuming to produce and maintain. On the other hand, simple datasets annotated only with slot-values are commonly used in dialogue system development, and are easy to collect, automatically annotate, and update. However, slot-values leave out some of the fine-grained long distance dependencies present in other semantic representations. In this work we investigate the development of SLU modules with minimum resources with slot-values as their semantic representation. We propose to use the ML to capture long distance dependencies which are not explicitly available in the slot-value semantic representation. We test the adequacy of the ML framework by comparing against a set of baselines using state of the art approaches to semantic processing. The results of this research have been published in Meza-Ruiz et al. [2008a,b]. Furthermore, we address the question of scalability of the ML approach for other NLP tasks involving the identification of semantic representations. In particular, we focus on SRL: the task of identifying predicates and arguments within sentences, together with their semantic roles. The semantic representation built during SRL is more complex than the slot-values used in dialogue systems, in the sense that they include the notion of predicate/argument scope. SRL is defined in the context of open domains under the premises that there are several levels of extra resources (lemmas, POS tags, constituent or dependency parses). In this work, we propose a ML model of SRL and experiment with the different architectures we can describe for the model which gives us an insight into the types of correlations that the ML model can express [Riedel and Meza-Ruiz, 2008; Meza-Ruiz and Riedel, 2009]. Additionally, we tested our minimal resources setup in a state of the art dialogue system: the TownInfo system. In this case, we were given a small dataset of gold standard semantic representations which were system dependent, and we rapidly developed a SLU module used in the functioning dialogue system. No extra resources were necessary in order to reach state of the art results.
|
36 |
Statistical Natural Language Processing Methods in Music Notation Analysis / Statistical Natural Language Processing Methods in Music Notation AnalysisLibovický, Jindřich January 2013 (has links)
The thesis summarizes the research in application of statistical methods of computational linguistics in music processing and explains theoretical background of these applications. In the second part methods of symbolic melody extraction are explored. A corpus of approxi- mately 400 hours of melodies of different music styles was created. A melody model using the language modeling techniques was trained on this corpus. In the third part of the thesis the model is used for an attempt to develop an alternative method of audio melody extraction which uses the melody model instead of commonly used heuristics and rules. The chosen ap- proach works well only on simple input data and produces worse results than the commonly used methods on the MIREX contest data. On the other hand, the experiments help to understand the conceptual between the pitch frequency development - the physical melody - and the melody perceived on an abstract level in the symbolic notation - the symbolic melody. 1
|
37 |
Titrage automatique de documents textuels / Automatic titling of textual documentsLopez, Cédric 01 October 2012 (has links)
Au cours du premier millénaire avant notre ère, les bibliothèques, qui apparaissent avec le besoin d'organiser la conservation des textes, sont immédiatement confrontées aux difficultés de l'indexation. Le titre apparaît alors comme une première solution, permettant d'identifier rapidement chaque type d'ouvrage et éventuellement de discerner des ouvrages thématiquement proches.Alors que dans la Grèce Antique, les titres ont une fonction peu informative, mais ont toujours pour objectif d'identifier le document, l'invention de l'imprimerie à caractères mobiles (Gutenberg, XVème siècle) a entraîné une forte augmentation du nombre de documents, offrant désormais une diffusion à grande échelle. Avec la recrudescence des textes imprimés, le titre acquiert peu à peu de nouvelles fonctions, conduisant très souvent à des enjeux d'influence socioculturelle ou politique (notamment dans le cas des articles journalistiques).Aujourd'hui, que le document soit sous forme électronique ou papier, la présence d'un ou de plusieurs titres est très souvent constatée, permettant de créer un premier lien entre le lecteur et le sujet abordé dans le document. Mais comment quelques mots peuvent-ils avoir une si grande influence ? Quelles fonctions les titres doivent-ils remplir en ce début du XXIème siècle ? Comment générer automatiquement des titres respectant ces fonctions ?Le titrage automatique de documents textuels est avant tout un des domaines clés de l'accessibilité des pages Web (standards W3C) tel que défini par la norme proposée par les associations sur le handicap. Côté lecteur, l'objectif est d'augmenter la lisibilité des pages obtenues à partir d'une recherche sur mot-clé(s) et dont la pertinence est souvent faible, décourageant les lecteurs devant fournir de grands efforts cognitifs. Côté producteur de site Web, l'objectif est d'améliorer l'indexation des pages pour une recherche plus pertinente. D'autres intérêts motivent cette étude (titrage de pages Web commerciales, titrage pour la génération automatique de sommaires, titrage pour fournir des éléments d'appui pour la tâche de résumé automatique,).Afin de traiter à grande échelle le titrage automatique de documents textuels, nous employons dans cette étude des méthodes et systèmes de TALN (Traitement Automatique du Langage Naturel). Alors que de nombreux travaux ont été publiés à propos de l'indexation et du résumé automatique, le titrage automatique demeurait jusqu'alors discret et connaissait quelques difficultés quant à son positionnement dans le domaine du TALN. Nous soutenons dans cette étude que le titrage automatique doit pourtant être considéré comme une tâche à part entière.Après avoir défini les problématiques liées au titrage automatique, et après avoir positionné cette tâche parmi les tâches déjà existantes, nous proposons une série de méthodes permettant de produire des titres syntaxiquement corrects selon plusieurs objectifs. En particulier, nous nous intéressons à la production de titres informatifs, et, pour la première fois dans l'histoire du titrage automatique, de titres accrocheurs. Notre système TIT', constitué de trois méthodes (POSTIT, NOMIT et CATIT), permet de produire des ensembles de titres informatifs dans 81% des cas et accrocheurs dans 78% des cas. / During the first millennium BC, the already existing libraries needed to organize texts preservation, and were thus immediately confronted with the difficulties of indexation. The use of a title occurred then as a first solution, enabling a quick indentification of every work, and in most of the cases, helping to discern works thematically close to a given one. While in Ancient Greece, titles have had a little informative function, although still performing an indentification function, the invention of the printing office with mobile characters (Gutenberg, XVth century AD) dramatically increased the number of documents, which are today spread on a large-scale. The title acquired little by little new functions, leaning very often to sociocultural or political influence (in particular in journalistic articles).Today, for both electronic and paper documents, the presence of one or several titles is very often noticed. It helps creating a first link between the reader and the subject of the document. But how some words can have a so big influence? What functions do the titles have to perform at this beginning of the XXIth century? How can one automatically generate titles respecting these functions? The automatic titling of textual documents is one of the key domains of Web pages accessibility (W3C standards) such as defined in a standard given by associations about the disabled. For a given reader, the goal is to increase the readability of pages obtained from a search, since usual searches are often disheartening readers who must supply big cognitive efforts. For a Website designer, the aim is to improve the indexation of pages for a more relevant search. Other interests motivate this study (titling of commercial Web pages, titling in order to automatically generate contents, titling to bring elements to enhance automatic summarization).In this study, we use NLP (Natural Language Processing) methods and systems. While numerous works were published about indexation and automatic summarization, automatic titling remained discreet and knew some difficulties as for its positioning in NLP. We support in this study that the automatic titling must be nevertheless considered as a full task.Having defined problems connected to automatic titling, and having positioned this task among the already existing tasks, we provide a series of methods enabling syntactically correct titles production, according to several objectives. In particular, we are interested in the generation of informative titles, and, for the first time in the history of automatic titling, we introduce the concept of catchiness.Our TIT' system consists of three methods (POSTIT, NOMIT, and CATIT), that enables to produce sets of informative titles in 81% of the cases and catchy titles in 78% of the cases.
|
38 |
Généralisation de données textuelles adaptée à la classification automatique / Toward new features for text miningTisserant, Guillaume 14 April 2015 (has links)
La classification de documents textuels est une tâche relativement ancienne. Très tôt, de nombreux documents de différentes natures ont été regroupés dans le but de centraliser la connaissance. Des systèmes de classement et d'indexation ont alors été créés. Ils permettent de trouver facilement des documents en fonction des besoins des lecteurs. Avec la multiplication du nombre de documents et l'apparition de l'informatique puis d'internet, la mise en œuvre de systèmes de classement des textes devient un enjeu crucial. Or, les données textuelles, de nature complexe et riche, sont difficiles à traiter de manière automatique. Dans un tel contexte, cette thèse propose une méthodologie originale pour organiser l'information textuelle de façon à faciliter son accès. Nos approches de classification automatique de textes mais aussi d'extraction d'informations sémantiques permettent de retrouver rapidement et avec pertinence une information recherchée.De manière plus précise, ce manuscrit présente de nouvelles formes de représentation des textes facilitant leur traitement pour des tâches de classification automatique. Une méthode de généralisation partielle des données textuelles (approche GenDesc) s'appuyant sur des critères statistiques et morpho-syntaxiques est proposée. Par ailleurs, cette thèse s'intéresse à la construction de syntagmes et à l'utilisation d'informations sémantiques pour améliorer la représentation des documents. Nous démontrerons à travers de nombreuses expérimentations la pertinence et la généricité de nos propositions qui permettent une amélioration des résultats de classification. Enfin, dans le contexte des réseaux sociaux en fort développement, une méthode de génération automatique de HashTags porteurs de sémantique est proposée. Notre approche s'appuie sur des mesures statistiques, des ressources sémantiques et l'utilisation d'informations syntaxiques. Les HashTags proposés peuvent alors être exploités pour des tâches de recherche d'information à partir de gros volumes de données. / We have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data.
|
39 |
Avaliação automática da qualidade de escrita de resumos científicos em inglês / Automatic evaluation of the quality of English abstractsGenoves Junior, Luiz Carlos 01 June 2007 (has links)
Problemas com a escrita podem afetar o desempenho de profissionais de maneira marcante, principalmente no caso de cientistas e acadêmicos que precisam escrever com proficiência e desembaraço não somente na língua materna, mas principalmente em inglês. Durante os últimos anos, ferramentas de suporte à escrita, algumas com enfoque em textos científicos, como o AMADEUS e o SciPo foram desenvolvidas e têm auxiliado pesquisadores na divulgação de suas pesquisas. Entretanto, a criação dessas ferramentas é baseada em córpus, sendo muito custosa, pois implica em selecionar textos bem escritos, além de segmentá-los de acordo com sua estrutura esquemática. Nesse mestrado estudamos, avaliamos e implementamos métodos de detecção automática da estrutura esquemática e de avaliação automática da qualidade de escrita de resumos científicos em inglês. Investigamos o uso de tais métodos para possibilitar o desenvolvimento de dois tipos de ferramentas: de detecção de bons resumos e de crítica. Nossa abordagem é baseada em córpus e em aprendizado de máquina supervisionado. Desenvolvemos um detector automático da estrutura esquemática, que chamamos de AZEA, com taxa de acerto de 80,4% eKappa de 0,73, superiores ao estado da arte (acerto de 73%, Kappa de 0,65). Experimentamos várias combinações de algoritmos, atributos e diferentes seções de um artigo científicos. Utilizamos o AZEA na implementação de duas dimensões de uma rubrica para o gênero científico, composta de 7 dimensões, e construímos e disponibilizamos uma ferramenta de crítica da estrutura de um resumo. Um detector de erros de uso de artigo também foi desenvolvido, com precisão é de 83,7% (Kappa de 0,63) para a tarefa de decidir entre omitir ou não um artigo, com enfoque no feedback ao usuário e como parte da implementação da dimensão de erros gramaticais da rubrica. Na tarefa de detectar bons resumos, utilizamos métodos usados com sucesso na avaliação automática da qualidade de escrita de redações com as implementações da rubrica e realizamos experimentos iniciais, ainda com resultados fracos, próximos à baseline. Embora não tenhamos construído um bom avaliador automático da qualidade de escrita, acreditamos que este trabalho indica direções para atingir esta meta, e forneça algumas das ferramentas necessárias / Poor writing may have serious implications for a professional\'s career. This is even more serious in the case of scientists and academics whose job requires fluency and proficiency in their mother tongue as well as in English. This is why a number of writing tools have been developed in order to assist researchers to promote their work. Here, we are particularly interested in tools, such as AMADEUS and SciPo, which focus on scientific writing. AMADEUS and SciPo are corpus-based tools and hence they rely on corpus compilation which is by no means an easy task. In addition to the dificult task of selecting well-written texts, it also requires segmenting these texts according to their schematic structure. The present dissertation aims to investigate, evaluate and implement some methods to automatically detect the schematic structure of English abstracts and to automatically evaluate their quality. These methods have been examined with a view to enabling the development of two types of tools, namely: detection of well-written abstracts and a critique tool. For automatically detecting schematic structures, we have developed a tool, named AZEA, which adopts a corpus-based, supervised machine learning approach. AZEA reaches 80.4% accuracy and Kappa of 0.73, which is above the highest rates reported in the literature so far (73% accuracy and Kappa of 0.65). We have tested a number of different combinations of algorithms, features and different paper sections. AZEA has been used to implement two out of seven dimensions of a rubric for analyzing scientific papers. A critique tool for evaluating the structure of abstracts has also been developed and made available. In addition, our work also includes the development of a classifier for identifying errors related to English article usage. This classifier reaches 83.7% accuracy (Kappa de 0.63) in the task of deciding whether or not a given English noun phrase requires an article. If implemented in the dimension of grammatical errors of the above mentioned rubric, it can be used to give users feedback on their errors. As regards the task of detecting well-written abstracts, we have resorted to methods which have been successfully adopted to evaluate quality of essays and some preliminary tests have been carried out. However, our results are not yet satisfactory since they are not much above the baseline. Despite this drawback, we believe this study proves relevant since in addition to offering some of the necessary tools, it provides some fundamental guidelines towards the automatic evaluation of the quality of texts
|
40 |
Consulta a ontologias utilizando linguagem natural controlada / Querying ontologies using controlled natural languageLuz, Fabiano Ferreira 31 October 2013 (has links)
A presente pesquisa explora areas de Processamento de Linguagem Natural (PLN), tais como, analisadores, gramaticas e ontologias no desenvolvimento de um modelo para o mapeamento de consulta em lingua portuguesa controlada para consultas SPARQL. O SPARQL e uma linguagem de consulta capaz de recuperar e manipular dados armazenados em RDF, que e a base para a construcao de Ontologias. Este projeto pretende investigar utilizacao das tecnicas supracitadas na mitigacao do problema de consulta a Ontologias utilizando linguagem natural controlada. A principal motivacao para o desenvolvimento deste trabalho e pesquisar tecnicas e modelos que possam proporcionar uma melhor interacao do homem com o computador. Facilidade na interacao homem-computador e convergida em produtividade, eficiencia, comodidade dentre outros beneficios implicitos. Nos nos concentramos em medir a eficiencia do modelo proposto e procurar uma boa combinacao entre todas as tecnicas em questao. / This research explores areas of Natural Language Processing (NLP), such as parsers, grammars and ontologies in the development of a model for mapping queries in controlled Portuguese into SPARQL queries. The SPARQL query language allows for manipulation and retrieval of data stored as RDF, which forms the basis for building ontologies. This project aims to investigate the use of the above techniques to help curb the problem of querying ontologies using controlled natural language. The main motivation for the development of this work is to research techniques and models that could provide a better interaction between man and computer. Ease in human-computer interaction is converted into productivity, efficiency, convenience, among other implicit benefits. We focus on measuring the effectiveness of the proposed model and look for a good combination of all the techniques in question.
|
Page generated in 0.0329 seconds