51 |
Efficient techniques for streaming cross document coreference resolutionShrimpton, Luke William January 2017 (has links)
Large text streams are commonplace; news organisations are constantly producing stories and people are constantly writing social media posts. These streams should be analysed in real-time so useful information can be extracted and acted upon instantly. When natural disasters occur people want to be informed, when companies announce new products financial institutions want to know and when celebrities do things their legions of fans want to feel involved. In all these examples people care about getting information in real-time (low latency). These streams are massively varied, people’s interests are typically classified by the entities they are interested in. Organising a stream by the entity being referred to would help people extract the information useful to them. This is a difficult task: fans of ‘Captain America’ films will not want to be incorrectly told that ‘Chris Evans’ (the main actor) was appointed to host ‘Top Gear’ when it was a different ‘Chris Evans’. People who use local idiosyncrasies such as referring to their home county (‘Cornwall’) as ‘Kernow’ (the Cornish for ‘Cornwall’ that has entered the local lexicon) should not be forced to change their language when finding out information about their home. This thesis addresses a core problem for real-time entity-specific NLP: Streaming cross document coreference resolution (CDC), how to automatically identify all the entities mentioned in a stream in real-time. This thesis address two significant problems for streaming CDC: There is no representative dataset and existing systems consume more resources over time. A new technique to create datasets is introduced and it was applied to social media (Twitter) to create a large (6M mentions) and challenging new CDC dataset that contains a much more variend range of entities than typical newswire streams. Existing systems are not able to keep up with large data streams. This problem is addressed with a streaming CDC system that stores a constant sized set of mentions. New techniques to maintain the sample are introduced significantly out-performing existing ones maintaining 95% of the performance of a non-streaming system while only using 20% of the memory.
|
52 |
Aprimorando o corretor gramatical CoGrOO / Refining the CoGrOO Grammar CheckerWilliam Daniel Colen de Moura Silva 06 March 2013 (has links)
O CoGrOO é um corretor gramatical de código aberto em uso por milhares de usuários de uma popular suíte de escritório de código aberto. Ele é capaz de identificar erros como: colocação pronominal, concordância nominal, concordância sujeito-verbo, uso da crase, concordância nominal e verbal e outros erros comuns de escrita em Português do Brasil. Para tal, o CoGrOO realiza uma análise híbrida: inicialmente o texto é anotado usando técnicas estatísticas de Processamento de Linguagens Naturais e, em seguida, um sistema baseado em regras é responsável por identificar os possíveis erros gramaticais. O objetivo deste trabalho é reduzir a quantidade de omissões e intervenções indevidas e, ao mesmo tempo, aumentar a quantidade de verdadeiros positivos sem, entretanto, adicionar novas regras de detecção de erros. A última avaliação científica do corretor gramatical foi realizada em 2006 e, desde então, não foram realizados estudos detalhados quanto ao seu desempenho, apesar de o código do sistema ter passado por substancial evolução. Este trabalho contribuirá com uma detalhada avaliação dos anotadores estatísticos e os resultados serão comparados com o estado da arte. Uma vez que os anotadores do CoGrOO estão disponíveis como software livre, melhorias nesses módulos gerarão boas alternativas a sistemas proprietários. / CoGrOO is an open source Brazilian Portuguese grammar checker currently used by thousands of users of a popular open source office suite. It is capable of identifying Brazilian Portuguese mistakes such as pronoun placement, noun agreement, subject-verb agreement, usage of the accent stress marker, subject-verb agreement, and other common errors of Brazilian Portuguese writing. To accomplish this, it performs a hybrid analysis; initially it annotates the text using statistical Natural Language Processing (NLP) techniques, and then a rule-based check is performed to identify possible grammar errors. The goal of this work is to reduce omissions and false alarms while improving true positives without adding new error rules. The last rigorous evaluation of the grammar checker was done in 2006 and since then there has been no detailed study on how it has been performing. This work will also contribute a detailed evaluation of low-level NLP modules and the results will be compared to state-of-the-art results. Since the low-level NLP modules are available as open source software, improvements on their performance will make them robust, free and ready-to-use alternatives for other systems.
|
53 |
Consulta a ontologias utilizando linguagem natural controlada / Querying ontologies using controlled natural languageFabiano Ferreira Luz 31 October 2013 (has links)
A presente pesquisa explora areas de Processamento de Linguagem Natural (PLN), tais como, analisadores, gramaticas e ontologias no desenvolvimento de um modelo para o mapeamento de consulta em lingua portuguesa controlada para consultas SPARQL. O SPARQL e uma linguagem de consulta capaz de recuperar e manipular dados armazenados em RDF, que e a base para a construcao de Ontologias. Este projeto pretende investigar utilizacao das tecnicas supracitadas na mitigacao do problema de consulta a Ontologias utilizando linguagem natural controlada. A principal motivacao para o desenvolvimento deste trabalho e pesquisar tecnicas e modelos que possam proporcionar uma melhor interacao do homem com o computador. Facilidade na interacao homem-computador e convergida em produtividade, eficiencia, comodidade dentre outros beneficios implicitos. Nos nos concentramos em medir a eficiencia do modelo proposto e procurar uma boa combinacao entre todas as tecnicas em questao. / This research explores areas of Natural Language Processing (NLP), such as parsers, grammars and ontologies in the development of a model for mapping queries in controlled Portuguese into SPARQL queries. The SPARQL query language allows for manipulation and retrieval of data stored as RDF, which forms the basis for building ontologies. This project aims to investigate the use of the above techniques to help curb the problem of querying ontologies using controlled natural language. The main motivation for the development of this work is to research techniques and models that could provide a better interaction between man and computer. Ease in human-computer interaction is converted into productivity, efficiency, convenience, among other implicit benefits. We focus on measuring the effectiveness of the proposed model and look for a good combination of all the techniques in question.
|
54 |
Klasifikace obsahu právních dokumentů / Content classification in legal documentsBečvarová, Lucia January 2017 (has links)
This thesis presents an applied research for the needs of a company Datlowe, s.r.o. aimed at automatic processing of legal documents. The goal of the work is to design, implement and evaluate a classification module that is able to assign categories to the paragraphs of the documents. Several classification algorithms are used, evaluated and compared to each other to be consequently combined to obtain the best models. The outcome is a prediction module which was successfully integrated into the entire document processing system. Other contributions, along with the classification module, are the measurement of the inter-annotator agreement and introducing new set of features for classification.
|
55 |
Robustness Analysis of Visual Question Answering Models by Basic QuestionsHuang, Jia-Hong 11 1900 (has links)
Visual Question Answering (VQA) models should have both high robustness and accuracy. Unfortunately, most of the current VQA research only focuses on accuracy because there is a lack of proper methods to measure the robustness of VQA models. There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions, with similarity scores, of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question about the given image. We claim that a robust VQA model is one, whose performance is not changed much when related basic questions as also made available to it as input. We formulate the basic questions generation problem as a LASSO optimization, and also propose a large scale Basic Question Dataset (BQD) and Rscore (novel robustness measure), for analyzing the robustness of VQA models. We hope our BQD will be used as a benchmark for to evaluate the robustness of VQA models, so as to help the community build more robust and accurate VQA models.
|
56 |
Monitoring Tweets for Depression to Detect At-Risk UsersJamil, Zunaira January 2017 (has links)
According to the World Health Organization, mental health is an integral part of health and well-being. Mental illness can affect anyone, rich or poor, male or female. One such example of mental illness is depression. In Canada 5.3% of the population had presented a depressive episode in the past 12 months. Depression is difficult to diagnose, resulting in high under-diagnosis. Diagnosing depression is often based on self-reported experiences, behaviors reported by relatives, and a mental status examination. Currently, author- ities use surveys and questionnaires to identify individuals who may be at risk of depression. This process is time-consuming and costly.
We propose an automated system that can identify at-risk users from their public social media activity. More specifically, we identify at-risk users from Twitter. To achieve this goal we trained a user-level classifier using Support Vector Machine (SVM) that can detect at-risk users with a recall of 0.8750 and a precision of 0.7778.
We also trained a tweet-level classifier that predicts if a tweet indicates distress. This task was much more difficult due to the imbalanced data. In the dataset that we labeled, we came across 5% distress tweets and 95% non-distress tweets. To handle this class imbalance, we used undersampling methods. The resulting classifier uses SVM and performs with a recall of 0.8020 and a precision of 0.1237.
Our system can be used by authorities to find a focused group of at-risk users. It is not a platform for labeling an individual as a patient with depres- sion, but only a platform for raising an alarm so that the relevant authorities could take necessary interventions to further analyze the predicted user to confirm his/her state of mental health. We respect the ethical boundaries relating to the use of social media data and therefore do not use any user identification information in our research.
|
57 |
ANVÄNDARENS UPPLEVELSE HOSKUNDTJÄNST-CHATBOTAROmar, Nihad, Muhammed, Alwan January 2022 (has links)
Chatbotar används idag i olika område, som utbildning, kundtjänst och vård, och harmånga fördelar. Inom kundtjänst kan chatbotar bidra till att spara tid och kostnader förföretag. För att chatbotar ska användas brett av användare måste de ge en njutbar ochnyttig upplevelse. Användarupplevelsen hos chatbotar kan undersökas ur olikaperspektiv. I det här arbetet syftar vi till att undersöka användarupplevelse hos tvåchatbotar som vi har implementerat på olika sätt och med olika tekniker. En avchatbotarna är regelbaserad och den andra är AI-baserad. Vi använde en enkät för attgenomföra den här undersökningen. 24 personer testade båda chatbotarna och sedanutvärderade dem ur ett pragmatiskt och hedoniskt perspektiv. Resultatet av den härundersökningen visar att AI-chatboten tillhandhåller en bra upplevelse och denbedömdes av deltagarna som uppfinningsrik och intressant. Å andra sidan fickdeltagarna inte en bra upplevelse med användning av den regelbaserade chatboten meninte en dålig upplevelse heller. När det gäller effektivitet är chatboten som använderAI-tekniker mycket effektivare än den andra. Att båda chatbotarna är identiska på allaegenskaper förutom tillvägagångsättet att implementera, kan man dra slutsatsen attimplementation påverkar användarupplevelsen.
|
58 |
NLIs over APIs : Evaluating Pattern Matching as a way of processing natural language for a simple API / NLIer över APIer : En utvärdering av mönstermatchning som en teknik för att bearbeta naturligt språk ovanpå ett simpelt APIAndrén, Samuel, Bolin, William January 2016 (has links)
This report explores of the feasibility of using pattern matching for implementing a robust Natural Language Interface (NLI) over a limited Application Programming Interface (API). Because APIs are used to such a great extent today and often in mobile applications, it becomes more important to find simple ways of making them accessible to end users. A very intuitive way to access information via an API is using natural language. Therefore, this study first explores the possibility of building a corpus of the most common phrases used for a particular API. It is then explored how those phrases adhere to patterns, and how these patterns can be used to extract meaning from a phrase. Finally it evaluates an implementation of an NLI using pattern matching system based on the patterns. The result of the building of the corpus shows that although the amount of unique phrases used with our API seems to increase quite steadily, the amount of patterns those phrases follow converges to a constant quickly. This implies that it is possible to use these patterns to create an NLI that is robust enough to query an API effectively. The evaluation of the pattern matching system indicates that this technique can be used to successfully extract information from a phrase if its pattern is known by the system. / Den här rapporten utforskar hur genomförbart det är att använda mönstermatchning för att implementera ett robust användargränssnitt för styrning med naturligt språk (Natural Language Interface, NLI) över en begränsad Application Programming Interface (API). Eftersom APIer används i stor utsträckning idag, ofta i mobila applikationer, har det blivit allt mer viktigt att hitta sätt att göra dem ännu mer tillgängliga för slutanvändare. Ett mycket intuitivt sätt att komma åt information är med hjälp av naturligt språk via en API. I den här rapporten redogörs först för möjligheten att bygga ett korpus för en viss API and att skapa mönster för mönstermatchning på det korpuset. Därefter utvärderas en implementation av ett NLI som bygger på mönstermatchning med hjälp av korpuset. Resultatet av korpusuppbyggnaden visar att trots att antalet unika fraser som används för vårt API ökar ganska stadigt, så konvergerar antalat mönster på de fraserna relativt snabbt mot en konstant. Detta antyder att det är mycket möjligt att använda desssa mönster för att skapa en NLI som är robust nog för en API. Utvärderingen av implementationen av mönstermatchingssystemet antyder att tekniken kan användas för att framgångsrikt extrahera information från fraser om mönstret frasen följer finns i systemet.
|
59 |
Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large DatasetsFearn, Wilson Murray 05 December 2019 (has links)
Text analysis is a significant branch of natural language processing, and includes manydifferent sub-fields such as topic modeling, document classification, and sentiment analysis.Unsurprisingly, those who do text analysis are concerned with the runtime of their algorithmsSome of these algorithms have runtimes that depend jointly on the size of the corpus beinganalyzed, as well as the size of that corpus's vocabulary. Trivially, a user may reduce theamount of data they feed into their model to speed it up, but we assume that users will behesitant to do this as more data tends to lead to better model quality. On the other hand,when the runtime also depends on the vocabulary of the corpus, a user may instead modifythe vocabulary to attain a faster runtime. Because elements of the vocabulary also add tomodel quality, this puts users into the position of needing to modify the corpus vocabulary inorder to reduce the runtime of their algorithm while maintaining model quality. To this end,we look at the relationship between model quality and runtime for text analysis by looking atthe effect that current techniques in vocabulary reduction have on algorithmic runtime andcomparing that with their effect on model quality. Despite the fact that this is an importantrelationship to investigate, it appears little work has been done in this area. We find thatmost preprocessing methods do not have much of an effect on more modern algorithms, butproper rare word filtering gives the best results in the form of significant runtime reductionstogether with slight improvements in accuracy and a vocabulary size that scales efficiently aswe increase the size of the data.
|
60 |
Accelerating Sustainability Report Assessment with Natural Language ProcessingVälme, Emma, Renmarker, Lea January 2021 (has links)
Corporations are expected to be transparent on their sustainability impact and keep their stakeholders informed about how large the impact on the environment is, as well as their work on reducing the impact in question. The transparency is accounted for in a, usually voluntary, sustainability report additional to the already required financial report. With new regulations for mandatory sustainability reporting in Sweden, comprehensive and complete guidelines for corporations to follow are insufficient and the reports tend to be extensive. The reports are therefore hard to assess in terms of how well the reporting is actually done. The Sustainability Reporting Maturity Grid (SRMG) is an assessment tool introduced by Cöster et al. (2020) used for assessing the quality of sustainability reporting. Today, the assessment is performed manually which has proven to be both time-consuming and resulting in varying assessments, affected by individual interpretation of the content. This thesis is exploring how assessment time and grading with the SRMG can be improved by applying Natural Language Processing (NLP) on sustainability documents, resulting in a compressed assessment method - The Prototype. The Prototype intends to facilitate and speed up the process of assessment. The first step towards developing the Prototype was to decide which one of the three Machine Learning models; Naïve Bayes (NB), Support Vector Machines (SVM), or Bidirectional Encoder Representations of Transformers (BERT), is most suitable. This decision was supported by analyzing the accuracy for each model and for respective criteria in the SRMG, where BERT proved a strong classification ability with an average accuracy of 96,8%. Results from the user evaluation of the Prototypeindicated that the assessment time can be halved using the Prototype, with an initial average of 40 minutes decreased to 20 minutes. However, the results further showed a decreased average grading and an increased variation in assessment. The results indicate that applying NLP could be successful, but to get a more competitive Prototype, a more nuanced dataset must be developed, giving more space for the model to detect patterns in the data.
|
Page generated in 0.4016 seconds