Spelling suggestions: "subject:"batural language aprocessing"" "subject:"batural language eprocessing""
361 |
Ontology learning from folksonomies.January 2010 (has links)
Chen, Wenhao. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (p. 63-70). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Ontologies and Folksonomies --- p.1 / Chapter 1.2 --- Motivation --- p.3 / Chapter 1.2.1 --- Semantics in Folksonomies --- p.3 / Chapter 1.2.2 --- Ontologies with basic level concepts --- p.5 / Chapter 1.2.3 --- Context and Context Effect --- p.6 / Chapter 1.3 --- Contributions --- p.6 / Chapter 1.4 --- Structure of the Thesis --- p.8 / Chapter 2 --- Background Study --- p.10 / Chapter 2.1 --- Semantic Web --- p.10 / Chapter 2.2 --- Ontology --- p.12 / Chapter 2.3 --- Folksonomy --- p.14 / Chapter 2.4 --- Cognitive Psychology --- p.17 / Chapter 2.4.1 --- Category (Concept) --- p.17 / Chapter 2.4.2 --- Basic Level Categories (Concepts) --- p.17 / Chapter 2.4.3 --- Context and Context Effect --- p.20 / Chapter 2.5 --- F1 Evaluation Metric --- p.21 / Chapter 2.6 --- State of the Art --- p.23 / Chapter 2.6.1 --- Ontology Learning --- p.23 / Chapter 2.6.2 --- Semantics in Folksonomy --- p.26 / Chapter 3 --- Ontology Learning from Folksonomies --- p.28 / Chapter 3.1 --- Generating Ontologies with Basic Level Concepts from Folksonomies --- p.29 / Chapter 3.1.1 --- Modeling Instances and Concepts in Folksonomies --- p.29 / Chapter 3.1.2 --- The Metric of Basic Level Categories (Concepts) --- p.30 / Chapter 3.1.3 --- Basic Level Concepts Detection Algorithm --- p.31 / Chapter 3.1.4 --- Ontology Generation Algorithm --- p.34 / Chapter 3.2 --- Evaluation --- p.35 / Chapter 3.2.1 --- Data Set and Experiment Setup --- p.35 / Chapter 3.2.2 --- Quantitative Analysis --- p.36 / Chapter 3.2.3 --- Qualitative Analysis --- p.39 / Chapter 4 --- Context Effect on Ontology Learning from Folksonomies --- p.43 / Chapter 4.1 --- Context-aware Basic Level Concepts Detection --- p.44 / Chapter 4.1.1 --- Modeling Context in Folksonomies --- p.44 / Chapter 4.1.2 --- Context Effect on Category Utility --- p.45 / Chapter 4.1.3 --- Context-aware Basic Level Concepts Detection Algorithm --- p.46 / Chapter 4.2 --- Evaluation --- p.47 / Chapter 4.2.1 --- Data Set and Experiment Setup --- p.47 / Chapter 4.2.2 --- Result Analysis --- p.49 / Chapter 5 --- Potential Applications --- p.54 / Chapter 5.1 --- Categorization of Web Resources --- p.54 / Chapter 5.2 --- Applications of Ontologies --- p.55 / Chapter 6 --- Conclusion and Future Work --- p.57 / Chapter 6.1 --- Conclusion --- p.57 / Chapter 6.2 --- Future Work --- p.59 / Bibliography --- p.63
|
362 |
Anotação automática de papéis semânticos de textos jornalísticos e de opinião sobre árvores sintáticas não revisadas / Automatic semantic role labeling on non-revised syntactic trees of journalistic and opinion textsHartmann, Nathan Siegle 25 June 2015 (has links)
Contexto: A Anotação de Papéis Semânticos (APS) é uma tarefa da área de Processamento de Línguas Naturais (PLN) que permite detectar os eventos descritos nas sentenças e os participantes destes eventos (Palmer et al., 2010). A APS responde perguntas como Quem?, Quando?, Onde?, O quê?, e Por quê?, dentre outras e, sendo assim, é importante para várias aplicações de PLN. Para anotar automaticamente um texto com papéis semânticos, a maioria dos sistemas atuais emprega técnicas de Aprendizagem de Máquina (AM). Porém, alguns papéis semânticos são previsíveis e, portanto, não necessitam ser tratados via AM. Além disso, a grande maioria das pesquisas desenvolvidas em APS tem dado foco ao inglês, considerando as particularidades gramaticais e semânticas dessa língua, o que impede que essas ferramentas e resultados sejam diretamente transportados para outras línguas. Revisão da Literatura: Para o português do Brasil, há três trabalhos finalizados recentemente que lidam com textos jornalísticos, porém com performance inferior ao estado da arte para o inglês. O primeiro (Alva- Manchego, 2013) obteve 79,6 de F1 na APS sobre o córpus PropBank.Br; o segundo (Fonseca, 2013), sem fazer uso de um treebank para treinamento, obteve 68,0 de F1 sobre o córpus PropBank.Br; o terceiro (Sequeira et al., 2012) realizou anotação apenas dos papéis Arg0 (sujeito prototípico) e Arg1 (paciente prototípico) no córpus CETEMPúblico, com performance de 31,3 pontos de F1 para o primeiro papel e de 19,0 de F1 para o segundo. Objetivos: O objetivo desse trabalho de mestrado é avançar o estado da arte na APS do português brasileiro no gênero jornalístico, avaliando o desempenho de um sistema de APS treinado com árvores sintáticas geradas por um parser automático (Bick, 2000), sem revisão humana, usando uma amostragem do córpus PLN-Br. Como objetivo adicional, foi avaliada a robustez da tarefa de APS frente a gêneros diferentes, testando o sistema de APS, treinado no gênero jornalístico, em uma amostra de revisões de produtos da web. Esse gênero não foi explorado até então na área de APS e poucas de suas características foram formalizadas. Resultados: Foi compilado o primeiro córpus de opiniões sobre produtos da web, o córpus Buscapé (Hartmann et al., 2014). A diferença de performance entre um sistema treinado sobre árvores revisadas e outro sobre árvores não revisadas ambos no gênero jornalístico foi de 10,48 pontos de F1. A troca de gênero entre as fases de treinamento e teste, em APS, é possível, com perda de performance de 3,78 pontos de F1 (córpus PLN-Br e Buscapé, respectivamente). Foi desenvolvido um sistema de inserção de sujeitos não expressos no texto, com precisão de 87,8% no córpus PLN-Br e de 94,5% no córpus Buscapé. Foi desenvolvido um sistema, baseado em regras, para anotar verbos auxiliares com papéis semânticos modificadores, com confiança de 96,76% no córpus PLN-Br. Conclusões: Foi mostrado que o sistema de Alva-Manchego (2013), baseado em árvores sintáticas, desempenha melhor APS do que o sistema de Fonseca (2013), independente de árvores sintáticas. Foi mostrado que sistemas de APS treinados sobre árvores sintáticas não revisadas desempenham melhor APS sobre árvores não revisadas do que um sistema treinado sobre dados gold-standard. Mostramos que a explicitação de sujeitos não expressos nos textos do Buscapé, um córpus do gênero de opinião de produtos na web, melhora a performance da sua APS. Também mostramos que é possível anotar verbos auxiliares com papéis semânticos modificadores, utilizando um sistema baseado em regras, com alta confiança. Por fim, mostramos que o uso do sentido do verbo, como feature de AM, para APS, não melhora a perfomance dos sistemas treinados sobre o PLN-Br e o Buscapé, por serem córpus pequenos. / Background: Semantic Role Labeling (SRL) is a Natural Language Processing (NLP) task that enables the detection of events described in sentences and the participants of these events (Palmer et al., 2010). SRL answers questions such as Who?, When?, Where?, What? and Why? (and others), that are important for several NLP applications. In order to automatically annotate a text with semantic roles, most current systems use Machine Learning (ML) techniques. However, some semantic roles are predictable, and therefore, do not need to be classified through ML. In spite of SRL being well advanced in English, there are grammatical and semantic particularities that prevents full reuse of tools and results in other languages. Related work: For Brazilian Portuguese, there are three studies recently concluded that performs SRL in journalistic texts. The first one (Alva-Manchego, 2013) obtained 79.6 of F1 on the SRL of the PropBank.Br corpus; the second one (Fonseca, 2013), without using a treebank for training, obtained 68.0 of F1 for the same corpus; and the third one (Sequeira et al., 2012) annotated only the Arg0 (prototypical agent) and Arg1 (prototypical patient) roles on the CETEMPúblico corpus, with a perfomance of 31.3 of F1 for the first semantic role and 19.0 for the second one. None of them, however, reached the state of the art of the English language. Purpose: The goal of this masters dissertation was to advance the state of the art of SRL in Brazilian Portuguese. The training corpus used is from the journalistic genre, as previous works, but the SRL annotation is performed on non-revised syntactic trees, i.e., generated by an automatic parser (Bick, 2000) without human revision, using a sampling of the corpus PLN-Br. To evaluate the resulting SRL classifier in another text genre, a sample of product reviews from web was used. Until now, product reviews was a genre not explored in SRL research, and few of its characteristics are formalized. Results: The first corpus of web product reviews, the Buscapé corpus (Hartmann et al., 2014), was compiled. It is shown that the difference in the performance of a system trained on revised syntactic trees and another trained on non-revised trees both from the journalistic genre was of 10.48 of F1. The change of genres during the training and testing steps in SRL is possible, with a performance loss of 3.78 of F1 (corpus PLN-Br and Buscapé, respectively). A system to insert unexpressed subjects reached 87.8% of precision on the PLN-Br corpus and a 94.5% of precision on the Buscapé corpus. A rule-based system was developed to annotated auxiliary verbs with semantic roles of modifiers (ArgMs), achieving 96.76% confidence on the PLN-Br corpus. Conclusions: First we have shown that Alva-Manchego (2013) SRL system, that is based on syntactic trees, performs better annotation than Fonseca (2013)s system, that is nondependent on syntactic trees. Second the SRL system trained on non-revised syntactic trees performs better over non-revised trees than a system trained on gold-standard data. Third, the explicitation of unexpressed subjects on the Buscapé texts improves their SRL performance. Additionally, we show it is possible to annotate auxiliary verbs with semantic roles of modifiers, using a rule-based system. Last, we have shown that the use of the verb sense as a feature of ML, for SRL, does not improve the performance of the systems trained over PLN-Br and Buscapé corpus, since they are small.
|
363 |
Agrupamento semântico de aspectos para mineração de opinião / Semantic clustering of aspects for opinion miningVargas, Francielle Alves 29 November 2017 (has links)
Com o rápido crescimento do volume de informações opinativas na web, extrair e sintetizar conteúdo subjetivo e relevante da rede é uma tarefa prioritária e que perpassa vários domínios da sociedade: político, social, econômico, etc. A organização semântica desse tipo de conteúdo, é uma tarefa importante no contexto atual, pois possibilita um melhor aproveitamento desses dados, além de benefícios diretos tanto para consumidores quanto para organizações privadas e governamentais. A área responsável pela extração, processamento e apresentação de conteúdo subjetivo é a mineração de opinião, também chamada de análise de sentimentos. A mineração de opinião é dividida em níveis de granularidade de análise: o nível do documento, o nível da sentença e o nível de aspectos. Neste trabalho, atuou-se no nível mais fino de granularidade, a mineração de opinião baseada em aspectos, que consiste de três principais tarefas: o reconhecimento e agrupamento de aspectos, a extração de polaridade e a sumarização. Aspectos são propriedades do alvo da opinião e podem ser implícitos e explícitos. Reconhecer e agrupar aspectos são tarefas críticas para mineração de opinião, no entanto, também são desafiadoras. Por exemplo, em textos opinativos, usuários utilizam termos distintos para se referir a uma mesma propriedade do objeto. Portanto, neste trabalho, atuamos no problema de agrupamento de aspectos para mineração de opinião. Para resolução deste problema, optamos por uma abordagem baseada em conhecimento linguístico. Investigou-se os principais fenômenos intrínsecos e extrínsecos em textos opinativos a fim de encontrar padrões linguísticos e insumos acionáveis para proposição de métodos automáticos de agrupamento de aspectos correlatos para mineração de opinião. Nós propomos, implementamos e comparamos seis métodos automáticos baseados em conhecimento linguístico para a tarefa de agrupamento de aspectos explícitos e implícitos. Um método inédito foi proposto para essa tarefa que superou os demais métodos implementados, especialmente o método baseado em léxico de sinônimos (baseline) e o modelo estatístico com base em word embeddings. O método proposto também não é dependente de uma língua ou de um domínio, no entanto, focamos no Português do Brasil e no domínio de produtos da web. / With the growing volume of opinion information on the web, extracting and synthesizing subjective and relevant content from the web has to be shown a priority task that passes through different society domains, such as political, social, economical, etc. The semantic organization of this type of content is very important nowadays since it allows a better use of those data, as well as it benefits customers and both private and governmental organizations. The area responsible for extracting, processing and presenting the subjective content is opinion mining, also known as sentiment analysis. Opinion mining is divided into granularity levels: document, sentence and aspect levels. In this research, the deepest level of granularity was studied, the opinion mining based on aspects, which consists of three main tasks: aspect recognition and clustering, polarity extracting, and summarization. Aspects are the properties and parts of the evaluated object and it may be implicit or explicit. Recognizing and clustering aspects are critical tasks for opinion mining; nonetheless, they are also challenging. For example, in reviews, users use distinct terms to refer to the same object property. Therefore, in this work, the aspect clustering task was the focus. To solve this problem, a linguistic approach was chosen. The main intrinsic and extrinsic phenomena in reviews were investigated in order to find linguistic standards and actionable inputs, so it was possible to propose automatic methods of aspect clustering for opinion mining. In addition, six automatic linguistic-based methods for explicit and implicit aspect clustering were proposed, implemented and compared. Besides that, a new method was suggested for this task, which surpassed the other implemented methods, specially the synonym lexicon-based method (baseline) and a word embeddings approach. This suggested method is also language and domain independent and, in this work, was tailored for Brazilian Portuguese and products domain.
|
364 |
Automated spatiotemporal and semantic information extraction for hazardsWang, Wei 01 July 2014 (has links)
This dissertation explores three research topics related to automated spatiotemporal and semantic information extraction about hazard events from Web news reports and other social media. The dissertation makes a unique contribution of bridging geographic information science, geographic information retrieval, and natural language processing. Geographic information retrieval and natural language processing techniques are applied to extract spatiotemporal and semantic information automatically from Web documents, to retrieve information about patterns of hazard events that are not explicitly described in the texts. Chapters 2, 3 and 4 can be regarded as three standalone journal papers. The research topics covered by the three chapters are related to each other, and are presented in a sequential way. Chapter 2 begins with an investigation of methods for automatically extracting spatial and temporal information about hazards from Web news reports. A set of rules is developed to combine the spatial and temporal information contained in the reports based on how this information is presented in text in order to capture the dynamics of hazard events (e.g., changes in event locations, new events occurring) as they occur over space and time. Chapter 3 presents an approach for retrieving semantic information about hazard events using ontologies and semantic gazetteers. With this work, information on the different kinds of events (e.g., impact, response, or recovery events) can be extracted as well as information about hazard events at different levels of detail. Using the methods presented in Chapter 2 and 3, an approach for automatically extracting spatial, temporal, and semantic information from tweets is discussed in Chapter 4. Four different elements of tweets are used for assigning appropriate spatial and temporal information to hazard events in tweets. Since tweets represent shorter, but more current information about hazards and how they are impacting a local area, key information about hazards can be retrieved through extracted spatiotemporal and semantic information from tweets.
|
365 |
Automating an Engine to Extract Educational Priorities for Workforce City InnovationHobbs, Madison 01 January 2019 (has links)
This thesis is grounded in my work done through the Harvey Mudd College Clinic Program as Project Manager of the PilotCity Clinic Team. PilotCity is a startup whose mission is to transform small to mid-sized cities into centers of innovation by introducing employer partnerships and work-based learning to high school classrooms. The team was tasked with developing software and algorithms to automate PilotCity's programming and to extract educational insights from unstructured data sources like websites, syllabi, resumes, and more. The team helped engineer a web application to expand and facilitate PilotCity's usership, designed a recommender system to automate the process of matching employers to high school classrooms, and packaged a topic modeling module to extract educational priorities from more complex data such as syllabi, course handbooks, or other educational text data. Finally, the team explored automatically generating supplementary course resources using insights from topic models. This thesis will detail the team's process from beginning to final deliverables including the methods, implementation, results, challenges, future directions, and impact of the project.
|
366 |
USING MODULAR ARCHITECTURES TO PREDICT CHANGE OF BELIEFS IN ONLINE DEBATESAldo Fabrizio Porco (7460849) 17 October 2019 (has links)
<div>
<div>
<div>
<p>Researchers studying persuasion have mostly focused on modeling arguments to
understand how people’s beliefs can change. However, in order to convince an audience the speakers usually adapt their speech. This can be seen often in political
campaigns when ideas are phrased - framed - in different ways according to the geo-graphical region the candidate is in. This practice suggests that, in order to change
people’s beliefs, it is important to take into account their previous perspectives and
topics of interest.
</p><p><br></p>
<p>In this work we propose ChangeMyStance, a novel task to predict if a user would
change their mind after being exposed to opposing views on a particular subject. This
setting takes into account users’ beliefs before a debate, thus modeling their preconceived notions about the topic. Moreover, we explore a new approach to solve the
problem, where the task is decomposed into ”simpler” problems. Breaking the main
objective into several tasks allows to build expert modules that combined produce
better results. This strategy significantly outperforms a BERT end-to-end model over
the same inputs.
</p>
</div>
</div>
</div>
|
367 |
NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTSWang, Qianqian 01 September 2017 (has links)
Natural Language Processing (NLP) is the field of study that focuses on the interactions between human language and computers. By “natural language” we mean a language that is used for everyday communication by humans. Different from programming languages, natural languages are hard to be defined with accurate rules. NLP is developing rapidly and it has been widely used in different industries. Technologies based on NLP are becoming increasingly widespread, for example, Siri or Alexa are intelligent personal assistants using NLP build in an algorithm to communicate with people. “Natural Language Processing Based Generator of Testing Instruments” is a stand-alone program that generates “plausible” multiple-choice selections by analyzing word sense disambiguation and calculating semantic similarity between two natural language entities. The core is Word Sense Disambiguation (WSD), WSD is identifying which sense of a word is used in a sentence when the word has multiple meanings. WSD is considered as an AI-hard problem. The project presents several algorithms to resolve WSD problem and compute semantic similarity, along with experimental results demonstrating their effectiveness.
|
368 |
A Machine Learning Approach to Predicting Alcohol Consumption in Adolescents From Historical Text Messaging DataBergh, Adrienne 28 May 2019 (has links)
Techniques based on artificial neural networks represent the current state-of-the-art in machine learning due to the availability of improved hardware and large data sets. Here we employ doc2vec, an unsupervised neural network, to capture the semantic content of text messages sent by adolescents during high school, and encode this semantic content as numeric vectors. These vectors effectively condense the text message data into highly leverageable inputs to a logistic regression classifier in a matter of hours, as compared to the tedious and often quite lengthy task of manually coding data. Using our machine learning approach, we are able to train a logistic regression model to predict adolescents' engagement in substance abuse during distinct life phases with accuracy ranging from 76.5% to 88.1%. We show the effects of grade level and text message aggregation strategy on the efficacy of document embedding generation with doc2vec. Additional examination of the vectorizations for specific terms extracted from the text message data adds quantitative depth to this analysis. We demonstrate the ability of the method used herein to overcome traditional natural language processing concerns related to unconventional orthography. These results suggest that the approach described in this thesis is a competitive and efficient alternative to existing methodologies for predicting substance abuse behaviors. This work reveals the potential for the application of machine learning-based manipulation of text messaging data to development of automatic intervention strategies against substance abuse and other adolescent challenges.
|
369 |
Cross-Lingual and Low-Resource Sentiment AnalysisFarra, Noura January 2019 (has links)
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment.
|
370 |
Adapting Automatic Summarization to New Sources of InformationOuyang, Jessica Jin January 2019 (has links)
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages – documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages.
|
Page generated in 0.1662 seconds