Spelling suggestions: "subject:"[een] NATURAL LANGUAGE PROCESSING"" "subject:"[enn] NATURAL LANGUAGE PROCESSING""
141 |
Predicting the programming language of questions and snippets of stack overflow using natural language processingAlrashedy, Kamel 11 September 2018 (has links)
Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Over- flow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this the- sis, a classifier is proposed to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snip- pets only that achieves an accuracy of 77.7%.Thus, deploying ML techniques on the combination of text and code snippets of a question provides the best performance. These results demonstrate that it is possible to identify the programming language of a snippet of only a few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some properties of the information inside the questions corresponding to these languages. / Graduate
|
142 |
Advancing Biomedical Named Entity Recognition with Multivariate Feature Selection and Semantically Motivated FeaturesJanuary 2013 (has links)
abstract: Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located within natural-language text and their semantic type is determined. This step is critical for later tasks in an information extraction pipeline, including normalization and relationship extraction. BANNER is a benchmark biomedical NER system using linear-chain conditional random fields and the rich feature set approach. A case study with BANNER locating genes and proteins in biomedical literature is described. The first corpus for disease NER adequate for use as training data is introduced, and employed in a case study of disease NER. The first corpus locating adverse drug reactions (ADRs) in user posts to a health-related social website is also described, and a system to locate and identify ADRs in social media text is created and evaluated. The rich feature set approach to creating NER feature sets is argued to be subject to diminishing returns, implying that additional improvements may require more sophisticated methods for creating the feature set. This motivates the first application of multivariate feature selection with filters and false discovery rate analysis to biomedical NER, resulting in a feature set at least 3 orders of magnitude smaller than the set created by the rich feature set approach. Finally, two novel approaches to NER by modeling the semantics of token sequences are introduced. The first method focuses on the sequence content by using language models to determine whether a sequence resembles entries in a lexicon of entity names or text from an unlabeled corpus more closely. The second method models the distributional semantics of token sequences, determining the similarity between a potential mention and the token sequences from the training data by analyzing the contexts where each sequence appears in a large unlabeled corpus. The second method is shown to improve the performance of BANNER on multiple data sets. / Dissertation/Thesis / Ph.D. Computer Science 2013
|
143 |
Concept Maps Mining for Text SummarizationAGUIAR, C. Z. 31 March 2017 (has links)
Made available in DSpace on 2018-08-02T00:03:48Z (GMT). No. of bitstreams: 1
tese_11160_CamilaZacche_dissertacao_final.pdf: 5437260 bytes, checksum: 0c96c6b2cce9c15ea234627fad78ac9a (MD5)
Previous issue date: 2017-03-31 / 8
Resumo
Os mapas conceituais são ferramentas gráficas para a
representação e construção do
conhecimento. Conceitos e relações formam a base para o aprendizado e, portanto, os mapas
conceituais têm sido amplamente utilizados em diferentes situações e para diferentes
propósitos na educação, sendo uma delas a represent
ação do texto escrito. Mes
mo um
gramá
tico e complexo texto pode ser representado por um mapa conceitual contendo apenas
conceitos
e relações que represente
m o que foi expresso de uma forma mais complicada.
No entanto, a construção manual de um mapa conceit
ual exige bastante tempo e esforço
na identificação e estruturação do conhecimento, especialmente quando o mapa não deve
representar os conceitos da estrutura cognitiva do autor. Em vez disso, o mapa deve
representar os conceitos expressos em um texto. Ass
im, várias abordagens tecnológicas
foram propostas para facilitar o processo de construção de mapas conceituais a partir de
textos.
Portanto, esta dissertação propõe uma nova abordagem para a construção automática
de mapas conceituais como sumarização de t
extos científicos. A sumarização pretende
produzir um mapa conceitual como uma representação resumida do texto, mantendo suas
diversas e mais importantes características.
A sumarização pode facilitar a compreensão dos textos, uma vez que os alunos estão
te
ntando lidar com a sobrecarga cognitiva causada pela crescente quantidade de informação
textual disponível atualmente. Este
crescimento
também pode ser prejudicial à construção
do conhecimento. Assim, consideramos a hipótese de que a sumarização de um text
o
representado por um mapa conceitual pode atribuir características importantes para assimilar
o conhecimento do texto, bem como diminuir a sua complexidade e o tempo necessário para
processá
-
lo.
Neste contexto, realizamos uma revisão da literatura entre o
s anos de 1994 e 2016 sobre
as abordagens que visam a construção automática de mapas conceituais a partir de textos. A
partir disso, construímos uma categorização para melhor identificar e analisar os recursos e
as características dessas abordagens tecnoló
gicas. Além disso, buscamos identificar as
limitações e reunir as melhores características dos trabalhos relacionados para propor nossa
abordagem.
9
Ademais, apresentamos um processo Concept Map Mining
elaborado seguindo quatro
dimensões
: Descrição da Fonte
de Dados, Definição do Domínio, Identificação de
Elementos e Visualização do Mapa.
Com o intuito de desenvolver uma arquitetura computacional para construir
automaticamente mapas conceituais como sumarização de textos acadêmicos, esta pesquisa
resultou na
ferramenta pública
CMBuilder
, uma ferramenta online para a construção
automática de mapas conceituais a partir de textos, bem como uma api java chamada
ExtroutNLP
, que contém bibliotecas para extração de informações e serviços públicos.
Para alcançar o objetivo proposto, direcionados
esforços
para áreas de processamento
de linguagem natural e recuperação de informação. Ressaltamos que a principal tarefa para
alcançar nosso objetivo é extrair do texto as proposições do tipo (
conceito, rela
ção, conceito
). Sob
essa premissa, a pesquisa introduz um pipeline que compreende: regras gramaticais e busca
em profundidade para a extração de conceitos e relações a partir do texto; mapeamento de
preposição, resolução de anáforas e exploração de entidad
es nomeadas para a rotulação de
conceitos;
ranking
de conceitos baseado na análise de frequência de elementos e na topologia
do mapa; e sumarização de proposição baseada na topologia do grafo. Além disso, a
abordagem também propõe o uso de técnicas de apre
ndizagem supervisionada de
clusterização e classificação associadas ao uso de um tesauro para a definição do domínio do
texto e construção de um vocabulário conceitual de domínios.
Finalmente, uma análise objetiva para validar a exatidão da biblioteca
Extr
outNLP
é
executada e apresenta
0.65
precision
sobre o
corpus
. Além disso,
uma análise subjetiva para
validar a qualidade do mapa conceitual construído pela ferramenta
CMBuilder
é realizada
,
apresentando
0.75/0.45 para
precision
/
recall
de conceitos e 0.57/
0.23
para precision/
recall
de
relações em idioma inglês e
apresenta
ndo
0.68/
0.38
para precision/
recall
de conceitos e
0.41/
0.19
para precision/
recall
de
relações em idioma português.
Ademais
, um experimento
para verificar se o mapa conceitual
sumarizado
pe
lo CMBuilder tem influência para a
compreensão do assunto abordado em um texto é realizado
, atingindo 60% de
acertos
para
mapas extraídos de pequenos textos com questões de múltipla
escolha e 77% de
acertos para
m
apas extraídos de textos extensos com quest
ões discursivas
|
144 |
25 Challenges of Semantic Process ModelingMendling, Jan, Leopold, Henrik, Pittke, Fabian January 2014 (has links) (PDF)
Process modeling has become an essential part of many organizations for documenting, analyzing and redesigning their business operations and to support them with suitable information systems. In order to serve this purpose, it is important for process models to be well grounded in formal and precise semantics. While behavioural semantics of process models are well understood, there is a considerable gap of research into the semantic aspects of their text labels and natural language descriptions. The aim of this paper is to make this research gap more transparent. To this end, we clarify the role of textual content in process models and the challenges that are associated with the interpretation, analysis, and improvement of their natural language parts. More specifically, we discuss particular use cases of semantic process modeling to identify 25 challenges. For each challenge, we identify prior research and discuss directions for addressing them.
|
145 |
Automatic Supervised Thesauri Construction with Roget’s ThesaurusKennedy, Alistair H January 2012 (has links)
Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus.
I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process.
I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus.
The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications.
|
146 |
Evaluating Text SegmentationFournier, Christopher January 2013 (has links)
This thesis investigates the evaluation of automatic and manual text segmentation. Text segmentation is the process of placing boundaries within text to create segments according to some task-dependent criterion. An example of text segmentation is topical segmentation, which aims to segment a text according to the subjective definition of what constitutes a topic. A number of automatic segmenters have been created to perform this task, and the question that this thesis answers is how to select the best automatic segmenter for such a task. This requires choosing an appropriate segmentation evaluation metric, confirming the reliability of a manual solution, and then finally employing an evaluation methodology that can select the automatic segmenter that best approximates human performance.
A variety of comparison methods and metrics exist for comparing segmentations (e.g., WindowDiff, Pk), and all save a few are able to award partial credit for nearly missing a boundary. Those comparison methods that can award partial credit unfortunately lack consistency, symmetricity, intuition, and a host of other desirable qualities. This work proposes a new comparison method named boundary similarity (B) which is based upon a new minimal boundary edit distance to compare two segmentations. Near misses are frequent, even among manual segmenters (as is exemplified by the low inter-coder agreement reported by many segmentation studies). This work adapts some inter-coder agreement coefficients to award partial credit for near misses using the new metric proposed herein, B.
The methodologies employed by many works introducing automatic segmenters evaluate them simply in terms of a comparison of their output to one manual segmentation of a text, and often only by presenting nothing other than a series of mean performance values (along with no standard deviation, standard error, or little if any statistical hypothesis testing). This work asserts that one segmentation of a text cannot constitute a “true” segmentation; specifically, one manual segmentation is simply one sample of the population of all possible segmentations of a text and of that subset of desirable segmentations. This work further asserts that an adapted inter-coder agreement statistics proposed herein should be used to determine the reproducibility and reliability of a coding scheme and set of manual codings, and then statistical hypothesis testing using the specific comparison methods and methodologies demonstrated herein should be used to select the best automatic segmenter.
This work proposes new segmentation evaluation metrics, adapted inter-coder agreement coefficients, and methodologies. Most important, this work experimentally compares the state-or-the-art comparison methods to those proposed herein upon artificial data that simulates a variety of scenarios and chooses the best one (B). The ability of adapted inter-coder agreement coefficients, based upon B, to discern between various levels of agreement in artificial and natural data sets is then demonstrated. Finally, a contextual evaluation of three automatic segmenters is performed using the state-of-the art comparison methods and B using the methodology proposed herein to demonstrate the benefits and versatility of B as opposed to its counterparts.
|
147 |
Information Density and Persuasiveness in Naturalistic DataJanuary 2020 (has links)
abstract: Attitudes play a fundamental role when making critical judgments and the extremity of people’s attitudes can be influenced by one’s emotions, beliefs, or past experiences and behaviors. Human attitudes and preferences are susceptible to social influence and attempts to influence or change another person’s attitudes are pervasive in all societies. Given the importance of attitudes and attitude change, the current project investigated linguistic aspects of conversations that lead to attitude change by analyzing a dataset mined from Reddit’s Change My View (Priniski & Horne, 2018). Analysis of the data was done using Natural Language Processing (NLP), specifically information density, to predict attitude change. Top posts from Reddit’s (N = 510,149) were imported and processed in Python and information density measures were computed. The results indicate that comments with higher information density are more likely to be awarded a delta and are perceived to be more persuasive. / Dissertation/Thesis / Masters Thesis Psychology 2020
|
148 |
Examination of Gender Bias in News ArticlesDamin Zhang (11814182) 19 December 2021 (has links)
Reading news articles from online sources has become a major choice of obtaining
information for many people. Authors who wrote news articles could introduce their own biases
either unintentionally or intentionally by using or choosing to use different words to describe
otherwise neutral and factual information. Such intentional word choices could create conflicts
among different social groups, showing explicit and implicit biases. Any type of biases within the
text could affect the reader’s view of the information. One type of biases in natural language is
gender bias that had been discovered in a lot of Natural Language Processing (NLP) models,
largely attributed to implicit biases in the training text corpora. Analyzing gender bias or
stereotypes in such large corpora is a hard task. Previous methods of bias detection were applied
to short text like tweets, and to manually built datasets, but little works had been done on long text
like news articles in large corpora. Simply detecting bias on annotated text does not help to
understand how it was generated and reproduced. Instead, we used structural topic modeling on a
large unlabelled corpus of news articles, incorporated qualitative results and quantitative analysis
to examine how gender bias was generated and reproduced. This research extends the prior
knowledge of bias detection and proposed a method for understanding gender bias in real-world
settings. We found that author gender correlated to the topic-gender prevalence and skewed
media-gender distribution assist understanding gender bias within news articles.
|
149 |
Emergency Medical Service EMR-Driven Concept Extraction From Narrative TextGeorge, Susanna Serene 08 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Being in the midst of a pandemic with patients having minor symptoms that quickly
become fatal to patients with situations like a stemi heart attack, a fatal accident injury,
and so on, the importance of medical research to improve speed and efficiency in patient
care, has increased. As researchers in the computer domain work hard to use automation
in technology in assisting the first responders in the work they do, decreasing the cognitive
load on the field crew, time taken for documentation of each patient case and improving
accuracy in details of a report has been a priority.
This paper presents an information extraction algorithm that custom engineers certain
existing extraction techniques that work on the principles of natural language processing
like metamap along with syntactic dependency parser like spacy for analyzing the sentence structure and regular expressions to recurring patterns, to retrieve patient-specific information from medical narratives. These concept value pairs automatically populates the fields of an EMR form which could be reviewed and modified manually if needed. This report can then be reused for various medical and billing purposes related to the patient.
|
150 |
A Hybrid Approach to General Information ExtractionGrap, Marie Belen 01 September 2015 (has links)
Information Extraction (IE) is the process of analyzing documents and identifying desired pieces of information within them. Many IE systems have been developed over the last couple of decades, but there is still room for improvement as IE remains an open problem for researchers. This work discusses the development of a hybrid IE system that attempts to combine the strengths of rule-based and statistical IE systems while avoiding their unique pitfalls in order to achieve high performance for any type of information on any type of document. Test results show that this system operates competitively in cases where target information belongs to a highly-structured data type and when critical contextual information is in close proximity to the target.
|
Page generated in 0.032 seconds