• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 943
  • 156
  • 74
  • 56
  • 27
  • 23
  • 18
  • 13
  • 10
  • 9
  • 8
  • 7
  • 5
  • 5
  • 4
  • Tagged with
  • 1622
  • 1622
  • 1622
  • 626
  • 573
  • 469
  • 387
  • 376
  • 271
  • 256
  • 246
  • 230
  • 221
  • 212
  • 208
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
311

A SENTIMENT BASED AUTOMATIC QUESTION-ANSWERING FRAMEWORK

Qiaofei Ye (6636317) 14 May 2019 (has links)
With the rapid growth and maturity of Question-Answering (QA) domain, non-factoid Question-Answering tasks are in high demand. However, existing Question-Answering systems are either fact-based, or highly keyword related and hard-coded. Moreover, if QA is to become more personable, sentiment of the question and answer should be taken into account. However, there is not much research done in the field of non-factoid Question-Answering systems based on sentiment analysis, that would enable a system to retrieve answers in a more emotionally intelligent way. This study investigates to what extent could prediction of the best answer be improved by adding an extended representation of sentiment information into non-factoid Question-Answering.
312

Using Machine Learning to Learn from Bug Reports : Towards Improved Testing Efficiency

Ingvarsson, Sanne January 2019 (has links)
The evolution of a software system originates from its changes, whether it comes from changed user needs or adaption to its current environment. These changes are as encouraged as they are inevitable, although every change to a software system comes with a risk of introducing an error or a bug. This thesis aimed to investigate the possibilities of using the description of bug reports as a decision basis for detecting the provenance of a bug by using machine learning. K-means and agglomerative clustering have been applied to free text documents by using Natural Language Processing to initially divide the investigated software system into sub parts. Topic labelling is further on performed on the found clusters to find suitable names and get an overall understanding for the clusters.Finally, it was investigated if it was possible to find which cluster that were more likely to cause a bug from certain clusters and should be tested more thoroughly. By evaluating a subset of known causes, it was found that possible direct connections could be found in 50% of the cases, while this number increased to 58% if the cause were attached to clusters.
313

The Axiology of Necrologies: Using Natural Language Processing to Examine Values in Obituaries

Levernier, Jacob 01 May 2017 (has links)
This dissertation is centrally concerned with exploring obituaries as repositories of values. Obituaries are a publicly-available natural language source that are variably written for members of communities that are wide (nation-level) and narrow (city-level, or at the level of specific groups therein). Because they are explicitly summative, limited in size, and written for consumption by a public audience, obituaries may be expected to express concisely the aspects of their subjects' lives that the authors (often family members living in the same communities) found most salient or worthy of featuring. 140,599 obituaries nested in 832 newspapers from across the USA were scraped with permission from *Legacy.com,* an obituaries publisher. Obituaries were coded for the age at death and gender (female/male) of the deceased using automated algorithms. For each publishing newspaper, county-level median income, educational achievement (operationalized as percent of the population with a Bachelor's degree or higher), and race and ethnicity were averaged across counties, weighting by population size. A Neo4J graph database was constructed using WordNet and the University of South Florida Free Association Norms datasets. Each word in each obituary in the corpus was lemmatized. The shortest path through the WordNet graph from each lemma to 30 Schwartz value prototype words published by Bardi, Calogero, and Mullen (2008) was then recorded. From these path lengths, a new measure, "word-by-hop," was calculated for each Schwartz value to reflect the relative lexical distance between each obituary and that Schwartz value. Of the Schwartz values, Power, Conformity, and Security were most indicated in the corpus, while Universalism, Hedonism, and Stimulation were least indicated. A series of nine two-level regression models suggested that, across Schwartz values, newspaper community accounted for the greatest amount of word-by-hop variability in the corpus. The best-fitting model indicated a small, negative effect of female status across Schwartz values. Unexpectedly, Hedonism and Conformity, which had conceptually opposite prototype words, were highly correlated, possibly indicating that obituary authors "compensate" for describing the deceased in a hedonistic way by concurrently emphasizing restraint. Future research could usefully further expand word-by-hop and incorporate individual-level covariates that match the newspaper-level covariates used here.
314

Désignations nominales des événements : étude et extraction automatique dans les textes / Nominal designation of events : study and automatic extraction in texts

Arnulphy, Béatrice 02 October 2012 (has links)
Ma thèse a pour but l'étude des désignations nominales des événements pour l'extraction automatique. Mes travaux s'inscrivent en traitement automatique des langues, soit dans une démarche pluridisciplinaire qui fait intervenir linguistique et informatique. L'extraction d'information a pour but d'analyser des documents en langage naturel et d'en extraire les informations utiles à une application particulière. Dans ce but général, de nombreuses campagnes d'extraction d'information ont été menées~: pour chaque événement considéré, il s'agit d'extraire certaines informations relatives (participants, dates, nombres, etc.). Dès le départ, ces challenges touchent de près aux entités nommées (éléments « notables » des textes, comme les noms de personnes ou de lieu). Toutes ces informations forment un ensemble autour de l'événement. Pourtant, ces travaux ne s'intéressent que peu aux mots utilisés pour décrire l'événement (particulièrement lorsqu'il s'agit d'un nom). L'événement est vu comme un tout englobant, comme la quantité et la qualité des informations qui le composent. Contrairement aux travaux en extraction d'informations générale, notre intérêt principal est porté uniquement sur la manière dont sont nommés les événements qui se produisent et particulièrement à la désignation nominale utilisée. Pour nous, l'événement est ce qui arrive, ce qui vaut la peine qu'on en parle. Les événements plus importants font l'objet d'articles de presse ou apparaissent dans les manuels d'Histoire. Un événement peut être évoqué par une description verbale ou nominale. Dans cette thèse, nous avons réfléchi à la notion d'événement. Nous avons observé et comparé les différents aspects présentés dans l'état de l'art jusqu'à construire une définition de l'événement et une typologie des événements en général, et qui conviennent dans le cadre de nos travaux et pour les désignations nominales des événements. Nous avons aussi dégagé de nos études sur corpus différents types de formation de ces noms d'événements, dont nous montrons que chacun peut être ambigu à des titres divers. Pour toutes ces études, la composition d'un corpus annoté est une étape indispensable, nous en avons donc profité pour élaborer un guide d'annotation dédié aux désignations nominales d'événements. Nous avons étudié l'importance et la qualité des lexiques existants pour une application dans notre tâche d'extraction automatique. Nous avons aussi, par des règles d'extraction, porté intérêt au cotexte d'apparition des noms pour en déterminer l'événementialité. À la suite de ces études, nous avons extrait un lexique pondéré en événementialité (dont la particularité est d'être dédié à l'extraction des événements nominaux), qui rend compte du fait que certains noms sont plus susceptibles que d'autres de représenter des événements. Utilisée comme indice pour l'extraction des noms d'événements, cette pondération permet d'extraire des noms qui ne sont pas présents dans les lexiques standards existants. Enfin, au moyen de l'apprentissage automatique, nous avons travaillé sur des traits d'apprentissage contextuels en partie fondés sur la syntaxe pour extraire de noms d'événements. / The aim of my PhD thesis is the study of nominal designations of events for automatic extraction. My work is part of natural language processing, or in a multidisciplinary approach that involves Linguistics and Computer Science. The aim of information extraction is to analyze natural language documents and extract information relevant to a particular application. In this general goal, many information extraction campaigns were conducted: for each event considered, the task of the campaign is to extract some information (participants, dates, numbers, etc..). From the outset these challenges relate closely to named entities (elements "significant" texts, such as names of people or places). All these information are set around the event and the work does not care about the words used to describe the event (especially when it comes to a name). The event is seen as an all-encompassing as the quantity and quality of information that compose it. Unlike work in general information retrieval, our main interest is focused only on the way are named events that occur particularly in the nominal designation used. For us, this is the event that happens that is worth talking about. The most important events are the subject of newspaper articles or appear in the history books. An event can be evoked by a verbal or nominal description. In this thesis, we reflected on the notion of event. We observed and compared the different aspects presented in the state of the art to construct a definition of the event and a typology of events generally agree that in the context of our work and designations nominal events. We also released our studies of different types of training corpus of the names of events, we show that each can be ambiguous in various ways. For these studies, the composition of an annotated corpus is an essential step, so we have the opportunity to develop an annotation guide dedicated to nominal designations events. We studied the importance and quality of existing lexicons for application in our extraction task automatically. We also focused on the context of appearance of names to determine the eventness, for this purpose, we used extraction rules. Following these studies, we extracted an eventive relative weighted lexicon (whose peculiarity is to be dedicated to the extraction of nominal events), which reflects the fact that some names are more likely than others to represent events. Used as a tip for the extraction of event names, this weight can extract names that are not present in the lexicons existing standards. Finally, using machine learning, we worked on learning contextual features based in part on the syntax to extract event names.
315

Génération automatique de questionnaires à choix multiples pédagogiques : évaluation de l'homogénéité des options / Automatic generation of educational multiple-choice questions : evaluation of option homogeneity

Pho, Van-Minh 24 September 2015 (has links)
Ces dernières années ont connu un renouveau des Environnements Informatiques pour l'Apprentissage Humain. Afin que ces environnements soient largement utilisés par les enseignants et les apprenants, ils doivent fournir des moyens pour assister les enseignants dans leur tâche de génération d'exercices. Parmi ces exercices, les Questionnaires à Choix Multiples (QCM) sont très présents. Cependant, la rédaction d'items à choix multiples évaluant correctement le niveau d'apprentissage des apprenants est une tâche complexe. Des consignes ont été développées pour rédiger manuellement des items, mais une évaluation automatique de la qualité des items constituerait un outil pratique pour les enseignants.Nous nous sommes intéressés à l'évaluation automatique de la qualité des distracteurs (mauvais choix de réponse). Pour cela, nous avons étudié les caractéristiques des distracteurs pertinents à partir de consignes de rédaction de QCM. Cette étude nous a conduits à considérer que l'homogénéité des distracteurs et de la réponse est un critère important pour valider les distracteurs. L'homogénéité est d'ordre syntaxique et sémantique. Nous avons validé la définition de l'homogénéité par une analyse de corpus de QCM, et nous avons proposé des méthodes de reconnaissance automatique de l'homogénéité syntaxique et sémantique à partir de cette analyse.Nous nous sommes ensuite focalisé sur l'homogénéité sémantique des distracteurs. Pour l'estimer automatiquement, nous avons proposé un modèle d'ordonnancement par apprentissage, combinant différentes mesures d'homogénéité sémantique. L'évaluation du modèle a montré que notre méthode est plus efficace que les travaux existants pour estimer l'homogénéité sémantique des distracteurs. / Recent years have seen a revival of Intelligent Tutoring Systems. In order to make these systems widely usable by teachers and learners, they have to provide means to assist teachers in their task of exercise generation. Among these exercises, multiple-choice tests are very common. However, writing Multiple-Choice Questions (MCQ) that correctly assess a learner's level is a complex task. Guidelines were developed to manually write MCQs, but an automatic evaluation of MCQ quality would be a useful tool for teachers.We are interested in automatic evaluation of distractor (wrong answer choice) quality. To do this, we studied characteristics of relevant distractors from multiple-choice test writing guidelines. This study led us to assume that homogeneity between distractors and answer is an important criterion to validate distractors. Homogeneity is both syntactic and semantic. We validated the definition of homogeneity by a MCQ corpus analysis, and we proposed methods for automatic recognition of syntactic and semantic homogeneity based on this analysis.Then, we focused our work on distractor semantic homogeneity. To automatically estimate it, we proposed a ranking model by machine learning, combining different semantic homogeneity measures. The evaluation of the model showed that our method is more efficient than existing work to estimate distractor semantic homogeneity
316

Drug Name Recognition in Reports on Concomitant Medication

Gräns, Arvid January 2019 (has links)
This thesis evaluates if and how drug name recognition can be used to find drug names in verbatims from reports on concomitant medication in clinical trial studies. In clinical trials, reports on concomitant medication are written if a trial participant takes other drugs than the studied drug. This information needs to be coded to a drug reference dictionary. Coded verbatims were used to create the data needed to train the drug name recognition models in this thesis. Labels for where in each verbatim the coded drugs name was, were created using a Levensthein distance. The drug name recognition models were trained and tested on verbatims with labels. Drug name recognition was performed using a logistic regression model and a bidirectional long short-term memory model. The bidirectional long short-term memory model performed the best result with an F1 score of 82.5% on classifying which words in the verbatims that were drug names. When the results were studied from case to case, they showed that the bidirectional long short-term memory classifications sometimes outperformed labels it was trained on in single word verbatims. The model was also tested on manually labelled golden standard data where it performed an F1-score of 46.4%. The results indicate that a bidirectional long short-term memory model can be implemented for drug name recognition, but that label reliability is an issue in this thesis.
317

Domain Adaptation for Hypernym Discovery via Automatic Collection of Domain-Specific Training Data / Domänanpassning för identifiering av hypernymer via automatisk insamling av domänspecifikt träningsdata

Palm Myllylä, Johannes January 2019 (has links)
Identifying semantic relations in natural language text is an important component of many knowledge extraction systems. This thesis studies the task of hypernym discovery, i.e discovering terms that are related by the hypernymy (is-a) relation. Specifically, this thesis explores how state-of-the-art methods for hypernym discovery perform when applied in specific language domains. In recent times, state-of-the-art methods for hypernym discovery are mostly made up by supervised machine learning models that leverage distributional word representations such as word embeddings. These models require labeled training data in the form of term pairs that are known to be related by hypernymy. Such labeled training data is often not available when working with a specific language domain. This thesis presents experiments with an automatic training data collection algorithm. The algorithm leverages a pre-defined domain-specific vocabulary, and the lexical resource WordNet, to extract training pairs automatically. This thesis contributes by presenting experimental results when attempting to leverage such automatically collected domain-specific training data for the purpose of domain adaptation. Experiments are conducted in two different domains: One domain where there is a large amount of text data, and another domain where there is a much smaller amount of text data. Results show that the automatically collected training data has a positive impact on performance in both domains. The performance boost is most significant in the domain with a large amount of text data, with mean average precision increasing by up to 8 points.
318

Matching events and activities by integrating behavioral aspects and label analysis

Baier, Thomas, Di Ciccio, Claudio, Mendling, Jan, Weske, Mathias 05 1900 (has links) (PDF)
Nowadays, business processes are increasingly supported by IT services that produce massive amounts of event data during the execution of a process. These event data can be used to analyze the process using process mining techniques to discover the real process, measure conformance to a given process model, or to enhance existing models with performance information. Mapping the produced events to activities of a given process model is essential for conformance checking, annotation and understanding of process mining results. In order to accomplish this mapping with low manual effort, we developed a semi-automatic approach that maps events to activities using insights from behavioral analysis and label analysis. The approach extracts Declare constraints from both the log and the model to build matching constraints to efficiently reduce the number of possible mappings. These mappings are further reduced using techniques from natural language processing, which allow for a matching based on labels and external knowledge sources. The evaluation with synthetic and real-life data demonstrates the effectiveness of the approach and its robustness toward non-conforming execution logs.
319

Extração de informações de narrativas clínicas / Clinical reports information retrieval

Oleynik, Michel 02 October 2013 (has links)
Narrativas clínicas são normalmente escritas em linguagem natural devido a seu poder descritivo e facilidade de comunicação entre os especialistas. Processar esses dados para fins de descoberta de conhecimento e coleta de estatísticas exige técnicas de extração de informações, com alguns resultados já apresentados na literatura para o domínio jornalístico, mas ainda raras no domínio médico. O presente trabalho visa desenvolver um classificador de laudos de anatomia patológica que seja capaz de inferir a topografia e a morfologia de um câncer na Classificação Internacional de Doenças para Oncologia (CID-O). Dados fornecidos pelo A.C. Camargo Cancer Center em São Paulo foram utilizados para treinamento e validação. Técnicas de processamento de linguagem natural (PLN) aliadas a classificadores bayesianos foram exploradas na busca de qualidade da recuperação da informação, avaliada por meio da medida-F2. Valores acima de 74% para o grupo topográfico e de 61% para o grupo morfológico são relatados, com pequena contribuição das técnicas de PLN e suavização. Os resultados corroboram trabalhos similares e demonstram a necessidade de retreinamento das ferramentas de PLN no domínio médico. / Clinical reports are usually written in natural language due to its descriptive power and ease of communication among specialists. Processing data for knowledge discovery and statistical analysis requires information retrieval techniques, already established for newswire texts, but still rare in the medical subdomain. The present work aims at developing an automated classifier of pathology reports, which should be able to infer the topography and the morphology classes of a cancer using codes of the International Classification of Diseases for Oncology (ICD-O). Data provided by the A.C. Camargo Cancer Center located in Sao Paulo was used for training and validation. Techniques of natural language processing (NLP) and Bayes classifiers were used in search for information retrieval quality, evaluated by F2-score. Measures upper than 74% in the topographic group and 61% in the morphologic group are reported, with small contribution from NLP or smoothing techniques. The results agree with similar studies and show that a retraining of NLP tools in the medical domain is necessary.
320

Uma abordagem conexionista para anotação de papéis semânticos / A connectionist approach to semantic role labeling

Fonseca, Erick Rocha 10 April 2013 (has links)
A anotação de papéis semânticos (APS) é uma subárea do Processamento de Línguas Naturais (PLN) que começou a ser explorada para a língua inglesa a partir de 2002. Seu objetivo é detectar estruturas de predicador e argumentos em sentenças escritas, que correspondem a descrições de eventos (normalmente feitas por verbos); seus participantes, como agente e paciente; e circunstâncias, como tempo, local, etc. Diversas aplicações de PLN, como tradução automática e recuperação de informação, têm obtido melhorias em seu desempenho ao empregar a APS como uma etapa de pré-processamento. Para a língua portuguesa, os avanços na pesquisa de APS são ainda muito incipientes. Dado que a grande maioria dos trabalhos encontrados na literatura desta área emprega aprendizado de máquina supervisionado, um fator limitante tem sido a ausência de dados rotulados em português, problema que apenas recentemente foi parcialmente resolvido com a criação do PropBank-Br. Este recurso segue o modelo de anotação usado no Prop- Bank, o principal conjunto de dados rotulados empregado na tarefa de APS para a língua inglesa. Ainda assim, o PropBank-Br contém menos de um décimo do total de instâncias de dados presentes no PropBank original. Outro ponto a ser observado é que a abordagem mais comum para a APS baseia-se na extração de uma grande quantidade de informação linguística das sentenças de entrada para ser usada por classificadores automáticos. Tal abordagem mostra-se extremamente dependente de outras ferramentas de PLN, característica particularmente indesejável no caso da língua portuguesa, que não possui muitos recursos livremente disponíveis. Em contrapartida, uma outra abordagem bem sucedida encontrada na literatura abre mão do uso de conhecimento linguístico explícito e associa palavras a sequências numéricas, cujos valores são ajustados durante o treinamento de uma rede neural artificial. Estas sequências são então empregadas pela rede para realizar a APS, e podem servir também para outras tarefas de PLN. O presente trabalho seguiu o segundo método descrito acima. Foram implementadas alterações nesse método que permitiram um ganho de desempenho em comparação com sua versão original quando testada no PropBank-Br. A versão final do sistema desenvolvido está pronta para uso e poderá auxiliar pesquisas de PLN em português / Semantic Role Labeling (SRL) is a subfield of Natural Language Processing (NLP) which began to be explored for English in 2002. Its goal is to detect structures of predicate and arguments in written sentences, which correspond to descriptions of events (usually made by verbs); its participants, such as agents and patients; and circumstances, such as time, place, etc. Many NLP applications, as machine translation and information retrieval, have achieved performance gains by applying SRL as a pre-processing step. For Portuguese, advances in SRL research are still in very early stages. Given that the majority of works found in the literature of this area employ supervised machine learning, a limiting factor has been the absence of labeled data in Portuguese, a problem that only recently was partially solved with the creation of PropBank-Br. This resource follows the annotation model used in PropBank, the main labeled data set employed in the SRL task for English. Even then, PropBank-Br contains less than one tenth of the data instances present in the original PropBank. Another point to be observed is that the most common approach to SRL is based on the extraction of a great amount of information from the input sentences to be used by automatic classifiers. Such approach is extremely dependent on other NLP tools, a particularly undesirable feature in the case of Portuguese, which does not have many freely available resources. On the other hand, another succesful approach found in the literature forgoes the use of explicit linguistic knowledge and associates words to numeric sequences, whose values are adjusted during the training of an artificial neural network. These sequences are then employed by the network in order to perform SRL, and can also be useful for other NLP tasks. This work followed the second method described above. Modifications on this method were implemented and allowed for a performance gain in comparison with its original version when tested on PropBank-Br. The final version of the developed system is ready for use and will be able to help NLP research in Portuguese

Page generated in 0.0861 seconds