• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 937
  • 156
  • 74
  • 56
  • 27
  • 23
  • 18
  • 13
  • 10
  • 9
  • 8
  • 7
  • 5
  • 5
  • 4
  • Tagged with
  • 1616
  • 1616
  • 1616
  • 626
  • 572
  • 469
  • 387
  • 376
  • 269
  • 256
  • 246
  • 230
  • 221
  • 210
  • 206
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1041

A study of the use of natural language processing for conversational agents

Wilkens, Rodrigo Souza January 2016 (has links)
linguagem é uma marca da humanidade e da consciência, sendo a conversação (ou diálogo) uma das maneiras de comunicacão mais fundamentais que aprendemos quando crianças. Por isso uma forma de fazer um computador mais atrativo para interação com usuários é usando linguagem natural. Dos sistemas com algum grau de capacidade de linguagem desenvolvidos, o chatterbot Eliza é, provavelmente, o primeiro sistema com foco em diálogo. Com o objetivo de tornar a interação mais interessante e útil para o usuário há outras aplicações alem de chatterbots, como agentes conversacionais. Estes agentes geralmente possuem, em algum grau, propriedades como: corpo (com estados cognitivos, incluindo crenças, desejos e intenções ou objetivos); incorporação interativa no mundo real ou virtual (incluindo percepções de eventos, comunicação, habilidade de manipular o mundo e comunicar com outros agentes); e comportamento similar ao humano (incluindo habilidades afetivas). Este tipo de agente tem sido chamado de diversos nomes como agentes animados ou agentes conversacionais incorporados. Um sistema de diálogo possui seis componentes básicos. (1) O componente de reconhecimento de fala que é responsável por traduzir a fala do usuário em texto. (2) O componente de entendimento de linguagem natural que produz uma representação semântica adequada para diálogos, normalmente utilizando gramáticas e ontologias. (3) O gerenciador de tarefa que escolhe os conceitos a serem expressos ao usuário. (4) O componente de geração de linguagem natural que define como expressar estes conceitos em palavras. (5) O gerenciador de diálogo controla a estrutura do diálogo. (6) O sintetizador de voz é responsável por traduzir a resposta do agente em fala. No entanto, não há consenso sobre os recursos necessários para desenvolver agentes conversacionais e a dificuldade envolvida nisso (especialmente em línguas com poucos recursos disponíveis). Este trabalho foca na influência dos componentes de linguagem natural (entendimento e gerência de diálogo) e analisa em especial o uso de sistemas de análise sintática (parser) como parte do desenvolvimento de agentes conversacionais com habilidades de linguagem mais flexível. Este trabalho analisa quais os recursos do analisador sintático contribuem para agentes conversacionais e aborda como os desenvolver, tendo como língua alvo o português (uma língua com poucos recursos disponíveis). Para isto, analisamos as abordagens de entendimento de linguagem natural e identificamos as abordagens de análise sintática que oferecem um bom desempenho. Baseados nesta análise, desenvolvemos um protótipo para avaliar o impacto do uso de analisador sintático em um agente conversacional. / Language is a mark of humanity and conscience, with the conversation (or dialogue) as one of the most fundamental manners of communication that we learn as children. Therefore one way to make a computer more attractive for interaction with users is through the use of natural language. Among the systems with some degree of language capabilities developed, the Eliza chatterbot is probably the first with a focus on dialogue. In order to make the interaction more interesting and useful to the user there are other approaches besides chatterbots, like conversational agents. These agents generally have, to some degree, properties like: a body (with cognitive states, including beliefs, desires and intentions or objectives); an interactive incorporation in the real or virtual world (including perception of events, communication, ability to manipulate the world and communicate with others); and behavior similar to a human (including affective abilities). This type of agents has been called by several terms, including animated agents or embedded conversational agents (ECA). A dialogue system has six basic components. (1) The speech recognition component is responsible for translating the user’s speech into text. (2) The Natural Language Understanding component produces a semantic representation suitable for dialogues, usually using grammars and ontologies. (3) The Task Manager chooses the concepts to be expressed to the user. (4) The Natural Language Generation component defines how to express these concepts in words. (5) The dialog manager controls the structure of the dialogue. (6) The synthesizer is responsible for translating the agents answer into speech. However, there is no consensus about the necessary resources for developing conversational agents and the difficulties involved (especially in resource-poor languages). This work focuses on the influence of natural language components (dialogue understander and manager) and analyses, in particular the use of parsing systems as part of developing conversational agents with more flexible language capabilities. This work analyses what kind of parsing resources contributes to conversational agents and discusses how to develop them targeting Portuguese, which is a resource-poor language. To do so we analyze approaches to the understanding of natural language, and identify parsing approaches that offer good performance, based on which we develop a prototype to evaluate the impact of using a parser in a conversational agent.
1042

Uma abordagem semiautomática para identificação de elementos de processo de negócio em texto de linguagem natural / A semi-automatic approach to identify business process elements in natural language text

Ferreira, Renato César Borges January 2017 (has links)
Para permitir um efetivo gerenciamento de processos de negócio, o primeiro passo é o desenvolvimento de modelos de processo adequados aos objetivos das organizações. Tais modelos são utilizados para descreverem papéis e responsabilidades dos colaboradores nas organizações. Além disso, a modelagem de processos é de grande importância para documentar, entender e automatizar processos. As organizações, geralmente provêm documentos não estruturados e de difícil entendimento por parte dos analistas. Neste panorama, a modelagem de processos se torna demorada e de alto custo, podendo gerar modelos de processo que estão em desacordo com a realidade prevista pelas organizações. A extração de modelos ou fragmentos de processo a partir de descrições textuais pode contribuir para minimizar o esforço necessário à modelagem de processos. Neste contexto, esta dissertação propõe uma abordagem para identificar elementos de processo de negócio em texto em linguagem natural de forma semiautomática. Baseado no estudo de processamento de linguagem natural, foi definido um conjunto de regras de mapeamento para identificar elementos de processo em descrição textual Além disso, para avaliar as regras de mapeamento e viabilizar a abordagem proposta, foi desenvolvido um protótipo capaz de identificar elementos de processo em texto de forma semiautomática. Para medir o desempenho do protótipo proposto, foram utilizadas métricas de recuperação de informação, tais como precisão, revocação e medida-F. Além disso, foram aplicados dois questionários com o objetivo de verificar a aceitação perante os usuários. As avaliações apresentam resultados promissores. A análise de 70 textos, apresentou, em média, 73,61% de precisão, 70,15% de revocação e 71,82% de medida-F. Além disso, os resultados do primeiro e segundo questionários apresentaram, em média, 91,66% de aceitação dos participantes. A principal contribuição deste trabalho é propor regras de mapeamento para identificar elementos de processo em texto em linguagem natural para auxiliar e minimizar o tempo necessário à modelagem de processos realizada pelos analistas de processo. / To enable effective business process management, the first step is the design of appropriate process models to the organization’s objectives. These models are used to describe roles and responsibilities of the employees in an organizations. In addition, business process modeling is very important to report, understand and automate processes. However, the documentation existent in organizations about such processes is mostly unstructured and difficult to be understood by analysts. In this context, process modeling becomes highly time consuming and expensive, generating process models that do not comply with the reality of the organizations. The extracting of process models from textual descriptions may contribute to minimize the effort required in process modeling. In this context, this dissertation proposes a semi-automatic approach to identify process elements in natural language text. Based on the study of natural language processing, it was defined a set of mapping rules to identify process elements in text. In addition, in order to evaluate the mapping rules and to demonstrate the feasibility of the proposed approach, a prototype was developed able to identify process elements in text in a semiautomatic way To measure the performance of the proposed prototype metrics were used to retrieve information such as precision, recall, and F-measure. In addition, two surveys were developed with the purpose of verifying the acceptance of the users. The evaluations present promising results. The analyses of 70 texts presented, on average, 73.61% precision, 70.15% recall and 71.82% F-measure. In addition, the results of the first and second surveys presented on average 91.66% acceptance of the participants. The main contribution of this work is to provide mapping rules for identify process elements in natural language text to support and minimize the time required for process modeling performed by process analysts.
1043

Academic Recommendation System Based on the Similarity Learning of the Citation Network Using Citation Impact

Alshareef, Abdulrhman M. 29 April 2019 (has links)
In today's significant and rapidly increasing amount of scientific publications, exploring recent studies in a given research area and building an effective scientific collaboration has become more challenging than any time before. Scientific production growth has been increasing the difficulties for identifying the most relevant papers to cite or to find an appropriate conference or journal to submit a paper to publish. As a result, authors and publishers rely on different analytical approaches in order to measure the relationship among the citation network. Different parameters have been used such as the impact factor, number of citations, co-citation to assess the impact of the produced research publication. However, using one assessing factor considers only one level of relationship exploration, since it does not reflect the effect of the other factors. In this thesis, we propose an approach to measure the Academic Citation Impact that will help to identify the impact of articles, authors, and venues at their extended nearby citation network. We combine the content similarity with the bibliometric indices to evaluate the citation impact of articles, authors, and venues in their surrounding citation network. Using the article metadata, we calculate the semantic similarity between any two articles in the extended network. Then we use the similarity score and bibliometric indices to evaluate the impact of the articles, authors, and venues among their extended nearby citation network. Furthermore, we propose an academic recommendation model to identify the latent preferences among the citation network of the given article in order to expose the concealed connection between the academic objects (articles, authors, and venues) at the citation network of the given article. To reveal the degree of trust for collaboration between academic objects (articles, authors, and venues), we use the similarity learning to estimate the collaborative confidence score that represents the anticipation of a prospect relationship between the academic objects among a scientific community. We conducted an offline experiment to measure the accuracy of delivering personalized recommendations, based on the user’s selection preferences; real-world datasets were used. Our evaluation results show a potential improvement to the quality of the recommendation when compared to baseline recommendation algorithms that consider co-citation information.
1044

Sumarização multidocumento com base em aspectos informativos / Multidocument summarization based on information aspects

Garay, Alessandro Yovan Bokan 20 August 2015 (has links)
A sumarização multidocumento consiste na produção de um sumário/resumo a partir de uma coleção de textos sobre um mesmo assunto. Devido à grande quantidade de informação disponível na Web, esta tarefa é de grande relevância já que pode facilitar a leitura dos usuários. Os aspectos informativos representam as unidades básicas de informação presentes nos textos. Por exemplo, em textos jornalísticos em que se relata um fato/acontecimento, os aspectos podem representar a seguintes informações: o que aconteceu, onde aconteceu, quando aconteceu, como aconteceu, e por que aconteceu. Conhecendo-se esses aspectos e as estratégias de produção e organização de sumários, é possível automatizar a tarefa de sumarização. No entanto, para o Português do Brasil, não há pesquisa feita sobre sumarização com base em aspectos. Portanto, neste trabalho de mestrado, investigaram-se métodos de sumarização multidocumento com base em aspectos informativos, pertencente à abordagem profunda para a sumarização, em que se busca interpretar o texto para se produzir sumários mais informativos. Em particular, implementaram-se duas etapas relacionadas: (i) identificação automática de aspectos os aspectos informativos e (ii) desenvolvimento e avaliação de dois métodos de sumarização com base em padrões de aspectos (ou templates) em sumários. Na etapa (i), criaram-se classificadores de aspectos com base em anotador de papéis semânticos, reconhecedor de entidades mencionadas, regras manuais e técnicas de aprendizado de máquina. Avaliaram-se os classificadores sobre o córpus CSTNews (Rassi et al., 2013; Felippo et al., 2014). Os resultados foram satisfatórios, demostrando que alguns aspectos podem ser identificados automaticamente em textos jornalísticos com um desempenho razoável. Já na etapa (ii), elaboraram-se dois métodos inéditos de sumarização multidocumento com base em aspectos. Os resultados obtidos mostram que os métodos propostos neste trabalho são competitivos com os métodos da literatura. Salienta-se que esta abordagem para sumarização tem recebido grande destaque ultimamente. Além disso, é inédita nos trabalhos desenvolvidos no Brasil, podendo trazer contribuições importantes para a área. / Multi-document summarization is the task of automatically producing a unique summary from a group of texts on the same topic. With the huge amount of available information in the web, this task is very relevant because it can facilitate the reading of the users. Informative aspects, in particular, represent the basic information units in texts and summaries, e.g., in news texts there should be the following information: what happened, when it happened, where it happened, how it happened and why it happened. Knowing these aspects and the strategies to produce and organize summaries, it is possible to automate the aspect-based summarization. However, there is no research about aspect-based multi-document summarization for Brazilian Portuguese. This research work investigates multi-document summarization methods based on informative aspects, which follows the deep approach for summarization, in which it aims at interpreting the texts to produce more informative summaries. In particular, two main stages are developed: (i) the automatic identification of informative aspects and (ii) and the development and evaluation of two summarization methods based on aspects patterns (or templates). In the step (i) classifiers were created based on semantic role labeling, named entity recognition, handcrafted rules and machine learning techniques. Classifiers were evaluated on the CSTNews annotated corpus (Rassi et al., 2013; Felippo et al., 2014). The results were satisfactory, demonstrating that some aspects can be automatically identified in the news with a reasonable performance. In the step (ii) two novels aspect-based multi-document summarization methods are elaborated. The results show that the proposed methods in this work are competitive with the classical methods. It should be noted that this approach has lately received a lot of attention. Furthermore, it is unprecedented in the summarization task developed in Brazil, with the potential to bring important contributions to the area.
1045

Biomedical Concept Association and Clustering Using Word Embeddings

Setu Shah (5931128) 12 February 2019 (has links)
<div>Biomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space.</div><div><br></div><div>A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services.</div><div><br></div><div>The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of.</div><div><br></div><div>To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for.</div><div><br></div><div>At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients.</div>
1046

Investigating data quality in question and answer reports

Mohamed Zaki Ali, Mona January 2016 (has links)
Data Quality (DQ) has been a long-standing concern for a number of stakeholders in a variety of domains. It has become a critically important factor for the effectiveness of organisations and individuals. Previous work on DQ methodologies have mainly focused on either the analysis of structured data or the business-process level rather than analysing the data itself. Question and Answer Reports (QAR) are gaining momentum as a way to collect responses that can be used by data analysts, for instance, in business, education or healthcare. Various stakeholders benefit from QAR such as data brokers and data providers, and in order to effectively analyse and identify the common DQ problems in these reports, the various stakeholders' perspectives should be taken into account which adds another complexity for the analysis. This thesis investigates DQ in QAR through an in-depth DQ analysis and provide solutions that can highlight potential sources and causes of problems that result in "low-quality" collected data. The thesis proposes a DQ methodology that is appropriate for the context of QAR. The methodology consists of three modules: question analysis, medium analysis and answer analysis. In addition, a Question Design Support (QuDeS) framework is introduced to operationalise the proposed methodology through the automatic identification of DQ problems. The framework includes three components: question domain-independent profiling, question domain-dependent profiling and answers profiling. The proposed framework has been instantiated to address one example of DQ issues, namely Multi-Focal Question (MFQ). We introduce MFQ as a question with multiple requirements; it asks for multiple answers. QuDeS-MFQ (the implemented instance of QuDeS framework) has implemented two components of QuDeS for MFQ identification, these are question domain-independent profiling and question domain-dependent profiling. The proposed methodology and the framework are designed, implemented and evaluated in the context of the Carbon Disclosure Project (CDP) case study. The experiments show that we can identify MFQs with 90% accuracy. This thesis also demonstrates the challenges including the lack of domain resources for domain knowledge representation, such as domain ontology, the complexity and variability of the structure of QAR, as well as the variability and ambiguity of terminology and language expressions and understanding stakeholders or users need.
1047

Towards the French Biomedical Ontology Enrichment / Vers l'enrichissement d'ontologies biomédicales françaises

Lossio-Ventura, Juan Antonio 09 November 2015 (has links)
En biomedicine, le domaine du « Big Data » (l'infobésité) pose le problème de l'analyse de gros volumes de données hétérogènes (i.e. vidéo, audio, texte, image). Les ontologies biomédicales, modèle conceptuel de la réalité, peuvent jouer un rôle important afin d'automatiser le traitement des données, les requêtes et la mise en correspondance des données hétérogènes. Il existe plusieurs ressources en anglais mais elles sont moins riches pour le français. Le manque d'outils et de services connexes pour les exploiter accentue ces lacunes. Dans un premier temps, les ontologies ont été construites manuellement. Au cours de ces dernières années, quelques méthodes semi-automatiques ont été proposées. Ces techniques semi-automatiques de construction/enrichissement d'ontologies sont principalement induites à partir de textes en utilisant des techniques du traitement du langage naturel (TALN). Les méthodes de TALN permettent de prendre en compte la complexité lexicale et sémantique des données biomédicales : (1) lexicale pour faire référence aux syntagmes biomédicaux complexes à considérer et (2) sémantique pour traiter l'induction du concept et du contexte de la terminologie. Dans cette thèse, afin de relever les défis mentionnés précédemment, nous proposons des méthodologies pour l'enrichissement/la construction d'ontologies biomédicales fondées sur deux principales contributions.La première contribution est liée à l'extraction automatique de termes biomédicaux spécialisés (complexité lexicale) à partir de corpus. De nouvelles mesures d'extraction et de classement de termes composés d'un ou plusieurs mots ont été proposées et évaluées. L'application BioTex implémente les mesures définies.La seconde contribution concerne l'extraction de concepts et le lien sémantique de la terminologie extraite (complexité sémantique). Ce travail vise à induire des concepts pour les nouveaux termes candidats et de déterminer leurs liens sémantiques, c'est-à-dire les positions les plus pertinentes au sein d'une ontologie biomédicale existante. Nous avons ainsi proposé une approche d'extraction de concepts qui intègre de nouveaux termes dans l'ontologie MeSH. Les évaluations, quantitatives et qualitatives, menées par des experts et non experts, sur des données réelles soulignent l'intérêt de ces contributions. / Big Data for biomedicine domain deals with a major issue, the analyze of large volume of heterogeneous data (e.g. video, audio, text, image). Ontology, conceptual models of the reality, can play a crucial role in biomedical to automate data processing, querying, and matching heterogeneous data. Various English resources exist but there are considerably less available in French and there is a strong lack of related tools and services to exploit them. Initially, ontologies were built manually. In recent years, few semi-automatic methodologies have been proposed. The semi-automatic construction/enrichment of ontologies are mostly induced from texts by using natural language processing (NLP) techniques. NLP methods have to take into account lexical and semantic complexity of biomedical data : (1) lexical refers to complex phrases to take into account, (2) semantic refers to sense and context induction of the terminology.In this thesis, we propose methodologies for enrichment/construction of biomedical ontologies based on two main contributions, in order to tackle the previously mentioned challenges. The first contribution is about the automatic extraction of specialized biomedical terms (lexical complexity) from corpora. New ranking measures for single- and multi-word term extraction methods have been proposed and evaluated. In addition, we present BioTex software that implements the proposed measures. The second contribution concerns the concept extraction and semantic linkage of the extracted terminology (semantic complexity). This work seeks to induce semantic concepts of new candidate terms, and to find the semantic links, i.e. relevant location of new candidate terms, in an existing biomedical ontology. We proposed a methodology that extracts new terms in MeSH ontology. The experiments conducted on real data highlight the relevance of the contributions.
1048

Unconscious processing at the subjective threshold : semantic comprehension?

Armstrong, Anna-Marie January 2014 (has links)
Our thoughts and behaviours can sometimes be influenced by stimuli that we are not consciously aware of having seen. For example, the presentation of a word that is blocked from entering conscious visual perception through masking can subsequently influence the cognitive processing of a further target word. However, the idea that unconscious cognition is sophisticated enough to process the semantic meaning of subliminal stimuli is controversial. This thesis attempts to explore the extent of subliminal priming. Empirical research centering on subjective methods of measuring conscious knowledge is presented in a series of three articles. The first article investigates the subliminal priming of negation. A series of experiments demonstrates that unconscious processing can accurately discriminate between two nouns beyond chance performance when subliminally instructed to either pick or not pick a given noun. This article demonstrates not only semantic processing of the instructional word, but also unconscious cognitive control by following a two-word subliminal instruction to not choose the primed noun. The second article investigates subliminal priming of active versus passive verb voice by presenting a prime sentence denoting one of two characters as either active or passive and asking which of two pictorial representations best matches the prime. The series of experiments demonstrates that overall, participants were able to identify the correct image for both active and passive conditions beyond chance expectations. This article suggests that individuals are able to process the meaning of word combinations that they are not aware of seeing. The third article attempts to determine whether subliminal processing is sophisticated enough to allow for the activation of specific anxieties relating to relationships. Whilst the findings reveal a small subliminal priming effect on generalised anxiety, the evidence regarding the subliminal priming of very specific anxieties is insensitive. The unconscious is shown in these experiments to be more powerful than previously supposed in terms of the fine grained processing of the semantics of word combinations, though not yet in terms of the fine grained resolution of emotional priming.
1049

Mining and modeling variability from natural language documents : two case studies / Extraction automatique de modèles de variabilité

Ben Nasr, Sana 05 April 2016 (has links)
L'analyse du domaine vise à identifier et organiser les caractéristiques communes et variables dans un domaine. Dans la pratique, le coût initial et le niveau d'effort manuel associés à cette analyse constituent un obstacle important pour son adoption par de nombreuses organisations qui ne peuvent en bénéficier. La contribution générale de cette thèse consiste à adopter et exploiter des techniques de traitement automatique du langage naturel et d'exploration de données pour automatiquement extraire et modéliser les connaissances relatives à la variabilité à partir de documents informels. L'enjeu est de réduire le coût opérationnel de l’analyse du domaine. Nous étudions l'applicabilité de notre idée à travers deux études de cas pris dans deux contextes différents: (1) la rétro-ingénierie des Modèles de Features (FMs) à partir des exigences réglementaires de sûreté dans le domaine de l’industrie nucléaire civil et (2) l’extraction de Matrices de Comparaison de Produits (PCMs) à partir de descriptions informelles de produits. Dans la première étude de cas, nous adoptons des techniques basées sur l’analyse sémantique, le regroupement (clustering) des exigences et les règles d'association. L'évaluation de cette approche montre que 69% de clusters sont corrects sans aucune intervention de l'utilisateur. Les dépendances entre features montrent une capacité prédictive élevée: 95% des relations obligatoires et 60% des relations optionnelles sont identifiées, et la totalité des relations d'implication et d'exclusion sont extraites. Dans la deuxième étude de cas, notre approche repose sur la technologie d'analyse contrastive pour identifier les termes spécifiques au domaine à partir du texte, l'extraction des informations pour chaque produit, le regroupement des termes et le regroupement des informations. Notre étude empirique montre que les PCMs obtenus sont compacts et contiennent de nombreuses informations quantitatives qui permettent leur comparaison. L'expérience utilisateur montre des résultats prometteurs et que notre méthode automatique est capable d'identifier 43% de features correctes et 68% de valeurs correctes dans des descriptions totalement informelles et ce, sans aucune intervention de l'utilisateur. Nous montrons qu'il existe un potentiel pour compléter ou même raffiner les caractéristiques techniques des produits. La principale leçon à tirer de ces deux études de cas, est que l’extraction et l’exploitation de la connaissance relative à la variabilité dépendent du contexte, de la nature de la variabilité et de la nature du texte. / Domain analysis is the process of analyzing a family of products to identify their common and variable features. This process is generally carried out by experts on the basis of existing informal documentation. When performed manually, this activity is both time-consuming and error-prone. In this thesis, our general contribution is to address mining and modeling variability from informal documentation. We adopt Natural Language Processing (NLP) and data mining techniques to identify features, commonalities, differences and features dependencies among related products. We investigate the applicability of this idea by instantiating it in two different contexts: (1) reverse engineering Feature Models (FMs) from regulatory requirements in nuclear domain and (2) synthesizing Product Comparison Matrices (PCMs) from informal product descriptions. In the first case study, we adopt NLP and data mining techniques based on semantic analysis, requirements clustering and association rules to assist experts when constructing feature models from these regulations. The evaluation shows that our approach is able to retrieve 69% of correct clusters without any user intervention. Moreover, features dependencies show a high predictive capacity: 95% of the mandatory relationships and 60% of optional relationships are found, and the totality of requires and exclude relationships are extracted. In the second case study, our proposed approach relies on contrastive analysis technology to mine domain specific terms from text, information extraction, terms clustering and information clustering. Overall, our empirical study shows that the resulting PCMs are compact and exhibit numerous quantitative and comparable information. The user study shows that our automatic approach retrieves 43% of correct features and 68% of correct values in one step and without any user intervention. We show that there is a potential to complement or even refine technical information of products. The main lesson learnt from the two case studies is that the exploitability and the extraction of variability knowledge depend on the context, the nature of variability and the nature of text.
1050

Uma solução efetiva para aprendizagem de relacionamentos não taxonômicos de ontologias / An effective solution for learning non taxonomic relationships of ontologies

SERRA, Ivo José da Cunha Serra 28 March 2014 (has links)
Submitted by Rosivalda Pereira (mrs.pereira@ufma.br) on 2017-08-15T20:12:06Z No. of bitstreams: 1 IvoJoseCunha.pdf: 14173001 bytes, checksum: 931d704f4e5fdefacca2b8ab283f31c4 (MD5) / Made available in DSpace on 2017-08-15T20:12:06Z (GMT). No. of bitstreams: 1 IvoJoseCunha.pdf: 14173001 bytes, checksum: 931d704f4e5fdefacca2b8ab283f31c4 (MD5) Previous issue date: 2014-03-28 / Learngin Non-Taxonomic Relationship is a sub-field of ontology learning and is an approach to automate the extraction of these relationships from textual information sources. Techniques for learning non-taxonomic relationships just like others in the area of Ontology Learning are subject to a great amount of noise since the source of information from which the relationships are extract is unstructured. Therefore, customizable solutions are needed for theses techniques to be applicable to the wideste variety of situations. This Thesis presents TARNT, a Techinique for Learning for Non-Taxonomic Relationship of ontologies from texts in English that employs techniques from Natural Language Processing and statistics to structure text and to select relationship that should be recommended. The control over the execution of its extraction rules and consequently on the recall and precision in the phase "Extraction of candidate relationships", the "apostrophe rule", which gives particular treatment to extractions that have greater probability to be valid ones and "Bag of labels", a refinement technique that has the potential to achieve greater effectiveness than those that operate on relationships consisting of a pair of concepts and a label, are among its positive aspects. Experimental evaluations of TARNT were performed according to two procedures based on the principle of comparing the learned relationship consisting of a pair of concepts and a label, are among its positive aspects. Experimental evaluations of TARNT were performed according to two procedures based on the principle of comparing the learned relationships with reference ones. These experiments consisted in measuring with recall and precision, the effectiveness of the technique in learning non-taxonomic relationships from two corpora in the domains of biology and family law. The results were compared to thet of another approach that uses and algorithm for the extraction of association rules in the Refinement phase. This Thesis also demonstrate the hypothesis that solutions to the Refinement phase that use relationships composed of two ontology concepts and a label are less effective than those that refine relationships composed of only two concepts, since they tend to have lower values for the evaluation measures when considering the same corpus and reference ontology. The demonstration was conducted by a theoretical exposition that consisted of the generalization of the observations made on the results obtained by two techniques that refine relationships of the two types considered. / A Aprendizagem de Relacionamentos Não-Taxonômicos é um sub-campo da Aprendizagem de ontologia e constitui uma abordagem para automatizar a extração desses relacionamentos a partir de fontes de informações textuais. As técnicas de aprendizagem de relacionamentos não taxonômicos, da mesma forma que outras na área de Aprendizagem de Ontologias estão sujeitas a uma grande quantidade de ruído uma vez que a fonte de informação da qual extraem os relacionamentos ser desestruturada. Portanto, soluções customizáveis são necessárias para que essas técnicas sejam aplicáveis a maior variedade possível de situações. O presente trabalho apresentou TARNT, uma Técnica para a Aprendizagem de Relacionamentos Não-Taxonômicos de ontologias a partir de textos na língua inglesa que emprega técnicas de Processamento de Linguagem Natural e estatísticas para etiquetar o texto e selecionar os relacionamentos a serem recomendados. o controle sobre execução de suas regras de extração e consequentemente sobre o recall e precisão na fase "Extração de relacionamentos candidatos"; a "regra de apóstrofo", que confere tratamento particular às extrações que tem maior probabilidade de serem relacionamentos válidos e Bag of labels, solução para a fase de "Refinamento" que apresenta o potencial de obter maior efetividade que as que operam sore relacionamentos compostos por um par de conceitos e um rótulo, estão entre seus aspectos positivos. Avaliações experimentais de TARNT foram realizadas conforme dois procedimentos baseados no princípio de comparação dos relacionamentos aprendidos com os de referência. Esses experimentos consistiram em mensurar com as medidas de avaliação recall e precisão, a efetividade da técnica na aprendizagem de relacionamentos não-taxonômicos a partir de dois corpora nos domínio da biologia e o direito da família. Os resultados obtidos foram ainda comparados aos de outra abordagem que utiliza o algoritmo de extração de regras de associação na fase de "Refinamento". Esse trabalho demostrou ainda a hipótese de pesquisa de que: soluções para a fase de "Refinamento" que utilizam relacionamentos compostos por dois conceitos de uma ontologia e um rótulo são menos efetivas que as que refinam relacionamentos compostos apenas pro dois conceitos, uma vez que esses tendem a apresentar menores valores para as medidas de avaliação quando considerados os mesmos corpus e ontologia de referência. A demonstração foi realizada por meio de uma exposição teórica que consistiu na generalização das observações realizadas sobre os resultados obtidos por duas técnicas que refinam relacionamentos dos dois tipos considerados.

Page generated in 0.1539 seconds