• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 58
  • 8
  • 8
  • 6
  • 2
  • 2
  • 2
  • 2
  • 2
  • Tagged with
  • 102
  • 102
  • 102
  • 45
  • 38
  • 37
  • 35
  • 35
  • 31
  • 24
  • 22
  • 16
  • 16
  • 12
  • 12
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Automatic Extraction and Assessment of Entities from the Web

Urbansky, David 23 October 2012 (has links) (PDF)
The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to find all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The findings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research fields, such as question answering, named entity recognition, and information retrieval.
62

Extracting Clinical Findings from Swedish Health Record Text

Skeppstedt, Maria January 2014 (has links)
Information contained in the free text of health records is useful for the immediate care of patients as well as for medical knowledge creation. Advances in clinical language processing have made it possible to automatically extract this information, but most research has, until recently, been conducted on clinical text written in English. In this thesis, however, information extraction from Swedish clinical corpora is explored, particularly focusing on the extraction of clinical findings. Unlike most previous studies, Clinical Finding was divided into the two more granular sub-categories Finding (symptom/result of a medical examination) and Disorder (condition with an underlying pathological process). For detecting clinical findings mentioned in Swedish health record text, a machine learning model, trained on a corpus of manually annotated text, achieved results in line with the obtained inter-annotator agreement figures. The machine learning approach clearly outperformed an approach based on vocabulary mapping, showing that Swedish medical vocabularies are not extensive enough for the purpose of high-quality information extraction from clinical text. A rule and cue vocabulary-based approach was, however, successful for negation and uncertainty classification of detected clinical findings. Methods for facilitating expansion of medical vocabulary resources are particularly important for Swedish and other languages with less extensive vocabulary resources. The possibility of using distributional semantics, in the form of Random indexing, for semi-automatic vocabulary expansion of medical vocabularies was, therefore, evaluated. Distributional semantics does not require that terms or abbreviations are explicitly defined in the text, and it is, thereby, a method suitable for clinical corpora. Random indexing was shown useful for extending vocabularies with medical terms, as well as for extracting medical synonyms and abbreviation dictionaries.
63

La structuration dans les entités nommées / Structuration in named entities

Dupont, Yoann 23 November 2017 (has links)
La reconnaissance des entités nommées et une discipline cruciale du domaine du TAL. Elle sert à l'extraction de relations entre entités nommées, ce qui permet la construction d'une base de connaissance (Surdeanu and Ji, 2014), le résumé automatique (Nobata et al., 2002), etc... Nous nous intéressons ici aux phénomènes de structurations qui les entourent.Nous distinguons ici deux types d'éléments structurels dans une entité nommée. Les premiers sont des sous-chaînes récurrentes, que nous appelerons les affixes caractéristiques d'une entité nommée. Le second type d'éléments est les tokens ayant un fort pouvoir discriminant, appelés des tokens déclencheurs. Nous détaillerons l'algorithme que nous avons mis en place pour extraire les affixes caractéristiques, que nous comparerons à Morfessor (Creutz and Lagus, 2005b). Nous appliquerons ensuite notre méthode pour extraire les tokens déclencheurs, utilisés pour l'extraction d'entités nommées du Français et d'adresses postales.Une autre forme de structuration pour les entités nommées est de nature syntaxique, qui suit généralement une structure d'imbrications ou arborée. Nous proposons un type de cascade d'étiqueteurs linéaires qui n'avait jusqu'à présent jamais été utilisé pour la reconnaissance d'entités nommées, généralisant les approches précédentes qui ne sont capables de reconnaître des entités de profondeur finie ou ne pouvant modéliser certaines particularités des entités nommées structurées.Tout au long de cette thèse, nous comparons deux méthodes par apprentissage automatique, à savoir les CRF et les réseaux de neurones, dont nous présenterons les avantages et inconvénients de chacune des méthodes. / Named entity recognition is a crucial discipline of NLP. It is used to extract relations between named entities, which allows the construction of knowledge bases (Surdeanu and Ji, 2014), automatic summary (Nobata et al., 2002) and so on. Our interest in this thesis revolves around structuration phenomena that surround them.We distinguish here two kinds of structural elements in named entities. The first one are recurrent substrings, that we will call the caracteristic affixes of a named entity. The second type of element is tokens with a good discriminative power, which we call trigger tokens of named entities. We will explain here the algorithm we provided to extract such affixes, which we will compare to Morfessor (Creutz and Lagus, 2005b). We will then apply the same algorithm to extract trigger tokens, which we will use for French named entity recognition and postal address extraction.Another form of structuration for named entities is of a syntactic nature. It follows an overlapping or tree structure. We propose a novel kind of linear tagger cascade which have not been used before for structured named entity recognition, generalising other previous methods that are only able to recognise named entities of a fixed depth or being unable to model certain characteristics of the structure. Ours, however, can do both.Throughout this thesis, we compare two machine learning methods, CRFs and neural networks, for which we will compare respective advantages and drawbacks.
64

Natural language processing in cross-media analysis

Woldemariam, Yonas Demeke January 2018 (has links)
A cross-media analysis framework is an integrated multi-modal platform where a media resource containing different types of data such as text, images, audio and video is analyzed with metadata extractors, working jointly to contextualize the media resource. It generally provides cross-media analysis and automatic annotation, metadata publication and storage, searches and recommendation services. For on-line content providers, such services allow them to semantically enhance a media resource with the extracted metadata representing the hidden meanings and make it more efficiently searchable. Within the architecture of such frameworks, Natural Language Processing (NLP) infrastructures cover a substantial part. The NLP infrastructures include text analysis components such as a parser, named entity extraction and linking, sentiment analysis and automatic speech recognition. Since NLP tools and techniques are originally designed to operate in isolation, integrating them in cross-media frameworks and analyzing textual data extracted from multimedia sources is very challenging. Especially, the text extracted from audio-visual content lack linguistic features that potentially provide important clues for text analysis components. Thus, there is a need to develop various techniques to meet the requirements and design principles of the frameworks. In our thesis, we explore developing various methods and models satisfying text and speech analysis requirements posed by cross-media analysis frameworks. The developed methods allow the frameworks to extract linguistic knowledge of various types and predict various information such as sentiment and competence. We also attempt to enhance the multilingualism of the frameworks by designing an analysis pipeline that includes speech recognition, transliteration and named entity recognition for Amharic, that also enables the accessibility of Amharic contents on the web more efficiently. The method can potentially be extended to support other under-resourced languages.
65

Automatically Detecting the Resonance of Terrorist Movement Frames on the Web

Etudo, Ugochukwu O 01 January 2017 (has links)
The ever-increasing use of the internet by terrorist groups as a platform for the dissemination of radical, violent ideologies is well documented. The internet has, in this way, become a breeding ground for potential lone-wolf terrorists; that is, individuals who commit acts of terror inspired by the ideological rhetoric emitted by terrorist organizations. These individuals are characterized by their lack of formal affiliation with terror organizations, making them difficult to intercept with traditional intelligence techniques. The radicalization of individuals on the internet poses a considerable threat to law enforcement and national security officials. This new medium of radicalization, however, also presents new opportunities for the interdiction of lone wolf terrorism. This dissertation is an account of the development and evaluation of an information technology (IT) framework for detecting potentially radicalized individuals on social media sites and Web fora. Unifying Collective Action Framing Theory (CAFT) and a radicalization model of lone wolf terrorism, this dissertation analyzes a corpus of propaganda documents produced by several, radically different, terror organizations. This analysis provides the building blocks to define a knowledge model of terrorist ideological framing that is implemented as a Semantic Web Ontology. Using several techniques for ontology guided information extraction, the resultant ontology can be accurately processed from textual data sources. This dissertation subsequently defines several techniques that leverage the populated ontological representation for automatically identifying individuals who are potentially radicalized to one or more terrorist ideologies based on their postings on social media and other Web fora. The dissertation also discusses how the ontology can be queried using intuitive structured query languages to infer triggering events in the news. The prototype system is evaluated in the context of classification and is shown to provide state of the art results. The main outputs of this research are (1) an ontological model of terrorist ideologies (2) an information extraction framework capable of identifying and extracting terrorist ideologies from text, (3) a classification methodology for classifying Web content as resonating the ideology of one or more terrorist groups and (4) a methodology for rapidly identifying news content of relevance to one or more terrorist groups.
66

Reconhecimento de entidades mencionadas em português utilizando aprendizado de máquina / Portuguese named entity recognition using machine learning

Wesley Seidel Carvalho 24 February 2012 (has links)
O Reconhecimento de Entidades Mencionadas (REM) é uma subtarefa da extração de informações e tem como objetivo localizar e classificar elementos do texto em categorias pré-definidas tais como nome de pessoas, organizações, lugares, datas e outras classes de interesse. Esse conhecimento obtido possibilita a execução de outras tarefas mais avançadas. O REM pode ser considerado um dos primeiros passos para a análise semântica de textos, além de ser uma subtarefa crucial para sistemas de gerenciamento de documentos, mineração de textos, extração da informação, entre outros. Neste trabalho, estudamos alguns métodos de Aprendizado de Máquina aplicados na tarefa de REM que estão relacionados ao atual estado da arte, dentre eles, dois métodos aplicados na tarefa de REM para a língua portuguesa. Apresentamos três diferentes formas de avaliação destes tipos de sistemas presentes na literatura da área. Além disso, desenvolvemos um sistema de REM para língua portuguesa utilizando Aprendizado de Máquina, mais especificamente, o arcabouço de máxima entropia. Os resultados obtidos com o nosso sistema alcançaram resultados equiparáveis aos melhores sistemas de REM para a língua portuguesa desenvolvidos utilizando outras abordagens de aprendizado de máquina. / Named Entity Recognition (NER), a task related to information extraction, aims to classify textual elements according to predefined categories such as names, places, dates etc. This enables the execution of more advanced tasks. NER is a first step towards semantic textual analysis and is also a crucial task for systems of information extraction and other types of systems. In this thesis, I analyze some Machine Learning methods applied to NER tasks, including two methods applied to Portuguese language. I present three ways of evaluating these types of systems found in the literature. I also develop an NER system for the Portuguese language utilizing Machine Learning that entails working with a maximum entropy framework. The results are comparable to the best NER systems for the Portuguese language developed with other Machine Learning alternatives.
67

Struktury trie pro zpracování rozsáhlých textových dat / Trie Structures for Large Text Data Processing

Rajčok, Andrej January 2016 (has links)
This study analyzes natural language processing with emphasis on morphological analysis of inflective languages and systems for named entity recognition. It analyzes effective pattern matching in dictionary by using succint structures and then analyzes practical implementation of succint structures. It describes design and implementation of named entity recognition system and morphological analyzer and compares and test their speed and effectiveness.
68

Automatic Extraction and Assessment of Entities from the Web

Urbansky, David 15 October 2012 (has links)
The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to find all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The findings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research fields, such as question answering, named entity recognition, and information retrieval.
69

Extracting Transaction Information from Financial Press Releases / Extrahering av Transaktionsdata från Finansiella Pressmeddelanden

Sjöberg, Agaton January 2021 (has links)
The use cases of Information Extraction (IE) are more or less endless, often consisting of a combination of Named Entity Recognition (NER) and Relation Extraction (RE). One use case of IE is the extraction of transaction information from Norwegian insider transaction Press Releases (PRs), where a transaction consists of at most four entities: the name of the owner performing the transaction, the number of shares transferred, the transaction date, and the price of the shares bought or sold. The relationships between the entities define which entity belongs to which transaction, and whether shares were bought or sold. This report has investigated how a pair of supervised NER and RE models extract this information. Since these Norwegian PRs were not labeled, two different approaches to annotating the transaction entities and their associated relations were investigated, and it was found that it is better to annotate only entities that occur in a relation than annotating all occurrences. Furthermore, the number of PRs needed to achieve a satisfactory result in the IE pipeline was investigated. The study shows that training with about 400 PRs is sufficient for the results to converge, at around 0.85 in F1-score. Finally, the report shows that there is not much difference between a complex RE model and a simple rule-based approach, when applied on the studied corpus.
70

Annotating Job Titles in Job Ads using Swedish Language Models

Ridhagen, Markus January 2023 (has links)
This thesis investigates automated annotation approaches to assist public authorities in Sweden in optimizing resource allocation and gaining valuable insights to enhance the preservation of high-quality welfare. The study uses pre-trained Swedish language models for the named entity recognition (NER) task of finding job titles in job advertisements from The Swedish Public Employment Service, Arbetsförmedlingen. Specifically, it evaluates the performance of the Swedish Bidirectional Encoder Representations from Transformers (BERT), developed by the National Library of Sweden (KB), referred to as KB-BERT. The thesis explores the impact of training data size on the models’ performance and examines whether active learning can enhance efficiency and accuracy compared to random sampling. The findings reveal that even with a small training dataset of 220 job advertisements, KB-BERT achieves a commendable F1-score of 0.770 in predicting job titles. The model’s performance improves further by augmenting the training data with an additional 500 annotated job advertisements, yielding an F1-score of 0.834. Notably, the highest F1-score of 0.856 is achieved by applying the active learning strategy of uncertainty sampling and the measure of mean entropy. The test data provided by Arbetsförmedlingen was re-annotated to evaluate the complexity of the task. The human annotator achieved an F1-score of 0.883. Based on these findings, it can be inferred that KB-BERT performs satisfactorily in classifying job titles from job ads.

Page generated in 0.1282 seconds