• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 81
  • 22
  • 20
  • 11
  • 5
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 173
  • 122
  • 105
  • 62
  • 54
  • 49
  • 48
  • 47
  • 37
  • 32
  • 30
  • 27
  • 27
  • 20
  • 19
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Algoritmy pro rozpoznávání pojmenovaných entit / Algorithms for named entities recognition

Winter, Luca January 2017 (has links)
The aim of this work is to find out which algorithm is the best at recognizing named entities in e-mail messages. The theoretical part explains the existing tools in this field. The practical part describes the design of two tools specifically designed to create new models capable of recognizing named entities in e-mail messages. The first tool is based on a neural network and the second tool uses a CRF graph model. The existing and newly created tools and their ability to generalize are compared on a subset of e-mail messages provided by Kiwi.com.
52

Verbesserung einer Erkennungs- und Normalisierungsmaschine für natürlichsprachige Zeitausdrücke

Thomas, Stefan 27 February 2018 (has links)
Digital gespeicherte Daten erfreuen sich einer stetig steigenden Verwendung. Insbesondere die computerbasierte Kommunikation über E-Mail, SMS, Messenger usw. hat klassische Kommunikationsmittel nahezu vollständig verdrängt. Einen Mehrwert aus diesen Daten zu generieren, ist sowohl im geschäftlichen als auch im privaten Bereich von entscheidender Bedeutung. Eine Möglichkeit den Nutzer zu unterstützen ist es, seine textuellen Daten umfassend zu analysieren und bestimmte Elemente hervorzuheben und ihm die Erstellung von Einträgen für Kalender, Adressbuch und dergleichen abzunehmen bzw. zumindest vorzubereiten. Eine weitere Möglichkeit stellt die semantische Suche in den Daten des Nutzers dar. Selbst mit Volltextsuche muss man bisher den genauen Wortlaut kennen, wenn man eine bestimmte Information sucht. Durch ein tiefgreifendes Verständnis für Zeit ist es nun aber möglich, über einen Zeitstrahl alle mit einem bestimmten Zeitpunkt oder einer Zeitspanne verknüpften Daten zu finden. Es existieren bereits viele Ansätze um Named Entity Recognition voll- bzw. semi-automatisch durchzuführen, aber insbesondere Verfahren, welche weitgehend sprachunabhängig arbeiten und sich somit leicht auf viele Sprachen skalieren lassen, sind kaum publiziert. Um ein solches Verfahren für natürlichsprachige Zeitausdrücke zu verbessern, werden in dieser Arbeit, basierend auf umfangreichen Analysen, Möglichkeiten vorgestellt. Es wird speziell eine Strategie entwickelt, die auf einem Verfahren des maschinellen Lernens beruht und so den manuellen Aufwand für die Unterstützung neuer Sprachen reduziert. Diese und weitere Strategien wurden implementiert und in die bestehende Architektur der Zeiterkennungsmaschine der ExB-Gruppe integriert.
53

Convolutional Neural Networks for Named Entity Recognition in Images of Documents

van de Kerkhof, Jan January 2016 (has links)
This work researches named entity recognition (NER) with respect to images of documents with a domain-specific layout, by means of Convolutional Neural Networks (CNNs). Examples of such documents are receipts, invoices, forms and scientific papers, the latter of which are used in this work. An NER task is first performed statically, where a static number of entity classes is extracted per document. Networks based on the deep VGG-16 network are used for this task. Here, experimental evaluation shows that framing the task as a classification task, where the network classifies each bounding box coordinate separately, leads to the best network performance. Also, a multi-headed architecture is introduced, where the network has an independent fully-connected classification head per entity. VGG-16 achieves better performance with the multi-headed architecture than with its default, single-headed architecture. Additionally, it is shown that transfer learning does not improve performance of these networks. Analysis suggests that the networks trained for the static NER task learn to recognise document templates, rather than the entities themselves, and therefore do not generalize well to new, unseen templates. For a dynamic NER task, where the type and number of entity classes vary per document, experimental evaluation shows that, on large entities in the document, the Faster R-CNN object detection framework achieves comparable performance to the networks trained on the static task. Analysis suggests that Faster R-CNN generalizes better to new templates than the networks trained for the static task, as Faster R-CNN is trained on local features rather than the full document template. Finally, analysis shows that Faster R-CNN performs poorly on small entities in the image and suggestions are made to improve its performance.
54

Translation Memory System Optimization : How to effectively implement translation memory system optimization / Optimering av översättningsminnessystem : Hur man effektivt implementerar en optimering i översättningsminnessystem

Chau, Ting-Hey January 2015 (has links)
Translation of technical manuals is expensive, especially when a larger company needs to publish manuals for their whole product range in over 20 different languages. When a text segment (i.e. a phrase, sentence or paragraph) is manually translated, we would like to reuse these translated segments in future translation tasks. A translated segment is stored with its corresponding source language, often called a language pair in a Translation Memory System. A language pair in a Translation Memory represents a Translation Entry also known as a Translation Unit. During a translation, when a text segment in a source document matches a segment in the Translation Memory, available target languages in the Translation Unit will not require a human translation. The previously translated segment can be inserted into the target document. Such functionality is provided in the single source publishing software, Skribenta developed by Excosoft. Skribenta requires text segments in source documents to find an exact or a full match in the Translation Memory, in order to apply a translation to a target language. A full match can only be achieved if a source segment is stored in a standardized form, which requires manual tagging of entities, and often reoccurring words such as model names and product numbers. This thesis investigates different ways to improve and optimize a Translation Memory System. One way was to aid users with the work of manual tagging of entities, by developing Heuristic algorithms to approach the problem of Named Entity Recognition (NER). The evaluation results from the developed Heuristic algorithms were compared with the result from an off the shelf NER tool developed by Stanford. The results shows that the developed Heuristic algorithms is able to achieve a higher F-Measure compare to the Stanford NER, and may be a great initial step to aid Excosofts’ users to improve their Translation Memories. / Översättning av tekniska manualer är väldigt kostsamt, speciellt när större organisationer behöver publicera produktmanualer för hela deras utbud till över 20 olika språk. När en text (t.ex. en fras, mening, paragraf) har blivit översatt så vill vi kunna återanvända den översatta texten i framtida översättningsprojekt och dokument. De översatta texterna lagras i ett översättningsminne (Translation Memory). Varje text lagras i sitt källspråk tillsammans med dess översättning på ett annat språk, så kallat målspråk. Dessa utgör då ett språkpar i ett översättningsminnessystem (Translation Memory System). Ett språkpar som lagras i ett översättningsminne utgör en Translation Entry även kallat Translation Unit. Om man hittar en matchning när man söker på källspråket efter en given textsträng i översättningsminnet, får man upp översättningar på alla möjliga målspråk för den givna textsträngen. Dessa kan i sin tur sättas in i måldokumentet. En sådan funktionalitet erbjuds i publicerings programvaran Skribenta, som har utvecklats av Excosoft. För att utföra en översättning till ett målspråk kräver Skribenta att text i källspråket hittar en exakt matchning eller en s.k. full match i översättningsminnet. En full match kan bara uppnås om en text finns lagrad i standardform. Detta kräver manuell taggning av entiteter och ofta förekommande ord som modellnamn och produktnummer. I denna uppsats undersöker jag hur man effektivt implementerar en optimering i ett översättningsminnessystem, bland annat genom att underlätta den manuella taggningen av entitier. Detta har gjorts genom olika Heuristiker som angriper problemet med Named Entity Recognition (NER). Resultat från de utvecklade Heuristikerna har jämförts med resultatet från det NER-verktyg som har utvecklats av Stanford. Resultaten visar att de Heuristiker som jag utvecklat uppnår ett högre F-Measure jämfört med Stanford NER och kan därför vara ett bra inledande steg för att hjälpa Excosofts användare att förbättra deras översättningsminnen.
55

Efficient naming for Smart Home devices in Information Centric Networks

Rossland Lindvall, Caspar, Söderberg, Mikael January 2020 (has links)
The current network trends point towards a significant discrepancy between the data usage and the underlying architecture; a severely increasing amount of data is being sent from more devices while data usage is becoming more data-centric instead of the previously host-centric. Information Centric Network (ICN) is a new alternative network paradigm that is designed for a data-centric usage. ICN is based on uniquely naming data packages and making it location independent. This thesis researched how to implement an efficient naming for ICN in a Smart Home Scenario. The results are based on testing how the forwarding information base is populated for numerous different scenarios and how a node's duty cycle affects its power usage. The results indicate that a hierarchical naming is optimized for hierarchical-like network topology and a flat naming for interconnected network topologies. An optimized duty cycle is strongly dependent on the specific network and accordingto the results can a sub-optimal duty cycle lead to excessive powerusage.
56

Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning

Hathurusinghe, Rajitha 16 September 2020 (has links)
This thesis explores the training of a deep neural network based named entity recognizer in an end-to-end privacy preserved setting where dataset creation and model training happen in an environment with minimal manual interventions. With the improvement of accuracy in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for training data for these models amidst the concerns on the data privacy. Several scenarios of data protection are suggested in the recent past due to public concerns hence the legal guidelines to enforce them. A promising new development is the decentralized model training on isolated datasets, which eliminates the compromises of privacy upon providing data to a centralized entity. However, in this federated setting curating the data source is still a privacy risk mostly in unstructured data sources such as text. We explore the feasibility of automatic dataset annotation for a Named Entity Recognition (NER) task and training a deep learning model with it in two federated learning settings. We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof- the-art deep learning language model for the downstream task of named entity recognition. We also explore this novel setting of deep learning NLP model and federated learning for its deviation from the classical centralized setting. We created an automatically annotated dataset containing around 80,000 sentences, a manual human annotated test set and tools to extend the dataset with more manual annotations. We observed the noise from automated annotation can be overcome to a level by increasing the dataset size. We also contributed to the federated learning framework with state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80 F1-score for recognition of entities in sentences.
57

Lze to říci jinak aneb automatické hledání parafrází / Automatic Identification of Paraphrases

Otrusina, Lubomír January 2009 (has links)
Automatic paraphrase discovery is an important task in natural language processing. Many systems use paraphrases for improve performance e.g. systems for question answering, information retrieval or document summarization. In this thesis, we explain basic concepts e.g. paraphrase or paraphrase pattern. Next we propose some methods for paraphrase discovery from various resources. Subsequently we propose an unsupervised method for discovering paraphrase from large plain text based on context and keywords between NE pairs. In the end we explain evaluation metods in paraphrase discovery area and then we evaluate our system and compare it with similar systems.
58

Exploring Data Extraction and Relation Identification Using Machine Learning : Utilizing Machine-Learning Techniques to Extract Relevant Information from Climate Reports

Berger, William, Fooladi, Alex, Lindgren, Markus, Messo, Michel, Rosengren, Jonas, Rådmann, Lukas January 2023 (has links)
Ensuring the accessibility of data from Swedish municipal climate reports is necessary for examining climate work in Sweden. Manual data extraction is time-consuming and prone to errors, necessitating automation of the process. This project presents machine-learning techniques that can be used to extract data and information from Swedish municipal climate plans, to improve the accessibility of climate data. The proposed solution involves recognizing entities in plain text and extracting predefined relations between these using Named Entity Recognition and Relation Extraction, respectively. The result of the project is a functioning prototype in the medical domain due to the lack of annotated climate datasets in Swedish. Nevertheless, the problem remained the same: how to effectively perform data extraction from reports using machine learning techniques. The presented prototype demonstrates the potential of automating data extraction from reports. These findings imply that the system could be adapted to handle climate reports when a sufficient dataset becomes available. / Tillgängliggörande av information som sammanställs i svenska kommunala klimatplaner är avgörande för att utvärdera och ifrågasätta klimatarbetet i Sverige. Manuell dataextraktion är tidskrävande och komplicerat, vilket understryker behovet av att automatisera processen. Detta projekt utforskar maskininlärningstekniker som kan användas för att extrahera data och information från de kommunala klimatplanerna. Den föreslagna lösningen utnyttjar Named Entity Recognition för att identifiera entiteter i text och Relation Extraction för att extrahera fördefinierade relationer mellan entiteterna. I brist på svenska annoterade dataset inom klimatdomänen, är resultatet av projektet en fungerande prototyp inom den medicinska domänen. Frågeställningen är således densamma, hur maskininlärning kan användas för att utföra dataextraktion på rapporter. Prototypen som presenteras visar potentialen i att automatisera denna typ av dataextrahering. Denna framgång antyder att modellen kan anpassas för att hantera klimatrapporter när ett adekvat dataset blir tillgängligt.
59

A System for Automatic Information Extraction from Log Files

Chhabra, Anubhav 15 August 2022 (has links)
The development of technology, data-driven systems and applications are constantly revolutionizing our lives. We are surrounded by digitized systems/solutions that are transforming and making our lives easier. The criticality and complexity behind these systems are immense. So as to meet user satisfaction and keep up with the business needs, these digital systems should possess high availability, minimum downtime, and mitigate cyber attacks. Hence, system monitoring becomes an integral part of the lifecycle of a digital product/system. System monitoring often includes monitoring and analyzing logs outputted by the systems containing information about the events occurring within a system. The first step in log analysis generally includes understanding and segregating the various logical components within a log line, termed log parsing. Traditional log parsers use regular expressions and human-defined grammar to extract information from logs. Human experts are required to create, maintain and update the database containing these regular expressions and rules. They should keep up with the pace at which new products, applications and systems are being developed and deployed, as each unique application/system would have its own set of logs and logging standards. Logs from new sources tend to break the existing systems as none of the expressions match the signature of the incoming logs. The reasons mentioned above make the traditional log parsers time-consuming, hard to maintain, prone to errors, and not a scalable approach. On the other hand, machine learning based methodologies can help us develop solutions that automate the log parsing process without much intervention from human experts. NERLogParser is one such solution that uses a Bidirectional Long Short Term Memory (BiLSTM) architecture to frame the log parsing problem as a Named Entity Recognition (NER) problem. There have been recent advancements in the Natural Language Processing (NLP) domain with the introduction of architectures like Transformer and Bidirectional Encoder Representations from Transformers (BERT). However, these techniques have not been applied to tackle the problem of information extraction from log files. This gives us a clear research gap to experiment with the recent advanced deep learning architectures. This thesis extensively compares different machine learning based log parsing approaches that frame the log parsing problem as a NER problem. We compare 14 different approaches, including three traditional word-based methods: Naive Bayes, Perceptron and Stochastic Gradient Descent; a graphical model: Conditional Random Fields (CRF); a pre-trained sequence-to-sequence model for log parsing: NERLogParser; an attention-based sequence-to-sequence model: Transformer Neural Network; three different neural language models: BERT, RoBERTa and DistilBERT; two traditional ensembles and three different cascading classifiers formed using the individual classifiers mentioned above. We evaluate the NER approaches using an evaluation framework that offers four different evaluation schemes that not just help in comparing the NER approaches but also help us assess the quality of extracted information. The primary goal of this research is to evaluate the NER approaches on logs from new and unseen sources. To the best of our knowledge, no study in the literature evaluates the NER methodologies in such a context. Evaluating NER approaches on unseen logs helps us understand the robustness and the generalization capabilities of various methodologies. To carry out the experimentation, we use In-Scope and Out-of-Scope datasets. Both the datasets originate from entirely different sources and are entirely mutually exclusive. The In-Scope dataset is used for training, validation and testing purposes, whereas the Out-of-Scope dataset is purely used to evaluate the robustness and generalization capability of NER approaches. To better deal with logs from unknown sources, we propose Log Diversification Unit (LoDU), a unit of our system that enables us to carry out log augmentation and enrichment, which helps make the NER approaches more robust towards new and unseen logs. We segregate our final results on a use-case basis where different NER approaches may be suitable for various applications. Overall, traditional ensembles perform the best in parsing the Out-of-Scope log files, but they may not be the best option to consider for real-time applications. On the other hand, if we want to balance the trade-off between performance and throughput, cascading classifiers can be considered the go-to solution.
60

Multilingual Transformer Models for Maltese Named Entity Recognition

Farrugia, Kris January 2022 (has links)
The recently developed state-of-the-art models for Named Entity Recognition are heavily dependent upon huge amounts of available annotated data. Consequently, it is extremely challenging for data-scarce languages to obtain significant result. Several approaches have been proposed to circumvent this issue, including cross-lingual transfer learning, which is the leveraging of knowledge obtained by available resources in the source language and transfer it to a target low-resource language.        Maltese is one of the many majorly underresourced languages. The main purpose of this project is to research how recently developed transformer multilingual models (Multilingual BERT and XLM-RoBERTa) perform and to ultimately set up an evaluation benchmark in zero-shot cross-lingual transfer learning for Maltese Named Entity Recognition. The models are fine-tuned on Arabic, English, Italian, Spanish and Dutch. The experiments evaluated the efficacy of the source languages and the use of multilingual data in both the training and validation stages.         The experiments demonstrated that feeding multilingual data to both the training and the validation phases was mostly beneficial to the performance. However, adding it to the validation phase only was generally detrimental. Furthermore, XLM-R achieved overall better scores however, employing mBERT and English as the source language yielded the best performance.

Page generated in 0.0381 seconds