Spelling suggestions: "subject:"[een] NAMED ENTITY RECOGNITION"" "subject:"[enn] NAMED ENTITY RECOGNITION""
21 |
Translation Memory System Optimization : How to effectively implement translation memory system optimization / Optimering av översättningsminnessystem : Hur man effektivt implementerar en optimering i översättningsminnessystemChau, Ting-Hey January 2015 (has links)
Translation of technical manuals is expensive, especially when a larger company needs to publish manuals for their whole product range in over 20 different languages. When a text segment (i.e. a phrase, sentence or paragraph) is manually translated, we would like to reuse these translated segments in future translation tasks. A translated segment is stored with its corresponding source language, often called a language pair in a Translation Memory System. A language pair in a Translation Memory represents a Translation Entry also known as a Translation Unit. During a translation, when a text segment in a source document matches a segment in the Translation Memory, available target languages in the Translation Unit will not require a human translation. The previously translated segment can be inserted into the target document. Such functionality is provided in the single source publishing software, Skribenta developed by Excosoft. Skribenta requires text segments in source documents to find an exact or a full match in the Translation Memory, in order to apply a translation to a target language. A full match can only be achieved if a source segment is stored in a standardized form, which requires manual tagging of entities, and often reoccurring words such as model names and product numbers. This thesis investigates different ways to improve and optimize a Translation Memory System. One way was to aid users with the work of manual tagging of entities, by developing Heuristic algorithms to approach the problem of Named Entity Recognition (NER). The evaluation results from the developed Heuristic algorithms were compared with the result from an off the shelf NER tool developed by Stanford. The results shows that the developed Heuristic algorithms is able to achieve a higher F-Measure compare to the Stanford NER, and may be a great initial step to aid Excosofts’ users to improve their Translation Memories. / Översättning av tekniska manualer är väldigt kostsamt, speciellt när större organisationer behöver publicera produktmanualer för hela deras utbud till över 20 olika språk. När en text (t.ex. en fras, mening, paragraf) har blivit översatt så vill vi kunna återanvända den översatta texten i framtida översättningsprojekt och dokument. De översatta texterna lagras i ett översättningsminne (Translation Memory). Varje text lagras i sitt källspråk tillsammans med dess översättning på ett annat språk, så kallat målspråk. Dessa utgör då ett språkpar i ett översättningsminnessystem (Translation Memory System). Ett språkpar som lagras i ett översättningsminne utgör en Translation Entry även kallat Translation Unit. Om man hittar en matchning när man söker på källspråket efter en given textsträng i översättningsminnet, får man upp översättningar på alla möjliga målspråk för den givna textsträngen. Dessa kan i sin tur sättas in i måldokumentet. En sådan funktionalitet erbjuds i publicerings programvaran Skribenta, som har utvecklats av Excosoft. För att utföra en översättning till ett målspråk kräver Skribenta att text i källspråket hittar en exakt matchning eller en s.k. full match i översättningsminnet. En full match kan bara uppnås om en text finns lagrad i standardform. Detta kräver manuell taggning av entiteter och ofta förekommande ord som modellnamn och produktnummer. I denna uppsats undersöker jag hur man effektivt implementerar en optimering i ett översättningsminnessystem, bland annat genom att underlätta den manuella taggningen av entitier. Detta har gjorts genom olika Heuristiker som angriper problemet med Named Entity Recognition (NER). Resultat från de utvecklade Heuristikerna har jämförts med resultatet från det NER-verktyg som har utvecklats av Stanford. Resultaten visar att de Heuristiker som jag utvecklat uppnår ett högre F-Measure jämfört med Stanford NER och kan därför vara ett bra inledande steg för att hjälpa Excosofts användare att förbättra deras översättningsminnen.
|
22 |
Creating a Graph Database from a Set of Documents / Skapandet av en grafdatabas från ett set av dokumentNikolic, Vladan January 2015 (has links)
In the context of search, it may be advantageous in some use-cases to have documents saved in a graph database rather than a document-orientated database. Graph databases are able to model relationships between objects, in this case documents, in ways which allow for efficient retrieval, as well as search queries that are slightly more specific or complex. This report will attempt to explore the possibilities of storing an existing set of documents into a graph database. A Named Entity Recognizer was used on a set of news articles in order to extract entities from each news article’s body of text. News articles that contain the same entities are then connected to each other in the graph. Ideas to improve this entity extraction are also explored. The method of evaluation that was utilized in this report proved not to be ideal for this task in that only a relative measure was given, not an absolute one. As such, no absolute answer with regards to the quality of the method can be presented. It is clear that improvements can be made, and the result should be subject to further study. / I ett sökkontext kan det vara födelaktigt att i några användarscenarion utgå från dokument lagrade i en grafdatabas gentemot en dokument-orienterad databas. Grafdatabaser kan modellera förhållanden mellan objekt, som i detta fall är dokument, på ett sätt som ökar effektiviteten för vissa mer specifika eller komplexa sökfrågor. Denna rapport utforskar möjligheterna i att lagra existerande dokument i en grafdatabas. En Named Entity Recognizer används för att extrahera entiter från en stor samling nyhetsartiklar. Nyhetsartiklar som innehåller samma entiteter är sedan kopplade till varandra i grafen. Dessutom undersöks möjligheter till att förbättra extraheringen av entiteter. Evalueringsmetoden som användes visade sig mindre än ideal, då endast en relativ snarare än absolut bedömning kan göras av den slutgiltiga grafen. Därav kan inget slutgiltigt svar ges angående grafens och metodens kvalitet, men resultatet bör vara av intresse för framtida undersökningar.
|
23 |
Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated LearningHathurusinghe, Rajitha 16 September 2020 (has links)
This thesis explores the training of a deep neural network based named entity recognizer in
an end-to-end privacy preserved setting where dataset creation and model training happen
in an environment with minimal manual interventions. With the improvement of accuracy
in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for
training data for these models amidst the concerns on the data privacy. Several scenarios of
data protection are suggested in the recent past due to public concerns hence the legal guidelines
to enforce them. A promising new development is the decentralized model training
on isolated datasets, which eliminates the compromises of privacy upon providing data to a
centralized entity. However, in this federated setting curating the data source is still a privacy
risk mostly in unstructured data sources such as text.
We explore the feasibility of automatic dataset annotation for a Named Entity Recognition
(NER) task and training a deep learning model with it in two federated learning settings.
We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof-
the-art deep learning language model for the downstream task of named entity recognition.
We also explore this novel setting of deep learning NLP model and federated learning
for its deviation from the classical centralized setting.
We created an automatically annotated dataset containing around 80,000 sentences, a
manual human annotated test set and tools to extend the dataset with more manual annotations.
We observed the noise from automated annotation can be overcome to a level by
increasing the dataset size. We also contributed to the federated learning framework with
state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80
F1-score for recognition of entities in sentences.
|
24 |
Exploring Data Extraction and Relation Identification Using Machine Learning : Utilizing Machine-Learning Techniques to Extract Relevant Information from Climate ReportsBerger, William, Fooladi, Alex, Lindgren, Markus, Messo, Michel, Rosengren, Jonas, Rådmann, Lukas January 2023 (has links)
Ensuring the accessibility of data from Swedish municipal climate reports is necessary for examining climate work in Sweden. Manual data extraction is time-consuming and prone to errors, necessitating automation of the process. This project presents machine-learning techniques that can be used to extract data and information from Swedish municipal climate plans, to improve the accessibility of climate data. The proposed solution involves recognizing entities in plain text and extracting predefined relations between these using Named Entity Recognition and Relation Extraction, respectively. The result of the project is a functioning prototype in the medical domain due to the lack of annotated climate datasets in Swedish. Nevertheless, the problem remained the same: how to effectively perform data extraction from reports using machine learning techniques. The presented prototype demonstrates the potential of automating data extraction from reports. These findings imply that the system could be adapted to handle climate reports when a sufficient dataset becomes available. / Tillgängliggörande av information som sammanställs i svenska kommunala klimatplaner är avgörande för att utvärdera och ifrågasätta klimatarbetet i Sverige. Manuell dataextraktion är tidskrävande och komplicerat, vilket understryker behovet av att automatisera processen. Detta projekt utforskar maskininlärningstekniker som kan användas för att extrahera data och information från de kommunala klimatplanerna. Den föreslagna lösningen utnyttjar Named Entity Recognition för att identifiera entiteter i text och Relation Extraction för att extrahera fördefinierade relationer mellan entiteterna. I brist på svenska annoterade dataset inom klimatdomänen, är resultatet av projektet en fungerande prototyp inom den medicinska domänen. Frågeställningen är således densamma, hur maskininlärning kan användas för att utföra dataextraktion på rapporter. Prototypen som presenteras visar potentialen i att automatisera denna typ av dataextrahering. Denna framgång antyder att modellen kan anpassas för att hantera klimatrapporter när ett adekvat dataset blir tillgängligt.
|
25 |
A System for Automatic Information Extraction from Log FilesChhabra, Anubhav 15 August 2022 (has links)
The development of technology, data-driven systems and applications are constantly revolutionizing our lives. We are surrounded by digitized systems/solutions that are transforming and making our lives easier. The criticality and complexity behind these systems are immense. So as to meet user satisfaction and keep up with the business needs, these digital systems should possess high availability, minimum downtime, and mitigate cyber attacks. Hence, system monitoring becomes an integral part of the lifecycle of a digital product/system. System monitoring often includes monitoring and analyzing logs outputted by the systems containing information about the events occurring within a system. The first step in log analysis generally includes understanding and segregating the various logical components within a log line, termed log parsing.
Traditional log parsers use regular expressions and human-defined grammar to extract information from logs. Human experts are required to create, maintain and update the database containing these regular expressions and rules. They should keep up with the pace at which new products, applications and systems are being developed and deployed, as each unique application/system would have its own set of logs and logging standards. Logs from new sources tend to break the existing systems as none of the expressions match the signature of the incoming logs. The reasons mentioned above make the traditional log parsers time-consuming, hard to maintain, prone to errors, and not a scalable approach. On the other hand, machine learning based methodologies can help us develop solutions that automate the log parsing process without much intervention from human experts. NERLogParser is one such solution that uses a Bidirectional Long Short Term Memory (BiLSTM) architecture to frame the log parsing problem as a Named Entity Recognition (NER) problem. There have been recent advancements in the Natural Language Processing (NLP) domain with the introduction of architectures like Transformer and Bidirectional Encoder Representations from Transformers (BERT). However, these techniques have not been applied to tackle the problem of information extraction from log files. This gives us a clear research gap to experiment with the recent advanced deep learning architectures.
This thesis extensively compares different machine learning based log parsing approaches that frame the log parsing problem as a NER problem. We compare 14 different approaches, including three traditional word-based methods: Naive Bayes, Perceptron and Stochastic Gradient Descent; a graphical model: Conditional Random Fields (CRF); a pre-trained sequence-to-sequence model for log parsing: NERLogParser; an attention-based sequence-to-sequence model: Transformer Neural Network; three different neural language models: BERT, RoBERTa and DistilBERT; two traditional ensembles and three different cascading classifiers formed using the individual classifiers mentioned above. We evaluate the NER approaches using an evaluation framework that offers four different evaluation schemes that not just help in comparing the NER approaches but also help us assess the quality of extracted information.
The primary goal of this research is to evaluate the NER approaches on logs from new and unseen sources. To the best of our knowledge, no study in the literature evaluates the NER methodologies in such a context. Evaluating NER approaches on unseen logs helps us understand the robustness and the generalization capabilities of various methodologies. To carry out the experimentation, we use In-Scope and Out-of-Scope datasets. Both the datasets originate from entirely different sources and are entirely mutually exclusive. The In-Scope dataset is used for training, validation and testing purposes, whereas the Out-of-Scope dataset is purely used to evaluate the robustness and generalization capability of NER approaches.
To better deal with logs from unknown sources, we propose Log Diversification Unit (LoDU), a unit of our system that enables us to carry out log augmentation and enrichment, which helps make the NER approaches more robust towards new and unseen logs. We segregate our final results on a use-case basis where different NER approaches may be suitable for various applications. Overall, traditional ensembles perform the best in parsing the Out-of-Scope log files, but they may not be the best option to consider for real-time applications. On the other hand, if we want to balance the trade-off between performance and throughput, cascading classifiers can be considered the go-to solution.
|
26 |
Information Extraction of Technical Details From Scholarly ArticlesKaushal, Kulendra Kumar 16 June 2021 (has links)
Researchers have made significant progress in information extraction from short documents in the last few years, including social media interaction, news articles, and email excerpts. This research aims to extract technical entities like hardware resources, computing platforms, compute time, programming language, and libraries from scholarly research articles. Research articles are generally long documents having both salient as well as non-salient entities. Analyzing the cross-sectional relation, filtering the relevant information, measuring the saliency of mentioned entities, and extracting novel entities are some of the technical challenges involved in this research. This work presents a detailed study about the performance, effectiveness, and scalability of rule-based weakly supervised algorithms. We also develop an automated end-to-end Research Entity and Relationship Extractor (E2R Extractor). Additionally, we perform a comprehensive study about the effectiveness of existing deep learning-based information extraction tools like Dygie, Dygie++, SciREX. The research also contributes a dataset containing novel entities annotated in BILUO format and represents the baseline results using the E2R extractor on the proposed dataset. The results indicate that the E2R extractor successfully extracts salient entities from research articles. / Master of Science / Information extraction is a process of automatically extracting meaningful information from unstructured text such as articles, news feeds and presenting it in a structured format.
Researchers have made significant progress in this domain over the past few years.
However, their work primarily focuses on short documents such as social media interactions, news articles, email excerpts, and not on long documents such as scholarly articles and research papers. Long documents contain a lot of redundant data, so filtering and extracting meaningful information is quite challenging. This work focuses on extracting entities such as hardware resources, compute platforms, and programming languages used in scholarly articles.
We present a deep learning-based model to extract such entities from research articles and research papers. We evaluate the performance of our deep learning model against simple rule-based algorithms and other state-of-the-art models for extracting the desired entities.
Our work also contributes a labeled dataset containing the entities mentioned above and results obtained on this dataset using our deep learning model.
|
27 |
Hierarchical Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History RecordsSegrera, Daniel 08 March 2023 (has links) (PDF)
Entity extraction is an important step in document understanding. Higher accuracy entity extraction on fine-grained entities can be achieved by combining the utility of Named Entity Recognition (NER) and Relation Extraction (RE) models. In this paper, a cascading model is proposed that implements NER and Relation extraction. This model utilizes relations between entities to infer context-dependent fine-grain named entities in text corpora. The RE module runs independent of the NER module, which reduces error accumulation from sequential steps. This process improves on the fine-grained NER F1-score of existing state-of-the-art from .4753 to .8563 on our data, albeit on a strictly limited domain. This provides the potential for further applications in historical document processing. These applications will enable automated searching of historical documents, such as those used in economics research and family history.
|
28 |
Improving Automatic Transcription Using Natural Language ProcessingKiefer, Anna 01 March 2024 (has links) (PDF)
Digital Democracy is a CalMatters and California Polytechnic State University initia-tive to promote transparency in state government by increasing access to the Califor-nia legislature. While Digital Democracy is made up of many resources, one founda-tional step of the project is obtaining accurate, timely transcripts of California Senateand Assembly hearings. The information extracted from these transcripts providescrucial data for subsequent steps in the pipeline. In the context of Digital Democracy,upleveling is when humans verify, correct, and annotate the transcript results afterthe legislative hearings have been automatically transcribed. The upleveling processis done with the assistance of a software application called the Transcription Tool.The human upleveling process is the most costly and time-consuming step of the Dig-ital Democracy pipeline. In this thesis, we hypothesize that we can make significantreductions to the time needed for upleveling by using Natural Language Processing(NLP) systems and techniques. The main contribution of this thesis is engineeringa new automatic transcription pipeline. Specifically, this thesis integrates a new au-tomatic speech recognition service, a new speaker diarization model, additional textpost-processing changes, and a new process for speaker identification. To evaluate the system’s improvements, we measure the accuracy and speed of the newly integrated features and record editor upleveling time both before and after the additions.
|
29 |
Prerequisites for Extracting Entity Relations from Swedish TextsLenas, Erik January 2020 (has links)
Natural language processing (NLP) is a vibrant area of research with many practical applications today like sentiment analyses, text labeling, questioning an- swering, machine translation and automatic text summarizing. At the moment, research is mainly focused on the English language, although many other lan- guages are trying to catch up. This work focuses on an area within NLP called information extraction, and more specifically on relation extraction, that is, to ex- tract relations between entities in a text. What this work aims at is to use machine learning techniques to build a Swedish language processing pipeline with part-of- speech tagging, dependency parsing, named entity recognition and coreference resolution to use as a base for later relation extraction from archival texts. The obvious difficulty lies in the scarcity of Swedish annotated datasets. For exam- ple, no large enough Swedish dataset for coreference resolution exists today. An important part of this work, therefore, is to create a Swedish coreference solver using distantly supervised machine learning, which means creating a Swedish dataset by applying an English coreference solver on an unannotated bilingual corpus, and then using a word-aligner to translate this machine-annotated En- glish dataset to a Swedish dataset, and then training a Swedish model on this dataset. Using Allen NLP:s end-to-end coreference resolution model, both for creating the Swedish dataset and training the Swedish model, this work achieves an F1-score of 0.5. For named entity recognition this work uses the Swedish BERT models released by the Royal Library of Sweden in February 2020 and achieves an overall F1-score of 0.95. To put all of these NLP-models within a single Lan- guage Processing Pipeline, Spacy is used as a unifying framework. / Natural Language Processing (NLP) är ett stort och aktuellt forskningsområde idag med många praktiska tillämpningar som sentimentanalys, textkategoriser- ing, maskinöversättning och automatisk textsummering. Forskningen är för när- varande mest inriktad på det engelska språket, men många andra språkområ- den försöker komma ikapp. Det här arbetet fokuserar på ett område inom NLP som kallas informationsextraktion, och mer specifikt relationsextrahering, det vill säga att extrahera relationer mellan namngivna entiteter i en text. Vad det här ar- betet försöker göra är att använda olika maskininlärningstekniker för att skapa en svensk Language Processing Pipeline bestående av part-of-speech tagging, de- pendency parsing, named entity recognition och coreference resolution. Denna pipeline är sedan tänkt att användas som en bas for senare relationsextrahering från svenskt arkivmaterial. Den uppenbara svårigheten med detta ligger i att det är ont om stora, annoterade svenska dataset. Till exempel så finns det inget till- räckligt stort svenskt dataset för coreference resolution. En stor del av detta arbete går därför ut på att skapa en svensk coreference solver genom att implementera distantly supervised machine learning, med vilket menas att använda en engelsk coreference solver på ett oannoterat engelskt-svenskt corpus, och sen använda en word-aligner för att översätta detta maskinannoterade engelska dataset till ett svenskt, och sen träna en svensk coreference solver på detta dataset. Det här arbetet använder Allen NLP:s end-to-end coreference solver, både för att skapa det svenska datasetet, och för att träna den svenska modellen, och uppnår en F1-score på 0.5. Vad gäller named entity recognition så använder det här arbetet Kungliga Bibliotekets BERT-modeller som bas, och uppnår genom detta en F1- score på 0.95. Spacy används som ett enande ramverk för att samla alla dessa NLP-komponenter inom en enda pipeline.
|
30 |
Modèles graphiques discriminants pour l'étiquetage de séquences : application à la reconnaissance d'entités nommées radiophiniques / Discriminative graphical models for sequence labelling : application to named entity recognition in audio broadcast newsZidouni, Azeddine 08 December 2010 (has links)
Le traitement automatique des données complexes et variées est un processus fondamental dans les applications d'extraction d'information. L'explosion combinatoire dans la composition des textes journalistiques et l'évolution du vocabulaire rend la tâche d'extraction d'indicateurs sémantiques, tel que les entités nommées, plus complexe par les approches symboliques. Les modèles stochastiques structurels tel que les champs conditionnels aléatoires (CRF) permettent d'optimiser des systèmes d'extraction d'information avec une importante capacité de généralisation. La première contribution de cette thèse est consacrée à la définition du contexte optimal pour l'extraction des régularités entre les mots et les annotations dans la tâche de reconnaissance d'entités nommées. Nous allons intégrer diverses informations dans le but d'enrichir les observations et améliorer la qualité de prédiction du système. Dans la deuxième partie nous allons proposer une nouvelle approche d'adaptation d'annotations entre deux protocoles différents. Le principe de cette dernière est basé sur l'enrichissement d'observations par des données générées par d'autres systèmes. Ces travaux seront expérimentés et validés sur les données de la campagne ESTER. D'autre part, nous allons proposer une approche de couplage entre le niveau signal représenté par un indice de la qualité de voisement et le niveau sémantique. L'objectif de cette étude est de trouver le lien entre le degré d'articulation du locuteur et l'importance de son discours / Recent researches in Information Extraction are designed to extract fixed types of information from data. Sequence annotation systems are developed to associate structured annotations to input data presented in sequential form. The named entity recognition (NER) task consists of identifying and classifying every word in a document into some predefined categories such as person name, locations, organizations, and dates. The complexity of the NER is largely related to the definition of the task and to the complexity of the relationships between words and the semantic associated. Our first contribution is devoted to solving the NER problem using discriminative graphical models. The proposed approach investigates the use of various contexts of the words to improve recognition. NER systems are fixed in accordance with a specific annotation protocol. Thus, new applications are developed for new protocols. The challenge is how we can adapt an annotation system which is performed for a specific application to other target application? We will propose in this work an adaptation approach of sequence labelling task based on annotation enrichment using conditional random fields (CRF). Experimental results show that the proposed approach outperform rules-based approach in NER task. Finally, we propose a multimodal approach of NER by integrating low level features as contextual information in radio broadcast news data. The objective of this study is to measure the correlation between the speaker voicing quality and the importance of his speech
|
Page generated in 0.055 seconds