Spelling suggestions: "subject:"sameentity"" "subject:"amentity""
31 |
Translation Memory System Optimization : How to effectively implement translation memory system optimization / Optimering av översättningsminnessystem : Hur man effektivt implementerar en optimering i översättningsminnessystemChau, Ting-Hey January 2015 (has links)
Translation of technical manuals is expensive, especially when a larger company needs to publish manuals for their whole product range in over 20 different languages. When a text segment (i.e. a phrase, sentence or paragraph) is manually translated, we would like to reuse these translated segments in future translation tasks. A translated segment is stored with its corresponding source language, often called a language pair in a Translation Memory System. A language pair in a Translation Memory represents a Translation Entry also known as a Translation Unit. During a translation, when a text segment in a source document matches a segment in the Translation Memory, available target languages in the Translation Unit will not require a human translation. The previously translated segment can be inserted into the target document. Such functionality is provided in the single source publishing software, Skribenta developed by Excosoft. Skribenta requires text segments in source documents to find an exact or a full match in the Translation Memory, in order to apply a translation to a target language. A full match can only be achieved if a source segment is stored in a standardized form, which requires manual tagging of entities, and often reoccurring words such as model names and product numbers. This thesis investigates different ways to improve and optimize a Translation Memory System. One way was to aid users with the work of manual tagging of entities, by developing Heuristic algorithms to approach the problem of Named Entity Recognition (NER). The evaluation results from the developed Heuristic algorithms were compared with the result from an off the shelf NER tool developed by Stanford. The results shows that the developed Heuristic algorithms is able to achieve a higher F-Measure compare to the Stanford NER, and may be a great initial step to aid Excosofts’ users to improve their Translation Memories. / Översättning av tekniska manualer är väldigt kostsamt, speciellt när större organisationer behöver publicera produktmanualer för hela deras utbud till över 20 olika språk. När en text (t.ex. en fras, mening, paragraf) har blivit översatt så vill vi kunna återanvända den översatta texten i framtida översättningsprojekt och dokument. De översatta texterna lagras i ett översättningsminne (Translation Memory). Varje text lagras i sitt källspråk tillsammans med dess översättning på ett annat språk, så kallat målspråk. Dessa utgör då ett språkpar i ett översättningsminnessystem (Translation Memory System). Ett språkpar som lagras i ett översättningsminne utgör en Translation Entry även kallat Translation Unit. Om man hittar en matchning när man söker på källspråket efter en given textsträng i översättningsminnet, får man upp översättningar på alla möjliga målspråk för den givna textsträngen. Dessa kan i sin tur sättas in i måldokumentet. En sådan funktionalitet erbjuds i publicerings programvaran Skribenta, som har utvecklats av Excosoft. För att utföra en översättning till ett målspråk kräver Skribenta att text i källspråket hittar en exakt matchning eller en s.k. full match i översättningsminnet. En full match kan bara uppnås om en text finns lagrad i standardform. Detta kräver manuell taggning av entiteter och ofta förekommande ord som modellnamn och produktnummer. I denna uppsats undersöker jag hur man effektivt implementerar en optimering i ett översättningsminnessystem, bland annat genom att underlätta den manuella taggningen av entitier. Detta har gjorts genom olika Heuristiker som angriper problemet med Named Entity Recognition (NER). Resultat från de utvecklade Heuristikerna har jämförts med resultatet från det NER-verktyg som har utvecklats av Stanford. Resultaten visar att de Heuristiker som jag utvecklat uppnår ett högre F-Measure jämfört med Stanford NER och kan därför vara ett bra inledande steg för att hjälpa Excosofts användare att förbättra deras översättningsminnen.
|
32 |
Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated LearningHathurusinghe, Rajitha 16 September 2020 (has links)
This thesis explores the training of a deep neural network based named entity recognizer in
an end-to-end privacy preserved setting where dataset creation and model training happen
in an environment with minimal manual interventions. With the improvement of accuracy
in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for
training data for these models amidst the concerns on the data privacy. Several scenarios of
data protection are suggested in the recent past due to public concerns hence the legal guidelines
to enforce them. A promising new development is the decentralized model training
on isolated datasets, which eliminates the compromises of privacy upon providing data to a
centralized entity. However, in this federated setting curating the data source is still a privacy
risk mostly in unstructured data sources such as text.
We explore the feasibility of automatic dataset annotation for a Named Entity Recognition
(NER) task and training a deep learning model with it in two federated learning settings.
We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof-
the-art deep learning language model for the downstream task of named entity recognition.
We also explore this novel setting of deep learning NLP model and federated learning
for its deviation from the classical centralized setting.
We created an automatically annotated dataset containing around 80,000 sentences, a
manual human annotated test set and tools to extend the dataset with more manual annotations.
We observed the noise from automated annotation can be overcome to a level by
increasing the dataset size. We also contributed to the federated learning framework with
state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80
F1-score for recognition of entities in sentences.
|
33 |
Exploring Data Extraction and Relation Identification Using Machine Learning : Utilizing Machine-Learning Techniques to Extract Relevant Information from Climate ReportsBerger, William, Fooladi, Alex, Lindgren, Markus, Messo, Michel, Rosengren, Jonas, Rådmann, Lukas January 2023 (has links)
Ensuring the accessibility of data from Swedish municipal climate reports is necessary for examining climate work in Sweden. Manual data extraction is time-consuming and prone to errors, necessitating automation of the process. This project presents machine-learning techniques that can be used to extract data and information from Swedish municipal climate plans, to improve the accessibility of climate data. The proposed solution involves recognizing entities in plain text and extracting predefined relations between these using Named Entity Recognition and Relation Extraction, respectively. The result of the project is a functioning prototype in the medical domain due to the lack of annotated climate datasets in Swedish. Nevertheless, the problem remained the same: how to effectively perform data extraction from reports using machine learning techniques. The presented prototype demonstrates the potential of automating data extraction from reports. These findings imply that the system could be adapted to handle climate reports when a sufficient dataset becomes available. / Tillgängliggörande av information som sammanställs i svenska kommunala klimatplaner är avgörande för att utvärdera och ifrågasätta klimatarbetet i Sverige. Manuell dataextraktion är tidskrävande och komplicerat, vilket understryker behovet av att automatisera processen. Detta projekt utforskar maskininlärningstekniker som kan användas för att extrahera data och information från de kommunala klimatplanerna. Den föreslagna lösningen utnyttjar Named Entity Recognition för att identifiera entiteter i text och Relation Extraction för att extrahera fördefinierade relationer mellan entiteterna. I brist på svenska annoterade dataset inom klimatdomänen, är resultatet av projektet en fungerande prototyp inom den medicinska domänen. Frågeställningen är således densamma, hur maskininlärning kan användas för att utföra dataextraktion på rapporter. Prototypen som presenteras visar potentialen i att automatisera denna typ av dataextrahering. Denna framgång antyder att modellen kan anpassas för att hantera klimatrapporter när ett adekvat dataset blir tillgängligt.
|
34 |
A System for Automatic Information Extraction from Log FilesChhabra, Anubhav 15 August 2022 (has links)
The development of technology, data-driven systems and applications are constantly revolutionizing our lives. We are surrounded by digitized systems/solutions that are transforming and making our lives easier. The criticality and complexity behind these systems are immense. So as to meet user satisfaction and keep up with the business needs, these digital systems should possess high availability, minimum downtime, and mitigate cyber attacks. Hence, system monitoring becomes an integral part of the lifecycle of a digital product/system. System monitoring often includes monitoring and analyzing logs outputted by the systems containing information about the events occurring within a system. The first step in log analysis generally includes understanding and segregating the various logical components within a log line, termed log parsing.
Traditional log parsers use regular expressions and human-defined grammar to extract information from logs. Human experts are required to create, maintain and update the database containing these regular expressions and rules. They should keep up with the pace at which new products, applications and systems are being developed and deployed, as each unique application/system would have its own set of logs and logging standards. Logs from new sources tend to break the existing systems as none of the expressions match the signature of the incoming logs. The reasons mentioned above make the traditional log parsers time-consuming, hard to maintain, prone to errors, and not a scalable approach. On the other hand, machine learning based methodologies can help us develop solutions that automate the log parsing process without much intervention from human experts. NERLogParser is one such solution that uses a Bidirectional Long Short Term Memory (BiLSTM) architecture to frame the log parsing problem as a Named Entity Recognition (NER) problem. There have been recent advancements in the Natural Language Processing (NLP) domain with the introduction of architectures like Transformer and Bidirectional Encoder Representations from Transformers (BERT). However, these techniques have not been applied to tackle the problem of information extraction from log files. This gives us a clear research gap to experiment with the recent advanced deep learning architectures.
This thesis extensively compares different machine learning based log parsing approaches that frame the log parsing problem as a NER problem. We compare 14 different approaches, including three traditional word-based methods: Naive Bayes, Perceptron and Stochastic Gradient Descent; a graphical model: Conditional Random Fields (CRF); a pre-trained sequence-to-sequence model for log parsing: NERLogParser; an attention-based sequence-to-sequence model: Transformer Neural Network; three different neural language models: BERT, RoBERTa and DistilBERT; two traditional ensembles and three different cascading classifiers formed using the individual classifiers mentioned above. We evaluate the NER approaches using an evaluation framework that offers four different evaluation schemes that not just help in comparing the NER approaches but also help us assess the quality of extracted information.
The primary goal of this research is to evaluate the NER approaches on logs from new and unseen sources. To the best of our knowledge, no study in the literature evaluates the NER methodologies in such a context. Evaluating NER approaches on unseen logs helps us understand the robustness and the generalization capabilities of various methodologies. To carry out the experimentation, we use In-Scope and Out-of-Scope datasets. Both the datasets originate from entirely different sources and are entirely mutually exclusive. The In-Scope dataset is used for training, validation and testing purposes, whereas the Out-of-Scope dataset is purely used to evaluate the robustness and generalization capability of NER approaches.
To better deal with logs from unknown sources, we propose Log Diversification Unit (LoDU), a unit of our system that enables us to carry out log augmentation and enrichment, which helps make the NER approaches more robust towards new and unseen logs. We segregate our final results on a use-case basis where different NER approaches may be suitable for various applications. Overall, traditional ensembles perform the best in parsing the Out-of-Scope log files, but they may not be the best option to consider for real-time applications. On the other hand, if we want to balance the trade-off between performance and throughput, cascading classifiers can be considered the go-to solution.
|
35 |
Multilingual Transformer Models for Maltese Named Entity RecognitionFarrugia, Kris January 2022 (has links)
The recently developed state-of-the-art models for Named Entity Recognition are heavily dependent upon huge amounts of available annotated data. Consequently, it is extremely challenging for data-scarce languages to obtain significant result. Several approaches have been proposed to circumvent this issue, including cross-lingual transfer learning, which is the leveraging of knowledge obtained by available resources in the source language and transfer it to a target low-resource language. Maltese is one of the many majorly underresourced languages. The main purpose of this project is to research how recently developed transformer multilingual models (Multilingual BERT and XLM-RoBERTa) perform and to ultimately set up an evaluation benchmark in zero-shot cross-lingual transfer learning for Maltese Named Entity Recognition. The models are fine-tuned on Arabic, English, Italian, Spanish and Dutch. The experiments evaluated the efficacy of the source languages and the use of multilingual data in both the training and validation stages. The experiments demonstrated that feeding multilingual data to both the training and the validation phases was mostly beneficial to the performance. However, adding it to the validation phase only was generally detrimental. Furthermore, XLM-R achieved overall better scores however, employing mBERT and English as the source language yielded the best performance.
|
36 |
Information Extraction of Technical Details From Scholarly ArticlesKaushal, Kulendra Kumar 16 June 2021 (has links)
Researchers have made significant progress in information extraction from short documents in the last few years, including social media interaction, news articles, and email excerpts. This research aims to extract technical entities like hardware resources, computing platforms, compute time, programming language, and libraries from scholarly research articles. Research articles are generally long documents having both salient as well as non-salient entities. Analyzing the cross-sectional relation, filtering the relevant information, measuring the saliency of mentioned entities, and extracting novel entities are some of the technical challenges involved in this research. This work presents a detailed study about the performance, effectiveness, and scalability of rule-based weakly supervised algorithms. We also develop an automated end-to-end Research Entity and Relationship Extractor (E2R Extractor). Additionally, we perform a comprehensive study about the effectiveness of existing deep learning-based information extraction tools like Dygie, Dygie++, SciREX. The research also contributes a dataset containing novel entities annotated in BILUO format and represents the baseline results using the E2R extractor on the proposed dataset. The results indicate that the E2R extractor successfully extracts salient entities from research articles. / Master of Science / Information extraction is a process of automatically extracting meaningful information from unstructured text such as articles, news feeds and presenting it in a structured format.
Researchers have made significant progress in this domain over the past few years.
However, their work primarily focuses on short documents such as social media interactions, news articles, email excerpts, and not on long documents such as scholarly articles and research papers. Long documents contain a lot of redundant data, so filtering and extracting meaningful information is quite challenging. This work focuses on extracting entities such as hardware resources, compute platforms, and programming languages used in scholarly articles.
We present a deep learning-based model to extract such entities from research articles and research papers. We evaluate the performance of our deep learning model against simple rule-based algorithms and other state-of-the-art models for extracting the desired entities.
Our work also contributes a labeled dataset containing the entities mentioned above and results obtained on this dataset using our deep learning model.
|
37 |
Identifying Single and Stacked News Triangles in Online News Articles - an Analysis of 31 Danish Online News Articles Annotated by 68 JournalistsNjor, Miklas January 2015 (has links)
While news articles for print use one News Triangle, where important information is at the top of the article, online news articles are supposed to use a series of Stacked News Triangles, due to online readers text- skimming habits[1]. To identify Stacked News Triangles presence, we analyse how 68 Danish journalists annotate 31 articles. We use keyword frequency as the measure of popularity. To explore if Named Entities influence News Triangle presence, we analyse Named Entities found in the articles and keywords.We find the presence of an overall News Triangle in 30 of 31 articles, while, for the presence of Stacked News Triangles, 14 of the 31 articles have Stacked News Triangles. For Named Entities in News Triangles we cannot see what their influences is. Nonetheless, we find difference in Named Entity Types in each category (Culture, Domestic, Economy, Sports).
|
38 |
Hierarchical Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History RecordsSegrera, Daniel 08 March 2023 (has links) (PDF)
Entity extraction is an important step in document understanding. Higher accuracy entity extraction on fine-grained entities can be achieved by combining the utility of Named Entity Recognition (NER) and Relation Extraction (RE) models. In this paper, a cascading model is proposed that implements NER and Relation extraction. This model utilizes relations between entities to infer context-dependent fine-grain named entities in text corpora. The RE module runs independent of the NER module, which reduces error accumulation from sequential steps. This process improves on the fine-grained NER F1-score of existing state-of-the-art from .4753 to .8563 on our data, albeit on a strictly limited domain. This provides the potential for further applications in historical document processing. These applications will enable automated searching of historical documents, such as those used in economics research and family history.
|
39 |
Improving Automatic Transcription Using Natural Language ProcessingKiefer, Anna 01 March 2024 (has links) (PDF)
Digital Democracy is a CalMatters and California Polytechnic State University initia-tive to promote transparency in state government by increasing access to the Califor-nia legislature. While Digital Democracy is made up of many resources, one founda-tional step of the project is obtaining accurate, timely transcripts of California Senateand Assembly hearings. The information extracted from these transcripts providescrucial data for subsequent steps in the pipeline. In the context of Digital Democracy,upleveling is when humans verify, correct, and annotate the transcript results afterthe legislative hearings have been automatically transcribed. The upleveling processis done with the assistance of a software application called the Transcription Tool.The human upleveling process is the most costly and time-consuming step of the Dig-ital Democracy pipeline. In this thesis, we hypothesize that we can make significantreductions to the time needed for upleveling by using Natural Language Processing(NLP) systems and techniques. The main contribution of this thesis is engineeringa new automatic transcription pipeline. Specifically, this thesis integrates a new au-tomatic speech recognition service, a new speaker diarization model, additional textpost-processing changes, and a new process for speaker identification. To evaluate the system’s improvements, we measure the accuracy and speed of the newly integrated features and record editor upleveling time both before and after the additions.
|
40 |
Using Concept Maps as a Tool for Cross-Language Relevance DeterminationRichardson, W. Ryan 02 August 2007 (has links)
Concept maps, introduced by Novak, aid learners' understanding. I hypothesize that concept maps also can function as a summary of large documents, e.g., electronic theses and dissertations (ETDs). I have built a system that automatically generates concept maps from English-language ETDs in the computing field. The system also will provide Spanish translations of these concept maps for native Spanish speakers. Using machine translation techniques, my approach leads to concept maps that could allow researchers to discover pertinent dissertations in languages they cannot read, helping them to decide if they want a potentially relevant dissertation translated.
I am using a state-of-the-art natural language processing system, called Relex, to extract noun phrases and noun-verb-noun relations from ETDs, and then produce concept maps automatically. I also have incorporated information from the table of contents of ETDs to create novel styles of concept maps. I have conducted five user studies, to evaluate user perceptions about these different map styles.
I am using several methods to translate node and link text in concept maps from English to Spanish. Nodes labeled with single words from a given technical area can be translated using wordlists, but phrases in specific technical fields can be difficult to translate. Thus I have amassed a collection of about 580 Spanish-language ETDs from Scirus and two Mexican universities and I am using this corpus to mine phrase translations that I could not find otherwise.
The usefulness of the automatically-generated and translated concept maps has been assessed in an experiment at Universidad de las Americas (UDLA) in Puebla, Mexico. This experiment demonstrated that concept maps can augment abstracts (translated using a standard machine translation package) in helping Spanish speaking users find ETDs of interest. / Ph. D.
|
Page generated in 0.0598 seconds