Global ETD Search

61	A System for Automatic Information Extraction from Log Files Chhabra, Anubhav 15 August 2022 (has links) The development of technology, data-driven systems and applications are constantly revolutionizing our lives. We are surrounded by digitized systems/solutions that are transforming and making our lives easier. The criticality and complexity behind these systems are immense. So as to meet user satisfaction and keep up with the business needs, these digital systems should possess high availability, minimum downtime, and mitigate cyber attacks. Hence, system monitoring becomes an integral part of the lifecycle of a digital product/system. System monitoring often includes monitoring and analyzing logs outputted by the systems containing information about the events occurring within a system. The first step in log analysis generally includes understanding and segregating the various logical components within a log line, termed log parsing. Traditional log parsers use regular expressions and human-defined grammar to extract information from logs. Human experts are required to create, maintain and update the database containing these regular expressions and rules. They should keep up with the pace at which new products, applications and systems are being developed and deployed, as each unique application/system would have its own set of logs and logging standards. Logs from new sources tend to break the existing systems as none of the expressions match the signature of the incoming logs. The reasons mentioned above make the traditional log parsers time-consuming, hard to maintain, prone to errors, and not a scalable approach. On the other hand, machine learning based methodologies can help us develop solutions that automate the log parsing process without much intervention from human experts. NERLogParser is one such solution that uses a Bidirectional Long Short Term Memory (BiLSTM) architecture to frame the log parsing problem as a Named Entity Recognition (NER) problem. There have been recent advancements in the Natural Language Processing (NLP) domain with the introduction of architectures like Transformer and Bidirectional Encoder Representations from Transformers (BERT). However, these techniques have not been applied to tackle the problem of information extraction from log files. This gives us a clear research gap to experiment with the recent advanced deep learning architectures. This thesis extensively compares different machine learning based log parsing approaches that frame the log parsing problem as a NER problem. We compare 14 different approaches, including three traditional word-based methods: Naive Bayes, Perceptron and Stochastic Gradient Descent; a graphical model: Conditional Random Fields (CRF); a pre-trained sequence-to-sequence model for log parsing: NERLogParser; an attention-based sequence-to-sequence model: Transformer Neural Network; three different neural language models: BERT, RoBERTa and DistilBERT; two traditional ensembles and three different cascading classifiers formed using the individual classifiers mentioned above. We evaluate the NER approaches using an evaluation framework that offers four different evaluation schemes that not just help in comparing the NER approaches but also help us assess the quality of extracted information. The primary goal of this research is to evaluate the NER approaches on logs from new and unseen sources. To the best of our knowledge, no study in the literature evaluates the NER methodologies in such a context. Evaluating NER approaches on unseen logs helps us understand the robustness and the generalization capabilities of various methodologies. To carry out the experimentation, we use In-Scope and Out-of-Scope datasets. Both the datasets originate from entirely different sources and are entirely mutually exclusive. The In-Scope dataset is used for training, validation and testing purposes, whereas the Out-of-Scope dataset is purely used to evaluate the robustness and generalization capability of NER approaches. To better deal with logs from unknown sources, we propose Log Diversification Unit (LoDU), a unit of our system that enables us to carry out log augmentation and enrichment, which helps make the NER approaches more robust towards new and unseen logs. We segregate our final results on a use-case basis where different NER approaches may be suitable for various applications. Overall, traditional ensembles perform the best in parsing the Out-of-Scope log files, but they may not be the best option to consider for real-time applications. On the other hand, if we want to balance the trade-off between performance and throughput, cascading classifiers can be considered the go-to solution. Log Parsing System Monitoring Log Forensics Deep Learning Machine Learning Named Entity Recognition
62	Multilingual Transformer Models for Maltese Named Entity Recognition Farrugia, Kris January 2022 (has links) The recently developed state-of-the-art models for Named Entity Recognition are heavily dependent upon huge amounts of available annotated data. Consequently, it is extremely challenging for data-scarce languages to obtain significant result. Several approaches have been proposed to circumvent this issue, including cross-lingual transfer learning, which is the leveraging of knowledge obtained by available resources in the source language and transfer it to a target low-resource language. Maltese is one of the many majorly underresourced languages. The main purpose of this project is to research how recently developed transformer multilingual models (Multilingual BERT and XLM-RoBERTa) perform and to ultimately set up an evaluation benchmark in zero-shot cross-lingual transfer learning for Maltese Named Entity Recognition. The models are fine-tuned on Arabic, English, Italian, Spanish and Dutch. The experiments evaluated the efficacy of the source languages and the use of multilingual data in both the training and validation stages. The experiments demonstrated that feeding multilingual data to both the training and the validation phases was mostly beneficial to the performance. However, adding it to the validation phase only was generally detrimental. Furthermore, XLM-R achieved overall better scores however, employing mBERT and English as the source language yielded the best performance. low-resource named-entity information extraction Maltese
63	A Master's thesis consisting of I. Acting book for the role of Blanche Dubois in A Streetcar Named Desire; II. Production log book for the role of Madeleine in Victims of Duty Modyman, Linda January 1962 (has links) Thesis (M.F.A.)--Boston University A Streetcar Named Desire Victims of duty Dubois, Blanche Drama Acting books Production books
64	Information Extraction of Technical Details From Scholarly Articles Kaushal, Kulendra Kumar 16 June 2021 (has links) Researchers have made significant progress in information extraction from short documents in the last few years, including social media interaction, news articles, and email excerpts. This research aims to extract technical entities like hardware resources, computing platforms, compute time, programming language, and libraries from scholarly research articles. Research articles are generally long documents having both salient as well as non-salient entities. Analyzing the cross-sectional relation, filtering the relevant information, measuring the saliency of mentioned entities, and extracting novel entities are some of the technical challenges involved in this research. This work presents a detailed study about the performance, effectiveness, and scalability of rule-based weakly supervised algorithms. We also develop an automated end-to-end Research Entity and Relationship Extractor (E2R Extractor). Additionally, we perform a comprehensive study about the effectiveness of existing deep learning-based information extraction tools like Dygie, Dygie++, SciREX. The research also contributes a dataset containing novel entities annotated in BILUO format and represents the baseline results using the E2R extractor on the proposed dataset. The results indicate that the E2R extractor successfully extracts salient entities from research articles. / Master of Science / Information extraction is a process of automatically extracting meaningful information from unstructured text such as articles, news feeds and presenting it in a structured format. Researchers have made significant progress in this domain over the past few years. However, their work primarily focuses on short documents such as social media interactions, news articles, email excerpts, and not on long documents such as scholarly articles and research papers. Long documents contain a lot of redundant data, so filtering and extracting meaningful information is quite challenging. This work focuses on extracting entities such as hardware resources, compute platforms, and programming languages used in scholarly articles. We present a deep learning-based model to extract such entities from research articles and research papers. We evaluate the performance of our deep learning model against simple rule-based algorithms and other state-of-the-art models for extracting the desired entities. Our work also contributes a labeled dataset containing the entities mentioned above and results obtained on this dataset using our deep learning model. Information Extraction Long Documents Research Articles Named Entity Recognition Hardware Resources Compute Platform Programming Language and Libraries
65	Detection of Named Branch Origin for Git Commits Michaud, Heather M. 15 September 2015 (has links) No description available. Computer Science Git repository mining branching merging named branch commit origin software evolution commit
66	Identifying Single and Stacked News Triangles in Online News Articles - an Analysis of 31 Danish Online News Articles Annotated by 68 Journalists Njor, Miklas January 2015 (has links) While news articles for print use one News Triangle, where important information is at the top of the article, online news articles are supposed to use a series of Stacked News Triangles, due to online readers text- skimming habits[1]. To identify Stacked News Triangles presence, we analyse how 68 Danish journalists annotate 31 articles. We use keyword frequency as the measure of popularity. To explore if Named Entities influence News Triangle presence, we analyse Named Entities found in the articles and keywords.We find the presence of an overall News Triangle in 30 of 31 articles, while, for the presence of Stacked News Triangles, 14 of the 31 articles have Stacked News Triangles. For Named Entities in News Triangles we cannot see what their influences is. Nonetheless, we find difference in Named Entity Types in each category (Culture, Domestic, Economy, Sports). Online News Keyword popularity Keyword frequency Folksonomy Named Entity Engineering and Technology Teknik och teknologier
67	Hierarchical Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History Records Segrera, Daniel 08 March 2023 (has links) (PDF) Entity extraction is an important step in document understanding. Higher accuracy entity extraction on fine-grained entities can be achieved by combining the utility of Named Entity Recognition (NER) and Relation Extraction (RE) models. In this paper, a cascading model is proposed that implements NER and Relation extraction. This model utilizes relations between entities to infer context-dependent fine-grain named entities in text corpora. The RE module runs independent of the NER module, which reduces error accumulation from sequential steps. This process improves on the fine-grained NER F1-score of existing state-of-the-art from .4753 to .8563 on our data, albeit on a strictly limited domain. This provides the potential for further applications in historical document processing. These applications will enable automated searching of historical documents, such as those used in economics research and family history. natural language processing named entity recognition relation extraction information extraction Physical Sciences and Mathematics
68	Direct Speech Translation Toward High-Quality, Inclusive, and Augmented Systems Gaido, Marco 28 April 2023 (has links) When this PhD started, the translation of speech into text in a different language was mainly tackled with a cascade of automatic speech recognition (ASR) and machine translation (MT) models, as the emerging direct speech translation (ST) models were not yet competitive. To close this gap, part of the PhD has been devoted to improving the quality of direct models, both in the simplified condition of test sets where the audio is split into well-formed sentences, and in the realistic condition in which the audio is automatically segmented. First, we investigated how to transfer knowledge from MT models trained on large corpora. Then, we defined encoder architectures that give different weights to the vectors in the input sequence, reflecting the variability of the amount of information over time in speech. Finally, we reduced the adverse effects caused by the suboptimal automatic audio segmentation in two ways: on one side, we created models robust to this condition; on the other, we enhanced the audio segmentation itself. The good results achieved in terms of overall translation quality allowed us to investigate specific behaviors of direct ST systems, which are crucial to satisfy real users’ needs. On one side, driven by the ethical goal of inclusive systems, we disclosed that established technical choices geared toward high general performance (statistical word segmentation of the target text, knowledge distillation from MT) cause an exacerbation of the gender representational disparities in the training data. Along this line of work, we proposed mitigation techniques that reduce the gender bias of ST models, and showed how gender-specific systems can be used to control the translation of gendered words related to the speakers, regardless of their vocal traits. On the other side, motivated by the practical needs of interpreters and translators, we evaluated the potential of direct ST systems in the “augmented translation” scenario, focusing on the translation and recognition of named entities (NEs). Along this line of work, we proposed solutions to cope with the major weakness of ST models (handling person names), and introduced direct models that jointly perform ST and NE recognition showing their superiority over a pipeline of dedicated tools for the two tasks. Overall, we believe that this thesis moves a step forward toward adopting direct ST systems in real applications, increasing the awareness of their strengths and weaknesses compared to the traditional cascade paradigm. direct speech translation audio segmentation gender bias named entities augmented translation
69	Improving Automatic Transcription Using Natural Language Processing Kiefer, Anna 01 March 2024 (has links) (PDF) Digital Democracy is a CalMatters and California Polytechnic State University initia-tive to promote transparency in state government by increasing access to the Califor-nia legislature. While Digital Democracy is made up of many resources, one founda-tional step of the project is obtaining accurate, timely transcripts of California Senateand Assembly hearings. The information extracted from these transcripts providescrucial data for subsequent steps in the pipeline. In the context of Digital Democracy,upleveling is when humans verify, correct, and annotate the transcript results afterthe legislative hearings have been automatically transcribed. The upleveling processis done with the assistance of a software application called the Transcription Tool.The human upleveling process is the most costly and time-consuming step of the Dig-ital Democracy pipeline. In this thesis, we hypothesize that we can make significantreductions to the time needed for upleveling by using Natural Language Processing(NLP) systems and techniques. The main contribution of this thesis is engineeringa new automatic transcription pipeline. Specifically, this thesis integrates a new au-tomatic speech recognition service, a new speaker diarization model, additional textpost-processing changes, and a new process for speaker identification. To evaluate the system’s improvements, we measure the accuracy and speed of the newly integrated features and record editor upleveling time both before and after the additions. Natural Language Processing Named Entity Recognition Transcription Automatic Speech Recognition Artificial Intelligence and Robotics Software Engineering
70	Using Concept Maps as a Tool for Cross-Language Relevance Determination Richardson, W. Ryan 02 August 2007 (has links) Concept maps, introduced by Novak, aid learners' understanding. I hypothesize that concept maps also can function as a summary of large documents, e.g., electronic theses and dissertations (ETDs). I have built a system that automatically generates concept maps from English-language ETDs in the computing field. The system also will provide Spanish translations of these concept maps for native Spanish speakers. Using machine translation techniques, my approach leads to concept maps that could allow researchers to discover pertinent dissertations in languages they cannot read, helping them to decide if they want a potentially relevant dissertation translated. I am using a state-of-the-art natural language processing system, called Relex, to extract noun phrases and noun-verb-noun relations from ETDs, and then produce concept maps automatically. I also have incorporated information from the table of contents of ETDs to create novel styles of concept maps. I have conducted five user studies, to evaluate user perceptions about these different map styles. I am using several methods to translate node and link text in concept maps from English to Spanish. Nodes labeled with single words from a given technical area can be translated using wordlists, but phrases in specific technical fields can be difficult to translate. Thus I have amassed a collection of about 580 Spanish-language ETDs from Scirus and two Mexican universities and I am using this corpus to mine phrase translations that I could not find otherwise. The usefulness of the automatically-generated and translated concept maps has been assessed in an experiment at Universidad de las Americas (UDLA) in Puebla, Mexico. This experiment demonstrated that concept maps can augment abstracts (translated using a standard machine translation package) in helping Spanish speaking users find ETDs of interest. / Ph. D. named entity extraction computing ontologies concept mapping cross-language information retrieval

Search results