Spelling suggestions: "subject:"[een] NAMED ENTITY"" "subject:"[enn] NAMED ENTITY""
71 |
Natural language processing in cross-media analysisWoldemariam, Yonas Demeke January 2018 (has links)
A cross-media analysis framework is an integrated multi-modal platform where a media resource containing different types of data such as text, images, audio and video is analyzed with metadata extractors, working jointly to contextualize the media resource. It generally provides cross-media analysis and automatic annotation, metadata publication and storage, searches and recommendation services. For on-line content providers, such services allow them to semantically enhance a media resource with the extracted metadata representing the hidden meanings and make it more efficiently searchable. Within the architecture of such frameworks, Natural Language Processing (NLP) infrastructures cover a substantial part. The NLP infrastructures include text analysis components such as a parser, named entity extraction and linking, sentiment analysis and automatic speech recognition. Since NLP tools and techniques are originally designed to operate in isolation, integrating them in cross-media frameworks and analyzing textual data extracted from multimedia sources is very challenging. Especially, the text extracted from audio-visual content lack linguistic features that potentially provide important clues for text analysis components. Thus, there is a need to develop various techniques to meet the requirements and design principles of the frameworks. In our thesis, we explore developing various methods and models satisfying text and speech analysis requirements posed by cross-media analysis frameworks. The developed methods allow the frameworks to extract linguistic knowledge of various types and predict various information such as sentiment and competence. We also attempt to enhance the multilingualism of the frameworks by designing an analysis pipeline that includes speech recognition, transliteration and named entity recognition for Amharic, that also enables the accessibility of Amharic contents on the web more efficiently. The method can potentially be extended to support other under-resourced languages.
|
72 |
Automatically Detecting the Resonance of Terrorist Movement Frames on the WebEtudo, Ugochukwu O 01 January 2017 (has links)
The ever-increasing use of the internet by terrorist groups as a platform for the dissemination of radical, violent ideologies is well documented. The internet has, in this way, become a breeding ground for potential lone-wolf terrorists; that is, individuals who commit acts of terror inspired by the ideological rhetoric emitted by terrorist organizations. These individuals are characterized by their lack of formal affiliation with terror organizations, making them difficult to intercept with traditional intelligence techniques. The radicalization of individuals on the internet poses a considerable threat to law enforcement and national security officials. This new medium of radicalization, however, also presents new opportunities for the interdiction of lone wolf terrorism. This dissertation is an account of the development and evaluation of an information technology (IT) framework for detecting potentially radicalized individuals on social media sites and Web fora. Unifying Collective Action Framing Theory (CAFT) and a radicalization model of lone wolf terrorism, this dissertation analyzes a corpus of propaganda documents produced by several, radically different, terror organizations. This analysis provides the building blocks to define a knowledge model of terrorist ideological framing that is implemented as a Semantic Web Ontology. Using several techniques for ontology guided information extraction, the resultant ontology can be accurately processed from textual data sources. This dissertation subsequently defines several techniques that leverage the populated ontological representation for automatically identifying individuals who are potentially radicalized to one or more terrorist ideologies based on their postings on social media and other Web fora. The dissertation also discusses how the ontology can be queried using intuitive structured query languages to infer triggering events in the news. The prototype system is evaluated in the context of classification and is shown to provide state of the art results. The main outputs of this research are (1) an ontological model of terrorist ideologies (2) an information extraction framework capable of identifying and extracting terrorist ideologies from text, (3) a classification methodology for classifying Web content as resonating the ideology of one or more terrorist groups and (4) a methodology for rapidly identifying news content of relevance to one or more terrorist groups.
|
73 |
Reconhecimento de entidades mencionadas em português utilizando aprendizado de máquina / Portuguese named entity recognition using machine learningWesley Seidel Carvalho 24 February 2012 (has links)
O Reconhecimento de Entidades Mencionadas (REM) é uma subtarefa da extração de informações e tem como objetivo localizar e classificar elementos do texto em categorias pré-definidas tais como nome de pessoas, organizações, lugares, datas e outras classes de interesse. Esse conhecimento obtido possibilita a execução de outras tarefas mais avançadas. O REM pode ser considerado um dos primeiros passos para a análise semântica de textos, além de ser uma subtarefa crucial para sistemas de gerenciamento de documentos, mineração de textos, extração da informação, entre outros. Neste trabalho, estudamos alguns métodos de Aprendizado de Máquina aplicados na tarefa de REM que estão relacionados ao atual estado da arte, dentre eles, dois métodos aplicados na tarefa de REM para a língua portuguesa. Apresentamos três diferentes formas de avaliação destes tipos de sistemas presentes na literatura da área. Além disso, desenvolvemos um sistema de REM para língua portuguesa utilizando Aprendizado de Máquina, mais especificamente, o arcabouço de máxima entropia. Os resultados obtidos com o nosso sistema alcançaram resultados equiparáveis aos melhores sistemas de REM para a língua portuguesa desenvolvidos utilizando outras abordagens de aprendizado de máquina. / Named Entity Recognition (NER), a task related to information extraction, aims to classify textual elements according to predefined categories such as names, places, dates etc. This enables the execution of more advanced tasks. NER is a first step towards semantic textual analysis and is also a crucial task for systems of information extraction and other types of systems. In this thesis, I analyze some Machine Learning methods applied to NER tasks, including two methods applied to Portuguese language. I present three ways of evaluating these types of systems found in the literature. I also develop an NER system for the Portuguese language utilizing Machine Learning that entails working with a maximum entropy framework. The results are comparable to the best NER systems for the Portuguese language developed with other Machine Learning alternatives.
|
74 |
Named Entity Recognition för Klassificering av Rubriker i Fakturor / Classification of Invoice Headers using Named Entity RecognitionKarlsson, Ludvig, Gyllström, Benjamin January 2021 (has links)
Fakturor är en viktig källa av information för företag. Två exempel på viktiga fält i en faktura kan vara, hur mycket pengar som ska betalas och faktura id. På grund av olika format och innehåll i fakturor som skiljer sig åt är extraktionen av information från dessa fakturor ofta en manuell process som kräver mycket tid. För att kunna spara viktig information från semi-strukturerade dokument som fakturor så måste vissa företag lägga ner mycket manuellt arbete. Detta arbete inkluderar att behöva förstå fakturan och därefter veta vilket innehåll som är av intresse för företaget. Detta arbete kan ta mycket tid och därför hade en automatisering av denna process varit av stort intresse. I denna forskningen används named entity recognition för att lösa problemet. De frågor som forskningen besvarar är: Hur effektiv named entity recognition är för klassificering av rubriker i fakturor, samt hur mycket effektiviteten kan öka vid komplettering av ytterligare komponenter. Named entity recognition används för att kategorisera entiteter som i detta fallet är rubriker för fält i fakturor. Modellen som skapas ska avgöra om rubriker i fakturan kan kategoriseras under någon av kategorierna: Invoice number, invoice date, due date, customer number, total amount, vat code, vat amount eller currency. Forskningen försöker endast göra en proof of concept för att se om denna algoritm kan användas för att minska tiden av manuellt arbete. Produktionsmodellen som skapas evalueras med måttet f1-score. Den får med denna metod resultatet 79 av 100. Detta resultatet antyder på att named entity recognition kan användas i ett verkligt scenario för att identifiera rubriker av intresse i en faktura. Men för att få så bra resultat som möjligt så bör modellen kombineras med en lösning som identifierar fält med hjälp av dess data. / Invoices are an important source of information for businesses. Two examples of important fields in an invoice could be the amount of money to be paid and the invoice Id. Due to the different formats and content of invoices, the extraction of information from these is often a manual and time consuming process. In order to save important information from semi-structured documents such as invoices, some companies have to put in a lot of manual work. This work includes understanding the invoice and then knowing what content is of interest to the company. This work can take a lot of time and therefore an automation of this process would be of great interest. In this research named entity recognition is used to solve the mentioned problem. The topics for this research are: How effective named entity recognition is for classification of headers in invoices, as well as how much the efficiency can be improved by complementing with further components. Named entity recognition is used to categorize entities. In this case the entities are the headings of the invoice. The model that is created must determine whether headings in the invoice can be categorized under one of the following categories: Invoice number, invoice date, due date, customer number, total amount, vat code, vat amount or currency. This research tries to make a proof of concept to discover if this algorithm can be used to reduce the time spent on manual work. The production model that is created is evaluated with the f1-score measurement. With this method, it gets a result of 79 out of 100. This result indicates that named entity recognition can be used by companies in real-world scenarios to identify headings in invoices. But to get the best results possible, the model should also be combined with a solution that identifies fields using its corresponding data.
|
75 |
Struktury trie pro zpracování rozsáhlých textových dat / Trie Structures for Large Text Data ProcessingRajčok, Andrej January 2016 (has links)
This study analyzes natural language processing with emphasis on morphological analysis of inflective languages and systems for named entity recognition. It analyzes effective pattern matching in dictionary by using succint structures and then analyzes practical implementation of succint structures. It describes design and implementation of named entity recognition system and morphological analyzer and compares and test their speed and effectiveness.
|
76 |
Serviceorientiertes Text Mining am Beispiel von Entitätsextrahierenden DienstenPfeifer, Katja 16 June 2014 (has links)
Der Großteil des geschäftsrelevanten Wissens liegt heute als unstrukturierte Information in Form von Textdaten auf Internetseiten, in Office-Dokumenten oder Foreneinträgen vor. Zur Extraktion und Verwertung dieser unstrukturierten Informationen wurde eine Vielzahl von Text-Mining-Lösungen entwickelt. Viele dieser Systeme wurden in der jüngeren Vergangenheit als Webdienste zugänglich gemacht, um die Verwertung und Integration zu vereinfachen.
Die Kombination verschiedener solcher Text-Mining-Dienste zur Lösung konkreter Extraktionsaufgaben erscheint vielversprechend, da so bestehende Stärken ausgenutzt, Schwächen der Systeme minimiert werden können und die Nutzung von Text-Mining-Lösungen vereinfacht werden kann. Die vorliegende Arbeit adressiert die flexible Kombination von Text-Mining-Diensten in einem serviceorientierten System und erweitert den Stand der Technik um gezielte Methoden zur Auswahl der Text-Mining-Dienste, zur Aggregation der Ergebnisse und zur Abbildung der eingesetzten Klassifikationsschemata.
Zunächst wird die derzeit existierende Dienstlandschaft analysiert und aufbauend darauf eine Ontologie zur funktionalen Beschreibung der Dienste bereitgestellt, so dass die funktionsgesteuerte Auswahl und Kombination der Text-Mining-Dienste ermöglicht wird. Des Weiteren werden am Beispiel entitätsextrahierender Dienste Algorithmen zur qualitätssteigernden Kombination von Extraktionsergebnissen erarbeitet und umfangreich evaluiert. Die Arbeit wird durch zusätzliche Abbildungs- und Integrationsprozesse ergänzt, die eine Anwendbarkeit auch in heterogenen Dienstlandschaften, bei denen unterschiedliche Klassifikationsschemata zum Einsatz kommen, gewährleisten. Zudem werden Möglichkeiten der Übertragbarkeit auf andere Text-Mining-Methoden erörtert.
|
77 |
Automatic Extraction and Assessment of Entities from the WebUrbansky, David 15 October 2012 (has links)
The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to find all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The findings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research fields, such as question answering, named entity recognition, and information retrieval.
|
78 |
Extracting Transaction Information from Financial Press Releases / Extrahering av Transaktionsdata från Finansiella PressmeddelandenSjöberg, Agaton January 2021 (has links)
The use cases of Information Extraction (IE) are more or less endless, often consisting of a combination of Named Entity Recognition (NER) and Relation Extraction (RE). One use case of IE is the extraction of transaction information from Norwegian insider transaction Press Releases (PRs), where a transaction consists of at most four entities: the name of the owner performing the transaction, the number of shares transferred, the transaction date, and the price of the shares bought or sold. The relationships between the entities define which entity belongs to which transaction, and whether shares were bought or sold. This report has investigated how a pair of supervised NER and RE models extract this information. Since these Norwegian PRs were not labeled, two different approaches to annotating the transaction entities and their associated relations were investigated, and it was found that it is better to annotate only entities that occur in a relation than annotating all occurrences. Furthermore, the number of PRs needed to achieve a satisfactory result in the IE pipeline was investigated. The study shows that training with about 400 PRs is sufficient for the results to converge, at around 0.85 in F1-score. Finally, the report shows that there is not much difference between a complex RE model and a simple rule-based approach, when applied on the studied corpus.
|
79 |
Annotating Job Titles in Job Ads using Swedish Language ModelsRidhagen, Markus January 2023 (has links)
This thesis investigates automated annotation approaches to assist public authorities in Sweden in optimizing resource allocation and gaining valuable insights to enhance the preservation of high-quality welfare. The study uses pre-trained Swedish language models for the named entity recognition (NER) task of finding job titles in job advertisements from The Swedish Public Employment Service, Arbetsförmedlingen. Specifically, it evaluates the performance of the Swedish Bidirectional Encoder Representations from Transformers (BERT), developed by the National Library of Sweden (KB), referred to as KB-BERT. The thesis explores the impact of training data size on the models’ performance and examines whether active learning can enhance efficiency and accuracy compared to random sampling. The findings reveal that even with a small training dataset of 220 job advertisements, KB-BERT achieves a commendable F1-score of 0.770 in predicting job titles. The model’s performance improves further by augmenting the training data with an additional 500 annotated job advertisements, yielding an F1-score of 0.834. Notably, the highest F1-score of 0.856 is achieved by applying the active learning strategy of uncertainty sampling and the measure of mean entropy. The test data provided by Arbetsförmedlingen was re-annotated to evaluate the complexity of the task. The human annotator achieved an F1-score of 0.883. Based on these findings, it can be inferred that KB-BERT performs satisfactorily in classifying job titles from job ads.
|
80 |
Exploring Construction of a Company Domain-Specific Knowledge Graph from Financial Texts Using Hybrid Information ExtractionJen, Chun-Heng January 2021 (has links)
Companies do not exist in isolation. They are embedded in structural relationships with each other. Mapping a given company’s relationships with other companies in terms of competitors, subsidiaries, suppliers, and customers are key to understanding a company’s major risk factors and opportunities. Conventionally, obtaining and staying up to date with this key knowledge was achieved by reading financial news and reports by highly skilled manual labor like a financial analyst. However, with the development of Natural Language Processing (NLP) and graph databases, it is now possible to systematically extract and store structured information from unstructured data sources. The current go-to method to effectively extract information uses supervised machine learning models, which require a large amount of labeled training data. The data labeling process is usually time-consuming and hard to get in a domain-specific area. This project explores an approach to construct a company domain-specific Knowledge Graph (KG) that contains company-related entities and relationships from the U.S. Securities and Exchange Commission (SEC) 10-K filings by combining a pre-trained general NLP with rule-based patterns in Named Entity Recognition (NER) and Relation Extraction (RE). This approach eliminates the time-consuming data-labeling task in the statistical approach, and by evaluating ten 10-k filings, the model has the overall Recall of 53.6%, Precision of 75.7%, and the F1-score of 62.8%. The result shows it is possible to extract company information using the hybrid methods, which does not require a large amount of labeled training data. However, the project requires the time-consuming process of finding lexical patterns from sentences to extract company-related entities and relationships. / Företag existerar inte som isolerade organisationer. De är inbäddade i strukturella relationer med varandra. Att kartlägga ett visst företags relationer med andra företag när det gäller konkurrenter, dotterbolag, leverantörer och kunder är nyckeln till att förstå företagets huvudsakliga riskfaktorer och möjligheter. Det konventionella sättet att hålla sig uppdaterad med denna viktiga kunskap var genom att läsa ekonomiska nyheter och rapporter från högkvalificerad manuell arbetskraft som till exempel en finansanalytiker. Men med utvecklingen av ”Natural Language Processing” (NLP) och grafdatabaser är det nu möjligt att systematiskt extrahera och lagra strukturerad information från ostrukturerade datakällor. Den nuvarande metoden för att effektivt extrahera information använder övervakade maskininlärningsmodeller som kräver en stor mängd märkta träningsdata. Datamärkningsprocessen är vanligtvis tidskrävande och svår att få i ett domänspecifikt område. Detta projekt utforskar ett tillvägagångssätt för att konstruera en företagsdomänspecifikt ”Knowledge Graph” (KG) som innehåller företagsrelaterade enheter och relationer från SEC 10-K-arkivering genom att kombinera en i förväg tränad allmän NLP med regelbaserade mönster i ”Named Entity Recognition” (NER) och ”Relation Extraction” (RE). Detta tillvägagångssätt eliminerar den tidskrävande datamärkningsuppgiften i det statistiska tillvägagångssättet och genom att utvärdera tio SEC 10-K arkiv har modellen den totala återkallelsen på 53,6 %, precision på 75,7 % och F1-poängen på 62,8 %. Resultatet visar att det är möjligt att extrahera företagsinformation med hybridmetoderna, vilket inte kräver en stor mängd märkta träningsdata. Projektet kräver dock en tidskrävande process för att hitta lexikala mönster från meningar för att extrahera företagsrelaterade enheter och relationer.
|
Page generated in 0.0558 seconds