• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 55
  • 8
  • 8
  • 6
  • 2
  • 2
  • 2
  • 2
  • 2
  • Tagged with
  • 99
  • 99
  • 99
  • 44
  • 37
  • 36
  • 35
  • 34
  • 30
  • 24
  • 21
  • 16
  • 16
  • 12
  • 11
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Using text mining to identify crime patterns from Arabic crime news report corpus

Alruily, Meshrif January 2012 (has links)
Most text mining techniques have been proposed only for English text, and even here, most research has been conducted on specific texts related to special contexts within the English language, such as politics, medicine and crime. In contrast, although Arabic is a widely spoken language, few mining tools have been developed to process Arabic text, and some Arabic domains have not been studied at all. In fact, Arabic is a language with a very complex morphology because it is highly inflectional, and therefore, dealing with texts written in Arabic is highly complicated. This research studies the crime domain in the Arabic language, exploiting unstructured text using text mining techniques. Developing a system for extracting important information from crime reports would be useful for police investigators, for accelerating the investigative process (instead of reading entire reports) as well as for conducting further or wider analyses. We propose the Crime Profiling System (CPS) to extract crime-related information (crime type, crime location and nationality of persons involved in the event), automatically construct dictionaries for the existing information, cluster crime documents based on certain attributes and utilize visualisation techniques to assist in crime data analysis. The proposed information extraction approach is novel, and it relies on computational linguistic techniques to identify the abovementioned information, i.e. without using predefined dictionaries (e.g. lists of location names) and annotated corpus. The language used in crime reporting is studied to identify patterns of interest using a corpus-based approach. Frequency analysis, collocation analysis and concordance analysis are used to perform the syntactic analysis in order to discover the local grammar. Moreover, the Self Organising Map (SOM) approach is adopted in order to perform the clustering and visualisation tasks for crime documents based on crime type, location or nationality. This clustering technique is improved because only refined data containing meaningful keywords extracted through the information extraction process are inputted into it, i.e. the data is cleaned by removing noise. As a result, a huge reduction in the quantity of data fed into the SOM is obtained, consequently, saving memory, data loading time and the execution time needed to perform the clustering. Therefore, the computation of the SOM is accelerated. Finally, the quantization error is reduced, which leads to high quality clustering. The outcome of the clustering stage is also visualised and the system is able to provide statistical information in the form of graphs and tables about crimes committed within certain periods of time and within a particular area.
2

Benoemde-entiteitherkenning vir Afrikaans / G.D. Matthew

Matthew, Gordon Derrac January 2013 (has links)
According to the Constitution of South Africa, the government is required to make all the infor-mation in the ten indigenous languages of South Africa (excluding English), available to the public. For this reason, the government made the information, that already existed for these ten languages, available to the public and an effort is also been made to increase the amount of resources available in these languages (Groenewald & Du Plooy, 2010). This release of infor-mation further helps to implement Krauwer‟s (2003) idea that there is an inventory for the mini-mal number of language-related resources required for a language to be competitive at the level of research and teaching. This inventory is known as the "Basic Language Resource Kit" (BLARK). Since most of the languages in South Africa are resource scarce, it is of the best in-terest for the cultural growth of the country, that each of the indigenous South African languages develops their own BLARK. In Chapter 1, the need for the development of an implementable named entity recogniser (NER) for Afrikaans is discussed by first referring to the Constitution of South Africa’s (Republic of South Africa, 2003) language policy. Secondly, the guidelines of BLARK (Krauwer, 2003) are discussed, which is followed by a discussion of an audit that focuses on the number of re-sources and the distribution of human language technology for all eleven South African languages (Sharma Grover, Van Huyssteen & Pretorius, 2010). In respect of an audit conducted by Sharma Grover et al. (2010), it was established that there is a shortage of text-based tools for Afrikaans. This study focuses on this need for text-based tools, by focusing on the develop-ment of a NER for Afrikaans. In Chapter 2 a description is given on what an entity and a named entity is. Later in the chapter the process of technology recycling is explained, by referring to other studies where the idea of technology recycling has been applied successfully (Rayner et al., 1997). Lastly, an analysis is done on the differences that may occur between Afrikaans and Dutch named entities. These differences are divided into three categories, namely: identical cognates, non-identical cognates and unrelated entities. Chapter 3 begins with a description of Frog (van den Bosch et al, 2007), the Dutch NER used in this study, and the functions and operation of its NER-component. This is followed by a description of the Afrikaans-to-Dutch-converter (A2DC) (Van Huyssteen & Pilon, 2009) and finally the various experiments that were completed, are explained. The study consists of six experiments, the first of which was to determine the results of Frog on Dutch data. The second experiment evaluated the effectiveness of Frog on unchanged (raw) Afrikaans data. The following two experiments evaluated the results of Frog on “Dutched” Afrikaans data. The last two experiments evaluated the effectiveness of Frog on raw and “Dutched” Afrikaans data with the addition of gazetteers as part of the pre-processing step. In conclusion, a summary is given with regards to the comparisons between the NER for Afri-kaans that was developed in this study, and the NER-component that Puttkammer (2006) used in his tokeniser. Finally a few suggestions for future research are proposed. / MA (Applied Language and Literary Studies), North-West University, Vaal Triangle Campus, 2013
3

Benoemde-entiteitherkenning vir Afrikaans / G.D. Matthew

Matthew, Gordon Derrac January 2013 (has links)
According to the Constitution of South Africa, the government is required to make all the infor-mation in the ten indigenous languages of South Africa (excluding English), available to the public. For this reason, the government made the information, that already existed for these ten languages, available to the public and an effort is also been made to increase the amount of resources available in these languages (Groenewald & Du Plooy, 2010). This release of infor-mation further helps to implement Krauwer‟s (2003) idea that there is an inventory for the mini-mal number of language-related resources required for a language to be competitive at the level of research and teaching. This inventory is known as the "Basic Language Resource Kit" (BLARK). Since most of the languages in South Africa are resource scarce, it is of the best in-terest for the cultural growth of the country, that each of the indigenous South African languages develops their own BLARK. In Chapter 1, the need for the development of an implementable named entity recogniser (NER) for Afrikaans is discussed by first referring to the Constitution of South Africa’s (Republic of South Africa, 2003) language policy. Secondly, the guidelines of BLARK (Krauwer, 2003) are discussed, which is followed by a discussion of an audit that focuses on the number of re-sources and the distribution of human language technology for all eleven South African languages (Sharma Grover, Van Huyssteen & Pretorius, 2010). In respect of an audit conducted by Sharma Grover et al. (2010), it was established that there is a shortage of text-based tools for Afrikaans. This study focuses on this need for text-based tools, by focusing on the develop-ment of a NER for Afrikaans. In Chapter 2 a description is given on what an entity and a named entity is. Later in the chapter the process of technology recycling is explained, by referring to other studies where the idea of technology recycling has been applied successfully (Rayner et al., 1997). Lastly, an analysis is done on the differences that may occur between Afrikaans and Dutch named entities. These differences are divided into three categories, namely: identical cognates, non-identical cognates and unrelated entities. Chapter 3 begins with a description of Frog (van den Bosch et al, 2007), the Dutch NER used in this study, and the functions and operation of its NER-component. This is followed by a description of the Afrikaans-to-Dutch-converter (A2DC) (Van Huyssteen & Pilon, 2009) and finally the various experiments that were completed, are explained. The study consists of six experiments, the first of which was to determine the results of Frog on Dutch data. The second experiment evaluated the effectiveness of Frog on unchanged (raw) Afrikaans data. The following two experiments evaluated the results of Frog on “Dutched” Afrikaans data. The last two experiments evaluated the effectiveness of Frog on raw and “Dutched” Afrikaans data with the addition of gazetteers as part of the pre-processing step. In conclusion, a summary is given with regards to the comparisons between the NER for Afri-kaans that was developed in this study, and the NER-component that Puttkammer (2006) used in his tokeniser. Finally a few suggestions for future research are proposed. / MA (Applied Language and Literary Studies), North-West University, Vaal Triangle Campus, 2013
4

Using Freebase, An Automatically Generated Dictionary, And A Classifier To Identify A Person's Profession In Tweets

Hall, Abraham 01 January 2013 (has links)
Algorithms for classifying pre-tagged person entities in tweets into one of eight profession categories are presented. A classifier using a semi-supervised learning algorithm that takes into consideration the local context surrounding the entity in the tweet, hash tag information, and topic signature scores is described. In addition to the classifier, this research investigates two dictionaries containing the professions of persons. These two dictionaries are used in their own classification algorithms which are independent of the classifier. The method for creating the first dictionary dynamically from the web and the algorithm that accesses this dictionary to classify a person into one of the eight profession categories are explained next. The second dictionary is freebase, an openly available online database that is maintained by its online community. The algorithm that uses freebase for classifying a person into one of the eight professions is described. The results also show that classifications made using the automated constructed dictionary, freebase, or the classifier are all moderately successful. The results also show that classifications made with the automated constructed person dictionary are slightly more accurate than classifications made using freebase. Various hybrid methods, combining the classifier and the two dictionaries are also explained. The results of those hybrid methods show significant improvement over any of the individual methods.
5

Undersökande studie inom Information Extraction : Konsten att Klassicera

Torstensson, Erik, Carls, Fredrik January 2016 (has links)
Denna uppsats är en undersökande studie inom Information Extraction. Huvudsyftet är att skapa och utvärdera metoder inom Information Extraction och undersöka hur de kan hjälpa till att förbättra det vetenskapliga resultatet av klassificering av textelement. En deluppgift är att utvärdera den befintliga marknaden för Information Extraction i Sverige.                       För att göra detta har vi skapat ett program bestående av två delar. Den första delen utgörs av ett basfall som är en enkel metod och den andra är mer avancerad och använder sig av olika tekniker inom området Information Extraction. Fältet vi undersöker är hur ofta män och kvinnor nämns i sju olika nyhetskällor i Sverige. Resultatet jämför dessa två metoder och utvärderar dem med vetenskapliga prestationsmått inom Information Extraction.                       Studiens resultat visar på liknande förekomster av män och kvinnor mellan basfallet och den mer avancerade metoden. Undantaget är att den mer avancerade metoden har ett högre vetenskapligt värde. Marknaden för Information Extraction i Sverige är dominerad av stora medieägda bolag, där media dessutom förser dessa företag med data att analysera. Detta gör att det blir svårt att konkurrera utan en ny innovativ idé. / This paper is an investigatory report about Information Extraction. The main purpose is to create and evaluate methods within Information Extraction and see how they can help improve the scientific result in classification of text elements. A subtask is to evaluate the existing market for Information Extraction in Sweden.                       For this task a two-part computer program has been created. The first part is just a baseline with a simple method and the second one is more advanced with tools used in the field Information Extraction. The field we investigate is how often men and women are mentioned in seven different newspapers in Sweden. The result compares these two methods and evaluates them using scientific measurements of information retrieval performance.                       The results of the study show similar occurrences of men and women between the baseline and the more advanced method. The exception being that the more advanced method has a higher scientific value. The market for Information Extraction in Sweden is dominated by large corporations owned by the media, which also provide the data for these kinds of companies to analyze. This makes it hard to compete without having a new innovative idea.
6

Named Entity Recognition för Klassificering av Rubriker i Fakturor / Classification of Invoice Headers using Named Entity Recognition

Karlsson, Ludvig, Gyllström, Benjamin January 2021 (has links)
Fakturor är en viktig källa av information för företag. Två exempel på viktiga fält i en faktura kan vara, hur mycket pengar som ska betalas och faktura id. På grund av olika format och innehåll i fakturor som skiljer sig åt är extraktionen av information från dessa fakturor ofta en manuell process som kräver mycket tid. För att kunna spara viktig information från semi-strukturerade dokument som fakturor så måste vissa företag lägga ner mycket manuellt arbete. Detta arbete inkluderar att behöva förstå fakturan och därefter veta vilket innehåll som är av intresse för företaget. Detta arbete kan ta mycket tid och därför hade en automatisering av denna process varit av stort intresse. I denna forskningen används named entity recognition för att lösa problemet. De frågor som forskningen besvarar är: Hur effektiv named entity recognition är för klassificering av rubriker i fakturor, samt hur mycket effektiviteten kan öka vid komplettering av ytterligare komponenter. Named entity recognition används för att kategorisera entiteter som i detta fallet är rubriker för fält i fakturor. Modellen som skapas ska avgöra om rubriker i fakturan kan kategoriseras under någon av kategorierna: Invoice number, invoice date, due date, customer number, total amount, vat code, vat amount eller currency. Forskningen försöker endast göra en proof of concept för att se om denna algoritm kan användas för att minska tiden av manuellt arbete. Produktionsmodellen som skapas evalueras med måttet f1-score. Den får med denna metod resultatet 79 av 100. Detta resultatet antyder på att named entity recognition kan användas i ett verkligt scenario för att identifiera rubriker av intresse i en faktura. Men för att få så bra resultat som möjligt så bör modellen kombineras med en lösning som identifierar fält med hjälp av dess data. / Invoices are an important source of information for businesses. Two examples of important fields in an invoice could be the amount of money to be paid and the invoice Id. Due to the different formats and content of invoices, the extraction of information from these is often a manual and time consuming process. In order to save important information from semi-structured documents such as invoices, some companies have to put in a lot of manual work. This work includes understanding the invoice and then knowing what content is of interest to the company. This work can take a lot of time and therefore an automation of this process would be of great interest. In this research named entity recognition is used to solve the mentioned problem. The topics for this research are: How effective named entity recognition is for classification of headers in invoices, as well as how much the efficiency can be improved by complementing with further components. Named entity recognition is used to categorize entities. In this case the entities are the headings of the invoice. The model that is created must determine whether headings in the invoice can be categorized under one of the following categories: Invoice number, invoice date, due date, customer number, total amount, vat code, vat amount or currency. This research tries to make a proof of concept to discover if this algorithm can be used to reduce the time spent on manual work. The production model that is created is evaluated with the f1-score measurement. With this method, it gets a result of 79 out of 100. This result indicates that named entity recognition can be used by companies in real-world scenarios to identify headings in invoices. But to get the best results possible, the model should also be combined with a solution that identifies fields using its corresponding data.
7

Entity extraction, animal disease-related event recognition and classification from web

Volkova, Svitlana January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / Global epidemic surveillance is an essential task for national biosecurity management and bioterrorism prevention. The main goal is to protect the public from major health threads. To perform this task effectively one requires reliable, timely and accurate medical information from a wide range of sources. Towards this goal, we present a framework for epidemiological analytics that can be used to extract and visualize infectious disease outbreaks from the variety of unstructured web sources automatically. More precisely, in this thesis, we consider several research tasks including document relevance classification, entity extraction and animal disease-related event recognition in the veterinary epidemiology domain. First, we crawl web sources and classify collected documents by topical relevance using supervised learning algorithms. Next, we propose a novel approach for automated ontology construction in the veterinary medicine domain. Our approach is based on semantic relationship discovery using syntactic patterns. We then apply our automatically-constructed ontology for the domain-specific entity extraction task. Moreover, we compare our ontology-based entity extraction results with an alternative sequence labeling approach. We introduce a sequence labeling method for the entity tagging that relies on syntactic feature extraction using a sliding window. Finally, we present our novel sentence-based event recognition approach that includes three main steps: entity extraction of animal diseases, species, locations, dates and the confirmation status n-grams; event-related sentence classification into two categories - suspected or confirmed; automated event tuple generation and aggregation. We show that our document relevance classification results as well as entity extraction and disease-related event recognition results are significantly better compared to the results reported by other animal disease surveillance systems.
8

Logarithmic opinion pools for conditional random fields

Smith, Andrew January 2007 (has links)
Since their recent introduction, conditional random fields (CRFs) have been successfully applied to a multitude of structured labelling tasks in many different domains. Examples include natural language processing (NLP), bioinformatics and computer vision. Within NLP itself we have seen many different application areas, like named entity recognition, shallow parsing, information extraction from research papers and language modelling. Most of this work has demonstrated the need, directly or indirectly, to employ some form of regularisation when applying CRFs in order to overcome the tendency for these models to overfit. To date a popular method for regularising CRFs has been to fit a Gaussian prior distribution over the model parameters. In this thesis we explore other methods of CRF regularisation, investigating their properties and comparing their effectiveness. We apply our ideas to sequence labelling problems in NLP, specifically part-of-speech tagging and named entity recognition. We start with an analysis of conventional approaches to CRF regularisation, and investigate possible extensions to such approaches. In particular, we consider choices of prior distribution other than the Gaussian, including the Laplacian and Hyperbolic; we look at the effect of regularising different features separately, to differing degrees, and explore how we may define an appropriate level of regularisation for each feature; we investigate the effect of allowing the mean of a prior distribution to take on non-zero values; and we look at the impact of relaxing the feature expectation constraints satisfied by a standard CRF, leading to a modified CRF model we call the inequality CRF. Our analysis leads to the general conclusion that although there is some capacity for improvement of conventional regularisation through modification and extension, this is quite limited. Conventional regularisation with a prior is in general hampered by the need to fit a hyperparameter or set of hyperparameters, which can be an expensive process. We then approach the CRF overfitting problem from a different perspective. Specifically, we introduce a form of CRF ensemble called a logarithmic opinion pool (LOP), where CRF distributions are combined under a weighted product. We show how a LOP has theoretical properties which provide a framework for designing new overfitting reduction schemes in terms of diverse models, and demonstrate how such diverse models may be constructed in a number of different ways. Specifically, we show that by constructing CRF models from manually crafted partitions of a feature set and combining them with equal weight under a LOP, we may obtain an ensemble that significantly outperforms a standard CRF trained on the entire feature set, and is competitive in performance to a standard CRF regularised with a Gaussian prior. The great advantage of LOP approach is that, unlike the Gaussian prior method, it does not require us to search a hyperparameter space. Having demonstrated the success of LOPs in the simple case, we then move on to consider more complex uses of the framework. In particular, we investigate whether it is possible to further improve the LOP ensemble by allowing parameters in different models to interact during training in such a way that diversity between the models is encouraged. Lastly, we show how the LOP approach may be used as a remedy for a problem that standard CRFs can sometimes suffer. In certain situations, negative effects may be introduced to a CRF by the inclusion of highly discriminative features. An example of this is provided by gazetteer features, which encode a word's presence in a gazetteer. We show how LOPs may be used to reduce these negative effects, and so provide some insight into how gazetteer features may be more effectively handled in CRFs, and log-linear models in general.
9

Identificação da cobertura espacial de documentos usando mineração de textos / Identification of spatial coverage documents with mining

Vargas, Rosa Nathalie Portugal 08 August 2012 (has links)
Atualmente, é comum que usuários levem em consideração a localização geográfica dos documentos, é dizer considerar o escopo geográfico que está sendo tratado no contexto do documento, nos processos de Recuperação de Informação. No entanto, os sistemas convencionais de extração de informação que estão baseados em palavras-chave não consideram que as palavras podem representar entidades geográficas espacialmente relacionadas com outras entidades nos documentos. Para resolver esse problema, é necessário viabilizar o georreferenciamento dos textos, ou seja, identificar as entidades geográficas presentes e associá-las com sua correta localização espacial. A identificação e desambiguação das entidades geográficas apresenta desafios importantes, principalmente do ponto de vista linguístico, já que um topônimo, pode possuir variados tipos de ambiguidade associados. Esse problema de ambiguidade causa ruido nos processos de recuperação de informação, já que o mesmo termo pode ter informação relevante ou irrelevante associada. Assim, a principal estratégia para superar os problemas de ambiguidade, compreende a identificação de evidências que auxiliem na identificação e desambiguação das localidades nos textos. O presente trabalho propõe uma metodologia que permite identificar e determinar a cobertura espacial dos documentos, denominada SpatialCIM. A metodologia SpatialCIM tem o objetivo de organizar os processos de resolução de topônimos. Assim, o principal objetivo deste trabalho é avaliar e selecionar técnicas de desambiguação que permitam resolver a ambiguidade dos topônimos nos textos. Para isso, foram propostas e desenvolvidas as abordagens de (1)Desambiguação por Pontos e a (2)Desambiguação Textual e Estrutural. Essas abordagens, exploram duas técnicas diferentes de desambiguação de topônimos, as quais, geram e desambiguam os caminhos geográficos associados aos topônimos reconhecidos para cada documento. Assim, a hipótese desta pesquisa é que o uso das técnicas de desambiguação de topônimos viabilizam uma melhor localização espacial dos documentos. A partir dos resultados obtidos neste trabalho, foi possível demonstrar que as técnicas de desambiguação melhoram a precisão e revocação na classificação espacial dos documentos. Demonstrou-se também o impacto positivo do uso de uma ferramenta linguística no processo de reconhecimento das entidades geográficas. Assim, foi demostrada a utilidade dos processos de desambiguação para a obtenção da cobertura espacial dos documentos / Currently, it is usual that users take into account the geographical localization of the documents in the Information Retrieval process. However, the conventional information retrieval systems based on key-word matching do not consider which words can represent geographical entities that are spatially related to other entities in the documents. To solve this problem, it is necessary to enable the geo-referencing of texts by identifying the geographical entities present in text and associate them with their correct spatial location. The identification and disambiguation of the geographical entities present major challenges mainly from the linguistic point of view, since one location can have different types of associated ambiguity. The ambiguity problem causes noise in the process of information retrieval, since the same term may have relevant or irrelevant information associated. Thus, the main strategy to overcome these problems, include the identification of evidence to assist in the identification and disambiguation of locations in the texts. This study proposes a methodology that allows the identification and spatial localization of the documents, denominated SpatialCIM. The SpatialCIM methodology has the objective to organize the Topônym Resolution process. Therefore the main objective of this study is to evaluate and select disambiguation techniques that allow solving the toponym ambiguity in texts. Therefore, we proposed and developed the approaches of (1) Disambiguation for Points and (2) Textual and Structural Disambiguation. These approaches exploit two different techniques of toponym disambiguation, which generate and desambiguate the associated paths with the recognized geographical toponym for each document. Therefore the hypothesis is, that the use of the toponyms disambiguation techniques enable a better spatial localization of documents. From the results it was possible to demonstrate that the disambiguation techniques improve the precision and recall for the spatial classification of documents. The positive effect of using a linguistic tool for the process of geographical entities recognition was also demonstrated. Thus, it was proved the usefulness of the disambiguation process for obtaining a spatial coverage of the document
10

Authorship Attribution Through Words Surrounding Named Entities

Jacovino, Julia Maureen 03 April 2014 (has links)
In text analysis, authorship attribution occurs in a variety of ways. The field of computational linguistics becomes more important as the need of authorship attribution and text analysis becomes more widespread. For this research, pre-existing authorship attribution software, Java Graphical Authorship Attribution Program (JGAAP), implements a named entity recognizer, specifically the Stanford Named Entity Recognizer, to probe into similar genre text and to aid in extricating the correct author. This research specifically examines the words authors use around named entities in order to test the ability of these words at attributing authorship / McAnulty College and Graduate School of Liberal Arts; / Computational Mathematics / MS; / Thesis;

Page generated in 0.5039 seconds