Global ETD Search

1	Using text mining to identify crime patterns from Arabic crime news report corpus Alruily, Meshrif January 2012 (has links) Most text mining techniques have been proposed only for English text, and even here, most research has been conducted on specific texts related to special contexts within the English language, such as politics, medicine and crime. In contrast, although Arabic is a widely spoken language, few mining tools have been developed to process Arabic text, and some Arabic domains have not been studied at all. In fact, Arabic is a language with a very complex morphology because it is highly inflectional, and therefore, dealing with texts written in Arabic is highly complicated. This research studies the crime domain in the Arabic language, exploiting unstructured text using text mining techniques. Developing a system for extracting important information from crime reports would be useful for police investigators, for accelerating the investigative process (instead of reading entire reports) as well as for conducting further or wider analyses. We propose the Crime Profiling System (CPS) to extract crime-related information (crime type, crime location and nationality of persons involved in the event), automatically construct dictionaries for the existing information, cluster crime documents based on certain attributes and utilize visualisation techniques to assist in crime data analysis. The proposed information extraction approach is novel, and it relies on computational linguistic techniques to identify the abovementioned information, i.e. without using predefined dictionaries (e.g. lists of location names) and annotated corpus. The language used in crime reporting is studied to identify patterns of interest using a corpus-based approach. Frequency analysis, collocation analysis and concordance analysis are used to perform the syntactic analysis in order to discover the local grammar. Moreover, the Self Organising Map (SOM) approach is adopted in order to perform the clustering and visualisation tasks for crime documents based on crime type, location or nationality. This clustering technique is improved because only refined data containing meaningful keywords extracted through the information extraction process are inputted into it, i.e. the data is cleaned by removing noise. As a result, a huge reduction in the quantity of data fed into the SOM is obtained, consequently, saving memory, data loading time and the execution time needed to perform the clustering. Therefore, the computation of the SOM is accelerated. Finally, the quantization error is reduced, which leads to high quality clustering. The outcome of the clustering stage is also visualised and the system is able to provide statistical information in the form of graphs and tables about crimes committed within certain periods of time and within a particular area. 005.1
2	Using Freebase, An Automatically Generated Dictionary, And A Classifier To Identify A Person's Profession In Tweets Hall, Abraham 01 January 2013 (has links) Algorithms for classifying pre-tagged person entities in tweets into one of eight profession categories are presented. A classifier using a semi-supervised learning algorithm that takes into consideration the local context surrounding the entity in the tweet, hash tag information, and topic signature scores is described. In addition to the classifier, this research investigates two dictionaries containing the professions of persons. These two dictionaries are used in their own classification algorithms which are independent of the classifier. The method for creating the first dictionary dynamically from the web and the algorithm that accesses this dictionary to classify a person into one of the eight profession categories are explained next. The second dictionary is freebase, an openly available online database that is maintained by its online community. The algorithm that uses freebase for classifying a person into one of the eight professions is described. The results also show that classifications made using the automated constructed dictionary, freebase, or the classifier are all moderately successful. The results also show that classifications made with the automated constructed person dictionary are slightly more accurate than classifications made using freebase. Various hybrid methods, combining the classifier and the two dictionaries are also explained. The results of those hybrid methods show significant improvement over any of the individual methods. Twitter named entity recognition classifier freebase Computer Sciences Engineering
3	Undersökande studie inom Information Extraction : Konsten att Klassicera Torstensson, Erik, Carls, Fredrik January 2016 (has links) Denna uppsats är en undersökande studie inom Information Extraction. Huvudsyftet är att skapa och utvärdera metoder inom Information Extraction och undersöka hur de kan hjälpa till att förbättra det vetenskapliga resultatet av klassificering av textelement. En deluppgift är att utvärdera den befintliga marknaden för Information Extraction i Sverige. För att göra detta har vi skapat ett program bestående av två delar. Den första delen utgörs av ett basfall som är en enkel metod och den andra är mer avancerad och använder sig av olika tekniker inom området Information Extraction. Fältet vi undersöker är hur ofta män och kvinnor nämns i sju olika nyhetskällor i Sverige. Resultatet jämför dessa två metoder och utvärderar dem med vetenskapliga prestationsmått inom Information Extraction. Studiens resultat visar på liknande förekomster av män och kvinnor mellan basfallet och den mer avancerade metoden. Undantaget är att den mer avancerade metoden har ett högre vetenskapligt värde. Marknaden för Information Extraction i Sverige är dominerad av stora medieägda bolag, där media dessutom förser dessa företag med data att analysera. Detta gör att det blir svårt att konkurrera utan en ny innovativ idé. / This paper is an investigatory report about Information Extraction. The main purpose is to create and evaluate methods within Information Extraction and see how they can help improve the scientific result in classification of text elements. A subtask is to evaluate the existing market for Information Extraction in Sweden. For this task a two-part computer program has been created. The first part is just a baseline with a simple method and the second one is more advanced with tools used in the field Information Extraction. The field we investigate is how often men and women are mentioned in seven different newspapers in Sweden. The result compares these two methods and evaluates them using scientific measurements of information retrieval performance. The results of the study show similar occurrences of men and women between the baseline and the more advanced method. The exception being that the more advanced method has a higher scientific value. The market for Information Extraction in Sweden is dominated by large corporations owned by the media, which also provide the data for these kinds of companies to analyze. This makes it hard to compete without having a new innovative idea. Information Extraction Named Entity Recognition Java Industrial Management Information Extraction Named Entity Recognition Java Industriell Ekonomi Computer Sciences Datavetenskap (datalogi)
4	Artificial intelligence application for feature extraction in annual reports : AI-pipeline for feature extraction in Swedish balance sheets from scanned annual reports Nilsson, Jesper January 2024 (has links) Hantering av ostrukturerade och fysiska dokument inom vissa områden, såsom finansiell rapportering, medför betydande ineffektivitet i dagsläget. Detta examensarbete fokuserar på utmaningen att extrahera data från ostrukturerade finansiella dokument, specifikt balansräkningar i svenska årsredovisningar, genom att använda en AI-driven pipeline. Syftet är att utveckla en metod för att automatisera datautvinning och möjliggöra förbättrad dataanalys. Projektet fokuserade på att automatisera utvinning av finansiella poster från balansräkningar genom en kombination av Optical Character Recognition (OCR) och en modell för Named Entity Recognition (NER). TesseractOCR användes för att konvertera skannade dokument till digital text, medan en BERT-baserad NER-modell tränades för att identifiera och klassificera relevanta finansiella poster. Ett Python-skript användes för att extrahera de numeriska värdena som är associerade med dessa poster. Projektet fann att NER-modellen uppnådde hög prestanda, med ett F1-score på 0,95, vilket visar dess effektivitet i att identifiera finansiella poster. Den fullständiga pipelinen lyckades extrahera över 99% av posterna från balansräkningar med en träffsäkerhet på cirka 90% för numerisk data. Projektet drar slutsatsen att kombinationen av OCR och NER är en lovande lösning för att automatisera datautvinning från ostrukturerade dokument med liknande attribut som årsredovisningar. Framtida arbeten kan utforska att förbättra träffsäkerheten i OCR och utvidga utvinningen till andra sektioner av olika typer av ostrukturerade dokument. / The persistence of unstructured and physical document management in fields such as financial reporting presents notable inefficiencies. This thesis addresses the challenge of extracting valuable data from unstructured financial documents, specifically balance sheets in Swedish annual reports, using an AI-driven pipeline. The objective is to develop a method to automate data extraction, enabling enhanced data analysis capabilities. The project focused on automating the extraction of financial posts from balance sheets using a combination of Optical Character Recognition (OCR) and a Named Entity Recognition (NER) model. TesseractOCR was used to convert scanned documents into digital text, while a fine-tuned BERT-based NER model was trained to identify and classify relevant financial features. A Python script was employed to extract the numerical values associated with these features. The study found that the NER model achieved high performance metrics, with an F1-score of 0.95, demonstrating its effectiveness in identifying financial entities. The full pipeline successfully extracted over 99% of features from balance sheets with an accuracy of about 90% for numerical data. The project concludes that combining OCR and NER technologies could be a promising solution for automating data extraction from unstructured documents with similar attributes to annual reports. Future work could explore enhancing OCR accuracy and extending the methodology to other sections of different types of unstructured documents. Artificial intelligence Feature extraction Named Entity Recognition BERT Optical Character Recognition financial documents Artificiell intelligens Datautvinning Named Entity Recognition BERT Optical Character Recognition finansiella dokument Software Engineering Programvaruteknik
5	Entity extraction, animal disease-related event recognition and classification from web Volkova, Svitlana January 1900 (has links) Master of Science / Department of Computing and Information Sciences / William H. Hsu / Global epidemic surveillance is an essential task for national biosecurity management and bioterrorism prevention. The main goal is to protect the public from major health threads. To perform this task effectively one requires reliable, timely and accurate medical information from a wide range of sources. Towards this goal, we present a framework for epidemiological analytics that can be used to extract and visualize infectious disease outbreaks from the variety of unstructured web sources automatically. More precisely, in this thesis, we consider several research tasks including document relevance classification, entity extraction and animal disease-related event recognition in the veterinary epidemiology domain. First, we crawl web sources and classify collected documents by topical relevance using supervised learning algorithms. Next, we propose a novel approach for automated ontology construction in the veterinary medicine domain. Our approach is based on semantic relationship discovery using syntactic patterns. We then apply our automatically-constructed ontology for the domain-specific entity extraction task. Moreover, we compare our ontology-based entity extraction results with an alternative sequence labeling approach. We introduce a sequence labeling method for the entity tagging that relies on syntactic feature extraction using a sliding window. Finally, we present our novel sentence-based event recognition approach that includes three main steps: entity extraction of animal diseases, species, locations, dates and the confirmation status n-grams; event-related sentence classification into two categories - suspected or confirmed; automated event tuple generation and aggregation. We show that our document relevance classification results as well as entity extraction and disease-related event recognition results are significantly better compared to the results reported by other animal disease surveillance systems. entity extraction event recognition and classification web mining document classification named entity recognition Computer Science (0984)
6	Logarithmic opinion pools for conditional random fields Smith, Andrew January 2007 (has links) Since their recent introduction, conditional random fields (CRFs) have been successfully applied to a multitude of structured labelling tasks in many different domains. Examples include natural language processing (NLP), bioinformatics and computer vision. Within NLP itself we have seen many different application areas, like named entity recognition, shallow parsing, information extraction from research papers and language modelling. Most of this work has demonstrated the need, directly or indirectly, to employ some form of regularisation when applying CRFs in order to overcome the tendency for these models to overfit. To date a popular method for regularising CRFs has been to fit a Gaussian prior distribution over the model parameters. In this thesis we explore other methods of CRF regularisation, investigating their properties and comparing their effectiveness. We apply our ideas to sequence labelling problems in NLP, specifically part-of-speech tagging and named entity recognition. We start with an analysis of conventional approaches to CRF regularisation, and investigate possible extensions to such approaches. In particular, we consider choices of prior distribution other than the Gaussian, including the Laplacian and Hyperbolic; we look at the effect of regularising different features separately, to differing degrees, and explore how we may define an appropriate level of regularisation for each feature; we investigate the effect of allowing the mean of a prior distribution to take on non-zero values; and we look at the impact of relaxing the feature expectation constraints satisfied by a standard CRF, leading to a modified CRF model we call the inequality CRF. Our analysis leads to the general conclusion that although there is some capacity for improvement of conventional regularisation through modification and extension, this is quite limited. Conventional regularisation with a prior is in general hampered by the need to fit a hyperparameter or set of hyperparameters, which can be an expensive process. We then approach the CRF overfitting problem from a different perspective. Specifically, we introduce a form of CRF ensemble called a logarithmic opinion pool (LOP), where CRF distributions are combined under a weighted product. We show how a LOP has theoretical properties which provide a framework for designing new overfitting reduction schemes in terms of diverse models, and demonstrate how such diverse models may be constructed in a number of different ways. Specifically, we show that by constructing CRF models from manually crafted partitions of a feature set and combining them with equal weight under a LOP, we may obtain an ensemble that significantly outperforms a standard CRF trained on the entire feature set, and is competitive in performance to a standard CRF regularised with a Gaussian prior. The great advantage of LOP approach is that, unlike the Gaussian prior method, it does not require us to search a hyperparameter space. Having demonstrated the success of LOPs in the simple case, we then move on to consider more complex uses of the framework. In particular, we investigate whether it is possible to further improve the LOP ensemble by allowing parameters in different models to interact during training in such a way that diversity between the models is encouraged. Lastly, we show how the LOP approach may be used as a remedy for a problem that standard CRFs can sometimes suffer. In certain situations, negative effects may be introduced to a CRF by the inclusion of highly discriminative features. An example of this is provided by gazetteer features, which encode a word's presence in a gazetteer. We show how LOPs may be used to reduce these negative effects, and so provide some insight into how gazetteer features may be more effectively handled in CRFs, and log-linear models in general. 005.3
7	Identificação da cobertura espacial de documentos usando mineração de textos / Identification of spatial coverage documents with mining Vargas, Rosa Nathalie Portugal 08 August 2012 (has links) Atualmente, é comum que usuários levem em consideração a localização geográfica dos documentos, é dizer considerar o escopo geográfico que está sendo tratado no contexto do documento, nos processos de Recuperação de Informação. No entanto, os sistemas convencionais de extração de informação que estão baseados em palavras-chave não consideram que as palavras podem representar entidades geográficas espacialmente relacionadas com outras entidades nos documentos. Para resolver esse problema, é necessário viabilizar o georreferenciamento dos textos, ou seja, identificar as entidades geográficas presentes e associá-las com sua correta localização espacial. A identificação e desambiguação das entidades geográficas apresenta desafios importantes, principalmente do ponto de vista linguístico, já que um topônimo, pode possuir variados tipos de ambiguidade associados. Esse problema de ambiguidade causa ruido nos processos de recuperação de informação, já que o mesmo termo pode ter informação relevante ou irrelevante associada. Assim, a principal estratégia para superar os problemas de ambiguidade, compreende a identificação de evidências que auxiliem na identificação e desambiguação das localidades nos textos. O presente trabalho propõe uma metodologia que permite identificar e determinar a cobertura espacial dos documentos, denominada SpatialCIM. A metodologia SpatialCIM tem o objetivo de organizar os processos de resolução de topônimos. Assim, o principal objetivo deste trabalho é avaliar e selecionar técnicas de desambiguação que permitam resolver a ambiguidade dos topônimos nos textos. Para isso, foram propostas e desenvolvidas as abordagens de (1)Desambiguação por Pontos e a (2)Desambiguação Textual e Estrutural. Essas abordagens, exploram duas técnicas diferentes de desambiguação de topônimos, as quais, geram e desambiguam os caminhos geográficos associados aos topônimos reconhecidos para cada documento. Assim, a hipótese desta pesquisa é que o uso das técnicas de desambiguação de topônimos viabilizam uma melhor localização espacial dos documentos. A partir dos resultados obtidos neste trabalho, foi possível demonstrar que as técnicas de desambiguação melhoram a precisão e revocação na classificação espacial dos documentos. Demonstrou-se também o impacto positivo do uso de uma ferramenta linguística no processo de reconhecimento das entidades geográficas. Assim, foi demostrada a utilidade dos processos de desambiguação para a obtenção da cobertura espacial dos documentos / Currently, it is usual that users take into account the geographical localization of the documents in the Information Retrieval process. However, the conventional information retrieval systems based on key-word matching do not consider which words can represent geographical entities that are spatially related to other entities in the documents. To solve this problem, it is necessary to enable the geo-referencing of texts by identifying the geographical entities present in text and associate them with their correct spatial location. The identification and disambiguation of the geographical entities present major challenges mainly from the linguistic point of view, since one location can have different types of associated ambiguity. The ambiguity problem causes noise in the process of information retrieval, since the same term may have relevant or irrelevant information associated. Thus, the main strategy to overcome these problems, include the identification of evidence to assist in the identification and disambiguation of locations in the texts. This study proposes a methodology that allows the identification and spatial localization of the documents, denominated SpatialCIM. The SpatialCIM methodology has the objective to organize the Topônym Resolution process. Therefore the main objective of this study is to evaluate and select disambiguation techniques that allow solving the toponym ambiguity in texts. Therefore, we proposed and developed the approaches of (1) Disambiguation for Points and (2) Textual and Structural Disambiguation. These approaches exploit two different techniques of toponym disambiguation, which generate and desambiguate the associated paths with the recognized geographical toponym for each document. Therefore the hypothesis is, that the use of the toponyms disambiguation techniques enable a better spatial localization of documents. From the results it was possible to demonstrate that the disambiguation techniques improve the precision and recall for the spatial classification of documents. The positive effect of using a linguistic tool for the process of geographical entities recognition was also demonstrated. Thus, it was proved the usefulness of the disambiguation process for obtaining a spatial coverage of the document Ambiguity problem Named entity recognition Problemas de ambiguidade Reconhecimento de entidades mencionadas Resolição de topônimos Toponym resolution
8	Authorship Attribution Through Words Surrounding Named Entities Jacovino, Julia Maureen 03 April 2014 (has links) In text analysis, authorship attribution occurs in a variety of ways. The field of computational linguistics becomes more important as the need of authorship attribution and text analysis becomes more widespread. For this research, pre-existing authorship attribution software, Java Graphical Authorship Attribution Program (JGAAP), implements a named entity recognizer, specifically the Stanford Named Entity Recognizer, to probe into similar genre text and to aid in extricating the correct author. This research specifically examines the words authors use around named entities in order to test the ability of these words at attributing authorship / McAnulty College and Graduate School of Liberal Arts; / Computational Mathematics / MS; / Thesis;
9	On Travel Article Classification Based on Consumer Information Search Process Model Hsiao, Yung-Lin 27 July 2011 (has links) The information overload problem becomes imperative with the explosion of information, and people need some agents to facilitate them to filter the information to meet their personal need. In this work, we conduct a research for the article classification in the tourism domain so as to identify articles that meet users¡¦ information need. We propose an information need orientation model in tourism, which consists of four goals: Initiation, Attraction, Accommodation, and Route planning. These goals can be characterized by 13 features. Some of the identified features can be enhanced by WordNet and Named Entity Recognition techniques as supplement techniques. To test the effectiveness of using the 13 features for classification and the relevant methods, we collected 15,797 articles from TripAdvisor.com, the world's largest travel site, and randomly selected 600 articles as training data labeled by two labelers. The experimental results show that our approach generally has comparable or better performance than that of using purely lexical features, namely TF-IDF, for classification, with fewer features. travel article classification text categorization ontology named entity recognition information search model
10	Feature identification framework and applications (FIFA) Audenaert, Michael Neal 12 April 2006 (has links) Large digital libraries typically contain large collections of heterogeneous resources intended to be delivered to a variety of user communities. One key challenge for these libraries is providing tight integration between resources both within a single collection and across the several collections of the library with out requiring hand coding. One key tool in doing this is elucidating the internal structure of the digital resources and using that structure to form connections between the resources. The heterogeneous nature of the collections and the diversity of the needs in the user communities complicates this task. Accordingly, in this thesis, I describe an approach to implementing a feature identification system to support digital collections that provides a general framework for applications while allowing decisions about the details of document representation and features identification to be deferred to domain specific implementations of that framework. These deferred decisions include details of the semantics and syntax of markup, the types of metadata to be attached to documents, the types of features to be identified, the feature identification algorithms to be applied, and which features should be indexed. This approach results in strong support for the general aspects of developing a feature identification system allowing future work to focus on the details of applying that system to the specific needs of individual collections and user communities. humanities informatics humanities computing collection enhancement feature identification named entity recognition

Search results