• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 32
  • 9
  • 5
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 65
  • 65
  • 30
  • 22
  • 18
  • 17
  • 9
  • 8
  • 7
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Test Data Extraction and Comparison with Test Data Generation

Raza, Ali 01 August 2011 (has links)
Testing an integrated information system that relies on data from multiple sources can be a challenge, particularly when the data is confidential. This thesis describes a novel test data extraction approach, called semantic-based test data extraction for integrated systems (iSTDE) that solves many of the problems associated with creating realistic test data for integrated information systems containing confidential data. iSTDE reads a consistent cross-section of data from the production databases, manipulates that data to obscure individual identities while still preserving overall semantic data characteristics that are critical to thorough system testing, and then moves that test data to an external test environment. This thesis also presents a theoretical study that compares test-data extraction with a competing technique, named test-data generation. Specifically, this thesis a) describes a comparison method that includes a comprehensive list of characteristics essential for testing the database applications organized into seven different areas, b) presents an analysis of the relative strengths and weaknesses of the different test-data creation techniques, and c) reports a number of specific conclusions that will help testers make appropriate choices.
32

Easing information extraction on the web through automated rules discovery

Ortona, Stefano January 2016 (has links)
The advent of the era of big data on the Web has made automatic web information extraction an essential tool in data acquisition processes. Unfortunately, automated solutions are in most cases more error prone than those created by humans, resulting in dirty and erroneous data. Automatic repair and cleaning of the extracted data is thus a necessary complement to information extraction on the Web. This thesis investigates the problem of inducing cleaning rules on web extracted data in order to (i) repair and align the data w.r.t. an original target schema, (ii) produce repairs that are as generic as possible such that different instances can benefit from them. The problem is addressed from three different angles: replace cross-site redundancy with an ensemble of entity recognisers; produce general repairs that can be encoded in the extraction process; and exploit entity-wide relations to infer common knowledge on extracted data. First, we present ROSeAnn, an unsupervised approach to integrate semantic annotators and produce a unied and consistent annotation layer on top of them. Both the diversity in vocabulary and widely varying accuracy justify the need for middleware that reconciles different annotator opinions. Considering annotators as "black-boxes" that do not require per-domain supervision allows us to recognise semantically related content in web extracted data in a scalable way. Second, we show in WADaR how annotators can be used to discover rules to repair web extracted data. We study the problem of computing joint repairs for web data extraction programs and their extracted data, providing an approximate solution that requires no per-source supervision and proves effective across a wide variety of domains and sources. The proposed solution is effective not only in repairing the extracted data, but also in encoding such repairs in the original extraction process. Third, we investigate how relationships among entities can be exploited to discover inconsistencies and additional information. We present RuDiK, a disk-based scalable solution to discover first-order logic rules over RDF knowledge bases built from web sources. We present an approach that does not limit its search space to rules that rely on "positive" relationships between entities, as in the case with traditional mining of constraints. On the contrary, it extends the search space to also discover negative rules, i.e., patterns that lead to contradictions in the data.
33

Um modelo de pontuação na busca de competências acadêmicas de pesquisadores / A score-based model for assessing academic researchers competences

Rech, Rodrigo Octavio January 2007 (has links)
Esta pesquisa descreve um modelo para descobrir e pontuar competências acadêmicas de pesquisadores, baseado na combinação de indicadores quantitativos que permitem mensurar a produção acadêmica dos cientistas. Um diferencial do modelo é a inclusão de indicadores quantitativos relacionados com a importância da produção bibliográfica dos pesquisadores. Estes indicadores possibilitam uma avaliação da produção considerando aspectos como repercussão na comunidade acadêmica e nível dos veículos de publicação. A pesquisa também contribui com a especificação de uma arquitetura flexível e extensível fundamentada em técnicas de extração de dados na Web e casamento aproximado de dados (através de funções de similaridade). A arquitetura foi implementada em um sistema Web cuja principal característica é a integração de diversas tecnologias open source. O sistema desenvolvido permite que qualquer pesquisador avalie quantitativamente sua produção científica, automatizando diversos aspectos relacionados à tarefa de avaliação, como a obtenção dos indicadores e a integração das diferentes bases de informações. / The present research describes a model that aims finding out and scoring academic researchers skills or competences based on the combination of quantitative indicators that make it possible to measure the production of academic scientists. A special feature concerning our model is the inclusion of quantitative indicators related to the importance of the researchers’ bibliographic production. These indicators allow the evaluation of the production considering both the outcome it has had in the academic community, and the quality level of the place it was published. The study also presents a flexible and extensible architecture specification based on techniques of web data extraction, and on approximate data matching (which is carried out through similarity functions). The architecture has been implemented in a web system whose main feature relies on the integration of several open-source technologies. The developed system allows any researcher to evaluate his/her own scientific production in quantitative terms, automating as well the so many aspects regarding the evaluation task, by making it easier to obtain the indicators and to integrate the different information databases, for instance.
34

Um modelo de pontuação na busca de competências acadêmicas de pesquisadores / A score-based model for assessing academic researchers competences

Rech, Rodrigo Octavio January 2007 (has links)
Esta pesquisa descreve um modelo para descobrir e pontuar competências acadêmicas de pesquisadores, baseado na combinação de indicadores quantitativos que permitem mensurar a produção acadêmica dos cientistas. Um diferencial do modelo é a inclusão de indicadores quantitativos relacionados com a importância da produção bibliográfica dos pesquisadores. Estes indicadores possibilitam uma avaliação da produção considerando aspectos como repercussão na comunidade acadêmica e nível dos veículos de publicação. A pesquisa também contribui com a especificação de uma arquitetura flexível e extensível fundamentada em técnicas de extração de dados na Web e casamento aproximado de dados (através de funções de similaridade). A arquitetura foi implementada em um sistema Web cuja principal característica é a integração de diversas tecnologias open source. O sistema desenvolvido permite que qualquer pesquisador avalie quantitativamente sua produção científica, automatizando diversos aspectos relacionados à tarefa de avaliação, como a obtenção dos indicadores e a integração das diferentes bases de informações. / The present research describes a model that aims finding out and scoring academic researchers skills or competences based on the combination of quantitative indicators that make it possible to measure the production of academic scientists. A special feature concerning our model is the inclusion of quantitative indicators related to the importance of the researchers’ bibliographic production. These indicators allow the evaluation of the production considering both the outcome it has had in the academic community, and the quality level of the place it was published. The study also presents a flexible and extensible architecture specification based on techniques of web data extraction, and on approximate data matching (which is carried out through similarity functions). The architecture has been implemented in a web system whose main feature relies on the integration of several open-source technologies. The developed system allows any researcher to evaluate his/her own scientific production in quantitative terms, automating as well the so many aspects regarding the evaluation task, by making it easier to obtain the indicators and to integrate the different information databases, for instance.
35

Um modelo de pontuação na busca de competências acadêmicas de pesquisadores / A score-based model for assessing academic researchers competences

Rech, Rodrigo Octavio January 2007 (has links)
Esta pesquisa descreve um modelo para descobrir e pontuar competências acadêmicas de pesquisadores, baseado na combinação de indicadores quantitativos que permitem mensurar a produção acadêmica dos cientistas. Um diferencial do modelo é a inclusão de indicadores quantitativos relacionados com a importância da produção bibliográfica dos pesquisadores. Estes indicadores possibilitam uma avaliação da produção considerando aspectos como repercussão na comunidade acadêmica e nível dos veículos de publicação. A pesquisa também contribui com a especificação de uma arquitetura flexível e extensível fundamentada em técnicas de extração de dados na Web e casamento aproximado de dados (através de funções de similaridade). A arquitetura foi implementada em um sistema Web cuja principal característica é a integração de diversas tecnologias open source. O sistema desenvolvido permite que qualquer pesquisador avalie quantitativamente sua produção científica, automatizando diversos aspectos relacionados à tarefa de avaliação, como a obtenção dos indicadores e a integração das diferentes bases de informações. / The present research describes a model that aims finding out and scoring academic researchers skills or competences based on the combination of quantitative indicators that make it possible to measure the production of academic scientists. A special feature concerning our model is the inclusion of quantitative indicators related to the importance of the researchers’ bibliographic production. These indicators allow the evaluation of the production considering both the outcome it has had in the academic community, and the quality level of the place it was published. The study also presents a flexible and extensible architecture specification based on techniques of web data extraction, and on approximate data matching (which is carried out through similarity functions). The architecture has been implemented in a web system whose main feature relies on the integration of several open-source technologies. The developed system allows any researcher to evaluate his/her own scientific production in quantitative terms, automating as well the so many aspects regarding the evaluation task, by making it easier to obtain the indicators and to integrate the different information databases, for instance.
36

Extração não supervisionada de dados da web utilizando abordagem independente de formato

Porto, André Luiz Lopes 17 November 2015 (has links)
Submitted by Lenieze Lira (leniezeblira@gmail.com) on 2016-07-25T13:47:02Z No. of bitstreams: 1 Dissertação - André Luiz Lopes Porto.pdf: 14791950 bytes, checksum: be2de076023a64a02a6a43c99e9977d8 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2016-07-28T13:48:47Z (GMT) No. of bitstreams: 1 Dissertação - André Luiz Lopes Porto.pdf: 14791950 bytes, checksum: be2de076023a64a02a6a43c99e9977d8 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2016-07-28T13:50:19Z (GMT) No. of bitstreams: 1 Dissertação - André Luiz Lopes Porto.pdf: 14791950 bytes, checksum: be2de076023a64a02a6a43c99e9977d8 (MD5) / Made available in DSpace on 2016-07-28T13:50:19Z (GMT). No. of bitstreams: 1 Dissertação - André Luiz Lopes Porto.pdf: 14791950 bytes, checksum: be2de076023a64a02a6a43c99e9977d8 (MD5) Previous issue date: 2015-11-17 / In this thesis we propose a new method for extraction data in rich Web pages that uses only the textual content of these pages. Our method, called FIEX (Format Independent Web Data Extraction), is based on information extraction techniques for text segmentation, and can extract data from Web pages where methods of state of the art based on data alignment techniques fail due to inconsistency between the logical structure of Web pages and the conceptual structure of the data represented in them. The FIEX, unlike the methods previously proposed in the literature, is able to extract data using only the textual content of a Web pages in challenging scenarios such as severe cases of textual elements compounds, in which various values of interest for extraction are represented by only one HTML element. To perform the extraction data of the web pages, FIEX is based on techniques of elimination noise by information redundancy and an information extraction method for text segmentation known in the literature as ONDUX (On-Demand Unsupervised Learning for Information Extraction). In our experiments, we used various Web pages collections of di erent areas of products and e-commerce stores with goal to extract data from product descriptions. The choose of this type of Web page, due to the large amount of data these pages are contained in severe cases of textual elements compounds. According to the results obtained in our experiments in various areas of products and e-commerce stores, we validate the hypothesis that the extraction based on only textual features is possible and e ective. / Nessa dissertação de mestrado propomos um novo método para extração em páginas Web ricas em dados que utiliza apenas o conteúdo textual destas páginas. Nosso método, chamado de FIEX (Format Independent Web Data Extraction), é baseado em técnicas de extração de informação por segmentação de texto, e consegue extrair dados de páginas Web nas quais métodos do estado-da-arte baseados em técnicas de alinhamento de dados não conseguem devido à inconsistência entre a estrutura lógica das páginas Web e a estrutura conceitual dos dados nelas representadas. O FIEX, diferentemente dos métodos previamente propostos na literatura, é capaz de extrair dados apenas utilizando o conteúdo textual de uma página Web em cenários desa adores como casos severos de elementos textuais compostos, nos quais diversos valores de interesse para extração estão representados por apenas um elemento HTML. Para realizar a extração dos dados de páginas Web, o FIEX, é baseado em técnicas de eliminação de ruídos por redundância de informação e um método de extração de informação por segmentação de texto conhecido na literatura como ONDUX (On-Demand Unsupervised Learning for Information Extraction). Em nossos experimentos, utilizamos várias coleções de páginas Web de diferentes domínios de produtos e de lojas de comércio eletr ônico com objetivo de extrair dados de descrições de produtos. A escolha desse tipo de página Web, deve-se à grande quantidade de dados destas páginas estarem contidos em casos severos de elementos textuais compostos. De acordo com os resultados obtidos em nossos experimentos em diferentes domínios de produtos e lojas de comércio eletrônico, validamos a hipótese de que a extração baseada em apenas características textuais é possível e e caz.
37

Um método probabilístico para o preenchimento automático de formulários Web a partir de textos ricos em dados

Toda, Guilherme Alves 26 March 2010 (has links)
Made available in DSpace on 2015-04-11T14:02:37Z (GMT). No. of bitstreams: 1 guilherme.pdf: 504019 bytes, checksum: 57c95d6c4c259deff8aa998a2816faaf (MD5) Previous issue date: 2010-03-26 / On the Web of today the most prevalent solution for users to interact with data-intesive applications is the use of form-based interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists and check boxes. Although these interfaces are popular and effectiver, in many cases, free text interfaces are preferred over form based ones. In this work we present, the implementation and the evaluation of a novel method for automatically filling form-based input interfaces using data-rich text. Our solution takes a data-rich free text as input (e.g, an ad), extracts implicit data values from it and fills appropriate fields using them. For this task, we rely on knowledge obtained from values of previous submissions for each field, which are freely obtained from the usage of the interfaces. Our approach, called iForm, exploits features related to the content and the style of these values, which are combined through a Bayesian framework. Through extensive experimentation, we show that our approach is feasible and effective, and it works well even when only a few previous submissions to the input interface are available. / A solução mais comum atualmente para usuários interagirem com aplicações que utilizam banco de dados na Web é através do uso de formulários compostos por vários campos de entrada, como caixas de texto, listas de seleção e caixas de marcação. Apesar destes formulários serem efetivos e populares, em muitos casos, aplicações onde informações são fornecidas através de texto livre são geralmente preferidas pelos usuários. Neste trabalho apresentaremos a proposta, a implementação e a avaliação de um novo método para preencher automaticamente formulários Web utilizando um texto rico em dados. Nossa solução toma como entrada um texto livre rico em dados (por exemplo, um anúncio), extrai seus dados implícitos e preenche os campos apropriados do formulário utilizando estes dados. Para essa tarefa, utilizamos o conhecimento obtido a partir de valores utilizados previamente pelos usuários para preencher os formulários. Nossa abordagem, chamada de iForm, utiliza características relacionadas ao conteúdo e ao estilo desses valores, que são combinadas através de uma Rede Bayesiana. Em nossos experimentos, mostramos que nossa abordagem é viável e efetiva, funcionando bem mesmo quando poucas submissões foram feitas ao formulário.
38

Tabular Information Extraction from Datasheets with Deep Learning for Semantic Modeling

Akkaya, Yakup 22 March 2022 (has links)
The growing popularity of artificial intelligence and machine learning has led to the adop- tion of the automation vision in the industry by many other institutions and organizations. Many corporations have made it their primary objective to make the delivery of goods and services and manufacturing in a more efficient way with minimal human intervention. Au- tomated document processing and analysis is also a critical component of this cycle for many organizations that contribute to the supply chain. The massive volume and diver- sity of data created in this rapidly evolving environment make this a highly desired step. Despite this diversity, important information in the documents is provided in the tables. As a result, extracting tabular data is a crucial aspect of document processing. This thesis applies deep learning methodologies to detect table structure elements for the extraction of data and preparation for semantic modelling. In order to find optimal structure definition, we analyzed the performance of deep learning models in different formats such as row/column and cell. The combined row and column detection models perform poorly compared to other models’ detection performance due to the highly over- lapping nature of rows and columns. Separate row and column detection models seem to achieve the best average F1-score with 78.5% and 79.1%, respectively. However, de- termining cell elements from the row and column detections for semantic modelling is a complicated task due to spanning rows and columns. Considering these facts, a new method is proposed to set the ground-truth information called a content-focused annota- tion to define table elements better. Our content-focused method is competent in handling ambiguities caused by huge white spaces and lack of boundary lines in table structures; hence, it provides higher accuracy. Prior works have addressed the table analysis problem under table detection and table structure detection tasks. However, the impact of dataset structures on table structure detection has not been investigated. We provide a comparison of table structure detection performance with cropped and uncropped datasets. The cropped set consists of only table images that are cropped from documents assuming tables are detected perfectly. The uncropped set consists of regular document images. Experiments show that deep learning models can improve the detection performance by up to 9% in average precision and average recall on the cropped versions. Furthermore, the impact of cropped images is negligible under the Intersection over Union (IoU) values of 50%-70% when compared to the uncropped versions. However, beyond 70% IoU thresholds, cropped datasets provide significantly higher detection performance.
39

Invoice Line Item Extraction using Machine Learning SaaS Models

Kadir, Avin January 2022 (has links)
Manual invoice processing is a time-consuming and error prone task which has proven to be done more efficiently by introducing automation software that minimizes the need for human input. Amazon Textract is a software as a service provided by Amazon Web Services for that purpose. It has been developed to extract document data from both general and financial documents, such as receipts and invoices, by using machine learning models. The service is available in multiple widely spoken languages, but not in Swedish as of the time of writing this thesis. This thesis explores the potential and accuracy of Amazon Textract in extracting data from Swedish invoices by using the English setting. Specifically, the accuracy of extracting line items as well as Swedish letters are examined. In addition, the potential of correcting incorrectly extracted data is explored. This is achieved by testing certain defined categories on each invoice by comparing the Amazon Textract extractions with the correct labeled data. These categories include emptiness, meaning no data was extracted, equality, missing and added line items, as well as missing and added characters that have been added or removed from otherwise correct line item strings. The invoices themselves are divided into two categories, namely structured and semi-structured invoices. The tests are mainly conducted on the service’s dedicated API method for data extraction of financial documents, but a comparison with the table extraction API method is also made to gain more insight in Amazon Textract’s capability.  The results suggest that Amazon Textract is quite inaccurate when extracting line item data from Swedish invoices. Therefore, manual post processing of the data is generally needed to ensure its correctness. However, it showed better results in extracting data from structured invoices, where it scored 70% in equality and 100% in 2 out of 6 invoice layouts. The Swedish character accuracy was 66%.
40

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

Wessman, Alan E. 26 January 2005 (has links) (PDF)
Extraction of information from semi-structured or unstructured documents, such as Web pages, is a useful yet complex task. Research has demonstrated that ontologies may be used to achieve a high degree of accuracy in data extraction while maintaining resiliency in the face of document changes. Ontologies do not, however, diminish the complexity of a data-extraction system. As research in the field progresses, the need for a modular data-extraction system that de-couples the various functional processes involved continues to grow. In this thesis we propose a framework for such a system. The nature of the framework allows new algorithms and ideas to be incorporated into a data extraction system without requiring wholesale rewrites of a large part of the system’s source code. It also allows researchers to focus their attention on parts of the system relevant to their research without having to worry about introducing incompatibilities with the remaining components. We demonstrate the value of the framework by providing a implementation of it, and we show that our implementation is capable of achieving accuracy in its extraction results comparable to that achieved by the legacy BYU-Ontos data-extraction system. We also suggest alternate ways in which the framework may be extended and implemented, and we supply documentation on the framework for future use by data-extraction researchers.

Page generated in 0.1171 seconds