Global ETD Search

11	AMBER : a domain-aware template based system for data extraction Cheng, Wang January 2015 (has links) The web is the greatest information source in human history, yet finding all offers for flats with gardens in London, Paris, and Berlin or all restaurants open after a screening of the latest blockbuster remain hard tasks – as that data is not easily amenable to processing. Extracting web data into databases for easier processing has been a resource-intensive process, requiring human supervision for every source from which to extract. This has been changing with approaches that replace human annotators with automated annotations. Such approaches could be successfully applied to restricted settings such as single attribute extraction or for domains with significant redundancy among sources. Multi-attribute objects are often presented on (i) Result pages, where multiple objects are presented on a single page as lists, tables or grids, with most important attributes and a summary description, (ii) Detail pages, where each page provides a detailed list of attributes and long description for a single entity, often in rich format. Both result and detail pages are having their own advantages. Extracting objects from result pages is orders of magnitude faster than from detail pages, and the links to detail pages are often only accessible through result pages. Detail pages have a complete list of attributes and full description of the entity. Early web data extraction approaches requires manual annotations for each web site to reach high accuracy, while a number of domain independent approaches only focus on unsupervised repeated structure segmentation. The former is limited in scaling and automation, while the latter is lacked in accuracy. Recent automated data extraction systems are often informed with an ontology and a set of object and attribute recognizers, however they have focused on extracting simple objects with few attributes from single-entity pages and avoided result pages. We present an automatic ontology-based multi-attribute object extraction system AMBER, which deals with both result and detail pages, achieves very high accuracy (>96%) with zero site-specific supervision, and is able to solve practical issues that arise in real-life data extraction tasks. AMBER is also applied as an important component of DIADEM, the first automatic full-site extraction system that is able to extract structured data from different domains without site-specific supervision, and has been tested through a large-scale evaluation (>10, 000) sites. On the result page side, AMBER achieves high accuracy through a novel domain- aware, path-based template discovery algorithm, and integrates annotations for all parts of the extraction, from identifying the primary list of objects, over segment- ing the individual objects, to aligning the attributes. Yet, AMBER is able to tolerate significant noise in the annotations, by combining these annotations with a novel algorithm for finding regular structures based on XPATH expressions that capture regular tree structures. On the detail page side, AMBER integrates boilerplate removal, dynamic lists identification and page dissimilarity calculation seamlessly to identify data region, then employs a set of fairly simple and cheaply computable features for attribute extraction. Besides, AMBER is the first approach that combines result page extraction and detail page extraction by integrating attributes extracted from result pages and the attributes found on corresponding detail pages. AMBER is able to identify attributes of objects with near perfect accuracy and to extract dozens of attributes with > 96% across several domains, even in presence of significant noise. It outperforms uninformed, automated approaches by a wide margin if given an ontology. Even in absence of an ontology, AMBER outperforms most previous systems on record segmentation. 006.3
12	A Green Form-Based Information Extraction System for Historical Documents Kim, Tae Woo 01 May 2017 (has links) Many historical documents are rich in genealogical facts. Extracting these facts by hand is tedious and almost impossible considering the hundreds of thousands of genealogically rich family-history books currently scanned and online. As one approach for helping to make the extraction feasible, we propose GreenFIE—a "Green" Form-based Information-Extraction tool which is "green" in the sense that it improves with use toward the goal of minimizing the cost of human labor while maintaining high extraction accuracy. Given a page in a historical document, the user's task is to fill out given forms with all facts on a page in a document called for by the forms (e.g. to collect the birth and death information, marriage information, and parent-child relationships for each person on the page). GreenFIE has a repository of extraction patterns that it applies to fill in forms. A user checks the correctness of GreenFIE's form filling, adds any missed facts, and fixes any mistakes. GreenFIE learns based on user feedback, adding new extraction rules to its repository. Ideally, GreenFIE improves as it proceeds so that it does most of the work, leaving little for the user to do other than confirm that its extraction is correct. We evaluate how well GreenFIE performs on family history books in terms of "greenness"—how much human labor diminishes during form filling, while simultaneously maintaining high accuracy. green systems self-improving systems data extraction regular-expression generation Computer Sciences
13	Interaktivní procházení webu a extrakce dat / Interactive web crawling and data extraction Fejfar, Petr January 2018 (has links) Title: Interactive crawling and data extraction Author: Bc. Petr Fejfar Author's e-mail address: pfejfar@gmail.com Department: Department of Distributed and Dependable Systems Supervisor: Mgr. Pavel Je ek, Ph.D., Department of Distributed and De- pendable Systems Abstract: The subject of this thesis is Web crawling and data extraction from Rich Internet Applications (RIA). The thesis starts with analysis of modern Web pages along with techniques used for crawling and data extraction. Based on this analysis, we designed a tool which crawls RIAs according to the instructions defined by the user via graphic interface. In contrast with other currently popular tools for RIAs, our solution is targeted at users with no programming experience, including business and analyst users. The designed solution itself is implemented in form of RIA, using the Web- Driver protocol to automate multiple browsers according to user-defined instructions. Our tool allows the user to inspect browser sessions by dis- playing pages that are being crawled simultaneously. This feature enables the user to troubleshoot the crawlers. The outcome of this thesis is a fully design and implemented tool enabling business user to extract data from the RIAs. This opens new opportunities for this type of user to collect data from Web pages for use...
14	Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada / A supervised learning approach for noise discovery in web pages found in the hidden web Lutz, João Adolfo Froede January 2013 (has links) Um dos problemas da extração de dados na web é a remoção de ruído existente nas páginas. Esta tarefa busca identificar todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruído pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de ruído em páginas da web oculta, a parte da web que é acessível apenas através do preenchimento de formulários. No processamento da web oculta, a extração de dados geralmente é precedida por uma etapa de inserção de dados, na qual os formulários que dão acesso às páginas ocultas são automaticamente ou semi-automaticamente preenchidos. Durante esta fase, são coleta- dos dados do domínio em questão, como os rótulos e valores dos campos. A proposta deste trabalho é agregar este tipo de dados com informações sintáticas dos elementos que compõem a página. É mostrado empiricamente que esta combinação atinge resultados melhores que uma abordagem baseada apenas em informações sintáticas. / One of the problems of data extraction from web pages is the identification of noise in pages. This task aims at identifying non-informative elements in pages, such as headers, menus, or advertisement. The presence of noise may hinder the performance of search engines and web mining tasks. In this paper we tackle the problem of discovering noise in web pages found in the hidden web, i.e., that part of the web that is only accessible by filling web forms. In hidden web processing, data extraction is usually preceeded by a form filling step, in which the query forms that give access to the hidden web pages are automatically or semi-automatically filled. During form filling relevant data about the queried domain are collected, as field names and field values. Our proposal combines this type of data with syntactic information about the nodes that compose the page. We show empirically that this combination achieves better results than an approach that is based solely on syntactic information. Keywords: Recuperacao : Informacao Web : Desenvolvimento Hidden web Information retrieval Web data extraction Web noise removal
15	Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada / A supervised learning approach for noise discovery in web pages found in the hidden web Lutz, João Adolfo Froede January 2013 (has links) Um dos problemas da extração de dados na web é a remoção de ruído existente nas páginas. Esta tarefa busca identificar todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruído pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de ruído em páginas da web oculta, a parte da web que é acessível apenas através do preenchimento de formulários. No processamento da web oculta, a extração de dados geralmente é precedida por uma etapa de inserção de dados, na qual os formulários que dão acesso às páginas ocultas são automaticamente ou semi-automaticamente preenchidos. Durante esta fase, são coleta- dos dados do domínio em questão, como os rótulos e valores dos campos. A proposta deste trabalho é agregar este tipo de dados com informações sintáticas dos elementos que compõem a página. É mostrado empiricamente que esta combinação atinge resultados melhores que uma abordagem baseada apenas em informações sintáticas. / One of the problems of data extraction from web pages is the identification of noise in pages. This task aims at identifying non-informative elements in pages, such as headers, menus, or advertisement. The presence of noise may hinder the performance of search engines and web mining tasks. In this paper we tackle the problem of discovering noise in web pages found in the hidden web, i.e., that part of the web that is only accessible by filling web forms. In hidden web processing, data extraction is usually preceeded by a form filling step, in which the query forms that give access to the hidden web pages are automatically or semi-automatically filled. During form filling relevant data about the queried domain are collected, as field names and field values. Our proposal combines this type of data with syntactic information about the nodes that compose the page. We show empirically that this combination achieves better results than an approach that is based solely on syntactic information. Keywords: Recuperacao : Informacao Web : Desenvolvimento Hidden web Information retrieval Web data extraction Web noise removal
16	Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada / A supervised learning approach for noise discovery in web pages found in the hidden web Lutz, João Adolfo Froede January 2013 (has links) Um dos problemas da extração de dados na web é a remoção de ruído existente nas páginas. Esta tarefa busca identificar todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruído pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de ruído em páginas da web oculta, a parte da web que é acessível apenas através do preenchimento de formulários. No processamento da web oculta, a extração de dados geralmente é precedida por uma etapa de inserção de dados, na qual os formulários que dão acesso às páginas ocultas são automaticamente ou semi-automaticamente preenchidos. Durante esta fase, são coleta- dos dados do domínio em questão, como os rótulos e valores dos campos. A proposta deste trabalho é agregar este tipo de dados com informações sintáticas dos elementos que compõem a página. É mostrado empiricamente que esta combinação atinge resultados melhores que uma abordagem baseada apenas em informações sintáticas. / One of the problems of data extraction from web pages is the identification of noise in pages. This task aims at identifying non-informative elements in pages, such as headers, menus, or advertisement. The presence of noise may hinder the performance of search engines and web mining tasks. In this paper we tackle the problem of discovering noise in web pages found in the hidden web, i.e., that part of the web that is only accessible by filling web forms. In hidden web processing, data extraction is usually preceeded by a form filling step, in which the query forms that give access to the hidden web pages are automatically or semi-automatically filled. During form filling relevant data about the queried domain are collected, as field names and field values. Our proposal combines this type of data with syntactic information about the nodes that compose the page. We show empirically that this combination achieves better results than an approach that is based solely on syntactic information. Keywords: Recuperacao : Informacao Web : Desenvolvimento Hidden web Information retrieval Web data extraction Web noise removal
17	A comparison of HTML-aware tools for Web Data extraction Boronat, Xavier Azagra 20 October 2017 (has links) Nowadays we live in a world where information is present everywhere in our daily life. In those last years the amount of information that we receive has grown and the stands in which is distributed have changed; from conventional newspapers or the radio to mobile phones, digital television or the Web. In this document we reference to the information that we can find in the Web, a really big source of data which is still developing. info:eu-repo/classification/ddc/000 ddc:000
18	Flexible RDF data extraction from Wiktionary - Leveraging the power of community build linguistic wikis Brekle, Jonas 26 February 2018 (has links) We present a declarative approach implemented in a comprehensive opensource framework (based on DBpedia) to extract lexical-semantic resources (an ontology about language use) from Wiktionary. The data currently includes language, part of speech, senses, definitions, synonyms, taxonomies (hyponyms, hyperonyms, synonyms, antonyms) and translations for each lexical word. Main focus is on flexibility to the loose schema and configurability towards differing language-editions ofWiktionary. This is achieved by a declarative mediator/wrapper approach. The goal is, to allow the addition of languages just by configuration without the need of programming, thus enabling the swift and resource-conserving adaptation of wrappers by domain experts. The extracted data is as fine granular as the source data in Wiktionary and additionally follows the lemon model. It enables use cases like disambiguation or machine translation. By offering a linked data service, we hope to extend DBpedia’s central role in the LOD infrastructure to the world of Open Linguistics. I info:eu-repo/classification/ddc/000 ddc:000
19	Expanding The NIF Ecosystem - Corpus Conversion, Parsing And Processing Using The NLP Interchange Format 2.0 Brümmer, Martin 26 February 2018 (has links) This work presents a thorough examination and expansion of the NIF ecosystem. info:eu-repo/classification/ddc/000 ddc:000
20	Need for speed : A study of the speed of forensic disk imaging tools Stewart, Dawid, Arvidsson, Alex January 2022 (has links) As our society becomes increasingly digitalized, there is an ever-increasing need for forensic tools to become faster and faster. This paper was made to help the Police and other digital forensic investigators choose the fastest disk imaging tool while still maintaining the integrity of the imaged disk. To answer this, an experiment including 162 disk imaging tests was done, with an active imaging and verification time of over 160 hours. The results were analyzed with the help of a scoring system and statistical significance tests. The paper also aimed to show if there is any difference when making images of disks that are filled to 100% compared to disks filled to 50%, and which of the disk imaging tools that handles it best. The results of the experiment showed that Guymager was the fastest disk imaging tool among the tested alternatives. It also illustrated that the speed was affected by the disks being filled to 50% as opposed to 100%. Guymager showed the best performance improvement using the EWF_E01 format, and OSForensics showed the biggest improvement when imaging using the DD format. Disk imaging digital forensics data extraction Computer and Information Sciences Data- och informationsvetenskap

Search results