• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Reducing human effort in web data extraction

Guo, Jinsong January 2017 (has links)
The human effort in large-scale web data extraction significantly affects both the extraction flexibility and the economic cost. Our work aims to reduce the human effort required by web data extraction tasks in three specific scenarios. (I) Data demand is unclear, and the user has to guide the wrapper induction by annotations. To maximally save the human effort in the annotation process, wrappers should be robust, i.e., immune to the webpage's change, to avoid the wrapper re-generation which requires a re-annotation process. Existing approaches primarily aim at generating accurate wrappers but barely generate robust wrappers. We prove that the XPATH wrapper induction problem is NP-hard, and propose an approximate solution estimating a set of top-k robust wrappers in polynomial time. Our method also meets one additional requirement that the induction process should be noise resistant, i.e., tolerate slightly erroneous examples. (II) Data demand is clear, and the user's guide should be avoided, i.e., the wrapper generation should be fully-unsupervised. Existing unsupervised methods purely relying on the repeated patterns of HTML structures/visual information are far from being practical. Partially supervised methods, such as the state-of-the-art system DIADEM, can work well for tasks involving only a small number of domains. However, the human effort in the annotator preparation process becomes a heavier burden when the domain number increases. We propose a new approach, called RED (abbreviation for 'redundancy'), an automatic approach exploiting content redundancy between the result page and its corresponding detail pages. RED requires no annotation (thus requires no human effort) and its wrapper accuracy is significantly higher than that of previous unsupervised methods. (III) Data quality is unknown, and the user's related decisions are blind. Without knowing the error types and the error number of each type in the extracted data, the extraction effort could be wasted on useless websites, and even worse, the human effort could be wasted on unnecessary or wrongly-targeted data cleaning process. Despite the importance of error estimation, no methods have addressed it sufficiently. We focus on two types of common errors in web data, namely duplicates and violations of integrity constraints. We propose a series of error estimation approaches by adapting, extending, and synthesizing some recent innovations in diverse areas such as active learning, classifier calibration, F-measure estimation, and interactive training.
2

Easing information extraction on the web through automated rules discovery

Ortona, Stefano January 2016 (has links)
The advent of the era of big data on the Web has made automatic web information extraction an essential tool in data acquisition processes. Unfortunately, automated solutions are in most cases more error prone than those created by humans, resulting in dirty and erroneous data. Automatic repair and cleaning of the extracted data is thus a necessary complement to information extraction on the Web. This thesis investigates the problem of inducing cleaning rules on web extracted data in order to (i) repair and align the data w.r.t. an original target schema, (ii) produce repairs that are as generic as possible such that different instances can benefit from them. The problem is addressed from three different angles: replace cross-site redundancy with an ensemble of entity recognisers; produce general repairs that can be encoded in the extraction process; and exploit entity-wide relations to infer common knowledge on extracted data. First, we present ROSeAnn, an unsupervised approach to integrate semantic annotators and produce a unied and consistent annotation layer on top of them. Both the diversity in vocabulary and widely varying accuracy justify the need for middleware that reconciles different annotator opinions. Considering annotators as "black-boxes" that do not require per-domain supervision allows us to recognise semantically related content in web extracted data in a scalable way. Second, we show in WADaR how annotators can be used to discover rules to repair web extracted data. We study the problem of computing joint repairs for web data extraction programs and their extracted data, providing an approximate solution that requires no per-source supervision and proves effective across a wide variety of domains and sources. The proposed solution is effective not only in repairing the extracted data, but also in encoding such repairs in the original extraction process. Third, we investigate how relationships among entities can be exploited to discover inconsistencies and additional information. We present RuDiK, a disk-based scalable solution to discover first-order logic rules over RDF knowledge bases built from web sources. We present an approach that does not limit its search space to rules that rely on "positive" relationships between entities, as in the case with traditional mining of constraints. On the contrary, it extends the search space to also discover negative rules, i.e., patterns that lead to contradictions in the data.

Page generated in 0.1774 seconds