Return to search

Information discovery from semi-structured record sets on the Web.

万维网(World Wide Web ,简称Web) 从上世纪九十年代出现以来在深度和广度上都得到了巨大的发展,大量的Web应用前所未有地改变了人们的生活。Web的发展形成了个庞大而有价值的信息资源,然而由于Web 内容异质性给自动信息抽取所造成的困难,这个信息源并没有被充分地利用。因此, Web信息抽取是Web信息应用过程中非常关键的一环。一般情况下,一个网页用来描述一个单独的对象或者一组相似的对象。例如,关于某款数码相机的网页描述了该相机的各方面特征,而一个院系的教授列表则描述了一组教授的基本信息。相应地, Web信息抽取可以分为两大类,即面向单个对象细节的信息抽取和面向组对象记录的信息抽取。本文集中讨论后者,即从单的网页中抽取组半结构化的数据记录。 / 本文提出了两个框架来解决半结构化数据记录的抽取问题。首先介绍一个基于数据记录切分树的框架RST 。该框架中提出了个新的搜索结构即数据记录切分树。基于所设计的搜索策略,数据记录切分树可以有效地从网页中抽取数据记录。在数据记录切分树中,对应于可能的数据记录的DOM子树组是在搜索过程中动态生成的,这使得RST框架比已有的方法更具灵活性。比如在MDR和DEPTA 中, DOM子树组是根据预定义的方式静态生成的,未能考虑当前数据记录区域的特征。另外, RST框架中提出了一个基于"HTML Token" 单元的相似度计算方法。i衷方法可以综合MDR中基于字符串编辑距离的方法之优点和DEPTA 中基于树结构编辑距离的方法之优点。 / 很多解决数据记录抽取问题的已有方法(包括RST框架)都需要预定义若干硬性的条件,并且他们通过遍历DOM树结构来在一个网页中穷举搜索可能存在的数据记录区域。这些方法不能很好地处理大量的含有复杂数据记录结构的网页。因此,本文提出了第二个解决框架Skoga。 Skoga框架由一个DOM结构知识驱动的模型和一个记录切分树模型组成。Skoga框架可以对DOM结构进行全局的分析,进而实现更加有效的、鲁棒的记录识别。DOM结构知识包含DOM 背景知识和DOM统计知识。前者描述DOM结构中的一些逻辑关系,这些关系对DOM 的逻辑结构进行限制。而后者描述一个DOM节点或者一组DOM节点的特点,由一组经过巧妙设计的特征(Feature) 来表示。特征的权重是由参数估计算法在一个开发数据集上学习得到的。基于面向结构化输出的支持向量机( Structuredoutput Support Vector Machine) 模型,本参数估计算法可以很好地处理DOM节点之间的依赖关系。另外,本文提出了一个基于分治策略的优化方法来搜索一个网页的最优化记录识别。 / 最后,本文提出了一个利用半结构化数据记录来进行维基百科类目(Wikipedia Category) 扩充的框架。該框架首先从某个维基百科类目中获取几个已有的实体(Entity) 作为种子,然后利用这些种子及其信息框(Infobox) 中的属性来从Web上发掘更多的同一类目的实体及其属性信息。该框架的一个特点是它利用半结构化的数据记录来进行新实体和属性的抽取,而这些半结构化的数据记录是通过自动的方法从Web上获取的。该框架提出了一个基于条件随机场(Conditional Random Fields) 的半监督学习模型来利用有限的标注样本进行目标信息抽取。这个半监督学习模型定义了一个记录相似关系图来指导学习过程,从而利用大量非标注样本来获得更好的信息抽取效果。 / The World Wide Web has been extensively developed since its first appearance two decades ago. Various applications on theWeb have unprecedentedly changed humans' life. Although the explosive growth and spread of the Web have resulted in a huge information repository, yet it is still under-utilized due to the difficulty in automated information extraction (IE) caused by the heterogeneity of Web content. Thus, Web IE is an essential task in the utilization of Web information. Typically, a Web page may describe either a single object or a group of similar objects. For example, the description page of a digital camera describes different aspects of the camera. On the contrary, the faculty list page of a department presents the information of a group of professors. Corresponding to the above two types, Web IE methods can be broadly categorized into two classes, namely, description details oriented extraction and object records oriented extraction. In this thesis, we focus on the later task, namely semi-structured data record extraction from a single Web page. / In this thesis, we develop two frameworks to tackle the task of data record extraction. We first present a record segmentation search tree framework in which a new search structure, named Record Segmentation Tree (RST), is designed and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. The subtree groups corresponding to possible data records are dynamically generated in the RST structure during the search process. Therefore, this framework is more exible compared with existing methods such as MDR and DEPTA that have a static manner of generating subtree groups. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. / Many existing methods, including the RST framework, for data record extraction from Web pages contain pre-coded hard criteria and adopt an exhaustive search strategy for traversing the DOM tree. They fail to handle many challenging pages containing complicated data records and record regions. In this thesis, we also present another framework Skoga which can perform robust detection of different kinds of data records and record regions. Skoga, composed of a DOM structure knowledge driven detection model and a record segmentation search tree model, can conduct a global analysis on the DOM structure to achieve effective detection. The DOM structure knowledge consists of background knowledge as well as statistical knowledge capturing different characteristics of data records and record regions as exhibited in the DOM structure. Specifically, the background knowledge encodes some logical relations governing certain structural constraints in the DOM structure. The statistical knowledge is represented by some carefully designed features that capture different characteristics of a single node or a node group in the DOM. The feature weights are determined using a development data set via a parameter estimation algorithm based on structured output Support Vector Machine model which can tackle the inter-dependency among the labels on the nodes of the DOM structure. An optimization method based on divide and conquer principle is developed making use of the DOM structure knowledge to quantitatively infer the best record and region recognition. / Finally, we present a framework that can make use of the detected data records to automatically populate existing Wikipedia categories. This framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of this framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised leaning process. The graph captures alignment similarity among data records. Then the semisupervised learning process can leverage the benefit of the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Bing, Lidong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 114-123). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Web Era and Web IE --- p.1 / Chapter 1.2 --- Semi-structured Record and Region Detection --- p.3 / Chapter 1.2.1 --- Problem Setting --- p.3 / Chapter 1.2.2 --- Observations and Challenges --- p.5 / Chapter 1.2.3 --- Our Proposed First Framework - Record Segmentation Tree --- p.9 / Chapter 1.2.4 --- Our Proposed Second Framework - DOM Structure Knowledge Oriented Global Analysis --- p.10 / Chapter 1.3 --- Entity Expansion and Attribute Acquisition with Semi-structured Data Records --- p.13 / Chapter 1.3.1 --- Problem Setting --- p.13 / Chapter 1.3.2 --- Our Proposed Framework - Semi-supervised CRF Regularized by Proximate Graph --- p.15 / Chapter 1.4 --- Outline of the Thesis --- p.17 / Chapter 2 --- Literature Survey --- p.19 / Chapter 2.1 --- Semi-structured Record Extraction --- p.19 / Chapter 2.2 --- Entity Expansion and Attribute Acquisition --- p.23 / Chapter 3 --- Record Segmentation Tree (RST) Framework --- p.27 / Chapter 3.1 --- Overview --- p.27 / Chapter 3.2 --- Record Segmentation Tree --- p.29 / Chapter 3.2.1 --- Basic Record Segmentation Tree --- p.29 / Chapter 3.2.2 --- Slimmed Segmentation Tree --- p.30 / Chapter 3.2.3 --- Utilize RST in Record Extraction --- p.31 / Chapter 3.3 --- Search Pruning Strategies --- p.33 / Chapter 3.3.1 --- Threshold-Based Top k Search --- p.33 / Chapter 3.3.2 --- Complexity Analysis --- p.35 / Chapter 3.3.3 --- Composite Node Pruning --- p.37 / Chapter 3.3.4 --- More Challenging Record Region Discussion --- p.37 / Chapter 3.4 --- Similarity Measure --- p.41 / Chapter 3.4.1 --- Encoding Subtree with Tokens --- p.42 / Chapter 3.4.2 --- Tandem Repeat Detection and Distance-based Measure --- p.42 / Chapter 4 --- DOM Structure Knowledge Oriented Global Analysis (Skoga) Framework --- p.45 / Chapter 4.1 --- Overview --- p.45 / Chapter 4.2 --- Design of DOM Structure Knowledge --- p.49 / Chapter 4.2.1 --- Background Knowledge --- p.49 / Chapter 4.2.2 --- Statistical Knowledge --- p.51 / Chapter 4.3 --- Finding Optimal Label Assignment --- p.54 / Chapter 4.3.1 --- Inference for Bottom Subtrees --- p.55 / Chapter 4.3.2 --- Recursive Inference for Higher Subtree --- p.57 / Chapter 4.3.3 --- Backtracking for the Optimal Label Assignment --- p.59 / Chapter 4.3.4 --- Second Optimal Label Assignment --- p.60 / Chapter 4.4 --- Statistical Knowledge Acquisition --- p.62 / Chapter 4.4.1 --- Finding Feature Weights via Structured Output SVM Learning --- p.62 / Chapter 4.4.2 --- Region-oriented Loss --- p.63 / Chapter 4.4.3 --- Cost Function Optimization --- p.65 / Chapter 4.5 --- Record Segmentation and Reassembling --- p.66 / Chapter 5 --- Experimental Results of Data Record Extraction --- p.68 / Chapter 5.1 --- Evaluation Data Set --- p.68 / Chapter 5.2 --- Experimental Setup --- p.70 / Chapter 5.3 --- Experimental Results on TBDW --- p.73 / Chapter 5.4 --- Experimental Results on Hybrid Data Set with Nested Region --- p.76 / Chapter 5.5 --- Experimental Results on Hybrid Data Set with Intertwined Region --- p.78 / Chapter 5.6 --- Empirical Case Studies --- p.79 / Chapter 5.6.1 --- Case Study One --- p.80 / Chapter 5.6.2 --- Case Study Two --- p.83 / Chapter 6 --- Semi-supervised CRF Regularized by Proximate Graph --- p.85 / Chapter 6.1 --- Overview --- p.85 / Chapter 6.2 --- Semi-structured Data Record Set Collection --- p.88 / Chapter 6.3 --- Semi-supervised Learning Model for Extraction --- p.89 / Chapter 6.3.1 --- Proximate Record Graph Construction --- p.91 / Chapter 6.3.2 --- Semi-Markov CRF and Features --- p.94 / Chapter 6.3.3 --- Posterior Regularization --- p.95 / Chapter 6.3.4 --- Inference with Regularized Posterior --- p.97 / Chapter 6.3.5 --- Semi-supervised Training --- p.97 / Chapter 6.3.6 --- Result Ranking --- p.98 / Chapter 6.4 --- Derived Training Example Generation --- p.99 / Chapter 6.5 --- Experiments --- p.100 / Chapter 6.5.1 --- Experiment Setting --- p.100 / Chapter 6.5.2 --- Entity Expansion --- p.103 / Chapter 6.5.3 --- Attribute Extraction --- p.107 / Chapter 7 --- Conclusions and Future Work --- p.110 / Chapter 7.1 --- Conclusions --- p.110 / Chapter 7.2 --- Future Work --- p.112 / Bibliography --- p.113

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_328411
Date January 2012
ContributorsBing, Lidong., Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, bibliography
Formatelectronic resource, electronic resource, remote, 1 online resource (xiv, 123 leaves) : ill. (some col.)
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0031 seconds