Spelling suggestions: "subject:"[een] COMPUTER NETWORK"" "subject:"[enn] COMPUTER NETWORK""
601 |
An end-to-end adaptation algorithm for best effort video delivery over Internet.January 1998 (has links)
by Walter Chi-Woon Fung. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1998. / Includes bibliographical references (leaves 64-[67]). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.2 --- Limitation of Existing Research --- p.3 / Chapter 1.3 --- Contributions of This Thesis --- p.3 / Chapter 1.4 --- Organization of the Thesis --- p.4 / Chapter 2 --- Related Work --- p.5 / Chapter 2.1 --- Ongoing Efforts For The Support of Real Time Applications on the Internet - RTP --- p.5 / Chapter 2.2 --- Using the Algorithm on top of RTP --- p.7 / Chapter 3 --- An Adaptive Video Retrieval Algorithm --- p.9 / Chapter 3.1 --- Lossless Environment --- p.9 / Chapter 3.1.1 --- Adapting the Request Rate to the Available Bandwidth --- p.12 / Chapter 3.2 --- Lossy Environment --- p.17 / Chapter 3.2.1 --- Adapting Ar in Lossy Environment --- p.20 / Chapter 3.3 --- Adjusting the Window Size --- p.24 / Chapter 3.4 --- Measurement Issues --- p.27 / Chapter 3.5 --- Mapping between Data Rate and Frame Rate --- p.28 / Chapter 4 --- Rate Measurement --- p.30 / Chapter 4.1 --- Arrival Rate Estimation --- p.30 / Chapter 4.2 --- Loss Rate Estimation --- p.32 / Chapter 5 --- Frame Skipping and Stuffing --- p.37 / Chapter 5.1 --- MPEG-1 Video Stream Basics --- p.37 / Chapter 5.2 --- Frame Skipping --- p.38 / Chapter 5.3 --- Frame Stuffing In Lossy Environment --- p.40 / Chapter 6 --- Experiment Result and Analysis --- p.43 / Chapter 6.1 --- Experiment --- p.43 / Chapter 6.2 --- Analysis --- p.54 / Chapter 6.2.1 --- Interacting With Streams With No Rate Control --- p.56 / Chapter 6.2.2 --- Multiple Streams Running The Algorithm --- p.58 / Chapter 6.2.3 --- Calculation of p --- p.59 / Chapter 7 --- Conclusions --- p.61 / Bibliography --- p.64
|
602 |
Selective Flooding for Better QoS RoutingKannan, Gangadharan 10 May 2000 (has links)
Quality-of-service (QoS) requirements for the timely delivery of real-time multimedia raise new challenges for the networking world. A key component of QoS is QoS routing which allows the selection of network routes with sufficient resources for requested QoS parameters. Several techniques have been proposed in the literature to compute QoS routes, most of which require dynamic update of link-state information across the Internet. Given the growing size of the Internet, it is becoming increasingly difficult to gather up-to-date state information in a dynamic environment. We propose a new technique to compute QoS routes on the Internet in a fast and efficient manner without any need for dynamic updates. Our method, known as Selective Flooding, checks the state of the links on a set of pre-computed routes from the source to the destination in parallel and based on this information computes the best route and then reserves resources. We implemented Selective Flooding on a QoS routing simulator and evaluated the performance of Selective Flooding compared to source routing for a variety of network parameters. We find Selective Flooding consistently outperforms source routing in terms of call-blocking rate and outperforms source routing in terms of network overhead for some network conditions. The contributions of this thesis include the design of a new QoS routing algorithm, Selective Flooding, extensive evaluation of Selective Flooding under a variety of network conditions and a working simulation model for future research.
|
603 |
Quantifying Resource Sharing, Resource Isolation and Agility for Web Applications with Virtual MachinesMiller, Elliot A 27 August 2007 (has links)
"Resource sharing between applications can significantly improve the resources required for all, which can reduce cost, and improve performance. Isolating resources on the other hand can also be beneficial as the failure or significant load on one application does not affect another. There is a delicate balance between resource sharing and resource isolation. Virtual machines may be a solution to this problem with the added benefit of being able to perform more dynamic load balancing, but this solution may be at a significant cost in performance. This thesis compares three different configurations for machines running application servers. It looks at speed at which a new application server can be started up, resource sharing and resource isolation between applications in an attempt to quantify the tradeoffs for each type of configuration."
|
604 |
Multi-layer virtual transport network design and managementWang, Yuefeng 13 March 2017 (has links)
Nowadays there is an increasing need for a general paradigm that can simplify network management and further enable network innovations. Software Defined Networking (SDN) is an efficient way to make the network programmable and reduce management complexity, however it is plagued with limitations inherited from the legacy Internet (TCP/IP) architecture. On the other hand, service overlay networks and virtual networks are widely used to overcome deficiencies of the Internet. However, most overlay/virtual networks are single-layered and lack dynamic scope management. Furthermore, how to solve the joint problem of designing and mapping the overlay/virtual network requests for better application and network performance remains an understudied area.
In this thesis, in response to limitations of current SDN management solutions and of the traditional single-layer overlay/virtual network design, we propose a recursive approach to enterprise network management, where network management is done through managing various Virtual Transport Networks (VTNs) over different scopes (i.e., regions of operation). Different from the traditional overlay/virtual network model which mainly focuses on routing/tunneling, our VTN approach provides communication service with explicit Quality-of-Service (QoS) support for applications via transport flows, i.e., it involves all mechanisms (e.g., addressing, routing, error and flow control, resource allocation) needed to meet application requirements. Our approach inherently provides a multi-layer solution for overlay/virtual network design.
The contributions of this thesis are threefold: (1) we propose a novel VTN-based management approach to enterprise network management; (2) we develop a framework for multi-layer VTN design and instantiate it to meet specific application and network goals; and (3) we design and prototype a VTN-based management architecture. Our simulation and experimental results demonstrate the flexibility of our VTN-based management approach and its performance advantages.
|
605 |
On SIP Server Clusters and the Migration to Cloud Computing PlatformsKim, Jong Yul January 2016 (has links)
This thesis looks in depth at telephony server clusters, the modern switchboards at the core of a packet-based telephony service. The most widely used de facto standard protocols for telecommunications are the Session Initiation Protocol (SIP) and the Real Time Protocol (RTP). SIP is a signaling protocol used to establish, maintain, and tear down communication channel between two or more parties. RTP is a media delivery protocol that allows packets to carry digitized voice, video, or text.
SIP telephony server clusters that provide communications services, such as an emergency calling service, must be scalable and highly available. We evaluate existing commercial and open source telephony server clusters to see how they differ in scalability and high availability.
We also investigate how a scalable SIP server cluster can be built on a cloud computing platform. Elasticity of resources is an attractive property for SIP server clusters because it allows the cluster to grow or shrink organically based on traffic load. However, simply deploying existing clusters to cloud computing platforms is not good enough to take full advantage of elasticity. We explore the design and implementation of clusters that scale in real-time. The database tier of our cluster was modified to use a scalable key-value store so that both the SIP proxy tier and the database tier can scale separately. Load monitoring and reactive threshold-based scaling logic is presented and evaluated.
Server clusters also need to reduce processing latency. Otherwise, subscribers experience low quality of service such as delayed call establishment, dropped calls, and inadequate media quality. Cloud computing platforms do not guarantee latency on virtual machines due to resource contention on the same physical host. These extra latencies from resource contention are temporary in nature. Therefore, we propose and evaluate a mechanism that temporarily distributes more incoming calls to responsive SIP proxies, based on measurements of the processing delay in proxies.
Availability of SIP server clusters is also a challenge on platforms where a node may fail anytime. We investigated how single component failures in a cluster can lead to a complete system outage. We found that for single component failures, simply having redundant components of the same type are enough to mask those failures. However, for client-facing components, smarter clients and DNS resolvers are necessary.
Throughout the thesis, a prototype SIP proxy cluster is re-used, with variations in the architecture or configuration, to demonstrate and address issues mentioned above. This allows us to tie all of our approaches for different issues into one coherent system that is dynamically scalable, is responsive despite latency varations of virtual machines, and is tolerant of single component failures in cloud platforms.
|
606 |
Improving Content Delivery and Service Discovery in NetworksSrinivasan, Suman Ramkumar January 2016 (has links)
Production and consumption of multimedia content on the Internet is rising, fueled by the demand for content from services such as YouTube, Netflix and Facebook video. The Internet is shifting from host-based to content-centric networking. At the same time, users are shifting away from a homogeneous desktop computing environment to using a heterogeneous mix of devices, such as smartphones, tablets and thin clients, all of which allow users to consume data on the move using wireless and cellular data networks.
The popularity of these new class of devices has, in turn, increased demand for multimedia content by mobile users. The emergence of rich Internet applications and the widespread adoption and use of High Definition (HD) video has also placed higher pressure on the service providers and the core Internet backbone, forcing service providers to respond to increased bandwidth use in such networks.
In my thesis, I aim to provide clarity and insight into the usage of core networking protocols and multimedia consumption on both mobile and wireless networks, as well as the network core. I also present research prototypes for potential solutions to some of the problems caused by the increased multimedia consumption on the Internet.
|
607 |
Protection architectures for multi-wavelength optical networks.January 2004 (has links)
by Lee Chi Man. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2004. / Includes bibliographical references (leaves 63-65). / Abstracts in English and Chinese. / Chapter CHAPTER 1 --- INTRODUCTION --- p.5 / Chapter 1.1 --- Background --- p.5 / Chapter 1.1.1 --- Backbone network - Long haul mesh network problem --- p.5 / Chapter 1.1.2 --- Access network ´ؤ Last mile problems --- p.8 / Chapter 1.1.3 --- Network integration --- p.9 / Chapter 1.2 --- SUMMARY OF INSIGHTS --- p.10 / Chapter 1.3 --- Contribution of this thesis --- p.11 / Chapter 1.4 --- Structure of the thesis --- p.11 / Chapter CHAPTER 2 --- PREVIOUS PROTECTION ARCHITECTURES --- p.12 / Chapter 2.1 --- Introduction --- p.12 / Chapter 2.2 --- Traditional physical protection architectures in metro area --- p.13 / Chapter 2.2.1 --- Self healing ring --- p.17 / Chapter 2.2.2 --- Some terminology in ring protection --- p.13 / Chapter 2.2.3 --- Unidirectional path-switched rings (UPSR) [17] --- p.13 / Chapter 2.2.4 --- Bidirectional line-switched rings (BLSR) [17] --- p.14 / Chapter 2.2.5 --- Ring interconnection and dual homing [17] --- p.16 / Chapter 2.3 --- Traditional physical protection architectures in access networks --- p.17 / Chapter 2.3.1 --- Basic architecture in passive optical networks --- p.17 / Chapter 2.3.2 --- Fault management issue in access networks --- p.18 / Chapter 2.3.3 --- Some protection architectures --- p.18 / Chapter 2.4 --- Recent protection architectures on a ccess networks --- p.21 / Chapter 2.4.1 --- Star-Ring-Bus architecture --- p.21 / Chapter 2.5 --- Concluding remarks --- p.22 / Chapter CHAPTER 3 --- GROUP PROTECTION ARCHITECTURE (GPA) FOR TRAFFIC RESTORATION IN MULTI- WAVELENGTH PASSIVE OPTICAL NETWORKS --- p.23 / Chapter 3.1 --- Background --- p.23 / Chapter 3.2 --- Organization of Chapter 3 --- p.24 / Chapter 3.3 --- Overview of Group Protection Architecture --- p.24 / Chapter 3.3.1 --- Network architecture --- p.24 / Chapter 3.3.2 --- Wavelength assignment --- p.25 / Chapter 3.3.3 --- Normal operation of the scheme --- p.25 / Chapter 3.3.4 --- Protection mechanism --- p.26 / Chapter 3.4 --- Enhanced GPA architecture --- p.27 / Chapter 3.4.1 --- Network architecture --- p.27 / Chapter 3.4.2 --- Wavelength assignment --- p.28 / Chapter 3.4.3 --- Realization of network elements --- p.28 / Chapter 3.4.3.1 --- Optical line terminal (OLT) --- p.28 / Chapter 3.4.3.2 --- Remote node (RN) --- p.29 / Chapter 3.4.3.3 --- Realization of optical network unit (ONU) --- p.30 / Chapter 3.4.4 --- Protection switching and restoration --- p.31 / Chapter 3.4.5 --- Experimental demonstration --- p.31 / Chapter 3.5 --- Conclusion --- p.33 / Chapter CHAPTER 4 --- A NOVEL CONE PROTECTION ARCHITECTURE (CPA) SCHEME FOR WDM PASSIVE OPTICAL ACCESS NETWORKS --- p.35 / Chapter 4.1 --- Introduction --- p.35 / Chapter 4.2 --- Single-side Cone Protection Architecture (SS-CPA) --- p.36 / Chapter 4.2.1 --- Network topology of SS-CPA --- p.36 / Chapter 4.2.2 --- Wavelength assignment of SS-CPA --- p.36 / Chapter 4.2.3 --- Realization of remote node --- p.37 / Chapter 4.2.4 --- Realization of optical network unit --- p.39 / Chapter 4.2.5 --- Two types of failures --- p.40 / Chapter 4.2.6 --- Protection mechanism against failure --- p.40 / Chapter 4.2.6.1 --- Multi-failures of type I failure --- p.40 / Chapter 4.2.6.2 --- Type II failure --- p.40 / Chapter 4.2.7 --- Experimental demonstration --- p.41 / Chapter 4.2.8 --- Power budget --- p.42 / Chapter 4.2.9 --- Protection capability analysis --- p.42 / Chapter 4.2.10 --- Non-fully-connected case and its extensibility for addition --- p.42 / Chapter 4.2.11 --- Scalability --- p.43 / Chapter 4.2.12 --- Summary --- p.43 / Chapter 4.3 --- Comparison between GPA and SS-CPA scheme --- p.43 / Chapter 4.1 --- Resources comparison --- p.43 / Chapter 4.2 --- Protection capability comparison --- p.44 / Chapter 4.4 --- Concluding remarks --- p.45 / Chapter CHAPTER 5 --- MUL 77- WA VELENGTH MUL TICAST NETWORK IN PASSIVE OPTICAL NETWORK --- p.46 / Chapter 5.1 --- Introduction --- p.46 / Chapter 5.2 --- Organization of this chapter --- p.47 / Chapter 5.3 --- Simple Group Multicast Network (SGMN) scheme --- p.47 / Chapter 5.3.1 --- Network design principle --- p.47 / Chapter 5.3.2 --- Wavelength assignment of SGMN --- p.48 / Chapter 5.3.3 --- Realization of remote node --- p.49 / Chapter 5.3.3 --- Realization of optical network unit --- p.50 / Chapter 5.3.4 --- Power budget --- p.51 / Chapter 5.4 --- A mulTI- wa velength a ccess network with reconfigurable multicast …… --- p.51 / Chapter 5.4.1 --- Motivation --- p.51 / Chapter 5.4.2 --- Background --- p.51 / Chapter 5.4.3 --- Network design principle --- p.52 / Chapter 5.4.4 --- Wavelength assignment --- p.52 / Chapter 5.4.5 --- Remote Node design --- p.53 / Chapter 5.4.6 --- Optical network unit design --- p.54 / Chapter 5.4.7 --- Multicast connection pattern --- p.55 / Chapter 5.4.8 --- Multicast group selection in OLT --- p.57 / Chapter 5.4.9 --- Scalability --- p.57 / Chapter 5.4.10 --- Experimental configuration --- p.58 / Chapter 5.4.11 --- Concluding remarks --- p.59 / Chapter CHAPTER 6 --- CONCLUSIONS --- p.60 / LIST OF PUBLICATIONS: --- p.62 / REFERENCES: --- p.63
|
608 |
Defending against low-rate TCP attack: dynamic detection and protection.January 2005 (has links)
Sun Haibin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 89-96). / Abstracts in English and Chinese. / Abstract --- p.i / Chinese Abstract --- p.iii / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 2 --- Background Study and Related Work --- p.5 / Chapter 2.1 --- Victim Exhaustion DoS/DDoS Attacks --- p.6 / Chapter 2.1.1 --- Direct DoS/DDoS Attacks --- p.7 / Chapter 2.1.2 --- Reflector DoS/DDoS Attacks --- p.8 / Chapter 2.1.3 --- Spoofed Packet Filtering --- p.9 / Chapter 2.1.4 --- IP Traceback --- p.13 / Chapter 2.1.5 --- Location Hiding --- p.20 / Chapter 2.2 --- QoS Based DoS Attacks --- p.22 / Chapter 2.2.1 --- Introduction to the QoS Based DoS Attacks --- p.22 / Chapter 2.2.2 --- Countermeasures to the QoS Based DoS Attacks --- p.22 / Chapter 2.3 --- Worm based DoS Attacks --- p.24 / Chapter 2.3.1 --- Introduction to the Worm based DoS Attacks --- p.24 / Chapter 2.3.2 --- Countermeasures to the Worm Based DoS Attacks --- p.24 / Chapter 2.4 --- Low-rate TCP Attack and RoQ Attacks --- p.26 / Chapter 2.4.1 --- General Introduction of Low-rate Attack --- p.26 / Chapter 2.4.2 --- Introduction of RoQ Attack --- p.27 / Chapter 3 --- Formal Description of Low-rate TCP Attacks --- p.28 / Chapter 3.1 --- Mathematical Model of Low-rate TCP Attacks --- p.28 / Chapter 3 2 --- Other forms of Low-rate TCP Attacks --- p.31 / Chapter 4 --- Distributed Detection Mechanism --- p.34 / Chapter 4.1 --- General Consideration of Distributed Detection . --- p.34 / Chapter 4.2 --- Design of Low-rate Attack Detection Algorithm . --- p.36 / Chapter 4.3 --- Statistical Sampling of Incoming Traffic --- p.37 / Chapter 4.4 --- Noise Filtering --- p.38 / Chapter 4.5 --- Feature Extraction --- p.39 / Chapter 4.6 --- Pattern Matching via the Dynamic Time Warping (DTW) Method --- p.41 / Chapter 4.7 --- Robustness and Accuracy of DTW --- p.45 / Chapter 4.7.1 --- DTW values for low-rate attack: --- p.46 / Chapter 4.7.2 --- DTW values for legitimate traffic (Gaussian): --- p.47 / Chapter 4.7.3 --- DTW values for legitimate traffic (Self-similar): --- p.48 / Chapter 5 --- Low-Rate Attack Defense Mechanism --- p.52 / Chapter 5.1 --- Design of Defense Mechanism --- p.52 / Chapter 5.2 --- Analysis of Deficit Round Robin Algorithm --- p.54 / Chapter 6 --- Fluid Model of TCP Flows --- p.56 / Chapter 6.1 --- Fluid Math. Model of TCP under DRR --- p.56 / Chapter 6.1.1 --- Model of TCP on a Droptail Router --- p.56 / Chapter 6.1.2 --- Model of TCP on a DRR Router --- p.60 / Chapter 6.2 --- Simulation of TCP Fluid Model --- p.62 / Chapter 6.2.1 --- Simulation of Attack with Single TCP Flow --- p.62 / Chapter 6.2.2 --- Simulation of Attack with Multiple TCP flows --- p.64 / Chapter 7 --- Experiments --- p.69 / Chapter 7.1 --- Experiment 1 (Single TCP flow vs. single source attack) --- p.69 / Chapter 7.2 --- Experiment 2 (Multiple TCP flows vs. single source attack) --- p.72 / Chapter 7.3 --- Experiment 3 (Multiple TCP flows vs. synchro- nized distributed low-rate attack) --- p.74 / Chapter 7.4 --- Experiment 4 (Network model of low-rate attack vs. Multiple TCP flows) --- p.77 / Chapter 8 --- Conclusion --- p.83 / Chapter A --- Lemmas and Theorem Derivation --- p.85 / Bibliography --- p.89
|
609 |
Fairness index in communication networks.January 2005 (has links)
Li Fengjun. / Thesis submitted in: July 2004. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 83-84). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgments --- p.v / Table of Contents --- p.vi / List of Figures --- p.viii / List of Tables --- p.ix / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivations of this work --- p.1 / Chapter 1.2 --- Network Fairness Issue --- p.3 / Chapter 1.3 --- Our Contribution --- p.4 / Chapter 1.4 --- Organization of the Thesis --- p.5 / Chapter Chapter 2 --- Background of Fairness Index --- p.7 / Chapter 2.1 --- The Model --- p.7 / Chapter 2.2 --- Definitions of Fairness Index --- p.9 / Chapter 2.3 --- General Existence and Uniqueness Properties of Perfectly Fair Solution --- p.12 / Chapter 2.4 --- Properties in Specific Network Topologies --- p.16 / Chapter 2.4.1 --- Uniform Routing Networks --- p.16 / Chapter 2.4.2 --- Single Routing Node Networks --- p.20 / Chapter Chapter 3 --- Extension of the Fairness Index --- p.22 / Chapter 3.1 --- A Single Routing Node Network Example --- p.22 / Chapter 3.2 --- The Max-Min Fairness Index --- p.27 / Chapter 3.3 --- Von Neumann Equilibrium Index --- p.29 / Chapter Chapter 4 --- Distributed Low Bit Rate Algorithm --- p.36 / Chapter 4.1 --- Distributed Controller --- p.36 / Chapter 4.2 --- Convergence of the Low Bit Rate Distributed Algorithm --- p.39 / Chapter 4.3 --- Experiment Results --- p.49 / Chapter 4.4 --- Heuristic Iterative Algorithm --- p.53 / Chapter Chapter 5 --- Fairness Index Based Routing --- p.57 / Chapter 5.1 --- Routing Protocol Basics --- p.58 / Chapter 5.1.1 --- Static Routing and Dynamic Routing --- p.58 / Chapter 5.1.2 --- Routing Metrics --- p.59 / Chapter 5.1.3 --- Distance Vector and Link State --- p.60 / Chapter 5.1.4 --- Shortest Path Routing Algorithm --- p.62 / Chapter 5.2 --- Minimum Delay Routing --- p.63 / Chapter 5.3 --- Fairness Index Based Routing --- p.66 / Chapter 5.3.1 --- Problem Formulation --- p.66 / Chapter 5.3.2 --- Cost Function --- p.69 / Chapter 5.3.3 --- Implementing Fairness Index Based Routing --- p.71 / Chapter 5.3.4 --- Experiment and Analysis --- p.73 / Bibliography --- p.82
|
610 |
Information discovery from semi-structured record sets on the Web.January 2012 (has links)
万维网(World Wide Web ,简称Web) 从上世纪九十年代出现以来在深度和广度上都得到了巨大的发展,大量的Web应用前所未有地改变了人们的生活。Web的发展形成了个庞大而有价值的信息资源,然而由于Web 内容异质性给自动信息抽取所造成的困难,这个信息源并没有被充分地利用。因此, Web信息抽取是Web信息应用过程中非常关键的一环。一般情况下,一个网页用来描述一个单独的对象或者一组相似的对象。例如,关于某款数码相机的网页描述了该相机的各方面特征,而一个院系的教授列表则描述了一组教授的基本信息。相应地, Web信息抽取可以分为两大类,即面向单个对象细节的信息抽取和面向组对象记录的信息抽取。本文集中讨论后者,即从单的网页中抽取组半结构化的数据记录。 / 本文提出了两个框架来解决半结构化数据记录的抽取问题。首先介绍一个基于数据记录切分树的框架RST 。该框架中提出了个新的搜索结构即数据记录切分树。基于所设计的搜索策略,数据记录切分树可以有效地从网页中抽取数据记录。在数据记录切分树中,对应于可能的数据记录的DOM子树组是在搜索过程中动态生成的,这使得RST框架比已有的方法更具灵活性。比如在MDR和DEPTA 中, DOM子树组是根据预定义的方式静态生成的,未能考虑当前数据记录区域的特征。另外, RST框架中提出了一个基于"HTML Token" 单元的相似度计算方法。i衷方法可以综合MDR中基于字符串编辑距离的方法之优点和DEPTA 中基于树结构编辑距离的方法之优点。 / 很多解决数据记录抽取问题的已有方法(包括RST框架)都需要预定义若干硬性的条件,并且他们通过遍历DOM树结构来在一个网页中穷举搜索可能存在的数据记录区域。这些方法不能很好地处理大量的含有复杂数据记录结构的网页。因此,本文提出了第二个解决框架Skoga。 Skoga框架由一个DOM结构知识驱动的模型和一个记录切分树模型组成。Skoga框架可以对DOM结构进行全局的分析,进而实现更加有效的、鲁棒的记录识别。DOM结构知识包含DOM 背景知识和DOM统计知识。前者描述DOM结构中的一些逻辑关系,这些关系对DOM 的逻辑结构进行限制。而后者描述一个DOM节点或者一组DOM节点的特点,由一组经过巧妙设计的特征(Feature) 来表示。特征的权重是由参数估计算法在一个开发数据集上学习得到的。基于面向结构化输出的支持向量机( Structuredoutput Support Vector Machine) 模型,本参数估计算法可以很好地处理DOM节点之间的依赖关系。另外,本文提出了一个基于分治策略的优化方法来搜索一个网页的最优化记录识别。 / 最后,本文提出了一个利用半结构化数据记录来进行维基百科类目(Wikipedia Category) 扩充的框架。該框架首先从某个维基百科类目中获取几个已有的实体(Entity) 作为种子,然后利用这些种子及其信息框(Infobox) 中的属性来从Web上发掘更多的同一类目的实体及其属性信息。该框架的一个特点是它利用半结构化的数据记录来进行新实体和属性的抽取,而这些半结构化的数据记录是通过自动的方法从Web上获取的。该框架提出了一个基于条件随机场(Conditional Random Fields) 的半监督学习模型来利用有限的标注样本进行目标信息抽取。这个半监督学习模型定义了一个记录相似关系图来指导学习过程,从而利用大量非标注样本来获得更好的信息抽取效果。 / The World Wide Web has been extensively developed since its first appearance two decades ago. Various applications on theWeb have unprecedentedly changed humans' life. Although the explosive growth and spread of the Web have resulted in a huge information repository, yet it is still under-utilized due to the difficulty in automated information extraction (IE) caused by the heterogeneity of Web content. Thus, Web IE is an essential task in the utilization of Web information. Typically, a Web page may describe either a single object or a group of similar objects. For example, the description page of a digital camera describes different aspects of the camera. On the contrary, the faculty list page of a department presents the information of a group of professors. Corresponding to the above two types, Web IE methods can be broadly categorized into two classes, namely, description details oriented extraction and object records oriented extraction. In this thesis, we focus on the later task, namely semi-structured data record extraction from a single Web page. / In this thesis, we develop two frameworks to tackle the task of data record extraction. We first present a record segmentation search tree framework in which a new search structure, named Record Segmentation Tree (RST), is designed and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. The subtree groups corresponding to possible data records are dynamically generated in the RST structure during the search process. Therefore, this framework is more exible compared with existing methods such as MDR and DEPTA that have a static manner of generating subtree groups. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. / Many existing methods, including the RST framework, for data record extraction from Web pages contain pre-coded hard criteria and adopt an exhaustive search strategy for traversing the DOM tree. They fail to handle many challenging pages containing complicated data records and record regions. In this thesis, we also present another framework Skoga which can perform robust detection of different kinds of data records and record regions. Skoga, composed of a DOM structure knowledge driven detection model and a record segmentation search tree model, can conduct a global analysis on the DOM structure to achieve effective detection. The DOM structure knowledge consists of background knowledge as well as statistical knowledge capturing different characteristics of data records and record regions as exhibited in the DOM structure. Specifically, the background knowledge encodes some logical relations governing certain structural constraints in the DOM structure. The statistical knowledge is represented by some carefully designed features that capture different characteristics of a single node or a node group in the DOM. The feature weights are determined using a development data set via a parameter estimation algorithm based on structured output Support Vector Machine model which can tackle the inter-dependency among the labels on the nodes of the DOM structure. An optimization method based on divide and conquer principle is developed making use of the DOM structure knowledge to quantitatively infer the best record and region recognition. / Finally, we present a framework that can make use of the detected data records to automatically populate existing Wikipedia categories. This framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of this framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised leaning process. The graph captures alignment similarity among data records. Then the semisupervised learning process can leverage the benefit of the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Bing, Lidong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 114-123). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Web Era and Web IE --- p.1 / Chapter 1.2 --- Semi-structured Record and Region Detection --- p.3 / Chapter 1.2.1 --- Problem Setting --- p.3 / Chapter 1.2.2 --- Observations and Challenges --- p.5 / Chapter 1.2.3 --- Our Proposed First Framework - Record Segmentation Tree --- p.9 / Chapter 1.2.4 --- Our Proposed Second Framework - DOM Structure Knowledge Oriented Global Analysis --- p.10 / Chapter 1.3 --- Entity Expansion and Attribute Acquisition with Semi-structured Data Records --- p.13 / Chapter 1.3.1 --- Problem Setting --- p.13 / Chapter 1.3.2 --- Our Proposed Framework - Semi-supervised CRF Regularized by Proximate Graph --- p.15 / Chapter 1.4 --- Outline of the Thesis --- p.17 / Chapter 2 --- Literature Survey --- p.19 / Chapter 2.1 --- Semi-structured Record Extraction --- p.19 / Chapter 2.2 --- Entity Expansion and Attribute Acquisition --- p.23 / Chapter 3 --- Record Segmentation Tree (RST) Framework --- p.27 / Chapter 3.1 --- Overview --- p.27 / Chapter 3.2 --- Record Segmentation Tree --- p.29 / Chapter 3.2.1 --- Basic Record Segmentation Tree --- p.29 / Chapter 3.2.2 --- Slimmed Segmentation Tree --- p.30 / Chapter 3.2.3 --- Utilize RST in Record Extraction --- p.31 / Chapter 3.3 --- Search Pruning Strategies --- p.33 / Chapter 3.3.1 --- Threshold-Based Top k Search --- p.33 / Chapter 3.3.2 --- Complexity Analysis --- p.35 / Chapter 3.3.3 --- Composite Node Pruning --- p.37 / Chapter 3.3.4 --- More Challenging Record Region Discussion --- p.37 / Chapter 3.4 --- Similarity Measure --- p.41 / Chapter 3.4.1 --- Encoding Subtree with Tokens --- p.42 / Chapter 3.4.2 --- Tandem Repeat Detection and Distance-based Measure --- p.42 / Chapter 4 --- DOM Structure Knowledge Oriented Global Analysis (Skoga) Framework --- p.45 / Chapter 4.1 --- Overview --- p.45 / Chapter 4.2 --- Design of DOM Structure Knowledge --- p.49 / Chapter 4.2.1 --- Background Knowledge --- p.49 / Chapter 4.2.2 --- Statistical Knowledge --- p.51 / Chapter 4.3 --- Finding Optimal Label Assignment --- p.54 / Chapter 4.3.1 --- Inference for Bottom Subtrees --- p.55 / Chapter 4.3.2 --- Recursive Inference for Higher Subtree --- p.57 / Chapter 4.3.3 --- Backtracking for the Optimal Label Assignment --- p.59 / Chapter 4.3.4 --- Second Optimal Label Assignment --- p.60 / Chapter 4.4 --- Statistical Knowledge Acquisition --- p.62 / Chapter 4.4.1 --- Finding Feature Weights via Structured Output SVM Learning --- p.62 / Chapter 4.4.2 --- Region-oriented Loss --- p.63 / Chapter 4.4.3 --- Cost Function Optimization --- p.65 / Chapter 4.5 --- Record Segmentation and Reassembling --- p.66 / Chapter 5 --- Experimental Results of Data Record Extraction --- p.68 / Chapter 5.1 --- Evaluation Data Set --- p.68 / Chapter 5.2 --- Experimental Setup --- p.70 / Chapter 5.3 --- Experimental Results on TBDW --- p.73 / Chapter 5.4 --- Experimental Results on Hybrid Data Set with Nested Region --- p.76 / Chapter 5.5 --- Experimental Results on Hybrid Data Set with Intertwined Region --- p.78 / Chapter 5.6 --- Empirical Case Studies --- p.79 / Chapter 5.6.1 --- Case Study One --- p.80 / Chapter 5.6.2 --- Case Study Two --- p.83 / Chapter 6 --- Semi-supervised CRF Regularized by Proximate Graph --- p.85 / Chapter 6.1 --- Overview --- p.85 / Chapter 6.2 --- Semi-structured Data Record Set Collection --- p.88 / Chapter 6.3 --- Semi-supervised Learning Model for Extraction --- p.89 / Chapter 6.3.1 --- Proximate Record Graph Construction --- p.91 / Chapter 6.3.2 --- Semi-Markov CRF and Features --- p.94 / Chapter 6.3.3 --- Posterior Regularization --- p.95 / Chapter 6.3.4 --- Inference with Regularized Posterior --- p.97 / Chapter 6.3.5 --- Semi-supervised Training --- p.97 / Chapter 6.3.6 --- Result Ranking --- p.98 / Chapter 6.4 --- Derived Training Example Generation --- p.99 / Chapter 6.5 --- Experiments --- p.100 / Chapter 6.5.1 --- Experiment Setting --- p.100 / Chapter 6.5.2 --- Entity Expansion --- p.103 / Chapter 6.5.3 --- Attribute Extraction --- p.107 / Chapter 7 --- Conclusions and Future Work --- p.110 / Chapter 7.1 --- Conclusions --- p.110 / Chapter 7.2 --- Future Work --- p.112 / Bibliography --- p.113
|
Page generated in 0.0462 seconds