Global ETD Search

Return to search

Extrakce strukturovaných dat z českého webu s využitím extrakčních ontologií / Extracting Structured Data from Czech Web Using Extraction Ontologies

The presented thesis deals with the task of automatic information extraction from HTML documents for two selected domains. Laptop offers are extracted from e-shops and free-published job offerings are extracted from company sites. The extraction process outputs structured data of high granularity grouped into data records, in which corresponding semantic label is assigned to each data item. The task was performed using the extraction system Ex, which combines two approaches: manually written rules and supervised machine learning algorithms. Due to the expert knowledge in the form of extraction rules the lack of training data could be overcome. The rules are independent of the specific formatting structure so that one extraction model could be used for heterogeneous set of documents. The achieved success of the extraction process in the case of laptop offers showed that extraction ontology describing one or a few product types could be combined with wrapper induction methods to automatically extract all product type offers on a web scale with minimum human effort.

http://www.nusl.cz/ntk/nusl-124547

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:124547
Date	January 2012
Creators	Pouzar, Aleš
Contributors	Svátek, Vojtěch, Labský, Martin
Publisher	Vysoká škola ekonomická v Praze
Source Sets	Czech ETDs
Language	Czech
Detected Language	English
Type	info:eu-repo/semantics/masterThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.0022 seconds

Extrakce strukturovaných dat z českého webu s využitím extrakčních ontologií / Extracting Structured Data from Czech Web Using Extraction Ontologies

Description

Links & Downloads

Tags

Additional Fields