The thesis treats automatic extraction of semantic data from Web pages. Within this broad problem, it focuses on finding values of data figures within the page presenting certain entity (e.g. price of a laptop). The main idea we wanted to evaluate is that a figure can be found using its context in the page: the words that surround it and values of the attributes of the containing HTML tags, class attribute in particular. Our research revealed there are two types of contemporary solutions of this problem: either the author of the Web page must inline semantic information inside the markup of the page or there are commercial tools that can be trained to parse a particular page format (targetting pages from a single Web domain). We examined the possibilities of developing a general solution that would - for given entity - find its properties across the Web domains using text analysis and machine learning. The naïve algorithm had about 30% accuracy, the lear- ning algorithms had the accuracy between 40 and 50% in finding the properties. Despite the accuracy is not acceptable for a final solution, we believe it confirms the potential of the idea. Keywords: Web pages data extraction 1
Identifer | oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:347948 |
Date | January 2016 |
Creators | Janata, Dominik |
Contributors | Vojtáš, Peter, Nečaský, Martin |
Source Sets | Czech ETDs |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/masterThesis |
Rights | info:eu-repo/semantics/restrictedAccess |
Page generated in 0.4324 seconds