Return to search

Web Page Classification Using Features from Titles and Snippets

Nowadays, when a keyword is provided, a search engine can return a large number of web pages, which makes it difficult for people to find the right information. Web page classification is a technology that can help us to make a relevant and quick selection of information that we are looking for. Moreover, web page classification is important for companies that provide marketing and analytics platforms, because it can help them to build a healthy mix of listings on search engines and large directories. This will provide more insight into the distribution of the types of web pages their local business listings are found on, and finally will help marketers to make better-informed decisions about marketing campaigns and strategies.
In this thesis we perform a literature review that introduces web page classification, feature selection and feature extraction. The literature review also includes a comparison of three commonly used classification algorithms and a description of metrics for performance evaluation. The findings in the literature enable us to extend existing classification techniques, methods and algorithms to address a new web page classification problem faced by our industrial partner SweetIQ (a company that provides location-based marketing services and an analytics platform).
We develop a classification method based on SweetIQ's data and business needs. Our method includes typical feature selection and feature extraction methods, but the features we use in this thesis are largely different from traditional ones used in the literature. We test selected features and find that the text extracted from the title and snippet of a web page can help a classifier to achieve good performance. Our classification method does not require the full content of a web page. Thus, it is fast and saves a lot of space.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/33177
Date January 2015
CreatorsLu, Zhengyang
ContributorsBenyoucef, Morad
PublisherUniversité d'Ottawa / University of Ottawa
Source SetsUniversité d’Ottawa
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0021 seconds