Global ETD Search

1	Web Page Classification Using Features from Titles and Snippets Lu, Zhengyang January 2015 (has links) Nowadays, when a keyword is provided, a search engine can return a large number of web pages, which makes it difficult for people to find the right information. Web page classification is a technology that can help us to make a relevant and quick selection of information that we are looking for. Moreover, web page classification is important for companies that provide marketing and analytics platforms, because it can help them to build a healthy mix of listings on search engines and large directories. This will provide more insight into the distribution of the types of web pages their local business listings are found on, and finally will help marketers to make better-informed decisions about marketing campaigns and strategies. In this thesis we perform a literature review that introduces web page classification, feature selection and feature extraction. The literature review also includes a comparison of three commonly used classification algorithms and a description of metrics for performance evaluation. The findings in the literature enable us to extend existing classification techniques, methods and algorithms to address a new web page classification problem faced by our industrial partner SweetIQ (a company that provides location-based marketing services and an analytics platform). We develop a classification method based on SweetIQ's data and business needs. Our method includes typical feature selection and feature extraction methods, but the features we use in this thesis are largely different from traditional ones used in the literature. We test selected features and find that the text extracted from the title and snippet of a web page can help a classifier to achieve good performance. Our classification method does not require the full content of a web page. Thus, it is fast and saves a lot of space. Web Page Classification Feature Selection
2	Large-Scale Web Page Classification Marath, Sathi 09 November 2010 (has links) Web page classification is the process of assigning predefined categories to web pages. Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective in classifying small segments of web directories. The effectiveness of these algorithms, however, has not been thoroughly investigated on large-scale web page classification of such popular web directories as Yahoo! and LookSmart. Such web directories have hundreds of thousands of categories, deep hierarchies, spindle category and document distributions over the hierarchies, and skewed category distribution over the documents. These statistical properties indicate class imbalance and rarity within the dataset. In hierarchical datasets similar to web directories, expanding the content of each category using the web pages of the child categories helps to decrease the degree of rarity. This process, however, results in the localized overabundance of positive instances especially in the upper level categories of the hierarchy. The class imbalance, rarity and the localized overabundance of positive instances make applying classification algorithms to web directories very difficult and the problem has not been thoroughly studied. To our knowledge, the maximum number of categories ever previously classified on web taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to a Macro-F1 of 12% only. We designed a unified framework for the content based classification of imbalanced hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and 4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior probability distribution of the subcategories indicates the presence or absence of class imbalance, rarity and the overabundance of positive instances within the dataset. Based on the prior probability distribution and associated machine learning issues, we partitioned the subcategories of Yahoo! web directory into five mutually exclusive groups. The effectiveness of different data level, algorithmic and architectural solutions to the associated machine learning issues is explored. Later, the best performing classification technologies for a particular prior probability distribution have been identified and integrated into the Yahoo! Web directory classification model. The methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web pages and we statistically proved that the methodology of this research works equally well on large and small dataset. The average classifier performance in terms of macro-averaged F1-Measure achieved in this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85% respectively.
3	An n-gram Based Approach to the Automatic Classification of Web Pages by Genre Mason, Jane E. 10 December 2009 (has links) The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata. The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed. The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.
4	An n-gram Based Approach to the Automatic Classification of Web Pages by Genre Mason, Jane E. 10 December 2009 (has links) The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata. The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed. The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.
5	Grid-Enabled Automatic Web Page Classification Metikurke, Seema Sreenivasamurthy 12 June 2006 (has links) Much research has been conducted on the retrieval and classification of web-based information. A big challenge is the performance issue, especially for a classification algorithm returning results for a large set of data that is typical when accessing the Web. This thesis describes a grid-enabled approach for automatic web page classification. The basic approach is first described that uses a vector space model (VSM). An enhancement of the approach through the use of a genetic algorithm (GA) is then described. The enhanced approach can efficiently process candidate web pages from a number of web sites and classify them. A prototype is implemented and empirical studies are conducted. The contributions of this thesis are: 1) Application of grid computing to improve performance of both VSM and GA using VSM based web page classification; 2) Improvement of the VSM classification algorithm by applying GA that uniquely discovers a set of training web pages while also generating a near optimal parameter values set for VSM. Automatic Web Page Classification Vector Space Model Genetic Algorithm Grid Computing Computer Sciences
6	OPIS : um método para identificação e busca de páginas-objeto / OPIS : a method for object page identifying and searching Colpo, Miriam Pizzatto January 2014 (has links) Páginas-objeto são páginas que representam exatamente um objeto inerente do mundo real na web, considerando um domínio específico, e a busca por essas páginas é chamada de busca-objeto. Os motores de busca convencionais (do Inglês, General Search Engine - GSE) conseguem responder, de forma satisfatória, à maioria das consultas realizadas na web atualmente, porém, isso dificilmente ocorre no caso de buscas-objeto, uma vez que, em geral, a quantidade de páginas-objeto recuperadas é bastante limitada. Essa dissertação propõe um novo método para a identificação e a busca de páginas-objeto, denominado OPIS (acrônimo para Object Page Identifying and Searching). O cerne do OPIS está na adoção de técnicas de realimentação de relevância e aprendizagem de máquina na tarefa de classificação, baseada em conteúdo, de páginas-objeto. O OPIS não descarta o uso de GSEs e, ao invés disso, em sua etapa de busca, propõe a integração de um classificador a um GSE, adicionando uma etapa de filtragem ao processo de busca tradicional. Essa abordagem permite que somente páginas identificadas como páginas-objeto sejam recuperadas pelas consultas dos usuários, melhorando, assim, os resultados de buscas-objeto. Experimentos, considerando conjuntos de dados reais, mostram que o OPIS supera o baseline com ganho médio de 47% de precisão média. / Object pages are pages that represent exactly one inherent real-world object on the web, regarding a specific domain, and the search for these pages is named as object search. General Search Engines (GSE) can satisfactorily answer most of the searches performed in the web nowadays, however, this hardly occurs with object search, since, in general, the amount of retrieved object pages is limited. This work proposes a method for both identifying and searching object pages, named OPIS (acronyms to Object Page Identifying and Searching). The kernel of OPIS is to adopt relevance feedback and machine learning techniques in the task of content-based classification of object pages. OPIS does not discard the use of GSEs and, instead, in his search step, proposes the integration of a classifier to a GSE, adding a filtering step to the traditional search process. This simple approach allows that only pages identified as object pages are retrieved by user queries, improving the results for object search. Experiments with real datasets show that OPIS outperforms the baseline with average boost of 47% considering the average precision. Banco : Dados Recuperacao : Informacao Object page Object search Relevance feedback Web page classification
7	OPIS : um método para identificação e busca de páginas-objeto / OPIS : a method for object page identifying and searching Colpo, Miriam Pizzatto January 2014 (has links) Páginas-objeto são páginas que representam exatamente um objeto inerente do mundo real na web, considerando um domínio específico, e a busca por essas páginas é chamada de busca-objeto. Os motores de busca convencionais (do Inglês, General Search Engine - GSE) conseguem responder, de forma satisfatória, à maioria das consultas realizadas na web atualmente, porém, isso dificilmente ocorre no caso de buscas-objeto, uma vez que, em geral, a quantidade de páginas-objeto recuperadas é bastante limitada. Essa dissertação propõe um novo método para a identificação e a busca de páginas-objeto, denominado OPIS (acrônimo para Object Page Identifying and Searching). O cerne do OPIS está na adoção de técnicas de realimentação de relevância e aprendizagem de máquina na tarefa de classificação, baseada em conteúdo, de páginas-objeto. O OPIS não descarta o uso de GSEs e, ao invés disso, em sua etapa de busca, propõe a integração de um classificador a um GSE, adicionando uma etapa de filtragem ao processo de busca tradicional. Essa abordagem permite que somente páginas identificadas como páginas-objeto sejam recuperadas pelas consultas dos usuários, melhorando, assim, os resultados de buscas-objeto. Experimentos, considerando conjuntos de dados reais, mostram que o OPIS supera o baseline com ganho médio de 47% de precisão média. / Object pages are pages that represent exactly one inherent real-world object on the web, regarding a specific domain, and the search for these pages is named as object search. General Search Engines (GSE) can satisfactorily answer most of the searches performed in the web nowadays, however, this hardly occurs with object search, since, in general, the amount of retrieved object pages is limited. This work proposes a method for both identifying and searching object pages, named OPIS (acronyms to Object Page Identifying and Searching). The kernel of OPIS is to adopt relevance feedback and machine learning techniques in the task of content-based classification of object pages. OPIS does not discard the use of GSEs and, instead, in his search step, proposes the integration of a classifier to a GSE, adding a filtering step to the traditional search process. This simple approach allows that only pages identified as object pages are retrieved by user queries, improving the results for object search. Experiments with real datasets show that OPIS outperforms the baseline with average boost of 47% considering the average precision. Banco : Dados Recuperacao : Informacao Object page Object search Relevance feedback Web page classification
8	OPIS : um método para identificação e busca de páginas-objeto / OPIS : a method for object page identifying and searching Colpo, Miriam Pizzatto January 2014 (has links) Páginas-objeto são páginas que representam exatamente um objeto inerente do mundo real na web, considerando um domínio específico, e a busca por essas páginas é chamada de busca-objeto. Os motores de busca convencionais (do Inglês, General Search Engine - GSE) conseguem responder, de forma satisfatória, à maioria das consultas realizadas na web atualmente, porém, isso dificilmente ocorre no caso de buscas-objeto, uma vez que, em geral, a quantidade de páginas-objeto recuperadas é bastante limitada. Essa dissertação propõe um novo método para a identificação e a busca de páginas-objeto, denominado OPIS (acrônimo para Object Page Identifying and Searching). O cerne do OPIS está na adoção de técnicas de realimentação de relevância e aprendizagem de máquina na tarefa de classificação, baseada em conteúdo, de páginas-objeto. O OPIS não descarta o uso de GSEs e, ao invés disso, em sua etapa de busca, propõe a integração de um classificador a um GSE, adicionando uma etapa de filtragem ao processo de busca tradicional. Essa abordagem permite que somente páginas identificadas como páginas-objeto sejam recuperadas pelas consultas dos usuários, melhorando, assim, os resultados de buscas-objeto. Experimentos, considerando conjuntos de dados reais, mostram que o OPIS supera o baseline com ganho médio de 47% de precisão média. / Object pages are pages that represent exactly one inherent real-world object on the web, regarding a specific domain, and the search for these pages is named as object search. General Search Engines (GSE) can satisfactorily answer most of the searches performed in the web nowadays, however, this hardly occurs with object search, since, in general, the amount of retrieved object pages is limited. This work proposes a method for both identifying and searching object pages, named OPIS (acronyms to Object Page Identifying and Searching). The kernel of OPIS is to adopt relevance feedback and machine learning techniques in the task of content-based classification of object pages. OPIS does not discard the use of GSEs and, instead, in his search step, proposes the integration of a classifier to a GSE, adding a filtering step to the traditional search process. This simple approach allows that only pages identified as object pages are retrieved by user queries, improving the results for object search. Experiments with real datasets show that OPIS outperforms the baseline with average boost of 47% considering the average precision. Banco : Dados Recuperacao : Informacao Object page Object search Relevance feedback Web page classification
9	Metody klasifikace webových stránek / Methods of Web Page Classification Nachtnebl, Viktor January 2012 (has links) This work deals with methods of web page classification. It explains the concept of classification and different features of web pages used for their classification. Further it analyses representation of a page and in detail describes classification method that deals with hierarchical category model and is able to dynamically create new categories. In the second half it shows implementation of chosen method and describes the results.
10	Pokročilé metody blokování nevhodného obsahu v mobilním webovém prohlížeči / Advanced Methods for Blocking of Inappropriate Content in a Mobile Web Browser Svoboda, Vladimír January 2016 (has links) This work describes actual state of open-source browsers on the Android platform and compares their features. It describes webpage classification problem and methods how to detect porn websites. It also shows a design of a system for blocking and detection of webpages with adult content and its implementation. For pornography detection there were used text based methods but also methods based on detection in images from web page with deep learning. Implemented solution was tested with experiments and described. The last chapter contains summarization of this work and proposal of improvements.

Search results