Spelling suggestions: "subject:"eeb page"" "subject:"beb page""
1 |
An n-gram Based Approach to the Automatic Classification of Web Pages by GenreMason, Jane E. 10 December 2009 (has links)
The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata.
The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed.
The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.
|
2 |
An n-gram Based Approach to the Automatic Classification of Web Pages by GenreMason, Jane E. 10 December 2009 (has links)
The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata.
The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed.
The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.
|
3 |
Web Page Classification Using Features from Titles and SnippetsLu, Zhengyang January 2015 (has links)
Nowadays, when a keyword is provided, a search engine can return a large number of web pages, which makes it difficult for people to find the right information. Web page classification is a technology that can help us to make a relevant and quick selection of information that we are looking for. Moreover, web page classification is important for companies that provide marketing and analytics platforms, because it can help them to build a healthy mix of listings on search engines and large directories. This will provide more insight into the distribution of the types of web pages their local business listings are found on, and finally will help marketers to make better-informed decisions about marketing campaigns and strategies.
In this thesis we perform a literature review that introduces web page classification, feature selection and feature extraction. The literature review also includes a comparison of three commonly used classification algorithms and a description of metrics for performance evaluation. The findings in the literature enable us to extend existing classification techniques, methods and algorithms to address a new web page classification problem faced by our industrial partner SweetIQ (a company that provides location-based marketing services and an analytics platform).
We develop a classification method based on SweetIQ's data and business needs. Our method includes typical feature selection and feature extraction methods, but the features we use in this thesis are largely different from traditional ones used in the literature. We test selected features and find that the text extracted from the title and snippet of a web page can help a classifier to achieve good performance. Our classification method does not require the full content of a web page. Thus, it is fast and saves a lot of space.
|
4 |
Large-Scale Web Page ClassificationMarath, Sathi 09 November 2010 (has links)
Web page classification is the process of assigning predefined categories to web pages.
Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest
Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective
in classifying small segments of web directories. The effectiveness of these algorithms,
however, has not been thoroughly investigated on large-scale web page classification of
such popular web directories as Yahoo! and LookSmart. Such web directories have
hundreds of thousands of categories, deep hierarchies, spindle category and document
distributions over the hierarchies, and skewed category distribution over the documents.
These statistical properties indicate class imbalance and rarity within the dataset.
In hierarchical datasets similar to web directories, expanding the content of each category
using the web pages of the child categories helps to decrease the degree of rarity. This
process, however, results in the localized overabundance of positive instances especially
in the upper level categories of the hierarchy. The class imbalance, rarity and the
localized overabundance of positive instances make applying classification algorithms to
web directories very difficult and the problem has not been thoroughly studied. To our
knowledge, the maximum number of categories ever previously classified on web
taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to
a Macro-F1 of 12% only.
We designed a unified framework for the content based classification of imbalanced
hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and
4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior
probability distribution of the subcategories indicates the presence or absence of class
imbalance, rarity and the overabundance of positive instances within the dataset. Based
on the prior probability distribution and associated machine learning issues, we
partitioned the subcategories of Yahoo! web directory into five mutually exclusive
groups. The effectiveness of different data level, algorithmic and architectural solutions
to the associated machine learning issues is explored. Later, the best performing
classification technologies for a particular prior probability distribution have been
identified and integrated into the Yahoo! Web directory classification model. The
methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web
pages and we statistically proved that the methodology of this research works equally
well on large and small dataset.
The average classifier performance in terms of macro-averaged F1-Measure achieved in
this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85%
respectively.
|
5 |
The Influence of Different Types of Web Page Design on Attitude and Visit Intention of Browsers with Different Information Processing StylesLin, Yu-Shan 20 June 2007 (has links)
The purpose of this study is to investigate the influence of different types of web page design on browsers' attitudes, figure out if the information processing styles play the moderating roles, and examine the relationship between attitude towards the web page and visit intention. Three web pages are designed specially for this research. They are created in different types, words only, pictures only, and combination of words and pictures. Respondents are undergraduate students, and answer questionnaires online. SPSS 14 is used to perform statistical analyses. Principal findings are summarized as follows. First, there is no significant difference between high and low NFC individuals about attitude towards the web page when the tourism web page presents in words (all-verbal) design. However, when the web page is only composed of pictures without any written description, NFC individuals, without respect to high or low NFC, show much lower level of attitude towards the web page and no differences appear between them. Second, the statistical analysis shows a higher level of attitude towards the web page is associated to high PFA individuals, compared with low PFA individuals, when the tourism web page presents in pictures (all-visual) design. When the tourism web page presents in words design, high and low PFA individuals show much lower level of attitude towards the web page and no differences appear between them. Third, we find that individuals with high NFC and high PFA are significantly different from the other groups when the tourism web page presents in combination (words and pictures) design. They show higher level of attitude towards the web page, compared with the other processor groups. Lastly, result shows that there is a positive correlation between attitude towards the web page and visit intention. Attitude towards the web page has a significant impact on visit intention, namely, higher level of Awp, then higher level of VI.
|
6 |
Content Analysis of The Performance of Online Auction Web-pages¢wTaking Yahoo! Kimo Bid Website as ExampleHuang, Chiao-Chu 23 January 2008 (has links)
Following the rapid development of the electronic commercial (EC) market, the circumstance of purchasing products and services on the Internet in Taiwan with substantially growing trend has become more mature. According to a marketing research report, it is expected that more retailers and consumers will enter the online-auction market in 2007. In the aspect of consumer behavior, the habits of consumers searching and purchasing products are gradually going to be shaped up, and a flood of individual sellers are joining into the competition at the online-auction seeking the opportunity for selling their products.
In the past researches, it is found that the study focused on online auction sellers is limited. This research tries to combine FCB Grid and four features to build a website to preliminarily analyze and explore how the professional retailers present the product pages in Yahoo!Kimo online auction interface for the purpose of examining whether there are different product performing ways from other EC websites. In addition, we also discuss about whether there is any discrepancy over product performance and ERP due to products¡¦ characteristics.
The research takes the advantage of using content analysis method and drawing 392 samples totally from 18 categories in Yahoo!Kimo online auction platform for study. All samples¡¦ accumulated scores have to over 1000 and rank the top 25 in each category. In each sample, 4 web pages including virtual storefront, about sellers, product, and evaluation have to be drawn to the accumulation. In total, there are 392 selected samples covering 18 categories.
As a result of this research, we find that the highly emotional involving products have the highest accumulated positive/negative evaluation scores on an average. However, there¡¦s no difference while facing trading problems revealed in the negative evaluation. Theses sellers put more emphasis on rational appeals such as relating truth, solving problems, and using recommendations by consumers. Some also use emotional appeals which tend to create an overall image of the storefront. In addition, the sellers of highly emotional involving products are more than happy to mention the ego gratification and social acceptance. On the other hand, appeals which were often exercised in traditional media advertisement seldom appear in online auction website. Compared with other researches, the auction sellers apply huge amount of product information. The categories of information are affected by the online auction platform.
From the websites of professional sellers, four concrete features are also concluded. First, sellers use pictures to develop the value of products. Second, the well-known brand awareness of a product is not the main factor to be successful in online auction platform, instead creating sellers¡¦ own brand images. Third, sellers provide sufficient social clues, but seldom refer to the rights of protecting consumers¡¦ private information. Fourth, affected by the Internet, they make good use of sales promotion such as e-paper, discount, and product warrant.
Finally, the research also describes four kinds of product webpage features and makes a concrete recommendation concerning building product web pages for sellers who want to contribute to long term business in online auction.
|
7 |
Malicious Web Page Detection Based on Anomaly BehaviorTsai, Wan-yi 04 February 2009 (has links)
Because of the convenience of the Internet, we rely closely on the Internet to do information searching and sharing, forum discussion, and online services. However, most of the websites we visit are developed by people with limited security knowledge, and this condition results in many vulnerabilities in web applications. Unfortunately, hackers have successfully taken advantage of these vulnerabilities to inject malicious JavaScript into compromised web pages to trigger drive-by download attacks.
Based on our long time observation of malicious web pages, malicious web pages have unusual behavior for evading detection which makes malicious web pages different form normal ones. Therefore, we propose a client-side malicious web page detection mechanism named Web Page Checker (WPC) which is based on anomaly behavior tracing and analyzing to identify malicious web pages. The experimental results show that our method can identify malicious web pages and alarm the website visitors efficiently.
|
8 |
Automatic Multi-word Term Extraction and its Application to Web-page SummarizationHuo, Weiwei 20 December 2012 (has links)
In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification.
We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization.
|
9 |
Examining the Complexity of Popular WebsitesTian, Ran 18 August 2015 (has links)
A significant fraction of today's Internet traffic is associated with popular web sites such as YouTube, Netflix or Facebook. In recent years, major Internet websites have become more complex as they incorporate a larger number and more diverse types of objects (e.g. video, audio, code) along with more elaborate ways from multiple servers. These not only affect the loading time of pages but also determine the pattern of resulting traffic on the Internet.
In this thesis, we characterize the complexity of major Internet websites through large-scale measurement and analysis. We identify thousands of the most popular Internet websites from multiple locations and characterize their complexities. We examine the effect of the relative popularity ranking and business type of the complexity of websites. Finally we compare and contrast our results with a similar study conducted 4 years earlier and report on the observed changes in different aspects.
|
10 |
A study of best practice design guidelines and the development of a usability analysis tool for the evaluation of Australian academic library web sitesRaward, Roslyn, n/a January 2002 (has links)
The library profession is now heavily involved in providing access to
information through library web sites and it is a challenge to design a web
site that has reliable content and a user interface that is intuitive to those
who use it. As web accessibility and usability are major issues in the design
of library Web sites, this paper suggests that the design will be most
successful when a usability analysis tool is used throughout the design and
redesign of academic library web sites.
The research drew on the literature of Human-computer Interaction and
usability engineering examining best practice usability and accessibility
design guidelines. It identified those guidelines that were relevant to
academic library web sites. In order to establish the extent to which
Australian academic library web sites met usability guidelines a usability
analysis tool was developed and used to evaluate a randomly selected
sample of web sites. The web sites were categorised under higher education
institutional archetypes as suggested by DETYA (1998) and the results
were discussed in light of these groups. The research found that there was
no correlation of the usability of the web sites between the archetypes. In
fact the pattern of usability was randomly distributed across all institutions,
with the best and worst results appearing in each archetypical category. The
study concluded that the web has provided a whole new start for all
institutions and after examining the results, it suggested that the design of
early web sites was not based on the size or the past history of the
institution that it belonged to, but rather reflected those factors, already
established in the literature, that faced library web managers at that time,
when designing the library web page.
|
Page generated in 0.0643 seconds