1 |
Identifying Search Engine Spam Using DNSMathiharan, Siddhartha Sankaran 2011 December 1900 (has links)
Web crawlers encounter both finite and infinite elements during crawl. Pages and hosts can be infinitely generated using automated scripts and DNS wildcard entries. It is a challenge to rank such resources as an entire web of pages and hosts could be created to manipulate the rank of a target resource. It is crucial to be able to differentiate genuine content from spam in real-time to allocate crawl budgets. In this study, ranking algorithms to rank hosts are designed which use the finite Pay Level Domains(PLD) and IPv4 addresses. Heterogenous graphs derived from the webgraph of IRLbot are used to achieve this. PLD Supporters (PSUPP) which is the number of level-2 PLD supporters for each host on the host-host-PLD graph is the first algorithm that is studied. This is further improved by True PLD Supporters(TSUPP) which uses true egalitarian level-2 PLD supporters on the host-IP-PLD graph and DNS blacklists. It was found that support from content farms and stolen links could be eliminated by finding TSUPP. When TSUPP was applied on the host graph of IRLbot, there was less than 1% spam in the top 100,000 hosts.
|
2 |
A Scalable P2P RIA Crawling System with Fault ToleranceBen Hafaiedh, Khaled January 2016 (has links)
Rich Internet Applications (RIAs) have been widely used in the web over the last decade as they were found to be responsive and user-friendly compared to traditional web applications. RIAs use client-side scripting such as JavaScript which allows for asynchronous updates on the server-side using AJAX (Asynchronous JavaScript and XML).
Due to the large size of RIAs and therefore the long time required for crawling, distributed RIA crawling has been introduced with the aim to decrease the crawling time. However, the current RIA crawling systems are not scalable, i.e. they are limited to a relatively low number of crawlers. Furthermore, they do not allow for fault tolerance in case that a failure occurs in one of their components. In this research, we address the scalability and resilience problems when crawling RIAs in a distributed environment and we explore the possibilities of designing an efficient RIA crawling system that is scalable and fault-tolerant. Our approach is to partition the search space among several storage devices (distributed databases) over a peer-to-peer (P2P) network where each database is responsible for storing only a portion of the RIA graph. This makes the distributed data structure invulnerable to a single point of failure. However, accessing the distributed data required by crawlers makes the crawling task challenging when the number of crawlers becomes high. We show by simulation results and analytical reasoning that our system is scalable and fault-tolerant. Furthermore, simulation results show that the crawling time using the P2P crawling system is significantly faster than the crawling time using both the non-distributed crawling system and the distributed crawling system using a single database.
|
3 |
Scraping Dynamic Websites for Economical Data : A Framework ApproachLegaspi Ramos, Xurxo January 2016 (has links)
Internet is a source of live data that is constantly updating with data of almost anyfield we can imagine. Having tools that can automatically detect these updates andcan select that information that we are interested in are becoming of utmost importancenowadays. That is the reason why through this thesis we will focus on someeconomic websites, studying their structures and identifying a common type of websitein this field: Dynamic Websites. Even when there are many tools that allow toextract information from the internet, not many tackle these kind of websites. Forthis reason we will study and implement some tools that allow the developers to addressthese pages from a different perspective.
|
4 |
Σημασιολογικός παγκόσμιος ιστός και τεχνικές εξατομίκευσης στις διαδικασίες αναζήτησης/διαπέρασης / Semantic web and personalization in searching and crawlingΚαϊτανίδης, Χρήστος 01 October 2008 (has links)
Η συγκεκριμένη μεταπτυχιακή διπλωματική εργασία ασχολείται με την αλληλεπίδραση δύο παράλληλων διεργασιών στην προσπάθεια αξιοποίησης του Παγκόσμιου Ιστού (Web): (α) τη διεργασία μετεξέλιξης του Παγκόσμιου Ιστού στο σημασιολογικό Παγκόσμιο Ιστό, (β) τη διεργασία βελτίωσης των διαδικασιών διαπέρασης (crawling) και ψαξίματος (searching) στον Παγκόσμιο Ιστό.
Στα πρώτα βήματα του Παγκόσμιου Ιστού το σημαντικότερο ίσως πρόβλημα για τους χρήστες που ήθελαν να αναζητήσουν πληροφορίες σε αυτό ήταν η έλλειψη πολλών και χρήσιμων πηγών. Σταδιακά, αλλά με ιδιαίτερα γρήγορους ρυθμούς ο Παγκόσμιος Ιστός μετατράπηκε σε μία από τις μεγαλύτερες πηγές πληροφοριών που χρησιμοποιεί ο άνθρωπος καθώς όλο και περισσότεροι εισάγουν δεδομένα για κάθε είδους δραστηριότητα και θέμα. Το πρόβλημα των χρηστών λοιπόν που αναζητούν πληροφορίες ανάχθηκε στη γρήγορη εξαγωγή των χρήσιμων, από τον τεράστιο όγκο των παρεχόμενων, πληροφοριών. Όροι και τεχνικές όπως Data Mining (Εξόρυξη Δεδομένων), Information Retrieval (Ανάκτηση Πληροφορίας), Knowledge Management (Διαχείριση Γνώσης) επεκτάθηκαν για να καλύψουν και το νεοεμφανιζόμενο μέσο.
Επιπλέον, στην προσπάθεια για καλύτερη ποιότητα των παρεχόμενων αποτελεσμάτων στο χρήστη σημαντικό ρόλο διαδραμάτισε η εκμετάλλευση των ιδιαίτερων στοιχείων που μπορούν να εξαχθούν για τα ενδιαφέροντά του, τόσο στο στάδιο της διαπέρασης, όπου συγκεντρώνονται σελίδες συγκεκριμένης θεματολογίας (topic-focused crawling), όσο και στο στάδιο της αναζήτησης μέσα από αυτές των πιο σημαντικών για τον εκάστοτε χρήστη (personalization).
Παράλληλα, καθώς ο Παγκόσμιος Ιστός σταδιακά μετεξελίσσεται στο Σημασιολογικό Παγκόσμιο Ιστό (Semantic Web) νέα μοντέλα και πρότυπα (XML, RDF, OWL) αναπτύσσονται για την προώθηση αυτής της διαδικασίας. Η έκφραση, μετάδοση και αναζήτηση πληροφοριών με χρήση αυτών των προτύπων ανοίγει νέους ορίζοντες στη χρήση του Διαδικτύου.
Το βασικό αντικείμενο της εργασίας αυτής είναι η αξιοποίηση των παρεχόμενων μοντέλων και προτύπων του Σημασιολογικού Ιστού σε συνδυασμό με ήδη εφαρμοσμένες ιδέες και αλγορίθμους στον απλό Παγκόσμιο Ιστό ώστε να είναι εφικτή η ταχύτερη και ακριβέστερη ανάκτηση και επεξεργασία πληροφοριών. Δόθηκε επίσης προσπάθεια στην αξιοποίηση τεχνικών που εκμεταλλεύονται τις ιδιαίτερες προτιμήσεις κάθε χρήστη, και στη διερεύνηση της χρήσης των νέων μοντέλων και προτύπων του Σημασιολογικού Ιστού για την προώθηση της διαδικασίας αυτής. / The presented master thesis examines the interaction between two parallel tasks aiming to the better utilization of the World Wide Web: (a) the task of transforming the World Wide Web into Semantic Web, (b) the task of improving the results of crawling and searching methods on the Web.
In the advent of the World Wide Web the most disconcerting problem for the users searching for information in the Web was the lack of useful and sufficient sources of information. Gradually, though in really fast pace, the World Wide Web transformed into the biggest storage of information that humans can use. More and more people contribute new data on the web about every aspect of their life, activity, job or interest. Eventually, users searching for information have to deal with another problem, quite the opposite than the one mentioned above. They need to find the information they are looking for through an enormous amount of data in the minimum amount of time spend in browsing. Terms and techniques such as Data Mining, Information Retrieval, Knowledge Management were extended to be applicable and to the newly presented media.
Moreover, on the strive for better quality of the results returned to users, the utilization of user’s special interests that can be extracted played an important role both in the field of crawling, where pages of a certain subject are gathered (topic-focused crawling), and in the field of searching, where pages are valued according to each user’s needs (personalization).
At the same time, while the World Wide Web gradually transforms into Semantic Web, new standards and models (XML, RDF, OWL) are evolving in order to launch this inquiry. The storage, presentation, transmission and search of information according to those standards open up new horizons in the utilization of the Web.
The principal effort of this master thesis is the utilization of the newly provided models and standards of the Semantic Web in conjunction with already tested, positively evaluated and applicable ideas and algorithms of the World Wide Web, in order to achieve higher speed in retrieval and accuracy of information. Moreover, strong efforts were given in integrating techniques that take into account the special preferences of each user and in the exploration of the benefits that come from the adaptation of these new models of the Semantic Web.
|
5 |
Crawling, Collecting, and Condensing News CommentsGobaan, Raveendran January 2013 (has links)
Traditionally, public opinion and policy is decided by issuing surveys and performing censuses designed to measure what the public thinks about a certain topic. Within the past five years social networks such as Facebook and Twitter have gained traction for collection of public opinion about current events. Academic research on Facebook data proves difficult since the platform is generally closed. Twitter on the other hand restricts the conversation of its users making it difficult to extract large scale concepts from the microblogging infrastructure.
News comments provide a rich source of discourse from individuals who are passionate about an issue. Furthermore, due to the overhead of commenting, the population of commenters is necessarily biased towards individual who have either strong opinions of a topic or in depth knowledge of the given issue. Furthermore, their comments are often a collection of insight derived from reading multiple articles on any given topic. Unfortunately the commenting systems employed by news companies are not implemented by a single entity, and are often stored and generated using AJAX, which causes traditional crawlers to ignore them. To make matters worse they are often noisy; containing spam, poor grammar, and excessive typos. Furthermore, due to the anonymity of comment systems, conversations can often be derailed by malicious users or inherent biases in the commenters.
In this thesis we discuss the design and creation of a crawler designed to extract comments from domains across the internet. For practical purposes we create a semiautomatic parser generator and describe how our system attempts to employ user feedback to predict which remote procedure calls are used to load comments. By reducing comment systems into remote procedure calls, we simplify the internet into a much simpler space, where we can focus on the data, almost independently from its presentation. Thus we are able to quickly create high fidelity parsers to extract comments from a web page.
Once we have our system, we show the usefulness by attempting to extract meaningful opinions from the large collections we collect. Unfortunately doing so in real time is shown to foil traditional summarization systems, which are designed to handle dozens of well formed documents. In attempting to solve this problem we create a new algorithm, KLSum+, that outperforms all its competitors in efficiency while generally scoring well against the ROUGE SU4 metric. This algorithm factors in background models to boost accuracy, but performs over 50 times faster than alternatives. Furthermore, using the summaries we see that the data collected can provide useful insight into public opinion and even provide the key points of discourse.
|
6 |
Máquinas de búsqueda para lecturas y escrituras concurrentes.Bonacic Castro, Carolina Alejandra January 2007 (has links)
No description available.
|
7 |
Model-based Crawling - An Approach to Design Efficient Crawling Strategies for Rich Internet ApplicationsDincturk, Mustafa Emre 02 August 2013 (has links)
Rich Internet Applications (RIAs) are a new generation of web applications that break away from the concepts on which traditional web applications are based. RIAs are more interactive and responsive than traditional web applications since RIAs allow client-side scripting (such as JavaScript) and asynchronous communication with the server (using AJAX). Although these are improvements in terms of user-friendliness, there is a big impact on our ability to automatically explore (crawl) these applications. Traditional crawling algorithms are not sufficient for crawling RIAs. We should be able to crawl RIAs in order to be able to search their content and build their models for various purposes such as reverse-engineering, detecting security vulnerabilities, assessing usability, and applying model-based testing techniques. One important problem is designing efficient crawling strategies for RIAs. It seems possible to design crawling strategies more efficient than the standard crawling strategies, the Breadth-First and the Depth-First. In this thesis, we explore the possibilities of designing efficient crawling strategies. We use a general approach that we called Model-based Crawling and present two crawling strategies that are designed using this approach. We show by experimental results that model-based crawling strategies are more efficient than the standard strategies.
|
8 |
Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources) / Ad hoc and general-purpose corpus construction from web sourcesBarbaresi, Adrien 19 June 2015 (has links)
Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés. Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité. / At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.
|
9 |
Model-based Crawling - An Approach to Design Efficient Crawling Strategies for Rich Internet ApplicationsDincturk, Mustafa Emre January 2013 (has links)
Rich Internet Applications (RIAs) are a new generation of web applications that break away from the concepts on which traditional web applications are based. RIAs are more interactive and responsive than traditional web applications since RIAs allow client-side scripting (such as JavaScript) and asynchronous communication with the server (using AJAX). Although these are improvements in terms of user-friendliness, there is a big impact on our ability to automatically explore (crawl) these applications. Traditional crawling algorithms are not sufficient for crawling RIAs. We should be able to crawl RIAs in order to be able to search their content and build their models for various purposes such as reverse-engineering, detecting security vulnerabilities, assessing usability, and applying model-based testing techniques. One important problem is designing efficient crawling strategies for RIAs. It seems possible to design crawling strategies more efficient than the standard crawling strategies, the Breadth-First and the Depth-First. In this thesis, we explore the possibilities of designing efficient crawling strategies. We use a general approach that we called Model-based Crawling and present two crawling strategies that are designed using this approach. We show by experimental results that model-based crawling strategies are more efficient than the standard strategies.
|
10 |
M-crawler: Crawling Rich Internet Applications Using Menu Meta-modelChoudhary, Suryakant 27 July 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
|
Page generated in 0.0389 seconds