Spelling suggestions: "subject:"eeb crawling"" "subject:"eeb trawling""
1 |
Identifying Search Engine Spam Using DNSMathiharan, Siddhartha Sankaran 2011 December 1900 (has links)
Web crawlers encounter both finite and infinite elements during crawl. Pages and hosts can be infinitely generated using automated scripts and DNS wildcard entries. It is a challenge to rank such resources as an entire web of pages and hosts could be created to manipulate the rank of a target resource. It is crucial to be able to differentiate genuine content from spam in real-time to allocate crawl budgets. In this study, ranking algorithms to rank hosts are designed which use the finite Pay Level Domains(PLD) and IPv4 addresses. Heterogenous graphs derived from the webgraph of IRLbot are used to achieve this. PLD Supporters (PSUPP) which is the number of level-2 PLD supporters for each host on the host-host-PLD graph is the first algorithm that is studied. This is further improved by True PLD Supporters(TSUPP) which uses true egalitarian level-2 PLD supporters on the host-IP-PLD graph and DNS blacklists. It was found that support from content farms and stolen links could be eliminated by finding TSUPP. When TSUPP was applied on the host graph of IRLbot, there was less than 1% spam in the top 100,000 hosts.
|
2 |
Scraping Dynamic Websites for Economical Data : A Framework ApproachLegaspi Ramos, Xurxo January 2016 (has links)
Internet is a source of live data that is constantly updating with data of almost anyfield we can imagine. Having tools that can automatically detect these updates andcan select that information that we are interested in are becoming of utmost importancenowadays. That is the reason why through this thesis we will focus on someeconomic websites, studying their structures and identifying a common type of websitein this field: Dynamic Websites. Even when there are many tools that allow toextract information from the internet, not many tackle these kind of websites. Forthis reason we will study and implement some tools that allow the developers to addressthese pages from a different perspective.
|
3 |
Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources) / Ad hoc and general-purpose corpus construction from web sourcesBarbaresi, Adrien 19 June 2015 (has links)
Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés. Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité. / At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.
|
4 |
M-crawler: Crawling Rich Internet Applications Using Menu Meta-modelChoudhary, Suryakant 27 July 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
|
5 |
M-crawler: Crawling Rich Internet Applications Using Menu Meta-modelChoudhary, Suryakant 27 July 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
|
6 |
Topic-Oriented Collaborative Web CrawlingChung, Chiasen January 2001 (has links)
A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering.
|
7 |
Topic-Oriented Collaborative Web CrawlingChung, Chiasen January 2001 (has links)
A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering.
|
8 |
M-crawler: Crawling Rich Internet Applications Using Menu Meta-modelChoudhary, Suryakant January 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
|
9 |
Enhancing Document Accessibility and User Interaction through Large Language Model: A Comparative Study for Educational Content : A Comparative Analysis of LLM and Traditional Site SearchUmar, Fatima January 2024 (has links)
This research integrates LLMs with RAG (Retrieval-Augmented Generation) to develop a conversational interface allowing users to post queries and ask questions from a website. It compares the LLMRAGmethodwith traditional site search functionality to determine which method users perceive as better, specifically regarding response quality and response time. The perceived results for response quality and response time were evaluated under the null hypothesis that there is no difference between the two methods. The study showed that the LLM RAG method was perceived as better in terms of response quality, and those results were significant. However, for response time, the traditional site search method was perceived as better, but the results were not significant, so the null hypothesis could not be rejected. Overall, the integration of LLMs with RAG frameworks promises to enhance information retrieval systems on digital platforms.
|
10 |
[en] TEXT MINING AT THE INTELLIGENT WEB CRAWLING PROCESS / [pt] MINERAÇÃO DE TEXTOS NA COLETA INTELIGENTE DE DADOS NA WEBFABIO DE AZEVEDO SOARES 31 March 2009 (has links)
[pt] Esta dissertação apresenta um estudo sobre a utilização de
Mineração de
Textos no processo de coleta inteligente de dados na Web. O
método mais comum
de obtenção de dados na Web consiste na utilização de web
crawlers. Web
crawlers são softwares que, uma vez alimentados por um
conjunto inicial de
URLs (sementes), iniciam o procedimento metódico de visitar
um site, armazenálo
em disco e extrair deste os hyperlinks que serão utilizados
para as próximas
visitas. Entretanto, buscar conteúdo desta forma na Web é
uma tarefa exaustiva e
custosa. Um processo de coleta inteligente de dados na Web,
mais do que coletar
e armazenar qualquer documento web acessível, analisa as
opções de crawling
disponíveis para encontrar links que, provavelmente,
fornecerão conteúdo de alta
relevância a um tópico definido a priori. Na abordagem de
coleta de dados
inteligente proposta neste trabalho, tópicos são definidos,
não por palavras chaves,
mas, pelo uso de documentos textuais como exemplos. Em
seguida, técnicas de
pré-processamento utilizadas em Mineração de Textos, entre
elas o uso de um
dicionário thesaurus, analisam semanticamente o documento
apresentado como
exemplo. Baseado nesta análise, o web crawler construído
será guiado em busca
do seu objetivo: recuperar informação relevante sobre o
documento. A partir de
sementes ou realizando uma consulta automática nas máquinas
de buscas
disponíveis, o crawler analisa, igualmente como na etapa
anterior, todo
documento recuperado na Web. Então, é executado um processo
de comparação
entre cada documento recuperado e o documento exemplo.
Depois de obtido o
nível de similaridade entre ambos, os hyperlinks do
documento recuperado são
analisados, empilhados e, futuramente, serão desempilhados
de acordo seus
respectivos e prováveis níveis de importância. Ao final do
processo de coleta de
dados, outra técnica de Mineração de Textos é aplicada,
objetivando selecionar os
documentos mais representativos daquela coleção de textos:
a Clusterização de
Documentos. A implementação de uma ferramenta que contempla
as heurísticas
pesquisadas permitiu obter resultados práticos, tornando
possível avaliar o
desempenho das técnicas desenvolvidas e comparar os
resultados obtidos com
outras formas de recuperação de dados na Web. Com este
trabalho, mostrou-se
que o emprego de Mineração de Textos é um caminho a ser
explorado no
processo de recuperação de informação relevante na Web. / [en] This dissertation presents a study about the application of
Text Mining as
part of the intelligent Web crawling process. The most
usual way of gathering
data in Web consists of the utilization of web crawlers.
Web crawlers are
softwares that, once provided with an initial set of URLs
(seeds), start the
methodical proceeding of visiting a site, store it in disk
and extract its hyperlinks
that will be used for the next visits. But seeking for
content in this way is an
expensive and exhausting task. An intelligent web crawling
process, more than
collecting and storing any web document available, analyses
its available crawling
possibilities for finding links that, probably, will
provide high relevant content to
a topic defined a priori. In the approach suggested in this
work, topics are not
defined by words, but rather by the employment of text
documents as examples.
Next, pre-processing techniques used in Text Mining,
including the use of a
Thesaurus, analyze semantically the document submitted as
example. Based on
this analysis, the web crawler thus constructed will be
guided toward its objective:
retrieve relevant information to the document. Starting
from seeds or querying
through available search engines, the crawler analyzes,
exactly as in the previous
step, every document retrieved in Web. the similarity level
between them is
obtained, the retrieved document`s hyperlinks are analysed,
queued and, later, will
be dequeued according to each one`s probable degree of
importance. By the end
of the gathering data process, another Text Mining
technique is applied, with the
propose of selecting the most representative document among
the collected texts:
Document Clustering. The implementation of a tool
incorporating all the
researched heuristics allowed to achieve results, making
possible to evaluate the
performance of the developed techniques and compare all
obtained results with
others means of retrieving data in Web. The present work
shows that the use of
Text Mining is a track worthy to be exploited in the
process of retrieving relevant
information in Web.
|
Page generated in 0.057 seconds