• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 17
  • 6
  • 1
  • 1
  • 1
  • Tagged with
  • 27
  • 27
  • 10
  • 7
  • 7
  • 6
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Identifying Search Engine Spam Using DNS

Mathiharan, Siddhartha Sankaran 2011 December 1900 (has links)
Web crawlers encounter both finite and infinite elements during crawl. Pages and hosts can be infinitely generated using automated scripts and DNS wildcard entries. It is a challenge to rank such resources as an entire web of pages and hosts could be created to manipulate the rank of a target resource. It is crucial to be able to differentiate genuine content from spam in real-time to allocate crawl budgets. In this study, ranking algorithms to rank hosts are designed which use the finite Pay Level Domains(PLD) and IPv4 addresses. Heterogenous graphs derived from the webgraph of IRLbot are used to achieve this. PLD Supporters (PSUPP) which is the number of level-2 PLD supporters for each host on the host-host-PLD graph is the first algorithm that is studied. This is further improved by True PLD Supporters(TSUPP) which uses true egalitarian level-2 PLD supporters on the host-IP-PLD graph and DNS blacklists. It was found that support from content farms and stolen links could be eliminated by finding TSUPP. When TSUPP was applied on the host graph of IRLbot, there was less than 1% spam in the top 100,000 hosts.
2

Scraping Dynamic Websites for Economical Data : A Framework Approach

Legaspi Ramos, Xurxo January 2016 (has links)
Internet is a source of live data that is constantly updating with data of almost anyfield we can imagine. Having tools that can automatically detect these updates andcan select that information that we are interested in are becoming of utmost importancenowadays. That is the reason why through this thesis we will focus on someeconomic websites, studying their structures and identifying a common type of websitein this field: Dynamic Websites. Even when there are many tools that allow toextract information from the internet, not many tackle these kind of websites. Forthis reason we will study and implement some tools that allow the developers to addressthese pages from a different perspective.
3

Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources) / Ad hoc and general-purpose corpus construction from web sources

Barbaresi, Adrien 19 June 2015 (has links)
Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés. Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité. / At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.
4

M-crawler: Crawling Rich Internet Applications Using Menu Meta-model

Choudhary, Suryakant 27 July 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
5

M-crawler: Crawling Rich Internet Applications Using Menu Meta-model

Choudhary, Suryakant 27 July 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
6

Topic-Oriented Collaborative Web Crawling

Chung, Chiasen January 2001 (has links)
A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering.
7

Topic-Oriented Collaborative Web Crawling

Chung, Chiasen January 2001 (has links)
A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering.
8

M-crawler: Crawling Rich Internet Applications Using Menu Meta-model

Choudhary, Suryakant January 2012 (has links)
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
9

[en] TEXT MINING AT THE INTELLIGENT WEB CRAWLING PROCESS / [pt] MINERAÇÃO DE TEXTOS NA COLETA INTELIGENTE DE DADOS NA WEB

FABIO DE AZEVEDO SOARES 31 March 2009 (has links)
[pt] Esta dissertação apresenta um estudo sobre a utilização de Mineração de Textos no processo de coleta inteligente de dados na Web. O método mais comum de obtenção de dados na Web consiste na utilização de web crawlers. Web crawlers são softwares que, uma vez alimentados por um conjunto inicial de URLs (sementes), iniciam o procedimento metódico de visitar um site, armazenálo em disco e extrair deste os hyperlinks que serão utilizados para as próximas visitas. Entretanto, buscar conteúdo desta forma na Web é uma tarefa exaustiva e custosa. Um processo de coleta inteligente de dados na Web, mais do que coletar e armazenar qualquer documento web acessível, analisa as opções de crawling disponíveis para encontrar links que, provavelmente, fornecerão conteúdo de alta relevância a um tópico definido a priori. Na abordagem de coleta de dados inteligente proposta neste trabalho, tópicos são definidos, não por palavras chaves, mas, pelo uso de documentos textuais como exemplos. Em seguida, técnicas de pré-processamento utilizadas em Mineração de Textos, entre elas o uso de um dicionário thesaurus, analisam semanticamente o documento apresentado como exemplo. Baseado nesta análise, o web crawler construído será guiado em busca do seu objetivo: recuperar informação relevante sobre o documento. A partir de sementes ou realizando uma consulta automática nas máquinas de buscas disponíveis, o crawler analisa, igualmente como na etapa anterior, todo documento recuperado na Web. Então, é executado um processo de comparação entre cada documento recuperado e o documento exemplo. Depois de obtido o nível de similaridade entre ambos, os hyperlinks do documento recuperado são analisados, empilhados e, futuramente, serão desempilhados de acordo seus respectivos e prováveis níveis de importância. Ao final do processo de coleta de dados, outra técnica de Mineração de Textos é aplicada, objetivando selecionar os documentos mais representativos daquela coleção de textos: a Clusterização de Documentos. A implementação de uma ferramenta que contempla as heurísticas pesquisadas permitiu obter resultados práticos, tornando possível avaliar o desempenho das técnicas desenvolvidas e comparar os resultados obtidos com outras formas de recuperação de dados na Web. Com este trabalho, mostrou-se que o emprego de Mineração de Textos é um caminho a ser explorado no processo de recuperação de informação relevante na Web. / [en] This dissertation presents a study about the application of Text Mining as part of the intelligent Web crawling process. The most usual way of gathering data in Web consists of the utilization of web crawlers. Web crawlers are softwares that, once provided with an initial set of URLs (seeds), start the methodical proceeding of visiting a site, store it in disk and extract its hyperlinks that will be used for the next visits. But seeking for content in this way is an expensive and exhausting task. An intelligent web crawling process, more than collecting and storing any web document available, analyses its available crawling possibilities for finding links that, probably, will provide high relevant content to a topic defined a priori. In the approach suggested in this work, topics are not defined by words, but rather by the employment of text documents as examples. Next, pre-processing techniques used in Text Mining, including the use of a Thesaurus, analyze semantically the document submitted as example. Based on this analysis, the web crawler thus constructed will be guided toward its objective: retrieve relevant information to the document. Starting from seeds or querying through available search engines, the crawler analyzes, exactly as in the previous step, every document retrieved in Web. the similarity level between them is obtained, the retrieved document`s hyperlinks are analysed, queued and, later, will be dequeued according to each one`s probable degree of importance. By the end of the gathering data process, another Text Mining technique is applied, with the propose of selecting the most representative document among the collected texts: Document Clustering. The implementation of a tool incorporating all the researched heuristics allowed to achieve results, making possible to evaluate the performance of the developed techniques and compare all obtained results with others means of retrieving data in Web. The present work shows that the use of Text Mining is a track worthy to be exploited in the process of retrieving relevant information in Web.
10

Seleção de valores para preenchimento de formulários web / Selection of values for form filling

Moraes, Tiago Guimarães January 2013 (has links)
Os motores de busca tradicionais utilizam técnicas que rastreiam as páginas na Web através de links HTML. Porém a maior parte da Web não é acessada por essas técnicas. A parcela da Web não acessada é chamada de Web oculta. Uma enorme quantidade de informação estruturada e de melhor qualidade que a presente na Web tradicional está disponível atrás das interfaces de busca, os formulários que são pontos de entrada para a Web oculta. Essa porção da Web é de difícil acesso para os motores de busca, pois o preenchimento correto dos formulários representa um grande desafio, dado que foram construídos para a manipulação humana e possuem grande variabilidade e diversidade de línguas e domínios. O grande desafio é selecionar os valores corretos para os campos do formulário, realizando um número reduzido de submissões que obtenha a cobertura da maior parte da base de dados por trás do formulário. Vários trabalhos propõem métodos para busca na Web oculta, porém a maior parte deles apresenta grandes limitações para a aplicação automática na Web. Entre as principais limitações estão a dependência de informação prévia a respeito do domínio dos formulários, o não tratamento de todos os tipos de campos que um formulário pode apresentar e a correta seleção de um subgrupo do conjunto de todas as possibilidades de preenchimento de um formulário. No presente trabalho é apresentada uma arquitetura genérica para o preenchimento automático de formulários. A principal contribuição dessa arquitetura consiste na seleção de valores para o preenchimento de formulários através do método ITP (Instance template pruning). para o preenchimento de formulários através do método ITP (Instance template pruning). Muitos formulários apresentam um número inviável de possibilidades de preenchimento quando combinam os valores dos campos. O método ITP consegue reduzir drasticamente o número de possibilidades. A poda de diversas consultas é possível à medida que as submissões são feitas e o conhecimento a respeito do formulário é obtido. Os experimentos realizados mostraram que o método proposto é superior ao método utilizado como baseline. A comparação foi feita com o método que representa o estado da arte. O método proposto pode ser utilizado em conjunto com outros métodos de forma a obter uma busca efetiva na Web oculta. Desta forma, os experimentos a partir da combinação do ITP com o baseline também implicaram em bons resultados. / The traditional search engines crawl the Web pages through HTML links. However, the biggest part of the Web is invisible for these crawlers. The portion of the Web which is not accessed is called hidden Web. An enormous quantity of structured data and with higher quality than in the traditional Web is available behind search interfaces, the forms that are the entry points to the hidden Web. Access this part of theWeb by search engines is difficult because the correct filling of forms represent a big challenge. Since these forms are built for human manipulation and have big variability and diversity of domains and languages. The challenge is to select the correct values to fill the form fields, with a few number of submissions that reach good coverage of the database behind the form. Several works proposed methods to search the hidden Web. Most of these works present big limitations for an application that surfaces the entire Web in a horizontal and automatic way. The main limitations are the dependency of prior information about the form domains, the non-treatment of the all form field types and the correct selection of a subgroup of the set of all form filling possibilities. In the present work is presented a generic architecture for the automatic form filling. The main contribution of this architecture is the selection of values for the form submission through the ITP (Instance Template Pruning) method. Several forms have an infeasible number of form filling possibilities when combining all fields and values. The ITP method can drastically reduce the number of possibilities. The prune of many possible queries is feasible as the submissions are made and the knowledge about the form is obtained. The results of the experiments performed indicate that the ITP method is superior to the baseline utilized. The comparison is made with the method that represents the state of the art. The proposed method can be used with other methods in order to an effective search in the hidden Web. Therefore, the results by the combination of ITP and baseline methods also have implicated in good results.

Page generated in 0.0837 seconds