Global ETD Search

11	Link Extraction for Crawling Flash on the Web Antelius, Daniel January 2015 (has links) The set of web pages not reachable using conventional web search engines is usually called the hidden or deep web. One client-side hurdle for crawling the hidden web is Flash files. This thesis presents a tool for extracting links from Flash files up to version 8 to enable web crawling. The files are both parsed and selectively interpreted to extract links. The purpose of the interpretation is to simulate the normal execution of Flash in the Flash runtime of a web browser. The interpretation is a low level approach that allows the extraction to occur offline and without involving automation of web browsers. A virtual machine is implemented and a set of limitations is chosen to reduce development time and maximize the coverage of interpreted byte code. Out of a test set of about 3500 randomly sampled Flash files the link extractor found links in 34% of the files. The resulting estimated web search engine coverage improvement is almost 10%. Flash crawling spidering deep web hidden web virtual machine interpretation Computer Sciences Datavetenskap (datalogi)
12	Exploring the Hidden Web Papsdorf, Christian 14 June 2017 (has links) Das Forschungsprojekt „Exploring the Hidden Web. Zu den Nutzungsweisen, Eigenschaften und Spezifika anonymer Kommunikation im Internet“ ging im Rahmen des von der VolkswagenStiftung ausgeschriebenen Programms „Offen - für Außergewöhnliches“ von vier zentralen Fragestellungen aus. Erstens sollte erforscht werden, worüber im Hidden Web kommuniziert wird. Zweitens ging es darum, welche Medien dafür genutzt werden. Und drittens sollte danach gefragt werden, wie unter den Bedingungen der Anonymität das für Interaktionen notwendige Vertrauen hergestellt wird. Für diese drei Aspekte sollte viertens jeweils untersucht werden, welche Unterschiede, Gemeinsamkeiten und Schnittstellen zu frei zugänglichen, gemeinhin als Internet bezeichneten Medien („Clearnet“) bestehen. Diese Fragen wurden im Rahmen eines explorativen, qualitativen Vorgehens untersucht.:1 Einleitung 2 Methodisches Vorgehen 3 Ergebnisse 4 Diskussion / The research project “Exploring the Hidden Web. Use, features and specific character of anonymous communication on the Internet”, as a part of the VolkswagenStiftung funding initiative “Off the beaten track”, was based on four distinct issues: The central research questions pursued are (a) what the topics of communication on the Hidden Web are and (b) which media is used for the communication. Another issue building on this is (c) how, under the condition of anonymity, the trust necessary for any communication is built. Regarding these three aspects, the question is to be posed of (d) which differences, common aspects and interfaces there are with freely-accessible media, commonly referred to as the Internet (“Clearnet”). The empirical foundation of this project is an explorative, qualitative approach.:1 Einleitung 2 Methodisches Vorgehen 3 Ergebnisse 4 Diskussion info:eu-repo/classification/ddc/300 ddc:300 info:eu-repo/classification/ddc/301 ddc:301 Internet; Vertrauen; Kommunikation
13	What is the Hidden Web?: The development, characteristics and social significance of anonymous communication on the hidden web Papsdorf, Christian 27 April 2016 (has links) More than two-and-a-half million people currently use the Tor network to communicate anonymously via the Internet and gain access to online media that are not accessible using standard Internet technology. This sphere of communication can be described as the hidden web. In part because this phenomenon is very recent, the subject has scarcely been studied in the social sciences. It is therefore the purpose of this paper to answer four fundamental questions: What is the hidden web? What characterises the communication sphere of the hidden web in contrast to the “normal Internet”? Which reasons can be identified to explain the development of the hidden web as a new communication sphere? And, finally, what is the social significance of the hidden web?:1 Introduction 2 Linguistic differentiation of the hidden web and an overview of the literature 3 Characteristics of communication via the hidden web 4 The creation of the hidden web as a response to the development of the visible web 5 The social significance of the hidden web 6 Summary and prospects / Über zweieinhalb Millionen Menschen nutzen gegenwärtig das Tor Network, um anonym über das Internet zu kommunizieren und Zugriff auf Online-Medien zu erhalten, die mit gewöhnlicher Internettechnik nicht nutzbar ist. Diese Kommunikationssphäre kann als Hidden Web bezeichnet werden. Unter anderem weil es sich um ein sehr junges Phänomen handelt, liegen bisher nahezu keine sozialwissenschaftlichen Erkenntnisse zu dem Thema vor. Dementsprechend werden hier vier grundlegende Fragen beantwortet: Was ist das Hidden Web? Welche Eigenschaften weist die Kommunikationssphäre des Hidden Web im Vergleich zum „normalen“ Internet auf? Welche Gründen lassen sich identifizieren, die die Entstehung des Hidden Web als neue Kommunikationssphäre erklären können? Und welche gesellschaftliche Bedeutung kommt dem Hidden Web schließlich zu?:1 Introduction 2 Linguistic differentiation of the hidden web and an overview of the literature 3 Characteristics of communication via the hidden web 4 The creation of the hidden web as a response to the development of the visible web 5 The social significance of the hidden web 6 Summary and prospects info:eu-repo/classification/ddc/300 ddc:300 info:eu-repo/classification/ddc/301 ddc:301
14	A Distributed Approach to Crawl Domain Specific Hidden Web Desai, Lovekeshkumar 03 August 2007 (has links) A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content "hidden" behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance. Deep Web Breadth-first crawler Search spider Distributed Web crawler task-specific and Domain Specific Hidden Web Content Extraction Computer Sciences
15	Towards completely automatized HTML form discovery on the web Moraes, Maurício Coutinho January 2013 (has links) The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness. Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados Deep web Hidden web Crawling Domain-specific search Query form discovery
16	Seleção de valores para preenchimento de formulários web / Selection of values for form filling Moraes, Tiago Guimarães January 2013 (has links) Os motores de busca tradicionais utilizam técnicas que rastreiam as páginas na Web através de links HTML. Porém a maior parte da Web não é acessada por essas técnicas. A parcela da Web não acessada é chamada de Web oculta. Uma enorme quantidade de informação estruturada e de melhor qualidade que a presente na Web tradicional está disponível atrás das interfaces de busca, os formulários que são pontos de entrada para a Web oculta. Essa porção da Web é de difícil acesso para os motores de busca, pois o preenchimento correto dos formulários representa um grande desafio, dado que foram construídos para a manipulação humana e possuem grande variabilidade e diversidade de línguas e domínios. O grande desafio é selecionar os valores corretos para os campos do formulário, realizando um número reduzido de submissões que obtenha a cobertura da maior parte da base de dados por trás do formulário. Vários trabalhos propõem métodos para busca na Web oculta, porém a maior parte deles apresenta grandes limitações para a aplicação automática na Web. Entre as principais limitações estão a dependência de informação prévia a respeito do domínio dos formulários, o não tratamento de todos os tipos de campos que um formulário pode apresentar e a correta seleção de um subgrupo do conjunto de todas as possibilidades de preenchimento de um formulário. No presente trabalho é apresentada uma arquitetura genérica para o preenchimento automático de formulários. A principal contribuição dessa arquitetura consiste na seleção de valores para o preenchimento de formulários através do método ITP (Instance template pruning). para o preenchimento de formulários através do método ITP (Instance template pruning). Muitos formulários apresentam um número inviável de possibilidades de preenchimento quando combinam os valores dos campos. O método ITP consegue reduzir drasticamente o número de possibilidades. A poda de diversas consultas é possível à medida que as submissões são feitas e o conhecimento a respeito do formulário é obtido. Os experimentos realizados mostraram que o método proposto é superior ao método utilizado como baseline. A comparação foi feita com o método que representa o estado da arte. O método proposto pode ser utilizado em conjunto com outros métodos de forma a obter uma busca efetiva na Web oculta. Desta forma, os experimentos a partir da combinação do ITP com o baseline também implicaram em bons resultados. / The traditional search engines crawl the Web pages through HTML links. However, the biggest part of the Web is invisible for these crawlers. The portion of the Web which is not accessed is called hidden Web. An enormous quantity of structured data and with higher quality than in the traditional Web is available behind search interfaces, the forms that are the entry points to the hidden Web. Access this part of theWeb by search engines is difficult because the correct filling of forms represent a big challenge. Since these forms are built for human manipulation and have big variability and diversity of domains and languages. The challenge is to select the correct values to fill the form fields, with a few number of submissions that reach good coverage of the database behind the form. Several works proposed methods to search the hidden Web. Most of these works present big limitations for an application that surfaces the entire Web in a horizontal and automatic way. The main limitations are the dependency of prior information about the form domains, the non-treatment of the all form field types and the correct selection of a subgroup of the set of all form filling possibilities. In the present work is presented a generic architecture for the automatic form filling. The main contribution of this architecture is the selection of values for the form submission through the ITP (Instance Template Pruning) method. Several forms have an infeasible number of form filling possibilities when combining all fields and values. The ITP method can drastically reduce the number of possibilities. The prune of many possible queries is feasible as the submissions are made and the knowledge about the form is obtained. The results of the experiments performed indicate that the ITP method is superior to the baseline utilized. The comparison is made with the method that represents the state of the art. The proposed method can be used with other methods in order to an effective search in the hidden Web. Therefore, the results by the combination of ITP and baseline methods also have implicated in good results. Banco : Dados Desenvolvimento : Software Serviços Web Hidden web crawling Deep web crawling Automatic filling forms Automatic query selection
17	Towards completely automatized HTML form discovery on the web Moraes, Maurício Coutinho January 2013 (has links) The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness. Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados Deep web Hidden web Crawling Domain-specific search Query form discovery
18	Seleção de valores para preenchimento de formulários web / Selection of values for form filling Moraes, Tiago Guimarães January 2013 (has links) Os motores de busca tradicionais utilizam técnicas que rastreiam as páginas na Web através de links HTML. Porém a maior parte da Web não é acessada por essas técnicas. A parcela da Web não acessada é chamada de Web oculta. Uma enorme quantidade de informação estruturada e de melhor qualidade que a presente na Web tradicional está disponível atrás das interfaces de busca, os formulários que são pontos de entrada para a Web oculta. Essa porção da Web é de difícil acesso para os motores de busca, pois o preenchimento correto dos formulários representa um grande desafio, dado que foram construídos para a manipulação humana e possuem grande variabilidade e diversidade de línguas e domínios. O grande desafio é selecionar os valores corretos para os campos do formulário, realizando um número reduzido de submissões que obtenha a cobertura da maior parte da base de dados por trás do formulário. Vários trabalhos propõem métodos para busca na Web oculta, porém a maior parte deles apresenta grandes limitações para a aplicação automática na Web. Entre as principais limitações estão a dependência de informação prévia a respeito do domínio dos formulários, o não tratamento de todos os tipos de campos que um formulário pode apresentar e a correta seleção de um subgrupo do conjunto de todas as possibilidades de preenchimento de um formulário. No presente trabalho é apresentada uma arquitetura genérica para o preenchimento automático de formulários. A principal contribuição dessa arquitetura consiste na seleção de valores para o preenchimento de formulários através do método ITP (Instance template pruning). para o preenchimento de formulários através do método ITP (Instance template pruning). Muitos formulários apresentam um número inviável de possibilidades de preenchimento quando combinam os valores dos campos. O método ITP consegue reduzir drasticamente o número de possibilidades. A poda de diversas consultas é possível à medida que as submissões são feitas e o conhecimento a respeito do formulário é obtido. Os experimentos realizados mostraram que o método proposto é superior ao método utilizado como baseline. A comparação foi feita com o método que representa o estado da arte. O método proposto pode ser utilizado em conjunto com outros métodos de forma a obter uma busca efetiva na Web oculta. Desta forma, os experimentos a partir da combinação do ITP com o baseline também implicaram em bons resultados. / The traditional search engines crawl the Web pages through HTML links. However, the biggest part of the Web is invisible for these crawlers. The portion of the Web which is not accessed is called hidden Web. An enormous quantity of structured data and with higher quality than in the traditional Web is available behind search interfaces, the forms that are the entry points to the hidden Web. Access this part of theWeb by search engines is difficult because the correct filling of forms represent a big challenge. Since these forms are built for human manipulation and have big variability and diversity of domains and languages. The challenge is to select the correct values to fill the form fields, with a few number of submissions that reach good coverage of the database behind the form. Several works proposed methods to search the hidden Web. Most of these works present big limitations for an application that surfaces the entire Web in a horizontal and automatic way. The main limitations are the dependency of prior information about the form domains, the non-treatment of the all form field types and the correct selection of a subgroup of the set of all form filling possibilities. In the present work is presented a generic architecture for the automatic form filling. The main contribution of this architecture is the selection of values for the form submission through the ITP (Instance Template Pruning) method. Several forms have an infeasible number of form filling possibilities when combining all fields and values. The ITP method can drastically reduce the number of possibilities. The prune of many possible queries is feasible as the submissions are made and the knowledge about the form is obtained. The results of the experiments performed indicate that the ITP method is superior to the baseline utilized. The comparison is made with the method that represents the state of the art. The proposed method can be used with other methods in order to an effective search in the hidden Web. Therefore, the results by the combination of ITP and baseline methods also have implicated in good results. Banco : Dados Desenvolvimento : Software Serviços Web Hidden web crawling Deep web crawling Automatic filling forms Automatic query selection
19	Towards completely automatized HTML form discovery on the web Moraes, Maurício Coutinho January 2013 (has links) The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness. Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados Deep web Hidden web Crawling Domain-specific search Query form discovery
20	Seleção de valores para preenchimento de formulários web / Selection of values for form filling Moraes, Tiago Guimarães January 2013 (has links) Os motores de busca tradicionais utilizam técnicas que rastreiam as páginas na Web através de links HTML. Porém a maior parte da Web não é acessada por essas técnicas. A parcela da Web não acessada é chamada de Web oculta. Uma enorme quantidade de informação estruturada e de melhor qualidade que a presente na Web tradicional está disponível atrás das interfaces de busca, os formulários que são pontos de entrada para a Web oculta. Essa porção da Web é de difícil acesso para os motores de busca, pois o preenchimento correto dos formulários representa um grande desafio, dado que foram construídos para a manipulação humana e possuem grande variabilidade e diversidade de línguas e domínios. O grande desafio é selecionar os valores corretos para os campos do formulário, realizando um número reduzido de submissões que obtenha a cobertura da maior parte da base de dados por trás do formulário. Vários trabalhos propõem métodos para busca na Web oculta, porém a maior parte deles apresenta grandes limitações para a aplicação automática na Web. Entre as principais limitações estão a dependência de informação prévia a respeito do domínio dos formulários, o não tratamento de todos os tipos de campos que um formulário pode apresentar e a correta seleção de um subgrupo do conjunto de todas as possibilidades de preenchimento de um formulário. No presente trabalho é apresentada uma arquitetura genérica para o preenchimento automático de formulários. A principal contribuição dessa arquitetura consiste na seleção de valores para o preenchimento de formulários através do método ITP (Instance template pruning). para o preenchimento de formulários através do método ITP (Instance template pruning). Muitos formulários apresentam um número inviável de possibilidades de preenchimento quando combinam os valores dos campos. O método ITP consegue reduzir drasticamente o número de possibilidades. A poda de diversas consultas é possível à medida que as submissões são feitas e o conhecimento a respeito do formulário é obtido. Os experimentos realizados mostraram que o método proposto é superior ao método utilizado como baseline. A comparação foi feita com o método que representa o estado da arte. O método proposto pode ser utilizado em conjunto com outros métodos de forma a obter uma busca efetiva na Web oculta. Desta forma, os experimentos a partir da combinação do ITP com o baseline também implicaram em bons resultados. / The traditional search engines crawl the Web pages through HTML links. However, the biggest part of the Web is invisible for these crawlers. The portion of the Web which is not accessed is called hidden Web. An enormous quantity of structured data and with higher quality than in the traditional Web is available behind search interfaces, the forms that are the entry points to the hidden Web. Access this part of theWeb by search engines is difficult because the correct filling of forms represent a big challenge. Since these forms are built for human manipulation and have big variability and diversity of domains and languages. The challenge is to select the correct values to fill the form fields, with a few number of submissions that reach good coverage of the database behind the form. Several works proposed methods to search the hidden Web. Most of these works present big limitations for an application that surfaces the entire Web in a horizontal and automatic way. The main limitations are the dependency of prior information about the form domains, the non-treatment of the all form field types and the correct selection of a subgroup of the set of all form filling possibilities. In the present work is presented a generic architecture for the automatic form filling. The main contribution of this architecture is the selection of values for the form submission through the ITP (Instance Template Pruning) method. Several forms have an infeasible number of form filling possibilities when combining all fields and values. The ITP method can drastically reduce the number of possibilities. The prune of many possible queries is feasible as the submissions are made and the knowledge about the form is obtained. The results of the experiments performed indicate that the ITP method is superior to the baseline utilized. The comparison is made with the method that represents the state of the art. The proposed method can be used with other methods in order to an effective search in the hidden Web. Therefore, the results by the combination of ITP and baseline methods also have implicated in good results. Banco : Dados Desenvolvimento : Software Serviços Web Hidden web crawling Deep web crawling Automatic filling forms Automatic query selection

Search results