Global ETD Search

11	Pokročilý robot na procházení webu / Advanced Web Crawler Činčera, Jaroslav January 2010 (has links) This Master's thesis describes design and implementation of advanced web crawler. This crawler can be configured by user and is designed for web browsing according to specified parameters. Can acquire and evaluate content of web pages. Its configuration is performed by creating projects which are consisting of different types of steps. User can create simple action like downloading page, form submission, etc. or can create more complex and larger projects.
12	Automatizované zhromažďovanie a štrukturalizácia dát z webových zdrojov Zahradník, Roman January 2018 (has links) This diploma thesis deals with the creation of a solution for continuous data acquisition from web sources. The application is in charge of automatically navigating web pages, extracting data using dedicated selectors, and subsequently standardizing them for further processing for data mining.
13	A Distributed Approach to Crawl Domain Specific Hidden Web Desai, Lovekeshkumar 03 August 2007 (has links) A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content "hidden" behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance. Deep Web Breadth-first crawler Search spider Distributed Web crawler task-specific and Domain Specific Hidden Web Content Extraction Computer Sciences
14	Deal Organizer : personalized alerts for shoppers Reyes, Ulises Uriel 27 November 2012 (has links) Deal Organizer is a web-based application that scans multiple websites for online bargains. It allows users to specify their preferences in order for them to receive notifications based on personalized content. The application obtains deals from other websites through data extraction techniques that include reading RSS feeds and web scraping. In order to better facilitate content personalization, the application tracks the user's activity by recording clicks on both links to deals and rating buttons, e.g., the Facebook like button. Due to the dynamic nature of the source websites offering these deals and the ever-evolving web technologies available to software developers, the Deal Organizer application was designed to implement an interface-based design using the Spring Framework. This yielded to an extensible, pluggable and flexible system that accommodates maintenance and future work gracefully. The application's performance was evaluated by executing resource-intensive tests on a constrained environment. Results show the application responded positively. / text Online deals Spring framework Spring MVC Web scraping Web crawler RSS feeds Content personalization Interface-based design
15	Veiksmų ontologijos formavimas panaudojant internetinį tekstyną / Building action ontology using internet corpus Matulionis, Paulius 20 June 2012 (has links) Šio baigiamojo magistro darbo tikslas yra veiksmų ontologijos formavimo panaudojant automatiniu būdu sukauptą internetinį tekstyną problematikos tyrimas. Tyrimo metu buvo analizuojami tekstynų žymėjimo standartai, sukauptas internetinis tekstynas, pasitelkiant priemones sukurtas darbo metu. Tyrimo metu buvo kuriamos ir tobulinamos jau esamos priemonės atlikti įvairius eksperimentus su tekstynais, duomenų sisteminimu, vaizdavimu ir ontologijos formavimu. Buvo sukurta procesų valdymo sistema, kuri buvo realizuota nuo front-end iki back-end sistemos kūrimo lygiu. Darbe yra pateikiamos detalios sistemos ir jos komponentų schemos atspindinčios visą sistemos veikimą. Tyrimo metu buvo atlikti eksperimentai susiję su veiksmų ontologijos formavimu. Darbe yra aprašyti ontologijos kūrimo žingsniai, pateikiamos problemos ir jų sprendimai bei pasiūlymai, ką galima būtų padaryti, kad eksperimentų rezultatai būtų dar tikslesni. Taip pat yra įvardinamos taisyklės, kurios buvo naudojamos reikalingų duomenų gavimui iš sukaupto tekstyno, taip pat taisyklės buvo apibendrintos ir pateiktos sukurtoms priemonėms suprantamu pavidalu. / The goal of the master thesis is to investigate the problem of the automated action ontology design using a corpus harvested from internet. A software package including tools for internet corpus harvesting, network service access, markup, ontology design and representation was developed and tested in the carried out experiment. A process management system was realized covering both front-end and the back-end system design levels. Detailed system and component models are presented, reflecting all the operations of the system. The thesis presents the results of experiments on building ontologies for several selected action verbs. Ontology building process is described, problems in recognizing separate elements of the action environment are analysed, suggestions on additional rules leading to more accurate results, are presented. Rules have been summarized and integrated into the designed software package. Informatics Ontologijos Ontologijų formavimas Saityno robotas Internetinis tekstynas Veiksmų ontologija Ontologies Ontologies formation Web crawler Internet corpus Actions ontology
16	An Indexation and Discovery Architecture for Semantic Web Services and its Application in Bioinformatics Yu, Liyang 09 June 2006 (has links) Recently much research effort has been devoted to the discovery of relevant Web services. It is widely recognized that adding semantics to service description is the solution to this challenge. Web services with explicit semantic annotation are called Semantic Web Services (SWS). This research proposes an indexation and discovery architecture for SWS, together with a prototype application in the area of bioinformatics. In this approach, a SWS repository is created and maintained by crawling both ontology-oriented UDDI registries and Web sites that hosting SWS. For a given service request, the proposed system invokes the matching algorithm and a candidate set is returned with different degree of matching considered. This approach can add more flexibility to the current industry standards by offering more choices to both the service requesters and publishers. Also, the prototype developed in this research shows the value can be added by using SWS in application areas such as bioinformatics. Ontology OWL-S Semantic Web services Semantic Web Web service standards Web crawler Bioinformatics applications Indexation Service discovery Search engine Computer Sciences
17	Uma abordagem para captura automatizada de dados abertos governamentais Ferreira, Juliana Sabino 07 November 2017 (has links) Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-06T16:01:21Z No. of bitstreams: 1 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) / Rejected by Milena Rubi ( ri.bso@ufscar.br), reason: Bom dia Juliana! Além da dissertação, você deve submeter também a carta comprovante devidamente preenchida e assinada pelo orientador. O modelo da carta encontra-se na página inicial do site do Repositório Institucional. Att., Milena P. Rubi Bibliotecária CRB8-6635 Biblioteca Campus Sorocaba on 2018-01-08T11:07:30Z (GMT) / Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-09T00:48:08Z No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) / Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:15:53Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) / Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:16:03Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) / Made available in DSpace on 2018-01-09T11:16:12Z (GMT). No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) Previous issue date: 2017-11-07 / Não recebi financiamento / Currently open government data run an important job on regards to public transparency, besides being obligated by law. But most of this data are stored in non-standard ways, isolated and independent, making it very hard for its use by third party systems providers. This work proposes the creation of an approach for capturing this open government data in an automated way, allowing its use in various applications. For that a Web Crawler was built for the capture and storing of this open government data, as well as an API for making this data available in JSON format, that way developers can easily use this data on their application. We also performed an evaluation of the API for developers with different levels of experience. / Atualmente os dados abertos governamentais exercem um papel fundamental na transparência pública na gestão dos governos, além de ser uma obrigação legal. Porém grande parte desses dados são publicados em formatos diversos, isolados e independentes, dificultado seu reaproveitamento por sistemas de terceiros que poderiam reusar informações disponibilizadas em tais portais. Este trabalho propõe a criação de uma abordagem para captura de dados abertos governamentais de forma automatizada, permitindo sua reutilização em outras aplicações. Para isso foi construído um Web Crawler para captura e armazenamento de Dados Abertos Governamentais (DAG) e a API DAG Prefeituras para disponibilizar esses dados no formato JSON para que outros desenvolvedores possam utilizar esses dados em suas aplicações. Também foi realizada uma avaliação do uso da API para desenvolvedores com diferentes níveis de experiência Dados Abertos Governamentais Cidades Inteligentes Serviços de informação e Estado Internet na administração pública Web Crawler Information services - Government policy Communication in public administration Internet in public administration
18	Characterizing the Third-Party Authentication Landscape : A Longitudinal Study of how Identity Providers are Used in Modern Websites / Longitudinella mätningar av användandet av tredjepartsautentisering på moderna hemsidor Josefsson Ågren, Fredrik, Järpehult, Oscar January 2021 (has links) Third-party authentication services are becoming more common since it eases the login procedure by not forcing users to create a new login for every website thatuses authentication. Even though it simplifies the login procedure the users still have to be conscious about what data is being shared between the identity provider (IDP) and the relying party (RP). This thesis presents a tool for collecting data about third-party authentication that outperforms previously made tools with regards to accuracy, precision and recall. The developed tool was used to collect information about third-party authentication on a set of websites. The collected data revealed that third-party login services offered by Facebook and Google are most common and that Twitters login service is significantly less common. Twitter's login service shares the most data about the users to the RPs and often gives the RPs permissions to perform write actions on the users Twitter account. In addition to our large-scale automatic data collection, three manual data collections were performed and compared to previously made manual data collections from a nine-year period. The longitudinal comparison showed that over the nine-year period the login services offered by Facebook and Google have been dominant.It is clear that less information about the users are being shared today compared to earlier years for Apple, Facebook and Google. The Twitter login service is the only IDP that have not changed their permission policies. This could be the reason why the usage of the Twitter login service on websites have decreased. The results presented in this thesis helps provide a better understanding of what personal information is exchanged by IDPs which can guide users to make well educated decisions on the web. Third Party Authentication Authentication Third-party Single Sign On OpenID OpenID connect Oauth TLS Web Security Identity Provider Relying Party IDP RP Longitudinal Analysis Web Crawler Selenium Google Facebook Twitter Apple Information Systems
19	Generic Data Harvester Asp, William, Valck, Johannes January 2022 (has links) This report goes through the process of developing a generic article scraper which shall extract relevant information from an arbitrary web article. The extraction is implemented by searching and examining the HTML of the article, by using Python and XPath. The data that shall be extracted is the title, summary, publishing date and body text of the article. As there is no standard way that websites, and in particular news articles, is built, the extraction needs to be adapted for every different structure and language of articles. The resulting program should provide a proof of concept method of extracting the data showing that future development is possible. The thesis host company Acuminor is working with financial crime intelligence and are collecting information through articles and reports. To scale up the data collection and minimize the maintenance of the scraping programs, a general article scraper is needed. There exist an open source alternative called Newspaper, but since this is no longer being maintained and it can be argued is not properly designed, an internal implementation for the company could be beneficial. The program consists of a main class that imports extractor classes that have an API for extracting the data. Each extractor are decoupled from the rest in order to keep the program as modular as possible. The extraction for title, summary and date are similar, with the extractors looking for specific HTML tags that contain some common attribute that most websites implement. The text extraction is implemented using a tree that is built up from the existing text on the page and then searching the tree for the most likely node containing only the body text, using attributes such as amount of text, depth and number of text nodes. The resulting program does not match the performance of Newspaper, but shows promising results on every part of the extraction. The text extraction is very slow and often takes too much text of the article but provides a great blueprint for further improvement at the company. Acuminor will be able to have their in-house article extraction that suits their wants and needs. / Den här rapporten går igenom processen av att utveckla en generisk artikelskrapare som ska extrahera reöevamt information från en godtycklig artikelhemsida. Extraheringen kommer bli implementerad genom att söka igenom och undersöka HTML-en i artikeln, genom att använda Python och XPath. Datan som skall extraheras är titeln, summering, publiceringsdatum och brödtexten i artikeln. Eftersom det inte finns något standard sätt som hemsidor, och mer specifikt nyhetsartiklar är uppbyggda, extraheringen måste anpassas för varje olika struktur och språk av artiklar. Det resulterande programmed skall visa på ett bevis för ett koncept sätt att extrahera datan som visar på att framtida utveckling är möjlig. Projektets värdföretag Acuminor jobbar inom finansiell brottsintelligens och samlar ihop information genom artiklar och rapporter. För att skala upp insamlingen av data och minimera underhåll av skrapningsprogrammen, behövs en generell artikelskrapare. Det existerar ett öppen källkodsalternativ kallad Newspaper, men eftersom denna inte länge är underhållen och det kan argumenteras att den inte är så bra designad, är en intern implementation för företaget fördelaktigt. Programmet består av en huvudklass som importerar extraheringsklasser som har ett API för att extrahera datan. Varje extraherare är bortkopplad från resten av programmet för att hålla programmet så moodulärt som möjligt. Extraheringen för titel, summering och datum är liknande, där extragherarna tittar efter specifika HTML taggar som innehåller något gemensamt attribut som de flesta hemsidor implementerar. Textextraheringen är implementerad med ett träd som byggs upp från grunden från den existerande texten på sidan och sen söks igenom för att hitta den mest troliga noden som innehåller brödtexten, där den använder attribut såsom text, djup och antal textnoder. Det resulterande programmet matchar inte prestandan av Newspaper, men visar på lovande resultat vid varje del av extraheringen. Textextraheringen är väldigt långsam och hämtar ofta för mycket text från artikeln men lämnar ett bra underlag för vidare förbättring hos företaget. Allt som allt kommer Acuminor kunna bygga vidare på deras egna artikel extraherare som passar deras behov. News Articles Newspapers Web crawler Web site parsing Optimization Web robot Web spider Web data extraction HTML Scrapy Nyheter Artiklar Tidningar Sökrobot Analys av hemsida Optimering Webbrobot Webbspindel Data extrahering hemsidor HTML Scrapy Computer and Information Sciences Data- och informationsvetenskap

Search results