Global ETD Search

1	Effective web crawlers Ali, Halil, hali@cs.rmit.edu.au January 2008 (has links) Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed. Web Crawler Search Engines Index Consistency Efficiency
2	Development of an online reputation monitor / Gerhardus Jacobus Christiaan Venter Venter, Gerhardus Jacobus Christiaan January 2015 (has links) The opinion of customers about companies are very important as this can influence a company’s profit. Companies often get customer feedback via surveys or other official methods in order to improve their services. However, some customers feel threatened when their opinions are publicly asked and thus prefer to voice their opinion on the internet where they take comfort in anonymity. This form of customer feedback is difficult to monitor as the information can be found anywhere on the internet and new information is generated at an astonishing rate. Currently there are companies such as Brandseye and Brand.Com that provide online reputation management services. These services have various shortcomings such as cost and is incapable of accessing historical data. Companies are also not allowed to purchase these software and can only use the software on a subscription basis. The design proposed in this document will be able to scan any number of user defined websites and save all the information found on the websites in a series of index files, which can be queried for occurrences of user defined keywords at any time. Additionally, the software will also be able to scan Twitter and Facebook for any number of user defined keywords and save any occurrences of the keywords to a database. After scanning the internet, the results will be passed through a similarity filter, which will filter out insignificant results as well as any duplicates that might be present. Once passed through the filter the remaining results will be analysed by a sentiment analysis tool which will determine whether the sentence in which the keyword occurs is positive or negative. The analysed results will determine the overall reputation of the keyword that was used. The proposed design has several advantages over current systems: - By using the modular design several tasks can execute at the same time without influencingeach other. For example; information can be extracted from the internet while existing resultsare being analysed. - By providing the keywords and websites that the system will use the user will have full controlover the online reputation management process. - By saving all the information contained in a website the user will be able to take historicalinformation into account to determine how the keywords reputation changes over time. Savingthe information will also allow the user to search for any keyword without rescanning theinternet. The proposed system was tested and successfully used to determine the online reputation of many user defined keywords. / MIng (Computer and Electronic Engineering), North-West University, Potchefstroom Campus, 2015 Online Reputation Monitor Web crawler Facebook Twitter dtSearch Sentiment Analysis
3	Development of an online reputation monitor / Gerhardus Jacobus Christiaan Venter Venter, Gerhardus Jacobus Christiaan January 2015 (has links) The opinion of customers about companies are very important as this can influence a company’s profit. Companies often get customer feedback via surveys or other official methods in order to improve their services. However, some customers feel threatened when their opinions are publicly asked and thus prefer to voice their opinion on the internet where they take comfort in anonymity. This form of customer feedback is difficult to monitor as the information can be found anywhere on the internet and new information is generated at an astonishing rate. Currently there are companies such as Brandseye and Brand.Com that provide online reputation management services. These services have various shortcomings such as cost and is incapable of accessing historical data. Companies are also not allowed to purchase these software and can only use the software on a subscription basis. The design proposed in this document will be able to scan any number of user defined websites and save all the information found on the websites in a series of index files, which can be queried for occurrences of user defined keywords at any time. Additionally, the software will also be able to scan Twitter and Facebook for any number of user defined keywords and save any occurrences of the keywords to a database. After scanning the internet, the results will be passed through a similarity filter, which will filter out insignificant results as well as any duplicates that might be present. Once passed through the filter the remaining results will be analysed by a sentiment analysis tool which will determine whether the sentence in which the keyword occurs is positive or negative. The analysed results will determine the overall reputation of the keyword that was used. The proposed design has several advantages over current systems: - By using the modular design several tasks can execute at the same time without influencingeach other. For example; information can be extracted from the internet while existing resultsare being analysed. - By providing the keywords and websites that the system will use the user will have full controlover the online reputation management process. - By saving all the information contained in a website the user will be able to take historicalinformation into account to determine how the keywords reputation changes over time. Savingthe information will also allow the user to search for any keyword without rescanning theinternet. The proposed system was tested and successfully used to determine the online reputation of many user defined keywords. / MIng (Computer and Electronic Engineering), North-West University, Potchefstroom Campus, 2015 Online Reputation Monitor Web crawler Facebook Twitter dtSearch Sentiment Analysis
4	Enhancing a Web Crawler with Arabic Search. Nguyen, Qui V. 25 July 2012 Many advantages of the Internetâ ease of access, limited regulation, vast potential audience, and fast flow of informationâ have turned it into the most popular way to communicate and exchange ideas. Criminal and terrorist groups also use these advantages to turn the Internet into their new play/battle fields to conduct their illegal/terror activities. There are millions of Web sites in different languages on the Internet, but the lack of foreign language search engines makes it impossible to analyze foreign language Web sites efficiently. This thesis will enhance an open source Web crawler with Arabic search capability, thus improving an existing social networking tool to perform page correlation and analysis of Arabic Web sites. A social networking tool with Arabic search capabilities could become a valuable tool for the intelligence community. Its page correlation and analysis results could be used to collect open source intelligence and build a network of Web sites that are related to terrorist or criminal activities. Nutch Lucene Web Crawler Information Retrieval in Arabic Stemming in Arabic
5	Monitoring internetu a jeho přínosy pro podnikání nástroji firmy SAS Institute / Monitoring of Internet and its benefits to business tools from SAS Institute Moravec, Petr January 2011 (has links) This Thesis is focused on the ways of getting information from the World Wide Web source . The Introduction pays attention to the theoretical approach towards data collection options . The main part of the Introduction is engaged in the Web Crawler program as the possibility of data collection from internet and consequently they are followed by alternative methods of data collection. E.g. Google Search API. The next part of the Thesis is dedicated to SAS products and their meanings in the context of reporting and internet monitoring. SAS Intelligence platform is presented as the crucial Company platform In the framework of the platform there could be found concrete SAS solutions. SAS Web Crawler and Semantic Server are described in SAS Content Categorization solution. Whilst the first two parts of Thesis are focused on the theory , the third and closing part pays attention to practical examples. There are illustrated examples of Internet Data collection, which are mainly realized in SAS. The practical part of Thesis follows the theoretical one and it cannot be detached.
6	Developing a Semantic Web Crawler to Locate OWL Documents Koron, Ronald Dean 18 September 2012 (has links) No description available. Computer Science OWL RDF Semantic Web Web Crawler
7	A Domain Based Approach to Crawl the Hidden Web Pandya, Milan 04 December 2006 (has links) There is a lot of research work being performed on indexing the Web. More and more sophisticated Web crawlers are been designed to search and index the Web faster. But all these traditional crawlers crawl only the part of Web we call “Surface Web”. They are unable to crawl the hidden portion of the Web. These traditional crawlers retrieve contents only from surface Web pages which are just a set of Web pages linked by some hyperlinks and ignoring the hidden information. Hence, they ignore tremendous amount of information hidden behind these search forms in Web pages. Most of the published research has been done to detect such searchable forms and make a systematic search over these forms. Our approach here will be based on a Web crawler that analyzes search forms and fills tem with appropriate content to retrieve maximum relevant information from the database. web crawler search spider web bot best first crawler focused web crawler web page domain based Computer Sciences
8	Lokman: A Medical Ontology Based Topical Web Crawler Kayisoglu, Altug 01 September 2005 (has links) (PDF) Use of ontology is an approach to overcome the &ldquo / search-on-the-net&rdquo / problem. An ontology based web information retrieval system requires a topical web crawler to construct a high quality document collection. This thesis focuses on implementing a topical web crawler with medical domain ontology in order to find out the advantages of ontological information in web crawling. Crawler is implemented with Best-First search algorithm. Design of the crawler is optimized to UMLS ontology. Crawler is tested with Harvest Rate and Target Recall Metrics and compared to a non-ontology based Best-First Crawler. Performed test results proved that ontology use in crawler URL selection algorithm improved the crawler performance by 76%. QA Computer Software 76.75-76.765
9	Mot effektiv identifiering och insamling avbrutna länkar med hjälp av en spindel / Towards effective identification and collection of broken links using a web crawler Anttila, Pontus January 2018 (has links) I dagsläget har uppdragsgivaren ingen automatiserad metod för att samla in brutna länkar på deras hemsida, utan detta sker manuellt eller inte alls. Detta projekt har resulterat i en praktisk produkt som idag kan appliceras på uppdragsgivarens hemsida. Produktens mål är att automatisera arbetet med att hitta och samla in brutna länkar på hemsidan. Genom att på ett effektivt sätt samla in alla eventuellt brutna länkar, och placera dem i en separat lista så kan en administratör enkelt exportera listan och sedan åtgärda de brutna länkar som hittats. Uppdragsgivaren kommer att ha nytta av denna produkt då en hemsida utan brutna länkar höjer hemsidans kvalité, samtidigt som den ger besökare en bättre upplevelse. / Today, the customer has no automated method for finding and collecting broken links on their website. This is done manually or not at all. This project has resulted in a practical product, that can be applied to the customer’s website. The aim of the product is to ease the work when collecting and maintaining broken links on the website. This will be achieved by gathering all broken links effectively, and place them in a separate list that at will can be exported by an administrator who will then fix these broken links. The quality of the customer’s website will be higher, as all broken links will be easier to find and remove. This will ultimately give visitors a better experience. error 404 spider web spider links broken links internet web crawler crawler spindel internet brutna länkar länkar 404 crawler web crawler sökmotor Computer and Information Sciences Data- och informationsvetenskap
10	設計與實作一個臉書粉絲頁資料抓取器 / Design and Implementation of a Facebook Fan Page Data Crawler 鄭博元, Cheng, Po Yuan Unknown Date (has links) 近年來隨著社群網路服務的盛行，臉書已成為現代人最主要的社交工具，許多名人及公司企業也都搶搭著這股風潮，紛紛在臉書上建立起粉絲頁來和粉絲們互動，而在虛擬世界和現實社會之間，兩者所互相造成的影響帶動出許多新興研究議題，透過資訊技術收集虛擬世界裡的資料，能幫助人文學者與社會科學家探索出數位科技與人文社會間的新現象。本研究針對臉書上的粉絲頁，設計建構出一套臉書資料抓取系統，以協助學者研究分析粉絲頁的動態消息資料，本系統可幫助研究者搜尋出相關粉絲頁，並依照按讚次數排列呈現，協助挑選受歡迎的粉絲頁；讓研究者能抓取特定的粉絲頁資料，抓取到的資料經過解析後分為文章訊息、留言訊息、按讚訊息，並將結果儲存至資料庫；針對已抓取的粉絲頁，自動定時更新至最新資料。 / With the popularity of social networking services in recent years, Facebook has become a major social tool for people. Many celebrities and companies have also gone with the tide to and established a fan page on Facebook to interact with fans. The mutual influence of the virtual world and the real world drives many emerging research agenda. Using information technology to collect data in the virtual world can help the humanities scholars and social scientists to explore new phenomena between digital technology and humanities community. In this thesis, we focus on Facebook fan page data. We design and construct a Facebook fan page crawler to help scholars get data for analysis. The crawler can help researchers find the relevant fan pages along with the numbers of thumbs up and it can help researchers select fan pages. The crawler can help researchers to get the fan page data which they want by extracting post messages, comment messages, and like messages from the data and then storing the results into the database. The crawler also can set update timer to help researchers get the latest information. 臉書網路爬蟲平行處理 Facebook Web Crawler Parallel Processing

Search results