Global ETD Search

1	Autonomous Cooperating Web Crawlers McLearn, Greg January 2002 (has links) A web crawler provides an automated way to discover web events ? creation, deletion, or updates of web pages. Competition among web crawlers results in redundant crawling, wasted resources, and less-than-timely discovery of such events. This thesis presents a cooperative sharing crawler algorithm and sharing protocol. Without resorting to altruistic practices, competing (yet cooperative) web crawlers can mutually share discovered web events with one another to maintain a more accurate representation of the web than is currently achieved by traditional polling crawlers. The choice to share or merge is entirely up to an individual crawler: sharing is the act of allowing a crawler M to access another crawler's web-event data (call this crawler S), and merging occurs when crawler M requests web-event data from crawler S. Crawlers can choose to share with competing crawlers if it can help reduce contention between peers for resources associated with the act of crawling. Crawlers can choose to merge from competing peers if it helps them to maintain a more accurate representation of the web at less cost than directly polling web pages. Crawlers can control how often they choose to merge through the use of a parameter ρ, which dictates the percentage of time spent either polling or merging with a peer. Depending on certain conditions, pathological behaviour can arise if polling or merging is the only form of data collection. Simulations of communities of simple cooperating web crawlers successfully show that a combination of polling and merging (0 < ρ < 1) can allow an individual member of the cooperating community a higher degree of accuracy in their representation of the web as compared to a traditional polling crawler. Furthermore, if web crawlers are allowed to evaluate their own performance, they can dynamically switch between periods of polling and merging to still perform better than traditional crawlers. The mutual performance gain increases as more crawlers are added to the community. Computer Science autonomous web crawlers WWW simulation
2	Autonomous Cooperating Web Crawlers McLearn, Greg January 2002 (has links) A web crawler provides an automated way to discover web events ? creation, deletion, or updates of web pages. Competition among web crawlers results in redundant crawling, wasted resources, and less-than-timely discovery of such events. This thesis presents a cooperative sharing crawler algorithm and sharing protocol. Without resorting to altruistic practices, competing (yet cooperative) web crawlers can mutually share discovered web events with one another to maintain a more accurate representation of the web than is currently achieved by traditional polling crawlers. The choice to share or merge is entirely up to an individual crawler: sharing is the act of allowing a crawler M to access another crawler's web-event data (call this crawler S), and merging occurs when crawler M requests web-event data from crawler S. Crawlers can choose to share with competing crawlers if it can help reduce contention between peers for resources associated with the act of crawling. Crawlers can choose to merge from competing peers if it helps them to maintain a more accurate representation of the web at less cost than directly polling web pages. Crawlers can control how often they choose to merge through the use of a parameter ρ, which dictates the percentage of time spent either polling or merging with a peer. Depending on certain conditions, pathological behaviour can arise if polling or merging is the only form of data collection. Simulations of communities of simple cooperating web crawlers successfully show that a combination of polling and merging (0 < ρ < 1) can allow an individual member of the cooperating community a higher degree of accuracy in their representation of the web as compared to a traditional polling crawler. Furthermore, if web crawlers are allowed to evaluate their own performance, they can dynamically switch between periods of polling and merging to still perform better than traditional crawlers. The mutual performance gain increases as more crawlers are added to the community. Computer Science autonomous web crawlers WWW simulation
3	Distributed Crawling of Rich Internet Applications Mir Taheri, Seyed Mohammad January 2015 (has links) Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Different solutions have been proposed to reduce the time and cost of crawling. New generation of web applications, known as Rich Internet Applications (RIAs), pose major challenges to the web crawlers. RIAs shift a portion of the computation to the client side. Shifting a portion of the application to the client browser influences the web crawler in two ways: First, the one-to-one correlation between the URL and the state of the application, that exists in traditional web applications, is broken. Second, reaching a state of the application is no longer a simple operation of navigating to the target URL, but often means navigating to a seed URL and executing a chain of events from it. Due to these challenges, crawling a RIA can take a prohibitively long time. This thesis studies applying distributed computing and parallel processing principles to the field of RIA crawling to reduce the time. We propose different algorithms to concurrently crawl a RIA over several nodes. The proposed algorithms are used as a building block to construct a distributed crawler of RIAs. The different algorithms proposed represent different trade-offs between communication and computation. This thesis explores the effect of making different trade-offs and their effect on the time it takes to crawl RIAs. We study the cost of running a distributed RIA crawl with client-server architecture and compare it with a peer-to-peer architecture. We further study distribution of different crawling strategies, namely: Breath-First search, Depth-First search, Greedy algorithm, and Probabilistic algorithm. To measure the effect of different design decisions in practice, a prototype of each algorithm is implemented. The implemented prototypes are used to obtain empirical performance measurements and to refine the algorithms. The ultimate refined algorithm is used for experimentation with a wide range of applications under different circumstances. This thesis finally includes two theoretical studies of load balancing algorithms and distributed component-based crawling and sets the stage for future studies. Distributed Systems Web Crawlers Rich Internet Applications
4	Intelligent Caching to Mitigate the Impact of Web Robots on Web Servers Rude, Howard Nathan January 2016 (has links) No description available. Computer Science web cache web robots crawlers prefetching prediction LSTM
5	CLASSIFICAÇÃO DE NÓDULOS PULMONARES UTILIZANDO VIDAS ARTIFICIAIS, MVS E MEDIDAS DIRECIONAIS DE TEXTURA / CLASSIFICATION OF PULMONARY NODULES USING ARTIFICIAL LIFE, MVS AND TEXTURE DIRECTIONAL MEASURES Froz, Bruno Rodrigues 02 February 2015 (has links) Made available in DSpace on 2016-08-17T14:52:36Z (GMT). No. of bitstreams: 1 dissertacao Bruno Rodrigues Froz.pdf: 1583465 bytes, checksum: f53ff1f85d91788fc7d52925b16f6794 (MD5) Previous issue date: 2015-02-02 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / The lung cancer is known for presenting the highest mortality rate and one of the lowest survival rate after diagnosis, which is mainly caused by the late detection and treatment. With the goal of assist the lung cancer specialists, computed aided diagnosis systems are developed to automate the detection and diagnosis of this disease. This work proposes a methodology to classify, with computed tomography images, lung nodules candidates and non-nodules candidates. The Lung Image Database Consortium (LIDC) image database is used to create an image database with nodules candidates and an image database with non-nodule candidates. Three techniques are utilized to extract texture measurements. The first one is the artificial life algorithm Artificial Crawlers. The second one is the use of Rose Diagram to extract directional measurements. The third and last one is an hybrid model to join the Artificial Crawlers and Rose Diagram texture measurements. In the classification, que Support Vector Machine classifier is used, with its radial basis kernel. The archived results are very promising. With 833 LIDC exams, divided in 60% for train and 40% for test, we reached na accuracy mean of 94,30%, sensitivity mean of 91,86%, specificity mean of 94,78%, variance coefficient of accuracy of 1,61% and ROC curves mean área of 0,922. / O câncer de pulmão é conhecido por apresentar a maior taxa de mortalidade e uma das menores taxas de sobrevida após o diagnóstico, o que é causado principalmente pela detecção e tratamento tardios. Para o auxílio dos especialistas em câncer pulmonar, são desenvolvidos sistemas de diagnósticos auxiliados por computador com o objetivo de automatizar a detecção e diagnóstico dessa doença. Este trabalho propõe uma metodologia para a classificação, através de imagens de tomografias computadorizadas, de candidatos a nódulos pulmonares e candidatos a não-nódulos. O banco de imagens Lung Image Database Consortium (LIDC) é utilizado para a criação de uma base de imagens de candidatos a nódulos e uma base de imagens de candidatos a não-nódulos. Três técnicas são utilizadas para a extração de medidas de textura. A primeira delas é o algoritmo de vidas artificiais Artificial Crawlers. A segunda técnica é a utilização do Rose Diagram para a extração de medidas direcionais. A terceira e última técnica é um modelo híbrido que une as medidas do Artificial Crawlers e do Rose Diagram. Para a classificação é utilizado o classificador Máquina de Vetor de Suporte (MVS), com o kernel de base radial. Os resultados alcançados são muito promissores. Utilizando 833 exames do LIDC divididos em 60% para treino e 40% para teste, alcançou-se uma média de acurácia de 94,30%, média de sensibilidade de 91,86%, média de especificidade de 94,78%, coeficiente de variância da acurácia de 1,61% e área média das curvas ROC de 0,922. Processamento de Imagens Reconhecimento de Padrões Câncer Classificação de Nódulo Pulmonar Vidas Artificiais Artificial Crawlers Rose Diagram Image Processing Pattern Recognition Cancer Lung Nodule Classification Artificial Life Artificial Crawlers Rose Diagram.
6	The One Spider To Rule Them All : Web Scraping Simplified: Improving Analyst Productivity and Reducing Development Time with A Generalized Spider / Spindeln som härskar över dom alla : Webbskrapning förenklat: förbättra analytikerproduktiviteten och minska utvecklingstiden med generaliserade spindlar Johansson, Rikard January 2023 (has links) This thesis addresses the process of developing a generalized spider for web scraping, which can be applied to multiple sources, thereby reducing the time and cost involved in creating and maintaining individual spiders for each website or URL. The project aims to improve analyst productivity, reduce development time for developers, and ensure high-quality and accurate data extraction. The research involves investigating web scraping techniques and developing a more efficient and scalable approach to report retrieval. The problem statement emphasizes the inefficiency of the current method with one customized spider per source and the need for a more streamlined approach to web scraping. The research question focuses on identifying patterns in the web scraping process and functions required for specific publication websites to create a more generalized web scraper. The objective is to reduce manual effort, improve scalability, and maintain high-quality data extraction. The problem is resolved using a quantitative approach that involves the analysis and implementation of spiders for each data source. This enables a comprehensive understanding of all potential scenarios and provides the necessary knowledge to develop a general spider. These spiders are then grouped based on their similarity, and through the application of simple logic, they are consolidated into a single general spider capable of handling all the sources. To construct the general spider, a utility library is created, equipped with the essential tools for extracting relevant information such as title, description, date, and PDF links. Subsequently, all the individual information is transferred to configuration files, enabling the execution of the general spider. The findings demonstrate the successful integration of multiple sources and spiders into a unified general spider. However, due to the limited time frame of the project, there is potential for further improvement. Enhancements could include better structuring of the configuration files, expansion of the utility library, or even the integration of AI capabilities to enhance the performance of the general spider. Nevertheless, the current solution is deemed suitable for automated article retrieval and ready to be used. / Denna rapport tar upp processen att utveckla en generaliserad spindel för webbskrapning, som kan appliceras på flera källor, och därigenom minska tiden och kostnaderna för att skapa och underhålla individuella spindlar för varje webbplats eller URL. Projektet syftar till att förbättra analytikers produktivitet, minska utvecklingstiden för utvecklare och säkerställa högkvalitativ och korrekt dataextraktion. Forskningen går ut på att undersöka webbskrapningstekniker och utveckla ett mer effektivt och skalbart tillvägagångssätt för att hämta rapporter. Problemformuleringen betonar ineffektiviteten hos den nuvarande metoden med en anpassad spindel per källa och behovet av ett mer effektiviserad tillvägagångssätt för webbskrapning. Forskningsfrågan fokuserar på att identifiera mönster i webbskrapningsprocessen och funktioner som krävs för specifika publikationswebbplatser för att skapa en mer generaliserad webbskrapa. Målet är att minska den manuella ansträngningen, förbättra skalbarheten och upprätthålla datautvinning av hög kvalitet. Problemet löses med hjälp av en kvantitativ metod som involverar analys och implementering av spindlar för varje datakälla. Detta möjliggör en omfattande förståelse av alla potentiella scenarier och ger den nödvändiga kunskapen för att utveckla en allmän spindel. Dessa spindlar grupperas sedan baserat på deras likhet, och genom tillämpning av enkel logik konsolideras de till en enda allmän spindel som kan hantera alla källor. För att konstruera den allmänna spindeln skapas ett verktygsbibliotek, utrustat med de väsentliga verktygen för att extrahera relevant information som titel, beskrivning, datum och PDF-länkar. Därefter överförs all individuell information till konfigurationsfiler, vilket möjliggör exekvering av den allmänna spindeln. Resultaten visar den framgångsrika integrationen av flera källor och spindlar till en enhetlig allmän spindel. Men på grund av projektets begränsade tidsram finns det potential för ytterligare förbättringar. Förbättringar kan inkludera bättre strukturering av konfigurationsfilerna, utökning av verktygsbiblioteket eller till och med integrering av AI-funktioner för att förbättra den allmänna spindelns prestanda. Ändå bedöms den nuvarande lösningen vara lämplig för automatisk artikelhämtning och redo att användas. Web scraping Web crawlers HTML Scrapy Optimization Web data extraction Webbskrapning Webbsökrobotar HTML Scrapy Optimering Webbdataextraktion Computer and Information Sciences Data- och informationsvetenskap
7	Σχεδιασμός και υλοποίηση ενός συστήματος αποκομιδής ορισμένης πληροφορίας από τον παγκόσμιο ιστό, με τη χρήση σημασιολογικών δικτύων λημμάτων / Design and implementation of a topical-focused web crawler through the use of semantic networks Κοζανίδης, Ελευθέριος 28 February 2013 (has links) Η συγκεκριμένη διατριβή στοχεύει στον σχεδιασμό της μεθοδολογίας που θα εφαρμοστεί για την υλοποίηση ενός προσκομιστή πληροφορίας από τον Παγκόσμιο Ιστό, ο οποίος θα λειτουργεί λαμβάνοντας υπόψη θεματικά κριτήρια. Τέτοιου είδους προγράμματα ανίχνευσης πληροφορίας, είναι ευρέως γνωστά ως θεματικά εστιασμένοι προσκομιστές ιστοσελίδων. Κατά τη διάρκεια της μελέτης μας, σχεδιάσαμε και υλοποιήσαμε ένα καινοτόμο σύστημα θεματικής κατηγοριοποίησης ιστοσελίδων που κάνει εκτεταμένη χρήση των σημασιολογικών δεδομένων τα οποία περιέχονται στο σημασιολογικό δίκτυο WordNet. Η απόφαση για την αξιοποίηση του WordNet ελήφθη με τη φιλοδοξία να αντιμετωπιστούν αποτελεσματικά φαινόμενα ασάφειας εννοιών που μειώνουν τις επιδόσεις των διαθέσιμων θεματικών κατηγοριοποιητών. Η καταλληλότητα του WordNet για την επίλυση της σημασιολογικής ασάφειας έχει αποδειχθεί στο παρελθόν, αλλά ποτέ δεν εξετάστηκε σε ένα σύστημα εστιασμένης προσκόμισης ιστοσελίδων με τον συγκεκριμένο τρόπο, ενώ ποτέ δεν έχει αξιοποιηθεί στην κατηγοριοποίηση ιστοσελίδων για την ελληνική γλώσσα. Ως εκ τούτου, ο θεματικός κατηγοριοποιητής που υλοποιήσαμε, και κατά συνέπεια, και ο εστιασμένος προσκομιστής στον οποίο ενσωματώνεται ο κατηγοριοποιητής, είναι καινοτόμοι όσο αφορά τον τρόπο με τον οποίο αποσαφηνίζουν έννοιες λέξεων με στόχο την αποτελεσματική ανίχνευση του θεματικού προσανατολισμού μίας ιστοσελίδας . Ένας προσκομιστής ιστοσελίδων είναι ένα πρόγραμμα που με αφετηρία μία λίστα διευθύνσεων ιστοσελίδων (URLs) αρχικοποίησης προσκομίζει το περιεχόμενο των ιστοσελίδων που συναντά και συνεχίζει ακολουθώντας τους εσωτερικούς τους συνδέσμους με απώτερο σκοπό την προσκόμιση όσο το δυνατό μεγαλύτερου υποσυνόλου δεδομένων του Παγκόσμιου Ιστού (ανάλογα με τους διαθέσιμους πόρους, την χωρητικότητα του δικτύου, κλπ.). Δεδομένου ότι ο όγκος των δεδομένων που είναι διαθέσιμα στον Παγκόσμιο Ιστό αυξάνεται με εκθετικό ρυθμό, είναι πρακτικά αδύνατο να προσκομιστούν όλες οι ζητούμενες πηγές πληροφορίας ανά πάσα στιγμή. Ένας τρόπος για να αντιμετωπίσουμε το συγκεκριμένο πρόβλημα είναι η εκμετάλλευση συστημάτων εστιασμένης προσκόμισης ιστοσελίδων που στοχεύουν στη λήψη ιστοσελίδων συγκεκριμένης θεματολογίας που εκφράζουν κάθε φορά το θεματικό προφίλ του χρήστη, σε αντίθεση με τους προσκομιστές ιστοσελίδων γενικού σκοπού που καταναλώνουν πόρους άσκοπα προσπαθώντας να προσκομίσουν κάθε πιθανή πηγή πληροφορίας που συναντούν. Οι εστιασμένοι προσκομιστές χρησιμοποιούνται εκτενώς, για την κατασκευή θεματικά προσανατολισμένων ευρετηρίων ιστοσελίδων, κάθε ένα από τα οποία έχει την δυνατότητα να εξυπηρετήσει αιτήσεις χρηστών με συγκεκριμένο θεματικό προσανατολισμό. Με αυτό τον τρόπο είναι δυνατόν να αντιμετωπιστεί το πρόβλημα της υπερφόρτωσης πληροφοριών. Προκειμένου να επιτελέσουμε την συγκεκριμένη εργασία μελετήσαμε εκτενώς υπάρχουσες τεχνικές εστιασμένης προσκόμισης, στις οποίες στηριχθήκαμε ώστε να ορίσουμε την μεθοδολογία που θα ακολουθήσουμε. Το αποτέλεσμα είναι η υλοποίηση ενός θεματικά εστιασμένου πολυνηματικού προσκομιστή, ο οποίος ενσωματώνει τις εξής καινοτομίες: είναι ρυθμισμένος προκειμένου να εκτελεί εστιασμένες προσκομίσεις σε ιστοσελίδες ελληνικού ενδιαφέροντος, αποσαφηνίζει το κείμενο που αντιστοιχεί σε ιστοσελίδες προκειμένου να ανακαλύψει τον θεματικό τους προσανατολισμό. Επιπλέον προτείνουμε μία σειρά υποσυστημάτων τα οποία θα μπορούσαν να ενσωματωθούν στο σύστημα εστιασμένης προσκόμισης προκειμένου να ενισχύσουμε την απόδοσή του. Τέτοια συστήματα είναι το υποσύστημα ανίχνευσης όψεων που αντιστοιχίζονται σε επώνυμες οντότητες καθώς και το υποσύστημα εξαγωγής λέξεων κλειδιών που μπορούν να χρησιμοποιηθούν ως χαρακτηριστικά κατηγοριοποίσης από το αλφαριθμητικό των διευθύνσεων (URL) ιστοσελίδων. Για να παρουσιάσουμε την αποτελεσματικότητα της προτεινόμενης μεθόδου, διενεργήσαμε μία σειρά πειραματικών μετρήσεων. Συγκεκριμένα αξιολογήσαμε πειραματικά τα ακόλουθα: την αποτελεσματικότητα του αλγορίθμου αποσαφήνισης που ενσωματώσαμε στον προσκομιστή, την απόδοση του θεματικού κατηγοριοποιητή ο οποίος καθορίζει την συμπεριφορά του εστιασμένου προσκομιστή σχετικά με το αν μια σελίδα θα πρέπει να κατέβει ως θεματικά σχετική με το θέμα ενδιαφέροντος ή όχι, την απόδοση του εστιασμένου προσκομιστή καταγράφοντας τον ρυθμό απόκτησης που επιτυγχάνει κατά την διάρκεια της εστιασμένης προσκόμισης χρησιμοποιώντας κάθε φορά διαφορετικά χαρακτηριστικά κατηγοριοποίησης, την καταλληλότητα του υποσυστήματος εξαγωγής λέξεων-κλειδιών από το αλφαριθμητικό URL για την περιγραφή του θεματικού προσανατολισμού της ιστοσελίδας και τέλος τη χρησιμότητα του συστήματος αναγνώρισης επώνυμων οντοτήτων στην οργάνωση ιστοσελίδων των οποίων η σημασιολογία δεν αναπαρίσταται ικανοποιητικά σε σημασιολογικούς πόρους γενικού σκοπού συμπεριλαμβανομένου του σημασιολογικού δικτύου WordNet. Τα πειραματικά αποτελέσματα επιβεβαιώνουν τη συμβολή του θεματικά εστιασμένου προσκομιστή που προτείνουμε στην προσκόμιση περιεχομένου ειδικού ενδιαφέροντος από τον Παγκόσμιο Ιστό. Παράλληλα αποδεικνύουμε ότι όλες οι μέθοδοι που ενσωματώσαμε στο σύστημα εστιασμένης προσκόμισης είναι δυνατό να συνεργαστούν κατά τρόπο που να βελτιώνει την απόδοση του προσκομιστή . Τέλος από τα πειραματικά αποτελέσματα αποδεικνύεται ότι η προτεινόμενη τεχνική είναι εξίσου αποτελεσματική για ιστοσελίδες στα αγγλικά και στα ελληνικά. Επιπλέον πιστεύουμε ότι μπορεί να εφαρμοστεί με επιτυχία και σε ιστοσελίδες που περιέχουν κείμενα άλλων φυσικών γλωσσών, με προϋπόθεση την ύπαρξη σημασιολογικών πόρων, αντίστοιχων με το WordNet και διαθέσιμων εργαλείων που θα επιτρέπουν την ανάλυση των δεδομένων κειμένου τους. / This dissertation aims at the specification of an algorithmic methodology that will be applied towards the implementation of a web crawler, which will operate upon thematic criteria. Such crawlers are widely known as topical focused web crawlers. To realize our objective, the utilization of a web page thematic classification system (either existing or newly developed one) is imperative. In the course of our study, we designed and implemented a novel thematic classifier that makes extensive use of the semantic data encoded in WordNet semantic network and such decision was taken with the aspiration of tackling effectively sense ambiguity phenomena that degrade the performance of available classifiers. The suitability of WordNet towards resolving semantic ambiguity has been previously proven but never examined in a focused web crawling application and has never been exploited for the Greek language. Therefore, our thematic classifier and consequently our focused crawler that integrates it are innovative in the way in which they perform word sense disambiguation for achieving the effective detection of the web page topics (themes). In a broad sense, a web crawler is a program that based on a seed list of URLs it downloads the contents of the web pages it comes across and continues following their internal links with the utmost objective of fetching as much as web data as possible (depending on available resources, network capacity, etc.). Given that the web data grows at exponential rates, it is practically impossible to download all the web sources at any given time. One way to tackle such difficulty is to implement and employ topical focused crawlers that aim at downloading content of specific topics (potentially of interest to the user) rather than waste resources trying to download every single data source that is available on the web. Topically focused crawlers are extensively used for building topical focused indices, each of which can serve specialized user search requests, therefore dealing partially with the information overload problem. To carry out our work, we have extensively reviewed existing approaches with respect to topically focused crawling techniques upon which we relied for defining our own focused crawling methodology, which resulted into the implementation of a topical focused crawler that incorporates the following innovate features: it is tailored to operate on the Greek web, it disambiguates the web pages in order to uncover their topic and it incorporates numerous features, such as a named entities recognizer, a URL keyword extractor, personalization techniques, etc., in order to maximize its performance. To demonstrate the effectiveness of our method, we have applied our topical focused crawler on several datasets and experimentally evaluated the following issues: the efficiency of the sense resolution algorithm incorporated into our crawler, the performance of the topical classifier that the crawler consults prior to making a final decision as to whether a page should be downloaded as topically relevant to a subject of interest or not, the suitability of the URL keyword extractor module for judging the subject of a web page based entirely on the analysis of its URL, the usefulness of the named entities recognizer in organizing pages whose semantics are poorly represented within the contents of general-purpose semantic resources (including WordNet semantic network). Experimental results confirm the contribution of our topically focused crawler in downloading web content of specific interest and show that all the methods and techniques that we have successfully integrated into the crawler can interoperate with its other in a manner that improves the crawling performance while allowing for flexibility in the downloading process at the same time. Last but not least, experimental results showcase that our crawling methodology is equally effective for both English and Greek and we believe that it can be fruitfully applied to other natural languages provided that there the respective semantic resources and tools are available for analyzing their textual data. 025.042 7 Focused crawlers Semantic networks Semantic similarity Word sense disambiguation Web page categorization WordNet GeoWordNet Greek WordNet

1

Page generated in 0.1196 seconds