Global ETD Search

Return to search

How to Build a Web Scraper for Social Media

In recent years, the act of scraping websites for information has become increasingly relevant. However, along with this increase in interest, the internet has also grown substantially and advances and improvements to websites over the years have in fact made it more difficult to scrape. One key reason for this is that scrapers simply account for a significant portion of the traffic to many websites, and so developers often implement anti-scraping measures along with the Robots Exclusion Protocol (robots.txt) to try to stymie this traffic. The popular use of dynamically loaded content – content which loads after user interaction – poses another problem for scrapers. In this paper, we have researched what kinds of issues commonly occur when scraping and crawling websites – more specifically when scraping social media – and how to solve them. In order to understand these issues better and to test solutions, a literature review was performed and design and creation methods were used to develop a prototype scraper using the frameworks Scrapy and Selenium. We found that automating interaction with dynamic elements worked best to solve the problem of dynamically loaded content. We also theorize that having an artificial random delay when scraping and randomizing intervals between each visit to a website would counteract some of the anti-scraping measures. Another, smaller aspect of our research was the legality and ethicality of scraping. Further thoughts and comments on potential solutions to other issues have also been included.

Engineering and Technology

Teknik och teknologier

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:mau-20594
Date	January 2019
Creators	Lloyd, Oskar, Nilsson, Christoffer
Publisher	Malmö universitet, Fakulteten för teknik och samhälle (TS), Malmö universitet, Fakulteten för teknik och samhälle (TS), Malmö universitet/Teknik och samhälle
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0029 seconds

How to Build a Web Scraper for Social Media

Description

Links & Downloads

Tags

Additional Fields