Sunday, March 20, 2016

Fourth Week of LITG Program

After laying the foundation for programming with Python, this week I got introduced into the world of web scraping. The main objective of my project encompasses retrieving information such as top questions, followers etc from Quora website, given the name of the user as an input. To develop a REST API for this, first of all a Python package has to be developed to extract information from Quora web page. This package can be pushed to Python package index. Therefore, learning about the significance and the techniques of web scraping is of extreme importance.

The term web scraping basically refers to extracting the useful information from any web page. Web scraping becomes useful in a number of circumstances. Not every website has an API to extract the data out of it. Also even if some websites have an API specifically developed to scrape its content most of them are not exposed to the public. Sometimes, although these API's are publicly accessible, they are not comprehensive. Therefore, developing an API of your own that scrapes the web lets you have more control over the logic of data extraction. You can scrape only what you want and restrain from having to handle the whole lot of data retrieved when using general purpose API's available out there. Data thus scraped can be quite useful in various analysis tasks. 

As instructed by my mentor, I learned two technologies that can be used for web scraping. One of these includes scraping the web using Scrapy library. The example for this that I referred is available at https://github.com/tapasweni-pathak/Talks-and-Workshops/tree/master/PyLadies%20Remote/Code%20Examples. The video tutorial available at https://www.youtube.com/watch?v=ic8ygbac5lo made it much easier for me to understand this project further. The other project studied is pyhoroscope developed by my mentor and available at https://github.com/tapasweni-pathak/pyhoroscope. This project uses Beautiful Soup + lxml, two other libraries that can be used for web scraping instead of Scrapy. In the rest of this article I will be summarizing the learning that I gained through these two projects. 

Scrapy is a portable, open source and 100% Python library. This library is extensively documented and therefore it is quite easy to learn it. Also the output of web scraping can be saved into a json, csv or xml file which is much easier for a developer. This sample project uses Cloud9 IDE for developing the scraper. The directory structure of Scrapy projects can be displayed as in the following image. 

Scrapy Directory Structure

The auto-generated files such as items.py, settings.py and pipelines.py serve various purposes. The items.py file is to define the fields that you want to scrape on the web page such as the heading etc. settings.py file lets you define various settings such as allowing redirection on web pages and what to do when 404 web pages are encountered. The pipelines.py file does post processing such as eliminating duplicates. The spiders folder consists of the spider files which include the logic of how you scrape the web. The tutorial develops two spiders(crawlers), a simple spider that scrapes once on http://scrapy.org/ and a recursive spider on http://www.cse.iitd.ac.in/~naveen/. For the simple crawler, inside the spider you have various important stuff such as selectors, Base Spider and the parse function. Selectors select certain parts of the html source specified by the xpath or cssselect. xpath is the path to the data that we need to extract. Since xpath can be confusing at times, people tend to use cssselect instead. Base Spider is the simplest spider from which every other spider inherits. Its basic task is to request the given start url and call the parse method of the spider on each resulting response. The parse method is called to process the response from the website. When the request does not specifically specify a custom callback function to process the response, the default callback method used by Scrapy is this parse method. Using these concepts I could successfully scrape the Scrapy official website. 

On the other hand the recursive crawler uses Crawler Spider as opposed to Base Spider. This crawler lets you define rules for recursive crawling. It involves regular expressions. This crawler searches for the given pattern in the url and if it finds that pattern it goes to that same url again and again and applies the same rules given in the callback function. The rule which was defined for recursive crawling in the example project has the following regular expression for the allowed url.

"cse\.iitd\.ernet\.in/\~naveen/.*\.html"

The '.' in the above regex denotes any character and the '*' denotes any number of times. This is the format of the url's that will be recursively scraped on the domain “cse\.iitd\.ernet\.in”. Also in this example a custom callback function was used to parse the response instead of the parse function which is the default callback function used so far. Using these concepts I could successfully perform a sample recursive crawling on the website http://www.cse.iitd.ac.in/~naveen/

The other project namely pyhoroscope developed using Beautiful Soup + lxml is a Python module that can be used to fetch and parse data from GaneshaSpeaks( http://www.ganeshaspeaks.com/ ). This project uses the html parser included in Python's standard library to parse the response. It has different methods to get today's horoscope, weekly horoscope, monthly horoscope, yearly horoscope or know all about a given sunsign. Also this project uses etree from lxml library for parsing the response. The parsed data is stored as a dictionary and this dictionary is returned on method invocation. 

With this knowledge acquired regarding web scraping, I will be heading towards developing a Python module to scrape Quora during the next week. At the end of the week this module will be pushed to Python package index as a package so that it could be used by any API thereafter.

No comments:

Post a Comment