Friday, March 25, 2016

Fifth Week of LITG Program - Part 1

The fifth week of the LITG program is a bit of a hectic week. To cover the backlogs of the previous weeks, I had to work a bit harder with a tight schedule. Before moving onto developing the Python package to scrape Quora user profiles, I was instructed by my mentor to explore and learn about several Python modules such as urllib2, lxml, etree, xpath and re which are most commonly used in Python code to extract data from websites. I also learned about dictionaries in Python. This article is dedicated to summarize my learning related to these topics. The simple commands that I tried related to each and every aforementioned module are available in the github repository at https://github.com/hansika/LITG_Practice


urllib2 Module:


urllib2 is a Python module which defines functions and classes that help in opening URLs(Uniform Resource Locators). I referred the tutorial available at https://docs.python.org/2/howto/urllib2.html from Python official website to try out some common commands in urllib2 module to get a basic understanding. This module is commonly used with HTTP URLs but it can also be used with other protocols such as HTTPS, FTP, Gopher etc. Apart from opening URLs the module also offers a slightly complex interface for handling common situations such as basic and digest authentication, redirections and cookies. Such features are provided by objects called handlers and openers. 


Get Request

urlopen() is the function used to fetch the content related to a particular URL. The response returned is a file-like object with several functions defined on it. read() and readlines() methods can be used to get access to the data of the remote resource. geturl() method returns the real URL of the page fetched. This is useful particularly when the urlopen() method has followed a redirect. At such circumstances, the URL of the page fetched may not be the same as the URL requested. info() method returns a dictionary-like object that describes the page fetched(headers sent by the server). It is important to call the close() method on this response after use as with any other usual file-like object. 


Request – Response

HTTP protocol works in a request response manner where the clients send requests to the server and the server returns responses. urllib2 has a Request() method for this, which returns a request object on a particular URL, on which the urlopen() method can be called. This Request() method can accept an additional data argument to send more information to the URL such as name etc. of the client. 


Exception Handling

urlopen() method raises URLError when it cannot handle a response. URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn’t exist. In this case, the exception raised will have a ‘reason’ attribute, which is a tuple containing an error code and a text error message.

HTTPError is the subclass of URLError raised in the specific case of HTTP URLs. If an error is raised when trying to access a HTTP URL, the server responds by returning an HTTP error code and an error page. 


Basic Authentication

To access certain URLs authentication is required. When such authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a ‘realm’. An example header looks as follows.
WWW-Authenticate: Basic realm="cPanel Users"


The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is ‘basic authentication’. To facilitate basic authentication with urllib2, first of all a password manager is created which handles the mapping of URLs and realms to passwords and usernames. HTTPPasswordMgr or HTTPPasswordMgrWithDefaultRealm is used for this purpose. We pass the username, password and the requested URL to the password manager using its add_password() method. Then we can create an instance of HTTPBasicAuthHandler from the created password manager and an opener object to use this handler. The open() method of this opener can be used to fetch the URLs that request authentication. Later this opener object can be installed such that all calls to urlopen() method of the urllib2 module use this custom opener with the authentication details embedded. 


etree Module:


etree is a module from the lxml library that can be used to process XML documents. I referred the tutorial at http://lxml.de/tutorial.html to learn about etree. etree is most commonly used to scrape the content from web pages. The following statement is used in Python to import this module.
from lxml import etree

An Element is the main container object for the ElementTree API. Most of the XML tree functionality is accessed through this class. I tried the following commands to play with the etree module and learn about it.


Creating XML Tree Structure

Elements can be created using the Element() method which accepts the tag of the element as an argument. The SubElement() method can be used to create child elements and assign them to parent elements. This method takes as arguments, the parent node and the tag of the newly created child node as a string. A created XML tree can be printed using the tostring() method which takes as an argument the root of the XML tree. This tostring() method can accept an additional argument in the form “pretty_print=True” in order to print the XML tree in a nice hierarchical manner as in the following example. 

<root>
      <child1/>
      <child2/>
      <child3/>
</root>

Furthermore, iselement() method can take an argument and check whether it is an element.

Elements can also carry attributes. To assign attributes to elements the Element() method can accept an additional argument in the form attribute_name = “attribute_value”. Elements also carry text. Such text can be set via the text attribute of elements. 


Using Xpath to Read the Text of Elements 

Xpath() method of etree module can be called with the argument “string()” or “//text()” to create a function that can be used to print the text content of the elements. When used with the “string()” argument, the text content of all the elements will be printed as one whole concatenated string. When used with the “//text()” argument, the text will be printed as a list. 


Parsing XML Files

There are three methods that can be used for this namely, fromstring() method, XML() method and parse() method. fromstring() and XML() methods take as arguments a string of XML data, parses it and returns the root element of the XML tree as an element object. On the other hand, parse() method takes input from files and file-like objects and returns a complete document from that file. What it returns is an ElementTree object in contrast to an element object.

Furthermore, we can create custom parsers. For an example, I created a custom XMLParser and a HTMLParser. The created custom XMLParser can be passed to the XML() method as an additional argument to instruct Python to use this parser in parsing the defined string instead of the default parser. Similarly, the created custom HTMLParser can be passed to the parse() method as an additional argument.


re Module:


This module has been added into Python to provide regular expression patterns. Regular expressions is a highly specialized programming language embedded inside Python and made available through the re module. I referred two tutorials for this at https://docs.python.org/2/howto/regex.html and https://developers.google.com/edu/python/regular-expressions#basic-patterns

The basic concept behind the re module includes specifying the rules for the set of possible strings that we want to match and asking questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. As mentioned in the first one of the above two tutorials, regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. 


Metacharacters

There is a number of metacharacters in the re module that define a set of special patterns to be matched. Following is a list of metacharacters that re supports.

. - matches anything except a newline character

^ - matches the complement of the given pattern

* - previous character can be matched zero or more times, instead of exactly once.

+ - matches one or more times

? - matches either once or zero times

{ } - {m,n}, where m and n are decimal integers means there must be at least m repetitions, and at most n. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity

[ ] - character class. [acd] defines either a or b or c from the character class containing a, b, c 

\ - escaping metacharacters (\d, \D, \s, \S, \w, \W)


Compiling Regular Expressions

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. Shown below is a list methods that can be called on any pattern object. 

match() - Determine if the RE matches at the beginning of the string. 
search() - Scan through a string, looking for any location where this RE matches. 
findall() - Find all substrings where the RE matches, and returns them as a list. 
finditer() - Find all substrings where the RE matches, and returns them as an iterator. 


Methods on Match Object Instances

The match object itself can be printed. It prints none when there is no match. In addition to that the following methods can be called on the objects returned from match method or search method.

group() - Return the string matched by the RE 
start() - Return the starting position of the match 
end() - Return the ending position of the match 
span() - Return a tuple containing the (start, end) positions of the match 


Technique Alternative to Compiling re Objects

The second tutorial mentioned above describes an alternative method for pattern matching that does not compile regular expressions into pattern objects. This method takes the following form.
match = re.search(pat, str)

The search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise.

The 'r' at the start of the pattern string denotes that it is a python "raw" string. A raw string literal does not have escape sequences as in a normal Python string. A backslash, \, is taken as meaning "just a backslash" except when it comes right before a quote that would otherwise terminate the literal. This is extremely useful since the syntax of regular expression patterns is already heavy with backslashes. 


xpath Method:


xpath() is the method most commonly used to extract text data out of web pages. This has been discussed under the lxml library as well. But that was rather limited to extracting text content from XML tree structures. This explanation addresses extracting text data from html tree structures using the xpath() method. 

Every bit of content on an html webpage has its own unique xpath. In simple terms, it is the path to the data that we need to extract. As mentioned before, the urlopen() method can return a file-like object which has the content of the requested webpage. A custom htmlparser can parse this file-like response into a html tree structure. xpath() method can be called on this tree structure with the xpath of the required data as an argument. This xpath() method then returns the text content of the defined data element as a string. 

Getting the xpath on Google chrome is not that hard. All we have to do is to go to the required webpage and then go to the developer tools on the browser. Next, move to the Elements tab. Then we can simply right click on the data that we want and select copy xpath from the menu. This is the xpath that we pass to the xpath() method as an argument.


Python Dictionaries:


Python dictionaries are heavily used in web scraping. So it is quite important to have some basic understanding about dictionaries in Python. According to Python syntax, each key in the dictionary is separated from its value by a colon(:). Items are separated from each other using commas(,) and the whole thing is enclosed in two curly braces. This is shown in the following example.

dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};

The values can be of any type and also values do not have to be unique. But the keys need to be unique and should be of some immutable data type such as strings, numbers or tuples. 


Accessing Values

Values of the above dictionary can be accessed using usual syntax. dict['Name'] returns 'Zara' 


Updating Dictionary

An existing dictionary can be updated either by adding a new entry or modifying the value of an existing entry. 

dict['Age'] = 8; # update existing entry
dict['school'] = "DPS School"; # Add new entry


Deleting Dictionary Elements

Deleting happens in three ways. We can remove individual elements or clear the entire content of a dictionary or delete the entire dictionary.

del dict['Name']; # remove entry with key 'Name'
dict.clear(); # remove all entries in dict
del dict; # delete entire dictionary


Dictionary Functions & Methods

There are several functions and methods defined on dictionaries. Some of them are shown below.

str(dict) – returns a printable string representation of a dictionary.
len(dict) – gives the total length of the dictionary.
dict.clear() - removes all the elements of the dictionary.
dict.get(key) – returns the value of the passed key or None if the key is not in the dictionary.
dict.has_key(key) – returns true if key is in the dictionary or false otherwise.
dict.values() - returns a list of values in the dictionary. 

With the knowledge acquired from these different areas I will be headed towards coding the package for scraping the Quora user profile within this week itself.

Sunday, March 20, 2016

Fourth Week of LITG Program

After laying the foundation for programming with Python, this week I got introduced into the world of web scraping. The main objective of my project encompasses retrieving information such as top questions, followers etc from Quora website, given the name of the user as an input. To develop a REST API for this, first of all a Python package has to be developed to extract information from Quora web page. This package can be pushed to Python package index. Therefore, learning about the significance and the techniques of web scraping is of extreme importance.

The term web scraping basically refers to extracting the useful information from any web page. Web scraping becomes useful in a number of circumstances. Not every website has an API to extract the data out of it. Also even if some websites have an API specifically developed to scrape its content most of them are not exposed to the public. Sometimes, although these API's are publicly accessible, they are not comprehensive. Therefore, developing an API of your own that scrapes the web lets you have more control over the logic of data extraction. You can scrape only what you want and restrain from having to handle the whole lot of data retrieved when using general purpose API's available out there. Data thus scraped can be quite useful in various analysis tasks. 

As instructed by my mentor, I learned two technologies that can be used for web scraping. One of these includes scraping the web using Scrapy library. The example for this that I referred is available at https://github.com/tapasweni-pathak/Talks-and-Workshops/tree/master/PyLadies%20Remote/Code%20Examples. The video tutorial available at https://www.youtube.com/watch?v=ic8ygbac5lo made it much easier for me to understand this project further. The other project studied is pyhoroscope developed by my mentor and available at https://github.com/tapasweni-pathak/pyhoroscope. This project uses Beautiful Soup + lxml, two other libraries that can be used for web scraping instead of Scrapy. In the rest of this article I will be summarizing the learning that I gained through these two projects. 

Scrapy is a portable, open source and 100% Python library. This library is extensively documented and therefore it is quite easy to learn it. Also the output of web scraping can be saved into a json, csv or xml file which is much easier for a developer. This sample project uses Cloud9 IDE for developing the scraper. The directory structure of Scrapy projects can be displayed as in the following image. 

Scrapy Directory Structure

The auto-generated files such as items.py, settings.py and pipelines.py serve various purposes. The items.py file is to define the fields that you want to scrape on the web page such as the heading etc. settings.py file lets you define various settings such as allowing redirection on web pages and what to do when 404 web pages are encountered. The pipelines.py file does post processing such as eliminating duplicates. The spiders folder consists of the spider files which include the logic of how you scrape the web. The tutorial develops two spiders(crawlers), a simple spider that scrapes once on http://scrapy.org/ and a recursive spider on http://www.cse.iitd.ac.in/~naveen/. For the simple crawler, inside the spider you have various important stuff such as selectors, Base Spider and the parse function. Selectors select certain parts of the html source specified by the xpath or cssselect. xpath is the path to the data that we need to extract. Since xpath can be confusing at times, people tend to use cssselect instead. Base Spider is the simplest spider from which every other spider inherits. Its basic task is to request the given start url and call the parse method of the spider on each resulting response. The parse method is called to process the response from the website. When the request does not specifically specify a custom callback function to process the response, the default callback method used by Scrapy is this parse method. Using these concepts I could successfully scrape the Scrapy official website. 

On the other hand the recursive crawler uses Crawler Spider as opposed to Base Spider. This crawler lets you define rules for recursive crawling. It involves regular expressions. This crawler searches for the given pattern in the url and if it finds that pattern it goes to that same url again and again and applies the same rules given in the callback function. The rule which was defined for recursive crawling in the example project has the following regular expression for the allowed url.

"cse\.iitd\.ernet\.in/\~naveen/.*\.html"

The '.' in the above regex denotes any character and the '*' denotes any number of times. This is the format of the url's that will be recursively scraped on the domain “cse\.iitd\.ernet\.in”. Also in this example a custom callback function was used to parse the response instead of the parse function which is the default callback function used so far. Using these concepts I could successfully perform a sample recursive crawling on the website http://www.cse.iitd.ac.in/~naveen/

The other project namely pyhoroscope developed using Beautiful Soup + lxml is a Python module that can be used to fetch and parse data from GaneshaSpeaks( http://www.ganeshaspeaks.com/ ). This project uses the html parser included in Python's standard library to parse the response. It has different methods to get today's horoscope, weekly horoscope, monthly horoscope, yearly horoscope or know all about a given sunsign. Also this project uses etree from lxml library for parsing the response. The parsed data is stored as a dictionary and this dictionary is returned on method invocation. 

With this knowledge acquired regarding web scraping, I will be heading towards developing a Python module to scrape Quora during the next week. At the end of the week this module will be pushed to Python package index as a package so that it could be used by any API thereafter.

Thursday, March 17, 2016

Third Week of LITG Program

The first three weeks in the schedule of the LITG program was dedicated to learning Python, the new programming language that I am about to explore throughout the program. Grasping the basics was not so hard with the proper guidance of my mentor. Also as a developer who has coded in several other languages such as Java, C, C# before I realized that the fundamental concepts and structures do not differ much among these programming languages. Yet laying a firm foundation with Python is of utmost importance to excel in the LITG program. As instructed by my mentor I referred the book 'Learn Python the Hard Way', available at http://learnpythonthehardway.org/book/.

This book is an excellent guide for any newbie to start learning to code in Python. It consists of a series of exercises from beginner level to more advanced concepts in programming such as inheritance, composition. The book starts with very basic concepts in programming such as adding comments, dealing with mathematical operators etc. The way the author has structured the book from easy to difficult tasks makes the learner much more enthusiastic to proceed learning. The author stresses not to copy paste the same code samples from the book but to code them by the learner himself to enhance the learning experience. This indeed helped me since typing is the best way to get familiar with the syntax of any new programming language. Also each exercise consists of a set of study drills which helped me further improve my knowledge related to that particular exercise. These study drills are actually extensions of the main exercise which instructs the learner to play more with the new facts learned with each exercise. The quizzes evaluated my learning, what I need to find out more and also what areas I am already proficient in. As a whole this book can be recommended to both beginner level programmers as well as to more advanced programmers who are beginners to Python. 

Now I would summarize the Python learning experience that I gained through this book. All python files are saved with the extension .py. Python is an object oriented programming language similar to Java meaning that it has all the common oop concepts such as classes, objects, functions, inheritance, compositions etc. Apart from that it is also a scripting language. Another important fact that I learned about Python is that it has modules which can be imported into any Python script. A module is a Python file with some functions or variables in it. Once the imported, the functions and the variables of the module can be used within the Python script. A class in Python is also like a mini module. Also every class inherits from a class named object which is quite strange and different from other programming languages such as Java. Inheritance(is-a) and composition(has-a) are two prominent relationships within any Python class. However the author of the book thoroughly discourages the use of multiple inheritance due to its complexity. 

The exercises of this book done by me are available at the github repository https://github.com/hansika/LearnPython. My usual practice is to add comments wherever necessary to make the concepts much more clear and also to make it convenient for me when I come back and look at these exercises after some time. I did as many study drills as possible to grasp the concepts more clearly. As soon as I found new keywords or new built-in functions I would search online for more knowledge regarding them. These extra facts that I find, I recorded in a separate file named information.txt which can also be found inside the same github repository. I can refer this file whenever needed to brush up and renew my knowledge. 

The path ahead in the LITG program would be challenging. Yet, with a firm foundation of language basics coping up with challenges would be much easier. I am looking forward to explore more in Python and enlarge my knowledge-base with the new facts I come across everyday. 

Wednesday, March 16, 2016

Second Week of LITG Program

As the second week of the LITG program started off, my main task was to identify the significance of various open source licenses available and adding an appropriate license to my project. As instructed by my mentor, I referred the website http://choosealicense.com/ for three types of open source licenses available namely, MIT License, Apache License 2.0 and GNU GPLv3. My learning about these licenses is summarized in this article.

The first type of license studied was MIT License. It is a free software license originating at the Massachusetts Institute of Technology(MIT). A comparatively small and straight forward license which lets a person do anything with your code while providing attribution back to you, the original owner of the code and not holding you liable at any misdeed. This includes the freedom for all activities including copying, modifying, merging, publishing or even selling copies of the software on the condition that the original license is included somewhere in the derivative.

The second type of license studied was Apache License 2.0. This license is similar to MIT License except for few additional restrictions. If another person changes any Apache licensed code they must state so. Also it has restrictions on using the project name. Another prominent feature of the Apache License is that it includes terms for dealing with patents. Apart from these exceptions, Apache License is almost the same as the MIT License allowing users to redistribute derivatives of the original code under any license of their preference.

The next type of license studied was GNU General Public License(GPL)v3. This includes a lot of restrictions over the usual features of the Apache License 2.0. It is a copyleft license in that it requires any user who redistributes your code or its derivatives to distribute it under the same license. Therefore the derivatives too need to be free and open source. Apart from that this license also includes terms for patent rights.

On the understanding gained related to the aforementioned three types of licenses, I decided to use Apache License 2.0 for my project. GNU GPLv3 was rejected at the first place on the understanding that it requires the users of my code to redistribute their work with the same license. This is breach of freedom for the developers to use my code and build on it as they wish. My purpose is to develop some new functionality and make it available for use by all kinds of enthusiastic developers out there regardless of whether their work is free or not. Software needs to be reusable beyond such boundaries. Then the choice was among the MIT License and the Apache License 2.0. These two licenses are almost similar. Yet I decided to choose Apache License over MIT License due to the fact that it includes additional terms for patent rights which ensures the originality of my work. With this license, the users of my code can use it at their will(to modify, copy, redistribute), apply a license of their choice on their derivatives while still providing the attribution of the original work back to me.

With the learning gained during the second week of the LITG program, I cam now easily select the software license that best fits my needs. This will indeed be helpful for me ahead in my career as a software developer.