The fifth week of the LITG program is a bit of a hectic week. To cover the backlogs of the previous weeks, I had to work a bit harder with a tight schedule. Before moving onto developing the Python package to scrape Quora user profiles, I was instructed by my mentor to explore and learn about several Python modules such as urllib2, lxml, etree, xpath and re which are most commonly used in Python code to extract data from websites. I also learned about dictionaries in Python. This article is dedicated to summarize my learning related to these topics. The simple commands that I tried related to each and every aforementioned module are available in the github repository at https://github.com/hansika/LITG_Practice.
urllib2 Module:
urllib2 is a Python module which defines functions and classes that help in opening URLs(Uniform Resource Locators). I referred the tutorial available at https://docs.python.org/2/howto/urllib2.html from Python official website to try out some common commands in urllib2 module to get a basic understanding. This module is commonly used with HTTP URLs but it can also be used with other protocols such as HTTPS, FTP, Gopher etc. Apart from opening URLs the module also offers a slightly complex interface for handling common situations such as basic and digest authentication, redirections and cookies. Such features are provided by objects called handlers and openers.
Get Request
urlopen() is the function used to fetch the content related to a particular URL. The response returned is a file-like object with several functions defined on it. read() and readlines() methods can be used to get access to the data of the remote resource. geturl() method returns the real URL of the page fetched. This is useful particularly when the urlopen() method has followed a redirect. At such circumstances, the URL of the page fetched may not be the same as the URL requested. info() method returns a dictionary-like object that describes the page fetched(headers sent by the server). It is important to call the close() method on this response after use as with any other usual file-like object.
Request – Response
HTTP protocol works in a request response manner where the clients send requests to the server and the server returns responses. urllib2 has a Request() method for this, which returns a request object on a particular URL, on which the urlopen() method can be called. This Request() method can accept an additional data argument to send more information to the URL such as name etc. of the client.
Exception Handling
urlopen() method raises URLError when it cannot handle a response. URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn’t exist. In this case, the exception raised will have a ‘reason’ attribute, which is a tuple containing an error code and a text error message.
HTTPError is the subclass of URLError raised in the specific case of HTTP URLs. If an error is raised when trying to access a HTTP URL, the server responds by returning an HTTP error code and an error page.
Basic Authentication
To access certain URLs authentication is required. When such authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a ‘realm’. An example header looks as follows.
WWW-Authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is ‘basic authentication’. To facilitate basic authentication with urllib2, first of all a password manager is created which handles the mapping of URLs and realms to passwords and usernames. HTTPPasswordMgr or HTTPPasswordMgrWithDefaultRealm is used for this purpose. We pass the username, password and the requested URL to the password manager using its add_password() method. Then we can create an instance of HTTPBasicAuthHandler from the created password manager and an opener object to use this handler. The open() method of this opener can be used to fetch the URLs that request authentication. Later this opener object can be installed such that all calls to urlopen() method of the urllib2 module use this custom opener with the authentication details embedded.
etree Module:
etree is a module from the lxml library that can be used to process XML documents. I referred the tutorial at http://lxml.de/tutorial.html to learn about etree. etree is most commonly used to scrape the content from web pages. The following statement is used in Python to import this module.
from lxml import etree
An Element is the main container object for the ElementTree API. Most of the XML tree functionality is accessed through this class. I tried the following commands to play with the etree module and learn about it.
Creating XML Tree Structure
Elements can be created using the Element() method which accepts the tag of the element as an argument. The SubElement() method can be used to create child elements and assign them to parent elements. This method takes as arguments, the parent node and the tag of the newly created child node as a string. A created XML tree can be printed using the tostring() method which takes as an argument the root of the XML tree. This tostring() method can accept an additional argument in the form “pretty_print=True” in order to print the XML tree in a nice hierarchical manner as in the following example.
<root>
<child1/>
<child2/>
<child3/>
</root>
Furthermore, iselement() method can take an argument and check whether it is an element.
Elements can also carry attributes. To assign attributes to elements the Element() method can accept an additional argument in the form attribute_name = “attribute_value”. Elements also carry text. Such text can be set via the text attribute of elements.
Using Xpath to Read the Text of Elements
Xpath() method of etree module can be called with the argument “string()” or “//text()” to create a function that can be used to print the text content of the elements. When used with the “string()” argument, the text content of all the elements will be printed as one whole concatenated string. When used with the “//text()” argument, the text will be printed as a list.
Parsing XML Files
There are three methods that can be used for this namely, fromstring() method, XML() method and parse() method. fromstring() and XML() methods take as arguments a string of XML data, parses it and returns the root element of the XML tree as an element object. On the other hand, parse() method takes input from files and file-like objects and returns a complete document from that file. What it returns is an ElementTree object in contrast to an element object.
Furthermore, we can create custom parsers. For an example, I created a custom XMLParser and a HTMLParser. The created custom XMLParser can be passed to the XML() method as an additional argument to instruct Python to use this parser in parsing the defined string instead of the default parser. Similarly, the created custom HTMLParser can be passed to the parse() method as an additional argument.
re Module:
This module has been added into Python to provide regular expression patterns. Regular expressions is a highly specialized programming language embedded inside Python and made available through the re module. I referred two tutorials for this at https://docs.python.org/2/howto/regex.html and https://developers.google.com/edu/python/regular-expressions#basic-patterns.
The basic concept behind the re module includes specifying the rules for the set of possible strings that we want to match and asking questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. As mentioned in the first one of the above two tutorials, regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.
Metacharacters
There is a number of metacharacters in the re module that define a set of special patterns to be matched. Following is a list of metacharacters that re supports.
. - matches anything except a newline character
^ - matches the complement of the given pattern
* - previous character can be matched zero or more times, instead of exactly once.
+ - matches one or more times
? - matches either once or zero times
{ } - {m,n}, where m and n are decimal integers means there must be at least m repetitions, and at most n. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity
[ ] - character class. [acd] defines either a or b or c from the character class containing a, b, c
\ - escaping metacharacters (\d, \D, \s, \S, \w, \W)
Compiling Regular Expressions
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. Shown below is a list methods that can be called on any pattern object.
match() - Determine if the RE matches at the beginning of the string.
search() - Scan through a string, looking for any location where this RE matches.
findall() - Find all substrings where the RE matches, and returns them as a list.
finditer() - Find all substrings where the RE matches, and returns them as an iterator.
Methods on Match Object Instances
The match object itself can be printed. It prints none when there is no match. In addition to that the following methods can be called on the objects returned from match method or search method.
group() - Return the string matched by the RE
start() - Return the starting position of the match
end() - Return the ending position of the match
span() - Return a tuple containing the (start, end) positions of the match
Technique Alternative to Compiling re Objects
The second tutorial mentioned above describes an alternative method for pattern matching that does not compile regular expressions into pattern objects. This method takes the following form.
match
= re.search(pat, str)
The search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise.
The 'r' at the start of the pattern string denotes that it is a python "raw" string. A raw string literal does not have escape sequences as in a normal Python string. A backslash, \, is taken as meaning "just a backslash" except when it comes right before a quote that would otherwise terminate the literal. This is extremely useful since the syntax of regular expression patterns is already heavy with backslashes.
xpath Method:
xpath() is the method most commonly used to extract text data out of web pages. This has been discussed under the lxml library as well. But that was rather limited to extracting text content from XML tree structures. This explanation addresses extracting text data from html tree structures using the xpath() method.
Every bit of content on an html webpage has its own unique xpath. In simple terms, it is the path to the data that we need to extract. As mentioned before, the urlopen() method can return a file-like object which has the content of the requested webpage. A custom htmlparser can parse this file-like response into a html tree structure. xpath() method can be called on this tree structure with the xpath of the required data as an argument. This xpath() method then returns the text content of the defined data element as a string.
Getting the xpath on Google chrome is not that hard. All we have to do is to go to the required webpage and then go to the developer tools on the browser. Next, move to the Elements tab. Then we can simply right click on the data that we want and select copy xpath from the menu. This is the xpath that we pass to the xpath() method as an argument.
Python Dictionaries:
Python dictionaries are heavily used in web scraping. So it is quite important to have some basic understanding about dictionaries in Python. According to Python syntax, each key in the dictionary is separated from its value by a colon(:). Items are separated from each other using commas(,) and the whole thing is enclosed in two curly braces. This is shown in the following example.
dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};
The
values can be of any type and also values do not have to be unique.
But the keys need to be unique and should be of some immutable data
type such as strings, numbers or tuples.
Accessing Values
Values
of the above dictionary can be accessed using usual syntax.
dict['Name'] returns 'Zara'
Updating Dictionary
An existing dictionary can be updated either by adding a new entry or modifying the value of an existing entry.
dict['Age']
= 8; # update existing entry
dict['school']
= "DPS School"; # Add new entry
Deleting Dictionary Elements
Deleting
happens in three ways. We can remove individual elements or clear the
entire content of a dictionary or delete the entire dictionary.
del
dict['Name']; # remove entry with key 'Name'
dict.clear();
# remove all entries in dict
del
dict; # delete entire dictionary
Dictionary Functions & Methods
There are several functions and methods defined on dictionaries. Some of them are shown below.
str(dict) – returns a printable string representation of a dictionary.
len(dict) – gives the total length of the dictionary.
dict.clear() - removes all the elements of the dictionary.
dict.get(key) – returns the value of the passed key or None if the key is not in the dictionary.
dict.has_key(key) – returns true if key is in the dictionary or false otherwise.
dict.values() - returns a list of values in the dictionary.
With the knowledge acquired from these different areas I will be headed towards coding the package for scraping the Quora user profile within this week itself.