Sunday, April 3, 2016

Fifth Week of LITG Program - Part 2

As mentioned in the previous blog posts, during the rest of the fifth week and the beginning of the sixth week, my main objective was to finish coding the Python module to fetch and parse data from the Quora user profiles. This was a bit of a hard task since I ran into several difficulties. But with the proper guidance from my mentor I could successfully overcome them. The complete package is now available in the github repository at https://github.com/hansika/pyquora.

Since the the package is finally expected to be uploaded to Python package index, there's a special directory structure required by the package to be developed. This directory structure can be shown as in the following diagram. This has been captured using the Pycharm IDE.

Python Package Directory Structure 

Each and every file in this hierarchy plays an important role in a Python package. Let me explain my learning related to these files one by one. __init__.py is used to initialize the Python package. The inclusion of the __init__.py file in a directory indicates to the Python interpreter that the directory should be treated like a Python package. In this case, this directory is named as scrape_quora. One common thing to do in __init__.py file is to import selected classes, functions, etc into the package level so that they can be conveniently imported from the package. In my package, the __init__.py file imports the Scrape_Quora class from the pyquora.py file. Therefore the Scrape_Quora class becomes accessible at the package level. There are two other files inside the package directory, namely test.py and pyquora.py. The pyquora.py file has the code to scrape the web. In my package, this file has individual functions to scrape each and every data item. The test.py file has the code to test the features developed inside the pyquora.py file. 

setup.py file stores the metadata about the library. This is the most important file in any Python package and it exists at the root of the Python package. This file serves two primary functions.
  1. It is the file where various aspects of the project are configured. The primary feature of setup.py is that it contains a global setup() function. The keyword arguments to this function are how specific details of the project are defined. These arguments typically contain the name of the package, version, author, author email, github URL of the package etc. 
  2. It is the command line interface for running various commands that relate to packaging tasks. 
The README and README.md files serve the common purposes of providing a general description of the package, its features and how to use it. The LICENSE file contains the license the code should have. For example, the scrape_quora package has the Apache License 2.0. The CHANGES.txt file contains the release information related to various versions of the package. This includes the release data, the version, and of course the changes made. In my package, the CHANGES.txt file only has details of the initial release. Although not present in the above diagram, Python packages typically have another file named requirements.txt. The purpose of this file is to declare the packages to be installed to make the newly developed package work. 

After creating this basic structure of the Python package, next I started coding. According to the instructions given by my mentor, I was asked to extract the following information given the user name of a certain Quora user profile. 
  • Name of the user 
  • URL 
  • Profile picture link 
  • Follower count 
  • Following count 
  • Count of edits 
  • Number of answers 
  • Number of questions 
I followed a similar structure as in the horoscope package(https://github.com/tapasweni-pathak/pyhoroscope) when developing my package. This made it much easier for me during troubleshooting.

Yet, there was another problem that I faced when scraping the information. Although the code for scraping was more or less similar to that in the horoscope package, I ran into a small issue since the xpath related to the same data element differs across multiple user profiles. For example, the xpath related to the name of the user turned to be as follows for two different user profiles. 


//*[@id="ld_cnuttt_69662"]/div/h1/span

//*[@id="ld_txiquk_13340"]/div/h1/span


Initially I had no idea about this and my code continued to give me an empty list when I scraped the profile name from the profile page. But with the guidance of my mentor, I could find a way to generalize this xpath across all profiles using regular expressions. The generalized xpath obtained to retrieve the profile name is as follows. 

//*[re:test(@id, "ld_[a-z]+_\\d+", g)]/div/h1/span/text()

I read the documentation available at http://exslt.org/regexp/index.html and at http://lxml.de/xpathxslt.html#regular-expressions-in-xpath. According to them xpath supports regular expressions in the EXSLT name space. There are three functions for regular expressions in the EXSLT name space. They are regexp:test, regexp:match and regexp:replace. The function that I used out of these is regexp:test(string, string, string?). This returns true if the string given as the first argument matches the regular expression given as the second argument. The third argument is a string consisting of flags to be used by the test. The following statement shows how such a generalized xpath is used in the code to extract the profile name from the quora profile page.

name = tree.xpath('//*[re:test(@id, "ld_[a-z]+_\\d+", g)]/div/h1/span/text()', namespaces={'re':Scrape_Quora.regexpNS})[0]

Likewise I had to use regular expressions to generalize the xpaths related to all the data elements to be extracted. While doing this, I could also enhance my learning related to xpaths acquired over the past week. Finally I coded the test.py file to test the functionality of all the functions created. 

After coding the package, I pushed it to the aforementioned github repository. During the rest of the sixth week, I will be working on pushing this package to the Python package index. 

No comments:

Post a Comment