Saturday, April 23, 2016

Ninth Week of LITG Prorgam

So far in the LITG program I have coded the pyquora package and pushed it into the Python Package Index. I have also coded QuserAPI which uses the pyquora package to return Quora user profile information in json format via REST calls. Information related to all these steps are included in the blog posts written thus far. Now the final bit left is to deploy this API onto the Heroku platform and test its functionalities. Therefore this week is dedicated to finding out about Heroku and acquiring a basic understanding. After that during the next week I will be moving onto deploying the API on Heroku platform. My learning about the Heroku platform is summarized in this blog post. 


The Heroku Platform

Simply put, Heroku provides a cloud based platform(Platform as a Service - PaaS) for deploying and running modern apps. It is completely free and only charges the developers as they grow. It is based on a managed container system. A smart container also known as a dyno is an instance of the application running and responding to requests. Heroku provides one dyno for free. Heroku also has integrated data services. Developers do not have to worry about discovering how to optimally provision a database through trial and error. They already have immediate access to a scalable, highly available database with rollback, one that supports their apps and development style. Heroku is also embedded with a powerful ecosystem. 

The platform relieves the developers from the infrastructure headaches and lets them focus on developing great apps. The objective of the Heroku platform is to make the process of deploying, configuring, scaling, tuning, and managing apps as simple and straightforward as possible. This makes the Heroku developer experience an app-centric one for software delivery integrated with the most popular developer tools and workflows today. There are three key important facts regarding the Heroku platform. They are mentioned below. 

Heroku Runtime

As mentioned before Heroku runs all apps inside dynos which are smart containers on a reliable, fully managed runtime environment. Developers can deploy their code written in Node, Ruby, Java, PHP, Python, Go, Scala, or Clojure. This runtime keeps apps running without any manual intervention.

Heroku Developer Experience

The Heroku Developer Experience refers to an app-centric approach to software delivery. Therefore developers can focus only on creating and continuously delivering applications, without worrying about servers or the underlying infrastructure. Developers can deploy directly from popular tools like Git, GitHub or Continuous Integration (CI) systems. There is also a web-based Heroku Dashboard which makes it much easier to manage the app and gain insight into the performance of the app.

Data Services and Ecosystem

Heroku Elements provide the facility for the developers to extend their apps with Add-ons and customize their application stack with Buildpacks. Add-ons are 3rd party cloud services that developers can use to immediately extend their apps with a range of functionality such as data stores, logging, monitoring and much more. Heroku provides two fully-managed data service Add-ons namely Heroku Postgres and Heroku Redis.

With this basic understanding gained I am looking forward to deploy QuserAPI on Heroku during the next week. When deploying there will be several other requirements such as the Procfile and the requirements.txt file which will be discussed in detail in the blog post of the coming week. 

Friday, April 15, 2016

Eighth Week of LITG Program

The task of the 8th week is to code the API(QUserAPI) to return information from Quora user profiles. This API uses the pyquora package developed throughout the fifth week of the LITG program. I followed the Horoscope-API(https://github.com/tapasweni-pathak/Horoscope-API) developed by my mentor, as a reference to learn when coding QuserAPI. The final API is available on the github repository at https://github.com/hansika/QuserAPI. This API consists of the following files. 
  • License.md 
  • Profile 
  • README.md 
  • requirements.txt 
  • server.py 
Out of these, the server.py is the file coded within this week. The License.md and the README.md files were added to the project way back during the first and the second weeks of the program. Of course the README.md file was modified during this week to include the features of the API. The Profile and the requirements.txt are two files needed by the file structure of Heroku, where we are expecting to deploy the API over the next weeks. These files will be explained in detail in a blog post in an upcoming week. 

The server.py file was coded similar to the server.py file of Horoscope-API which was studied in depth during the last week. It uses Flask Python framework. This file has one method for each of the features of the pyquora package. These features and their corresponding methods are as follows. 
  • Quora Profile - profile_name_route(user_name)
  • NameQuora Profile Picture Link - profile_picture_link_route(user_name)
  • Quora Profile URL - url_route(user_name)
  • Number of Questions - no_of_questions_route(user_name)
  • Number of Answers - no_of_answers_route(user_name)
  • Number of Followers - no_of_followers_route(user_name)
  • Number of Following - no_of_questions_following(user_name)
  • Number of Edits - no_of_edits_route(user_name)
All the above methods call the methods of the pyquora package in order to scrape Quora user profiles. Also there is another method named index_route() which returns additional details related to the API such as the author, project name, project url, project issues, base url and end points. All these methods have their own route decorators and all the methods have used GET requests. Furthermore, all the methods return their results in the form of a json object using the jsonify method of Flask which was discussed in detail in the blog post of the seventh week. Few example route decorators and their corresponding methods are shown below. 
  • index_route() - @app.route('/', methods=['GET'])
  • profile_name_route(user_name) - @app.route('/profile/name/<user_name>', methods=['GET'])
  • profile_picture_link_route(user_name) - @app.route('/profile/picture_link/<user_name>', methods=['GET'])
  • url_route(user_name) - @app.route('/url/<user_name>', methods=['GET'])
  • no_of_answers_route(user_name) - @app.route('/profile/number/answers/<user_name>', methods=['GET'])
The tasks to be completed during the next weeks are to read, find out about Heroku and finally deploy this API on Heroku. Therefore, the next blog post will be basically about what is Heroku. 

Thursday, April 14, 2016

Sixth Week of LITG Program

The main task expected to be completed during the sixth week is to push the developed scrape_quora package into the Python Package Index which is also known as PyPI as a shortened form. Python Package Index is a repository of software for the Python programming language. At the moment of writing this article, there are 78557 packages in the package index. Once you have created some awesome piece of software using Python, you can simply push it to the Python package index and let people install it using pip install. You can also use PyPI's test server to test the developed package. Pushing a package to the Python package index requires a special directory structure. This was explained in detail in the blog post titled 'Fifth Week of LITG Program - Part 2'. Nevertheless I re-post the required directory structure in this article as well.

Python Package Directory Structure

Once the package is ready, we need few other things before moving onto pushing the package into the Python Package Index. 

Before pushing the package directly to the live server we need to push it to the test server and test the package using the pip install command. Therefore first of all we need to have user accounts on both these servers. 

  • .pypirc configuration file:
This file basically contains the information to authenticate the user with PyPI test and live servers. On a Linux machine, this configuration file should be in the home directory. 

Once we are done with all these steps, all we have to do is to work through a sequence of terminal commands in order to push the package into the Python Package Index.


1.   Register the package against PyPI's test server. 
python setup.py register -r test 

At this step I got several errors. First I got the following error.

Traceback (most recent call last):
 File "setup.py", line 10, in <module>
 packages = ['scrape_quora']
 File "/usr/lib/python2.7/distutils/
 dist.run_commands()
 File "/usr/lib/python2.7/distutils/
 self.run_command(cmd)
 File "/usr/lib/python2.7/distutils/
 cmd_obj.run()
 File "/usr/lib/python2.7/dist-
 _register.run(self)
 File "/usr/lib/python2.7/distutils/
 self._set_config()
 File "/usr/lib/python2.7/distutils/
 config = self._read_pypirc()
 File "/usr/lib/python2.7/distutils/
 current['username'] = config.get(server, 'username')
 File "/usr/lib/python2.7/
 raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'pypi # the live PyPI'

I had to do some Google search and also play around with the content of the .pypirc configuration file to resolve this error. My .pypirc file looked as follows prior to the error.


[distutils] # this tells distutils what package indexes you can push to
index-servers =
    pypi # the live PyPI
    test # test PyPI

[test] # authentication details for test PyPI
repository = 'https://testpypi.python.org/pypi
username = <your_user_name>
password = <your_password>

[pypi] # authentication details for live PyPI
repository = https://pypi.python.org/pypi
username = <your_user_name>
password = <your_password>

I had to remove the two comments highlighted in yellow to resolve the aforementioned error. After that I got a different error. 

Registering scrape_quora to 'https://testpypi.python.org/pypi 
Server response (500): <urlopen error unknown url type: 'https> 

For this too, I did some Google search but still could not find a workaround. But later it was found that I have inserted an unwanted single quote(highlighted in red) at the beginning of the URL of the test server. After removing this I could very easily resolve this error. The next error encountered was regarding authentication details. It was basically a 401 authentication failed error. PyPI live and test servers provide the option for the users to login with their gmail accounts. Therefore first I used my gmail address as the username and the gmail password as the password to login to the PyPI live and test servers. These details were included in the .pypirc file. Later my mentor instructed me to create user accounts on both the websites instead of using the default gmail credentials. Once this was done, I changed the credentials of the .pypirc file to these new login details. After that, the 401 error was resolved. I could successfully register the package against PyPI's test server.

2.   Upload the package to PyPI's test server to test. 

python setup.py sdist upload -r test 

This command did not give any errors. I could successfully upload the package to PyPI's test server. But after uploading there was a small issue with the format of the README file. Also I was previously asked to add more test cases to the package. So I deleted the already uploaded package, made these two modifications and tried to re-upload the package with the same version number(0.1.0). Then I had the following error.

Submitting dist/scrape_quora-0.1.0.tar.gz to https://testpypi.python.org/pypi 
Upload failed (400): This filename has previously been used, you should use a different version. 

The version number of the package takes the following format. 

<major>.<minor>.<patch>

So I had to change the patch number once for every modification. After adding one more test case the patch number was changed to 1. Then after the change of the README file, the patch number was changed to 2. So the final version number was 0.1.2. The sequence of version changes was recorded under the CHANGES.txt file. The setup.py file was also updated accordingly. Once this was done, uploading the package to the test server went fine. The package is accessible at the test server on the URL https://testpypi.python.org/pypi/scrape_quora/0.1.2

3.   Install package from test PyPI server. 

pip install -i test scrape_quora 

I could successfully install the package from the test server using the above command. 

After testing with PyPI's test server I went ahead to upload the package to PyPI's live server. The following three commands were used for the process. 

1.   Register the package against PyPI's live server. 

python setup.py register -r pypi 


2.   Upload the package to the live server of PyPI. 

python setup.py sdist upload -r pypi 


3.   Install the package to the machine. 

pip install scrape_quora 


The uploaded python package is available at the URL https://pypi.python.org/pypi/scrape_quora/0.1.2. I also created a small Python file to test some of the features of the package. Those functionalities actually went all fine. 

During the next weeks I will be acquiring a basic understanding of how to develop an API using the Flask framework. I will indeed create an API that uses this package to retrieve Quora user account information as and when needed. 

Sunday, April 10, 2016

Seventh Week of LITG Program

The remaining tasks expected to be completed over the next few weeks are to code the API for returning the Quora user profile information and deploy it to Heroku. Prior to this, I was instructed by my mentor to acquire a fundamental understanding of how to work with Flask which is a Python web framework. It provides the developer with tools, libraries and technologies that allow to build a web application such as a blog, a wiki page or even a commercial website. Flask is a micro-framework. Micro-frameworks are normally frameworks with little to no dependencies to external libraries. The framework is light meaning that there are little dependencies to update. This also means that some times the developer has to do more work by himself. Flask has basically two dependencies namely, Werkzeug a WSGI utility library and jinja2 which is its template engine. 

I followed the documentation available at http://flask.pocoo.org/docs/0.10/quickstart/#http-methods to learn about Flask. I experimented with several special commands available with Flask. These exercises were pushed to the github repository at https://github.com/hansika/Flask_Learning

Using Flask in our Applications

First of all, if we want to use Flask in our web applications, we need to import Flask using the following statement.

from flask import Flask

The next task is to create an instance of the class Flask. The first argument is the name of the application's module or package. If it is a single module, the argument should be __name__. 

app = Flask(__name__)

To run the local server with our application we use the run() function as in the following statement. 

if __name__ == '__main__':
            app.run()

if __name__ == '__main__': makes sure the server only runs if the script is executed directly from the Python interpreter and not used as an imported module. Once these initial statements are included, we are good to go ahead and explore other functions of Flask.

Debug Mode

As changes are done to the code, we need to restart the server. But when debugging mode is on, the server will reload itself at code changes. This can be done in two ways.

app.debug = True                               or                                        app.run(debug=True)
app.run()

Routing

The route() decorator is used to bind a function to a URL. Following are few examples.

@app.route('/')
When running on localhost port 5000, this function can be called by using the URL http://127.0.0.1:5000/ 

@app.route('/hello')
This function can be called by using the URL http://127.0.0.1:5000/hello

Variable Rules

Variable rules are used to make certain parts of the URL dynamic. These special parts are denoted in the route as <variable_name>. These parts are passed to the function as keyword arguments. An example is shown below. 

@app.route('/user/<username>')

Optionally a converter can be used by specifying a rule with<converter:variable_name> as in the following example.

@app.route('/post/<int:post_id>')

Redirection Behavior

Consider the following example.

@app.route('/projects/')

Here, the URL with a trailing slash is similar to a folder on a file system. When we try to access it without the trailing slash, Flask will redirect it to the URL with the trailing slash. 

@app.route('/about')

In this example, there is no trailing slash. This is similar to the pathname of a file on UNIX-like systems. When you try to access this URL with the trailing slash, it produces a 404 “Not Found” error.

URL Building

To build a URL to a specific function we can use the url_for() function. It accepts the name of the function as first argument and a number of keyword arguments, each corresponding to the variable part of the URL rule. Unknown variable parts are appended to the URL as query parameters. Building a URL in this manner would be more beneficial than hard coding it especially when we want to change these URLs. We can then change these URLs in one go without having to remember to change URLs all over the place. Two examples are shown below.

url_for('login', next='/') - 'next' appended as a query parameter.
url_for('profile', username='John Doe') – username sent to the dynamic part of the URL.

Static Files

Dynamic web applications also need static files such as JavaScript files and CSS files. During development Flask can serve these static files as well. All we have to do is to create a folder called static in the package or next to the module and it will be available at '/static' on the application.

To generate URLs for static files, we need to use the special 'static' endpoint name.

url_for('static', filename='style.css')

The file has to be stored on the filesystem as static/style.css. When running on localhost this CSS file can be accessed via the URL http://127.0.0.1:5000/static/style.css

Rendering Templates

To render a HTML template, render_template() method can be used. For that we need to provide the name of the template and the variables that we have to pass to the template engine as keyword arguments. Shown below is one example of this. 

def hello(name=None):
        return render_template('hello.html', name=name)

Flask looks for templates in the templates folder. Therefore this templates folder should either be next to your module or else if it is a package, this folder should be inside the package.

The Request Object

For a request object the route decorator looks like shown below. 

@app.route('/login', methods=['POST', 'GET'])

The current request method is available by using the 'method' attribute. For example we can check if the method is is POST method by using the following statement.

if request.method == 'POST':

Furthermore, form data(data transmitted in a POST or PUT request) can be accessed using the form attribute. An example is shown below. 

if valid_login(request.form['username'], request.form['password']):

Horoscope-API 

After obtaining a basic knowledge about Flask, I learned about a sample REST API developed by my mentor. This API namely the Horoscope-API deployed at Heroku has been developed using Flask. It uses the horoscope package developed to fetch and parse data from GaneshaSpeaks. This API has methods to return Today's horoscope, Weekly horoscope, Monthly horoscope and Yearly horoscope. Also there is a method to know more about a given sunsign.

GET requests have been used for all the methods in the application. Shown below is the route decorator for the index_route function which returns basic details of the project and the author such as the author name, author URL, project name, project URL etc. 

@app.route('/', methods=['GET']) 

These details are returned as a json object. For that the jsonify() method of Flask has been used. jsonify method creates a response with the JSON representation of the given arguments. The arguments to this method can be any one of the following three forms. 

jsonify(**kwarg)

jsonify(mapping, **kwarg)

jsonify(iterable, **kwarg)

mapping is a positional argument. It actually takes the form of a dictionary having key-value pairs. The key becomes the key in the JSON object and the value becomes the corresponding value. **kwarg denotes a set of keyword arguments. When used with keyword arguments, the argument name becomes the key in the JSON object and the argument value becomes the corresponding value. This method can also accept a positional argument which is an iterable object. Each item in the iterable must itself be an iterable with exactly two objects. The first object of each item becomes a key in the JSON object, and the second object the corresponding value. If a key occurs more than once, the last value for that key becomes the corresponding value in the JSON object. Following is an example JSON response returned when using keyword arguments. 

return jsonify(username=g.user.username, email=g.user.email, id=g.user.id)

This returns the following JSON response. 

{
"username": "admin",
"email": "admin@localhost",
"id": 42
}

Horoscope-API uses the first two forms of arguments for the jsonify method. index_route method uses mapping object type positional argument and all the other methods use keyword arguments to the jsonify method.

Another special method used by the Horoscope-API is the dict method. This accepts the same set of arguments as mentioned above for the jsonify method. The task of this method is to return a new dictionary from the arguments passed to it. If no positional argument is given, an empty dictionary is created. The methods of the horoscope package return their results embedded in Python dictionary objects. Horoscope-API creates new dictionaries from these dictionary objects by using them as mapping object arguments to the dict method. Shown below is an example of two ways of using the dict method. Both the ways return the dictionary {"one": 1, "two": 2, "three": 3}.

a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}

With the knowledge gained regarding the Flask web framework and the Horoscope-API I will be starting to code the API to return the Quora user profile information over the next week.

Sunday, April 3, 2016

Fifth Week of LITG Program - Part 2

As mentioned in the previous blog posts, during the rest of the fifth week and the beginning of the sixth week, my main objective was to finish coding the Python module to fetch and parse data from the Quora user profiles. This was a bit of a hard task since I ran into several difficulties. But with the proper guidance from my mentor I could successfully overcome them. The complete package is now available in the github repository at https://github.com/hansika/pyquora.

Since the the package is finally expected to be uploaded to Python package index, there's a special directory structure required by the package to be developed. This directory structure can be shown as in the following diagram. This has been captured using the Pycharm IDE.

Python Package Directory Structure 

Each and every file in this hierarchy plays an important role in a Python package. Let me explain my learning related to these files one by one. __init__.py is used to initialize the Python package. The inclusion of the __init__.py file in a directory indicates to the Python interpreter that the directory should be treated like a Python package. In this case, this directory is named as scrape_quora. One common thing to do in __init__.py file is to import selected classes, functions, etc into the package level so that they can be conveniently imported from the package. In my package, the __init__.py file imports the Scrape_Quora class from the pyquora.py file. Therefore the Scrape_Quora class becomes accessible at the package level. There are two other files inside the package directory, namely test.py and pyquora.py. The pyquora.py file has the code to scrape the web. In my package, this file has individual functions to scrape each and every data item. The test.py file has the code to test the features developed inside the pyquora.py file. 

setup.py file stores the metadata about the library. This is the most important file in any Python package and it exists at the root of the Python package. This file serves two primary functions.
  1. It is the file where various aspects of the project are configured. The primary feature of setup.py is that it contains a global setup() function. The keyword arguments to this function are how specific details of the project are defined. These arguments typically contain the name of the package, version, author, author email, github URL of the package etc. 
  2. It is the command line interface for running various commands that relate to packaging tasks. 
The README and README.md files serve the common purposes of providing a general description of the package, its features and how to use it. The LICENSE file contains the license the code should have. For example, the scrape_quora package has the Apache License 2.0. The CHANGES.txt file contains the release information related to various versions of the package. This includes the release data, the version, and of course the changes made. In my package, the CHANGES.txt file only has details of the initial release. Although not present in the above diagram, Python packages typically have another file named requirements.txt. The purpose of this file is to declare the packages to be installed to make the newly developed package work. 

After creating this basic structure of the Python package, next I started coding. According to the instructions given by my mentor, I was asked to extract the following information given the user name of a certain Quora user profile. 
  • Name of the user 
  • URL 
  • Profile picture link 
  • Follower count 
  • Following count 
  • Count of edits 
  • Number of answers 
  • Number of questions 
I followed a similar structure as in the horoscope package(https://github.com/tapasweni-pathak/pyhoroscope) when developing my package. This made it much easier for me during troubleshooting.

Yet, there was another problem that I faced when scraping the information. Although the code for scraping was more or less similar to that in the horoscope package, I ran into a small issue since the xpath related to the same data element differs across multiple user profiles. For example, the xpath related to the name of the user turned to be as follows for two different user profiles. 


//*[@id="ld_cnuttt_69662"]/div/h1/span

//*[@id="ld_txiquk_13340"]/div/h1/span


Initially I had no idea about this and my code continued to give me an empty list when I scraped the profile name from the profile page. But with the guidance of my mentor, I could find a way to generalize this xpath across all profiles using regular expressions. The generalized xpath obtained to retrieve the profile name is as follows. 

//*[re:test(@id, "ld_[a-z]+_\\d+", g)]/div/h1/span/text()

I read the documentation available at http://exslt.org/regexp/index.html and at http://lxml.de/xpathxslt.html#regular-expressions-in-xpath. According to them xpath supports regular expressions in the EXSLT name space. There are three functions for regular expressions in the EXSLT name space. They are regexp:test, regexp:match and regexp:replace. The function that I used out of these is regexp:test(string, string, string?). This returns true if the string given as the first argument matches the regular expression given as the second argument. The third argument is a string consisting of flags to be used by the test. The following statement shows how such a generalized xpath is used in the code to extract the profile name from the quora profile page.

name = tree.xpath('//*[re:test(@id, "ld_[a-z]+_\\d+", g)]/div/h1/span/text()', namespaces={'re':Scrape_Quora.regexpNS})[0]

Likewise I had to use regular expressions to generalize the xpaths related to all the data elements to be extracted. While doing this, I could also enhance my learning related to xpaths acquired over the past week. Finally I coded the test.py file to test the functionality of all the functions created. 

After coding the package, I pushed it to the aforementioned github repository. During the rest of the sixth week, I will be working on pushing this package to the Python package index.