Monday, May 2, 2016

Twelfth Week of LITG Program

On request by my mentor Tapasweni Pathak, I write this blog post to sum up the overall experience I gained via the LITG program. This is a bunch of questions and answers regarding the program. 

How was the experience with Learn IT Girl? 

It was a wonderful, educative and a rich experience indeed filled with many lessons to mold my career path. 

Was the time plan unmanageable? Did the project make you spend more hours than expected?

No. The time line was quite feasible thanks to the support and the guidance of my mentor. I spent nearly ten hours per week on the project work (nearly three hours on the five weekdays and the other seven during the weekend). Depending on the workload due for the week sometimes I had to work few extra hours(especially while coding the Python package). Nevertheless I could successfully balance the project work with the other academic work at the university. 

What all did you learn? 

Python programming language 
Selecting the most appropriate type of license for a given project 
Working with Github 
Web scraping with Scrapy library 
Functionality of Python packages such as etree, re, urllib2, xpath 
Developing a Python package 
Pushing a Python package into the Python Package Index 
Flask Python Framework 
Developing an API using Python and Flask Framework 
Deploying a Python application on Heroku 

Apart from these I could learn the following as well which would be much useful to me in future.

Working with tight schedules on strict deadlines 
Keeping record of the work by writing weekly blog posts 
Communicating with international mentors on project related professional matters 

Why did you choose to learn Python and why this project?

Python is a widely used language today but I had zero knowledge about it. Also it is a very easy to learn high-level language with many libraries and built-in support which attracts many developers. Therefore Python knowledge would indeed be a plus point for me in my career as a Software Engineer. That is why I selected to learn Python.

The project is about developing a Python API to extract the following Quora user information given the user name of the profile as the input. 

Name of the user 
URL of the user profile 
Profile picture link 
Follower count 
Following count 
Count of edits 
Number of answers 
Number of questions 

Future improvements include extracting the Facebook, LinkedIn profile links which actually requires login to Quora. This API is based on a Python package which performs the web scraping part of the project. Web scraping was a whole new concept for me. I had never really worked on it before. Also I had so many learning material available to grasp the basic concepts. The Horoscope API(https://github.com/tapasweni-pathak/Horoscope-API) and the pyhoroscope package(https://github.com/tapasweni-pathak/pyhoroscope) developed by my mentor follows the same concepts. There are many blog posts written by my mentor about developing Python packages and APIs. Therefore it was much convenient for me to learn the fundamentals of Python and other required technologies within a very short period of time. This was the major reason why I chose this project. 

How can your project help others? 

Both the API and the package can be used by developers in their projects. Also developers can volunteer to add more features to the package and improve it. 

The Github repository for the QUserAPI - https://github.com/hansika/QUserAPI
(Instructions to use the API are available at this link) 

The Github repository for the scrape_quora Python package - https://github.com/hansika/pyquora

The package can be installed in isolation(without using the API) by using the command 'pip install scrape_quora' 

This API along with the Python package can be used by developers who work on far more complicated tasks such as data mining using the information extracted from Quora user profiles. The API has been developed in a much intuitive manner for the developers to understand and use. 

The things that you would like to change in the next round of Learn IT Girl to make it better? 

I think the mid evaluation should be much more structured. Currently the mid evaluation is more like a self evaluation where the mentee evaluates her own learning. Not even the mentor is involved in this process. But I feel that if the mentee is requested to produce some working piece of software during the mid evaluation on which the evaluators(including the mentor) can give some constructive feedback, she will be even more motivated to do better throughout the rest of the program. Personally I would find this very helpful since it is always feedback and comments from people that keep me going and produce even better work. 

Would you like to mentor in next round? 

Yes. I would like to mentor in the next round. I would love to give something in return with the knowledge I gained and help a girl learn a new programming language. 

What would you like to advice next year mentees? 

Choosing the language that best fits you is critical. It should be something that you have zero knowledge and also something that would be valuable enough to spend three months of your time to learn. Go for a language that you think will remain in the industry at least for a few years. 

Choose a project that fits your potential. It should not be too complex and also not something too simple that you can complete within a few weeks. Choose something which will teach you enough new concepts to digest within a period of three months. 

Have a proper time plan. Get the help of your mentor to create it. Your work break down structure should be such that it does not clash or disturb your other personal or academic work. Allocate time so that you can complete the work due for each week. You know your work schedule more than anyone else. 

Above all have the passion to learn. Try out things by yourself. Bother your mentor only when you really cannot resolve your issues for several days. Do enough Google search before you ask from your mentor. It is your time to learn so take the maximum out of it. 

What were the things that you didn’t like about your mentor? 

My mentor was Tapasweni Pathak. She was a great mentor helping me immensely from the point of creating the time-line to the point of writing this final blog post. A bit strict at times especially when I was lagging behind due to my exams but of course a motivative character who actually drove me complete my tasks timely by adding deadlines. She is an awesome supervisor who can make anything work.

Sunday, May 1, 2016

Eleventh Week of LITG Program

It is the eleventh week of the LITG program. During this week I did some final testing and cleaning of the project as instructed by my mentor. These tasks will be explained first in this blog post. At the end I will also be summarizing the overall project and also the learning that I gained through this project. 

First I was instructed by my mentor to add my name as the owner of the project in the license files of both the Python package and the API. The license included is the Apache License Version 2.0. This license has the following line in it. 

Copyright [yyyy] [name of copyright owner]

This was changed as follows. 

Copyright 2016 Hansika Hewamalage 

After that I added more test cases to the scrape_quora Python package. I created a list of Quora user names and sent them one by one to the routes of the QUserAPI. After doing all these it was required to push both the package and the API to github again. Also I had to push the package to the Python package index once again with a new version number(0.1.3). CHANGES.txt file and the setup.py file of the package were updated accordingly before pushing the package to PyPI. 

Then I redeployed the API on Heroku. Before redeploying the requirements.txt file was updated to include the latest version of the scrape_quora package. Then redeployment was done according to the tutorial at https://devcenter.heroku.com/articles/getting-started-with-python#push-local-changes. Redeployment is all about pushing local changes to the git repository with the remote heroku. The following sequence of commands was followed in order to achieve this. 

First add the modified files to the local git repository. 
git add . 

Next commit the changes to the repository. 
git commit -m “license file and the requirements.txt file modified” 

Deploy the source code to Heroku. 
git push heroku master 

Check whether its deployed properly. 
heroku open 

Once all the steps were followed, the API was successfully redeployed on Heroku. 

Next I will be summarizing all the work that I did throughout the last ten weeks. My project was to develop an API that retrieves Quora(https://www.quora.com/) user profile information given the user name(available at https://github.com/hansika/QUserAPI). This was based on a Python package namely the scrape_quora Python package(available at https://github.com/hansika/pyquora) which was also developed by me. The chosen programming language was Python. First of all license files were added to the projects QUserAPI and pyquora(the Python package). After much exploration the chosen license was Apache License Version 2.0. 

Then the first three weeks were dedicated to learning Python by following the exercises available at http://learnpythonthehardway.org/book/preface.html. All the completed exercises were pushed into a git repository(available at https://github.com/hansika/LearnPython). It was a bit of a struggle during the first few weeks since my university exams clashed with the program. Somehow I managed to complete the backlogs and be on the track. The fourth week was a much important week since I got exposed into the world of web scraping. I completed the video tutorial available at https://www.youtube.com/watch?v=ic8ygbac5lo which teaches the fundamentals of web scraping using Scrapy library. It uses the example code available at https://github.com/tapasweni-pathak/Talks-and-Workshops/tree/master/PyLadies%20Remote/Code%20Examples. It was much easier for me to grasp the concepts using this example. I also referred the python package developed by mentor namely pyhoroscope(available at https://github.com/tapasweni-pathak/pyhoroscope). The purpose of this package is to fetch and parse data from GaneshaSpeaks. This is actually the base on which I developed my package, scrape_quora. 

Next, during the first half of the fifth week I got introduced to a number of new Python packages namely, etree, re, urllib2 and xpath which were needed to develop scrape_quora. I learned about them by trying out different commands related to these packages. My learning is available at the github repository https://github.com/hansika/LITG_Practice. Furthermore I learned about Python dictionaries which were also needed in a future week of the program. During the rest of the fifth week and the beginning of the sixth week I finished coding the python package scrape_quora. This package basically scrapes the name of the user, URL of the user profile, profile picture link, follower count, following count, count of edits, number of answers and the number of questions given the user name of the profile as the input. For a future improvement we can also scrape the Facebook, LinkedIn profile links which actually requires login to Quora. 

The task during the rest of the sixth week was to push the package into the Python Package Index. It was first tested by pushing the package into the test server. I ran into many issues while doing this but with the help of my mentor I could overcome all of them. The package pushed to the test server is available at https://testpypi.python.org/pypi?:action=display&name=scrape_quora&version=0.1.3. The package pushed to the live server is available at https://pypi.python.org/pypi?:action=display&name=scrape_quora&version=0.1.3. The next task was to code the API using this Python package. Before that during the seventh week I learned about Flask Python framework. Flask was needed to code the API. I pushed all my learning related to Flask into the github repository at https://github.com/hansika/Flask_Learning. I tried out very simple commands using Flask to get a basic understanding. Furthermore during this week I also followed the Horoscope API(https://github.com/tapasweni-pathak/Horoscope-API) developed by my mentor using the aforementioned pyhoroscope package. This was used as the basis for developing the QUserAPI. 

The eighth week was dedicated to coding the API. I could successfully finish coding the API during this week. Then during the ninth week it was time to start learning about the Heroku platform. The blog post of the ninth week was dedicated to summarizing my learning related to Heroku. Then during the tenth week which was the final week of the project work I deployed the developed API onto the Heroku platform. Even while doing this I ran into many difficulties. But with enough Google search I could overcome all of them. The official documentation of Heroku available at https://devcenter.heroku.com/articles/getting-started-with-python#introduction was much useful in all these issues. Likewise I could successfully deploy the API which is now accessible via http://quser-api.herokuapp.com/

This API is a much needed piece of software for those developers who work on more complex tasks such as data mining using the information extracted from Quora user profiles. All the instructions related to using the API are available at the github repository at https://github.com/hansika/QuserAPI. The API has been developed in a much intuitive manner to both use and also tounderstand. 

As a whole the LITG program has brought me many important lessons to my life. First of all I should say that this was my first time working in such an international program under the mentorship of a foreigner. It brought me many new experiences. Apart from the new technolgies and the programming languages learned, I could collect many good experiences to climb up the career ladder. I should emphasize the support given to me by my mentor Tapasweni Pathak throughout this project to successfully complete everything. If it was not for the feasible and end to end schedule created by her I would not be able to complete the tasks timely. Because of that I could get a wondeful experience in working accroding to prescheduled timelines. Esepecially during the first few weeks when I had my university exams we often had to refine the timeline to cover the backlogs. Furthermore writing blog posts every week imrpoved my writing skills. Also it was a good way of keeping a note of all the new learning throughout the week. Since we tend to forget easily anything that we learn, writing blogs is a good way to go back and revise what we learn. This habit I hope to continue throughout the rest of my work as well. Another good thing I learned is to try out and actually do something related to whatever new technologies that I learn. When such new learning is pushed into a github repository this new knowledge will be available for the future as well. This is another good habit that I hope to continue. 

All in all the LITG program was a great influence for me to add many good habits to my career path. The new experiences and the learning gained throughout this program will be much needed and helpful for me ahead in my life to achieve my career goals.

Tenth Week of LITG Program

It is the tenth week of the LITG program and we have reached almost the end of the project. This week is dedicated to pushing the developed API to the Heroku platform with the basic understanding gained during the last week.

I ran into a number of difficulties while pushing the API to Heroku. The tutorial that I was referring at the first place was outdated which resulted in giving me many errors when following it. After some effort and a bit of Google search I could find this official documentation of the Heroku platform available at https://devcenter.heroku.com/articles/getting-started-with-python#introduction. This tutorial gives step by step guidance to successfully deploy a Python application on the Heroku platform in a much intuitive manner. It also gives instructions to deploy an app locally so that we can test it using localhost. Nevertheless I directly deployed the app on Heroku.

According to this tutorial there is a number of steps which should be followed in order to deploy a Python application to the Heroku platform. For these steps to be successful, two more files should be added to the API namely, the requirements.txt file and the Procfile. These will be explained next.

requirements.txt File


As mentioned in the tutorial at https://devcenter.heroku.com/articles/getting-started-with-python#declare-app-dependencies the purpose of this file is to declare the app dependencies. Heroku recognizes an app as a Python app by the existence of this file in the root directory. For example the QUserAPI that I developed contains the following set of Python packages along with their versions in the requirements.txt file as dependencies.

Flask==0.10.1
Jinja2==2.8
Werkzeug==0.11.3
gunicorn==19.4.5
itsdangerous==0.24
MarkupSafe==0.23
newrelic==2.60.0.46
scrape_quora==0.1.3
wsgiref==0.1.2
lxml==3.5.0

When an app is deployed, Heroku reads this file and installs the appropriate Python dependencies using the pip install -r requirements.txt command.

Procfile


As mentioned at https://devcenter.heroku.com/articles/getting-started-with-python#define-a-procfile this file included within the root directory of the app explicitly defines the command that should be executed to start the app. For example QUserAPI contains the following command in the Procfile.

web: newrelic-admin run-program gunicorn -b 0.0.0.0:$PORT server:app

This file declares a single process type, web, and the command needed to run it. The name web declares that this process type will be attached to the HTTP routing stack of Heroku, and receive web traffic once deployed.

There are two other Python packges listed in this command namely, newrelic and gunicorn. newrelic is a package that instruments our applications for performance monitoring and advanced performance analytics with New Relic. It helps to trace performance issues of applications even while monitoring them at production environments. On the other hand gunicorn which is the shortened form of 'Green Unicorn' is a Python WSGI HTTP Server for UNIX and it is broadly compatible with various web frameworks. It basically helps in transforming Python code to run on HTTP.

The $PORT piece of the command instructs Heroku to deploy the app on whatever port that is free at the moment.

Once these files and the code for the API are ready we can move forward to follow the steps to deploy the app on Heroku.

Deploying the Application


The steps to deploy the app fall into a number of sub activities. These steps will be discussed under those sub activities.

Initial Steps


1. Create a free Heroku account
This is required since authentication is needed for heroku and git commands to work in an upcoming step.

2. Install virtualenv locally using the command pip install virtualenv on the terminal. 
(In addition to this, it is required to have a Python version installed on the system. In my case I already had Python installed.)


Set up


3. Install the Heroku Toolbelt which provides access to the Heroku Command Line Interface(CLI). The documentation at https://devcenter.heroku.com/articles/getting-started-with-python#set-up provides the facility to download the version of Toolbelt compatible with the OS used. Once it is installed we can use the heroku command from the terminal.

4. The next step is to login to heroku using heroku login command on the terminal.
The email address and the password of the created free Heroku account can be used for this login.


Prepare the app


5. First go to the project folder(root directory) using the cd command on the terminal.


Deploy the app


6. Create an app on Heroku using the command heroku create which prepares Heroku to receive the source code.
By default Heroku gives some randomly generated name to this app which also comes on the URL to access the application once deployed. I ran into a small issue at this point. At first I did not know that Heroku gives such a default name to the app. Therefore I ran this command with no arguments. Then it gave a name as floating-taiga-50750. I wanted to change the name to quser-api. Therefore I ran the command as heroku create quser-api. Then it created an app with the name quser-api.

As mentioned in the documentation, when a new app is created in this manner, a git remote called heroku is also created and associated with this local git repository.

7. Next deploy the source code using the command git push heroku master.

At this step I got the following error.

remote: Compressing source files... done.
remote: Building source:
remote:
remote:

remote: ! Push rejected, no Cedar-supported app detected
remote: HINT: This occurs when Heroku cannot detect the buildpack
remote: to use for this application automatically.
remote: See https://devcenter.heroku.com/articles/buildpacks
remote:
remote: Verifying deploy...
remote:
remote: ! Push rejected to floating-taiga-50750.
remote:
To https://git.heroku.com/floating-taiga-50750.git
! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'https://git.heroku.com/floating-taiga-50750.git'

I noticed that the name of the app in the error is the random name(floating-taiga-50750) first generated by Heroku and not the name that I created next. After some Google search I could find the fix available at http://stackoverflow.com/questions/31330587/heroku-error-message-no-cedar-supported-app-detected. According to this article, reinitializing the .git files can fix the issue. So I followed the following commands in the given order to resolve the error.

rm -rf .git
git init
git add .
git commit -am "Reinitialize"
heroku create quser-api

I changed the name to quser-api since I need the application to have that name on Heroku. But then I got another error.

Creating ⬢ quser-api... !!!
▸ Name is already taken

To solve this problem I removed all the exiting apps under my user account on the web dashboard of Heroku. Then I ran the heroku create quser-api command again and the issue was resolved. Next I ran the git push heroku master command to deploy the project on Heroku.

8. Visit the app at the URL generated by the app name(http://quser-api.herokuapp.com/). 
We can also use the command heroku open on the terminal as a shortcut to open the website. Likewise I tested all the routes of the API at all the URLs.

The app was successfully deployed on Heroku. With this, I have completed all the work of my project about creating an API to scrape Quora user profiles. The final touch ups and my overall learning throughout the project will be summarized in the next week's blog post.

Saturday, April 23, 2016

Ninth Week of LITG Prorgam

So far in the LITG program I have coded the pyquora package and pushed it into the Python Package Index. I have also coded QuserAPI which uses the pyquora package to return Quora user profile information in json format via REST calls. Information related to all these steps are included in the blog posts written thus far. Now the final bit left is to deploy this API onto the Heroku platform and test its functionalities. Therefore this week is dedicated to finding out about Heroku and acquiring a basic understanding. After that during the next week I will be moving onto deploying the API on Heroku platform. My learning about the Heroku platform is summarized in this blog post. 


The Heroku Platform

Simply put, Heroku provides a cloud based platform(Platform as a Service - PaaS) for deploying and running modern apps. It is completely free and only charges the developers as they grow. It is based on a managed container system. A smart container also known as a dyno is an instance of the application running and responding to requests. Heroku provides one dyno for free. Heroku also has integrated data services. Developers do not have to worry about discovering how to optimally provision a database through trial and error. They already have immediate access to a scalable, highly available database with rollback, one that supports their apps and development style. Heroku is also embedded with a powerful ecosystem. 

The platform relieves the developers from the infrastructure headaches and lets them focus on developing great apps. The objective of the Heroku platform is to make the process of deploying, configuring, scaling, tuning, and managing apps as simple and straightforward as possible. This makes the Heroku developer experience an app-centric one for software delivery integrated with the most popular developer tools and workflows today. There are three key important facts regarding the Heroku platform. They are mentioned below. 

Heroku Runtime

As mentioned before Heroku runs all apps inside dynos which are smart containers on a reliable, fully managed runtime environment. Developers can deploy their code written in Node, Ruby, Java, PHP, Python, Go, Scala, or Clojure. This runtime keeps apps running without any manual intervention.

Heroku Developer Experience

The Heroku Developer Experience refers to an app-centric approach to software delivery. Therefore developers can focus only on creating and continuously delivering applications, without worrying about servers or the underlying infrastructure. Developers can deploy directly from popular tools like Git, GitHub or Continuous Integration (CI) systems. There is also a web-based Heroku Dashboard which makes it much easier to manage the app and gain insight into the performance of the app.

Data Services and Ecosystem

Heroku Elements provide the facility for the developers to extend their apps with Add-ons and customize their application stack with Buildpacks. Add-ons are 3rd party cloud services that developers can use to immediately extend their apps with a range of functionality such as data stores, logging, monitoring and much more. Heroku provides two fully-managed data service Add-ons namely Heroku Postgres and Heroku Redis.

With this basic understanding gained I am looking forward to deploy QuserAPI on Heroku during the next week. When deploying there will be several other requirements such as the Procfile and the requirements.txt file which will be discussed in detail in the blog post of the coming week. 

Friday, April 15, 2016

Eighth Week of LITG Program

The task of the 8th week is to code the API(QUserAPI) to return information from Quora user profiles. This API uses the pyquora package developed throughout the fifth week of the LITG program. I followed the Horoscope-API(https://github.com/tapasweni-pathak/Horoscope-API) developed by my mentor, as a reference to learn when coding QuserAPI. The final API is available on the github repository at https://github.com/hansika/QuserAPI. This API consists of the following files. 
  • License.md 
  • Profile 
  • README.md 
  • requirements.txt 
  • server.py 
Out of these, the server.py is the file coded within this week. The License.md and the README.md files were added to the project way back during the first and the second weeks of the program. Of course the README.md file was modified during this week to include the features of the API. The Profile and the requirements.txt are two files needed by the file structure of Heroku, where we are expecting to deploy the API over the next weeks. These files will be explained in detail in a blog post in an upcoming week. 

The server.py file was coded similar to the server.py file of Horoscope-API which was studied in depth during the last week. It uses Flask Python framework. This file has one method for each of the features of the pyquora package. These features and their corresponding methods are as follows. 
  • Quora Profile - profile_name_route(user_name)
  • NameQuora Profile Picture Link - profile_picture_link_route(user_name)
  • Quora Profile URL - url_route(user_name)
  • Number of Questions - no_of_questions_route(user_name)
  • Number of Answers - no_of_answers_route(user_name)
  • Number of Followers - no_of_followers_route(user_name)
  • Number of Following - no_of_questions_following(user_name)
  • Number of Edits - no_of_edits_route(user_name)
All the above methods call the methods of the pyquora package in order to scrape Quora user profiles. Also there is another method named index_route() which returns additional details related to the API such as the author, project name, project url, project issues, base url and end points. All these methods have their own route decorators and all the methods have used GET requests. Furthermore, all the methods return their results in the form of a json object using the jsonify method of Flask which was discussed in detail in the blog post of the seventh week. Few example route decorators and their corresponding methods are shown below. 
  • index_route() - @app.route('/', methods=['GET'])
  • profile_name_route(user_name) - @app.route('/profile/name/<user_name>', methods=['GET'])
  • profile_picture_link_route(user_name) - @app.route('/profile/picture_link/<user_name>', methods=['GET'])
  • url_route(user_name) - @app.route('/url/<user_name>', methods=['GET'])
  • no_of_answers_route(user_name) - @app.route('/profile/number/answers/<user_name>', methods=['GET'])
The tasks to be completed during the next weeks are to read, find out about Heroku and finally deploy this API on Heroku. Therefore, the next blog post will be basically about what is Heroku. 

Thursday, April 14, 2016

Sixth Week of LITG Program

The main task expected to be completed during the sixth week is to push the developed scrape_quora package into the Python Package Index which is also known as PyPI as a shortened form. Python Package Index is a repository of software for the Python programming language. At the moment of writing this article, there are 78557 packages in the package index. Once you have created some awesome piece of software using Python, you can simply push it to the Python package index and let people install it using pip install. You can also use PyPI's test server to test the developed package. Pushing a package to the Python package index requires a special directory structure. This was explained in detail in the blog post titled 'Fifth Week of LITG Program - Part 2'. Nevertheless I re-post the required directory structure in this article as well.

Python Package Directory Structure

Once the package is ready, we need few other things before moving onto pushing the package into the Python Package Index. 

Before pushing the package directly to the live server we need to push it to the test server and test the package using the pip install command. Therefore first of all we need to have user accounts on both these servers. 

  • .pypirc configuration file:
This file basically contains the information to authenticate the user with PyPI test and live servers. On a Linux machine, this configuration file should be in the home directory. 

Once we are done with all these steps, all we have to do is to work through a sequence of terminal commands in order to push the package into the Python Package Index.


1.   Register the package against PyPI's test server. 
python setup.py register -r test 

At this step I got several errors. First I got the following error.

Traceback (most recent call last):
 File "setup.py", line 10, in <module>
 packages = ['scrape_quora']
 File "/usr/lib/python2.7/distutils/
 dist.run_commands()
 File "/usr/lib/python2.7/distutils/
 self.run_command(cmd)
 File "/usr/lib/python2.7/distutils/
 cmd_obj.run()
 File "/usr/lib/python2.7/dist-
 _register.run(self)
 File "/usr/lib/python2.7/distutils/
 self._set_config()
 File "/usr/lib/python2.7/distutils/
 config = self._read_pypirc()
 File "/usr/lib/python2.7/distutils/
 current['username'] = config.get(server, 'username')
 File "/usr/lib/python2.7/
 raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'pypi # the live PyPI'

I had to do some Google search and also play around with the content of the .pypirc configuration file to resolve this error. My .pypirc file looked as follows prior to the error.


[distutils] # this tells distutils what package indexes you can push to
index-servers =
    pypi # the live PyPI
    test # test PyPI

[test] # authentication details for test PyPI
repository = 'https://testpypi.python.org/pypi
username = <your_user_name>
password = <your_password>

[pypi] # authentication details for live PyPI
repository = https://pypi.python.org/pypi
username = <your_user_name>
password = <your_password>

I had to remove the two comments highlighted in yellow to resolve the aforementioned error. After that I got a different error. 

Registering scrape_quora to 'https://testpypi.python.org/pypi 
Server response (500): <urlopen error unknown url type: 'https> 

For this too, I did some Google search but still could not find a workaround. But later it was found that I have inserted an unwanted single quote(highlighted in red) at the beginning of the URL of the test server. After removing this I could very easily resolve this error. The next error encountered was regarding authentication details. It was basically a 401 authentication failed error. PyPI live and test servers provide the option for the users to login with their gmail accounts. Therefore first I used my gmail address as the username and the gmail password as the password to login to the PyPI live and test servers. These details were included in the .pypirc file. Later my mentor instructed me to create user accounts on both the websites instead of using the default gmail credentials. Once this was done, I changed the credentials of the .pypirc file to these new login details. After that, the 401 error was resolved. I could successfully register the package against PyPI's test server.

2.   Upload the package to PyPI's test server to test. 

python setup.py sdist upload -r test 

This command did not give any errors. I could successfully upload the package to PyPI's test server. But after uploading there was a small issue with the format of the README file. Also I was previously asked to add more test cases to the package. So I deleted the already uploaded package, made these two modifications and tried to re-upload the package with the same version number(0.1.0). Then I had the following error.

Submitting dist/scrape_quora-0.1.0.tar.gz to https://testpypi.python.org/pypi 
Upload failed (400): This filename has previously been used, you should use a different version. 

The version number of the package takes the following format. 

<major>.<minor>.<patch>

So I had to change the patch number once for every modification. After adding one more test case the patch number was changed to 1. Then after the change of the README file, the patch number was changed to 2. So the final version number was 0.1.2. The sequence of version changes was recorded under the CHANGES.txt file. The setup.py file was also updated accordingly. Once this was done, uploading the package to the test server went fine. The package is accessible at the test server on the URL https://testpypi.python.org/pypi/scrape_quora/0.1.2

3.   Install package from test PyPI server. 

pip install -i test scrape_quora 

I could successfully install the package from the test server using the above command. 

After testing with PyPI's test server I went ahead to upload the package to PyPI's live server. The following three commands were used for the process. 

1.   Register the package against PyPI's live server. 

python setup.py register -r pypi 


2.   Upload the package to the live server of PyPI. 

python setup.py sdist upload -r pypi 


3.   Install the package to the machine. 

pip install scrape_quora 


The uploaded python package is available at the URL https://pypi.python.org/pypi/scrape_quora/0.1.2. I also created a small Python file to test some of the features of the package. Those functionalities actually went all fine. 

During the next weeks I will be acquiring a basic understanding of how to develop an API using the Flask framework. I will indeed create an API that uses this package to retrieve Quora user account information as and when needed. 

Sunday, April 10, 2016

Seventh Week of LITG Program

The remaining tasks expected to be completed over the next few weeks are to code the API for returning the Quora user profile information and deploy it to Heroku. Prior to this, I was instructed by my mentor to acquire a fundamental understanding of how to work with Flask which is a Python web framework. It provides the developer with tools, libraries and technologies that allow to build a web application such as a blog, a wiki page or even a commercial website. Flask is a micro-framework. Micro-frameworks are normally frameworks with little to no dependencies to external libraries. The framework is light meaning that there are little dependencies to update. This also means that some times the developer has to do more work by himself. Flask has basically two dependencies namely, Werkzeug a WSGI utility library and jinja2 which is its template engine. 

I followed the documentation available at http://flask.pocoo.org/docs/0.10/quickstart/#http-methods to learn about Flask. I experimented with several special commands available with Flask. These exercises were pushed to the github repository at https://github.com/hansika/Flask_Learning

Using Flask in our Applications

First of all, if we want to use Flask in our web applications, we need to import Flask using the following statement.

from flask import Flask

The next task is to create an instance of the class Flask. The first argument is the name of the application's module or package. If it is a single module, the argument should be __name__. 

app = Flask(__name__)

To run the local server with our application we use the run() function as in the following statement. 

if __name__ == '__main__':
            app.run()

if __name__ == '__main__': makes sure the server only runs if the script is executed directly from the Python interpreter and not used as an imported module. Once these initial statements are included, we are good to go ahead and explore other functions of Flask.

Debug Mode

As changes are done to the code, we need to restart the server. But when debugging mode is on, the server will reload itself at code changes. This can be done in two ways.

app.debug = True                               or                                        app.run(debug=True)
app.run()

Routing

The route() decorator is used to bind a function to a URL. Following are few examples.

@app.route('/')
When running on localhost port 5000, this function can be called by using the URL http://127.0.0.1:5000/ 

@app.route('/hello')
This function can be called by using the URL http://127.0.0.1:5000/hello

Variable Rules

Variable rules are used to make certain parts of the URL dynamic. These special parts are denoted in the route as <variable_name>. These parts are passed to the function as keyword arguments. An example is shown below. 

@app.route('/user/<username>')

Optionally a converter can be used by specifying a rule with<converter:variable_name> as in the following example.

@app.route('/post/<int:post_id>')

Redirection Behavior

Consider the following example.

@app.route('/projects/')

Here, the URL with a trailing slash is similar to a folder on a file system. When we try to access it without the trailing slash, Flask will redirect it to the URL with the trailing slash. 

@app.route('/about')

In this example, there is no trailing slash. This is similar to the pathname of a file on UNIX-like systems. When you try to access this URL with the trailing slash, it produces a 404 “Not Found” error.

URL Building

To build a URL to a specific function we can use the url_for() function. It accepts the name of the function as first argument and a number of keyword arguments, each corresponding to the variable part of the URL rule. Unknown variable parts are appended to the URL as query parameters. Building a URL in this manner would be more beneficial than hard coding it especially when we want to change these URLs. We can then change these URLs in one go without having to remember to change URLs all over the place. Two examples are shown below.

url_for('login', next='/') - 'next' appended as a query parameter.
url_for('profile', username='John Doe') – username sent to the dynamic part of the URL.

Static Files

Dynamic web applications also need static files such as JavaScript files and CSS files. During development Flask can serve these static files as well. All we have to do is to create a folder called static in the package or next to the module and it will be available at '/static' on the application.

To generate URLs for static files, we need to use the special 'static' endpoint name.

url_for('static', filename='style.css')

The file has to be stored on the filesystem as static/style.css. When running on localhost this CSS file can be accessed via the URL http://127.0.0.1:5000/static/style.css

Rendering Templates

To render a HTML template, render_template() method can be used. For that we need to provide the name of the template and the variables that we have to pass to the template engine as keyword arguments. Shown below is one example of this. 

def hello(name=None):
        return render_template('hello.html', name=name)

Flask looks for templates in the templates folder. Therefore this templates folder should either be next to your module or else if it is a package, this folder should be inside the package.

The Request Object

For a request object the route decorator looks like shown below. 

@app.route('/login', methods=['POST', 'GET'])

The current request method is available by using the 'method' attribute. For example we can check if the method is is POST method by using the following statement.

if request.method == 'POST':

Furthermore, form data(data transmitted in a POST or PUT request) can be accessed using the form attribute. An example is shown below. 

if valid_login(request.form['username'], request.form['password']):

Horoscope-API 

After obtaining a basic knowledge about Flask, I learned about a sample REST API developed by my mentor. This API namely the Horoscope-API deployed at Heroku has been developed using Flask. It uses the horoscope package developed to fetch and parse data from GaneshaSpeaks. This API has methods to return Today's horoscope, Weekly horoscope, Monthly horoscope and Yearly horoscope. Also there is a method to know more about a given sunsign.

GET requests have been used for all the methods in the application. Shown below is the route decorator for the index_route function which returns basic details of the project and the author such as the author name, author URL, project name, project URL etc. 

@app.route('/', methods=['GET']) 

These details are returned as a json object. For that the jsonify() method of Flask has been used. jsonify method creates a response with the JSON representation of the given arguments. The arguments to this method can be any one of the following three forms. 

jsonify(**kwarg)

jsonify(mapping, **kwarg)

jsonify(iterable, **kwarg)

mapping is a positional argument. It actually takes the form of a dictionary having key-value pairs. The key becomes the key in the JSON object and the value becomes the corresponding value. **kwarg denotes a set of keyword arguments. When used with keyword arguments, the argument name becomes the key in the JSON object and the argument value becomes the corresponding value. This method can also accept a positional argument which is an iterable object. Each item in the iterable must itself be an iterable with exactly two objects. The first object of each item becomes a key in the JSON object, and the second object the corresponding value. If a key occurs more than once, the last value for that key becomes the corresponding value in the JSON object. Following is an example JSON response returned when using keyword arguments. 

return jsonify(username=g.user.username, email=g.user.email, id=g.user.id)

This returns the following JSON response. 

{
"username": "admin",
"email": "admin@localhost",
"id": 42
}

Horoscope-API uses the first two forms of arguments for the jsonify method. index_route method uses mapping object type positional argument and all the other methods use keyword arguments to the jsonify method.

Another special method used by the Horoscope-API is the dict method. This accepts the same set of arguments as mentioned above for the jsonify method. The task of this method is to return a new dictionary from the arguments passed to it. If no positional argument is given, an empty dictionary is created. The methods of the horoscope package return their results embedded in Python dictionary objects. Horoscope-API creates new dictionaries from these dictionary objects by using them as mapping object arguments to the dict method. Shown below is an example of two ways of using the dict method. Both the ways return the dictionary {"one": 1, "two": 2, "three": 3}.

a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}

With the knowledge gained regarding the Flask web framework and the Horoscope-API I will be starting to code the API to return the Quora user profile information over the next week.

Sunday, April 3, 2016

Fifth Week of LITG Program - Part 2

As mentioned in the previous blog posts, during the rest of the fifth week and the beginning of the sixth week, my main objective was to finish coding the Python module to fetch and parse data from the Quora user profiles. This was a bit of a hard task since I ran into several difficulties. But with the proper guidance from my mentor I could successfully overcome them. The complete package is now available in the github repository at https://github.com/hansika/pyquora.

Since the the package is finally expected to be uploaded to Python package index, there's a special directory structure required by the package to be developed. This directory structure can be shown as in the following diagram. This has been captured using the Pycharm IDE.

Python Package Directory Structure 

Each and every file in this hierarchy plays an important role in a Python package. Let me explain my learning related to these files one by one. __init__.py is used to initialize the Python package. The inclusion of the __init__.py file in a directory indicates to the Python interpreter that the directory should be treated like a Python package. In this case, this directory is named as scrape_quora. One common thing to do in __init__.py file is to import selected classes, functions, etc into the package level so that they can be conveniently imported from the package. In my package, the __init__.py file imports the Scrape_Quora class from the pyquora.py file. Therefore the Scrape_Quora class becomes accessible at the package level. There are two other files inside the package directory, namely test.py and pyquora.py. The pyquora.py file has the code to scrape the web. In my package, this file has individual functions to scrape each and every data item. The test.py file has the code to test the features developed inside the pyquora.py file. 

setup.py file stores the metadata about the library. This is the most important file in any Python package and it exists at the root of the Python package. This file serves two primary functions.
  1. It is the file where various aspects of the project are configured. The primary feature of setup.py is that it contains a global setup() function. The keyword arguments to this function are how specific details of the project are defined. These arguments typically contain the name of the package, version, author, author email, github URL of the package etc. 
  2. It is the command line interface for running various commands that relate to packaging tasks. 
The README and README.md files serve the common purposes of providing a general description of the package, its features and how to use it. The LICENSE file contains the license the code should have. For example, the scrape_quora package has the Apache License 2.0. The CHANGES.txt file contains the release information related to various versions of the package. This includes the release data, the version, and of course the changes made. In my package, the CHANGES.txt file only has details of the initial release. Although not present in the above diagram, Python packages typically have another file named requirements.txt. The purpose of this file is to declare the packages to be installed to make the newly developed package work. 

After creating this basic structure of the Python package, next I started coding. According to the instructions given by my mentor, I was asked to extract the following information given the user name of a certain Quora user profile. 
  • Name of the user 
  • URL 
  • Profile picture link 
  • Follower count 
  • Following count 
  • Count of edits 
  • Number of answers 
  • Number of questions 
I followed a similar structure as in the horoscope package(https://github.com/tapasweni-pathak/pyhoroscope) when developing my package. This made it much easier for me during troubleshooting.

Yet, there was another problem that I faced when scraping the information. Although the code for scraping was more or less similar to that in the horoscope package, I ran into a small issue since the xpath related to the same data element differs across multiple user profiles. For example, the xpath related to the name of the user turned to be as follows for two different user profiles. 


//*[@id="ld_cnuttt_69662"]/div/h1/span

//*[@id="ld_txiquk_13340"]/div/h1/span


Initially I had no idea about this and my code continued to give me an empty list when I scraped the profile name from the profile page. But with the guidance of my mentor, I could find a way to generalize this xpath across all profiles using regular expressions. The generalized xpath obtained to retrieve the profile name is as follows. 

//*[re:test(@id, "ld_[a-z]+_\\d+", g)]/div/h1/span/text()

I read the documentation available at http://exslt.org/regexp/index.html and at http://lxml.de/xpathxslt.html#regular-expressions-in-xpath. According to them xpath supports regular expressions in the EXSLT name space. There are three functions for regular expressions in the EXSLT name space. They are regexp:test, regexp:match and regexp:replace. The function that I used out of these is regexp:test(string, string, string?). This returns true if the string given as the first argument matches the regular expression given as the second argument. The third argument is a string consisting of flags to be used by the test. The following statement shows how such a generalized xpath is used in the code to extract the profile name from the quora profile page.

name = tree.xpath('//*[re:test(@id, "ld_[a-z]+_\\d+", g)]/div/h1/span/text()', namespaces={'re':Scrape_Quora.regexpNS})[0]

Likewise I had to use regular expressions to generalize the xpaths related to all the data elements to be extracted. While doing this, I could also enhance my learning related to xpaths acquired over the past week. Finally I coded the test.py file to test the functionality of all the functions created. 

After coding the package, I pushed it to the aforementioned github repository. During the rest of the sixth week, I will be working on pushing this package to the Python package index. 

Friday, March 25, 2016

Fifth Week of LITG Program - Part 1

The fifth week of the LITG program is a bit of a hectic week. To cover the backlogs of the previous weeks, I had to work a bit harder with a tight schedule. Before moving onto developing the Python package to scrape Quora user profiles, I was instructed by my mentor to explore and learn about several Python modules such as urllib2, lxml, etree, xpath and re which are most commonly used in Python code to extract data from websites. I also learned about dictionaries in Python. This article is dedicated to summarize my learning related to these topics. The simple commands that I tried related to each and every aforementioned module are available in the github repository at https://github.com/hansika/LITG_Practice


urllib2 Module:


urllib2 is a Python module which defines functions and classes that help in opening URLs(Uniform Resource Locators). I referred the tutorial available at https://docs.python.org/2/howto/urllib2.html from Python official website to try out some common commands in urllib2 module to get a basic understanding. This module is commonly used with HTTP URLs but it can also be used with other protocols such as HTTPS, FTP, Gopher etc. Apart from opening URLs the module also offers a slightly complex interface for handling common situations such as basic and digest authentication, redirections and cookies. Such features are provided by objects called handlers and openers. 


Get Request

urlopen() is the function used to fetch the content related to a particular URL. The response returned is a file-like object with several functions defined on it. read() and readlines() methods can be used to get access to the data of the remote resource. geturl() method returns the real URL of the page fetched. This is useful particularly when the urlopen() method has followed a redirect. At such circumstances, the URL of the page fetched may not be the same as the URL requested. info() method returns a dictionary-like object that describes the page fetched(headers sent by the server). It is important to call the close() method on this response after use as with any other usual file-like object. 


Request – Response

HTTP protocol works in a request response manner where the clients send requests to the server and the server returns responses. urllib2 has a Request() method for this, which returns a request object on a particular URL, on which the urlopen() method can be called. This Request() method can accept an additional data argument to send more information to the URL such as name etc. of the client. 


Exception Handling

urlopen() method raises URLError when it cannot handle a response. URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn’t exist. In this case, the exception raised will have a ‘reason’ attribute, which is a tuple containing an error code and a text error message.

HTTPError is the subclass of URLError raised in the specific case of HTTP URLs. If an error is raised when trying to access a HTTP URL, the server responds by returning an HTTP error code and an error page. 


Basic Authentication

To access certain URLs authentication is required. When such authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a ‘realm’. An example header looks as follows.
WWW-Authenticate: Basic realm="cPanel Users"


The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is ‘basic authentication’. To facilitate basic authentication with urllib2, first of all a password manager is created which handles the mapping of URLs and realms to passwords and usernames. HTTPPasswordMgr or HTTPPasswordMgrWithDefaultRealm is used for this purpose. We pass the username, password and the requested URL to the password manager using its add_password() method. Then we can create an instance of HTTPBasicAuthHandler from the created password manager and an opener object to use this handler. The open() method of this opener can be used to fetch the URLs that request authentication. Later this opener object can be installed such that all calls to urlopen() method of the urllib2 module use this custom opener with the authentication details embedded. 


etree Module:


etree is a module from the lxml library that can be used to process XML documents. I referred the tutorial at http://lxml.de/tutorial.html to learn about etree. etree is most commonly used to scrape the content from web pages. The following statement is used in Python to import this module.
from lxml import etree

An Element is the main container object for the ElementTree API. Most of the XML tree functionality is accessed through this class. I tried the following commands to play with the etree module and learn about it.


Creating XML Tree Structure

Elements can be created using the Element() method which accepts the tag of the element as an argument. The SubElement() method can be used to create child elements and assign them to parent elements. This method takes as arguments, the parent node and the tag of the newly created child node as a string. A created XML tree can be printed using the tostring() method which takes as an argument the root of the XML tree. This tostring() method can accept an additional argument in the form “pretty_print=True” in order to print the XML tree in a nice hierarchical manner as in the following example. 

<root>
      <child1/>
      <child2/>
      <child3/>
</root>

Furthermore, iselement() method can take an argument and check whether it is an element.

Elements can also carry attributes. To assign attributes to elements the Element() method can accept an additional argument in the form attribute_name = “attribute_value”. Elements also carry text. Such text can be set via the text attribute of elements. 


Using Xpath to Read the Text of Elements 

Xpath() method of etree module can be called with the argument “string()” or “//text()” to create a function that can be used to print the text content of the elements. When used with the “string()” argument, the text content of all the elements will be printed as one whole concatenated string. When used with the “//text()” argument, the text will be printed as a list. 


Parsing XML Files

There are three methods that can be used for this namely, fromstring() method, XML() method and parse() method. fromstring() and XML() methods take as arguments a string of XML data, parses it and returns the root element of the XML tree as an element object. On the other hand, parse() method takes input from files and file-like objects and returns a complete document from that file. What it returns is an ElementTree object in contrast to an element object.

Furthermore, we can create custom parsers. For an example, I created a custom XMLParser and a HTMLParser. The created custom XMLParser can be passed to the XML() method as an additional argument to instruct Python to use this parser in parsing the defined string instead of the default parser. Similarly, the created custom HTMLParser can be passed to the parse() method as an additional argument.


re Module:


This module has been added into Python to provide regular expression patterns. Regular expressions is a highly specialized programming language embedded inside Python and made available through the re module. I referred two tutorials for this at https://docs.python.org/2/howto/regex.html and https://developers.google.com/edu/python/regular-expressions#basic-patterns

The basic concept behind the re module includes specifying the rules for the set of possible strings that we want to match and asking questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. As mentioned in the first one of the above two tutorials, regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. 


Metacharacters

There is a number of metacharacters in the re module that define a set of special patterns to be matched. Following is a list of metacharacters that re supports.

. - matches anything except a newline character

^ - matches the complement of the given pattern

* - previous character can be matched zero or more times, instead of exactly once.

+ - matches one or more times

? - matches either once or zero times

{ } - {m,n}, where m and n are decimal integers means there must be at least m repetitions, and at most n. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity

[ ] - character class. [acd] defines either a or b or c from the character class containing a, b, c 

\ - escaping metacharacters (\d, \D, \s, \S, \w, \W)


Compiling Regular Expressions

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. Shown below is a list methods that can be called on any pattern object. 

match() - Determine if the RE matches at the beginning of the string. 
search() - Scan through a string, looking for any location where this RE matches. 
findall() - Find all substrings where the RE matches, and returns them as a list. 
finditer() - Find all substrings where the RE matches, and returns them as an iterator. 


Methods on Match Object Instances

The match object itself can be printed. It prints none when there is no match. In addition to that the following methods can be called on the objects returned from match method or search method.

group() - Return the string matched by the RE 
start() - Return the starting position of the match 
end() - Return the ending position of the match 
span() - Return a tuple containing the (start, end) positions of the match 


Technique Alternative to Compiling re Objects

The second tutorial mentioned above describes an alternative method for pattern matching that does not compile regular expressions into pattern objects. This method takes the following form.
match = re.search(pat, str)

The search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise.

The 'r' at the start of the pattern string denotes that it is a python "raw" string. A raw string literal does not have escape sequences as in a normal Python string. A backslash, \, is taken as meaning "just a backslash" except when it comes right before a quote that would otherwise terminate the literal. This is extremely useful since the syntax of regular expression patterns is already heavy with backslashes. 


xpath Method:


xpath() is the method most commonly used to extract text data out of web pages. This has been discussed under the lxml library as well. But that was rather limited to extracting text content from XML tree structures. This explanation addresses extracting text data from html tree structures using the xpath() method. 

Every bit of content on an html webpage has its own unique xpath. In simple terms, it is the path to the data that we need to extract. As mentioned before, the urlopen() method can return a file-like object which has the content of the requested webpage. A custom htmlparser can parse this file-like response into a html tree structure. xpath() method can be called on this tree structure with the xpath of the required data as an argument. This xpath() method then returns the text content of the defined data element as a string. 

Getting the xpath on Google chrome is not that hard. All we have to do is to go to the required webpage and then go to the developer tools on the browser. Next, move to the Elements tab. Then we can simply right click on the data that we want and select copy xpath from the menu. This is the xpath that we pass to the xpath() method as an argument.


Python Dictionaries:


Python dictionaries are heavily used in web scraping. So it is quite important to have some basic understanding about dictionaries in Python. According to Python syntax, each key in the dictionary is separated from its value by a colon(:). Items are separated from each other using commas(,) and the whole thing is enclosed in two curly braces. This is shown in the following example.

dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};

The values can be of any type and also values do not have to be unique. But the keys need to be unique and should be of some immutable data type such as strings, numbers or tuples. 


Accessing Values

Values of the above dictionary can be accessed using usual syntax. dict['Name'] returns 'Zara' 


Updating Dictionary

An existing dictionary can be updated either by adding a new entry or modifying the value of an existing entry. 

dict['Age'] = 8; # update existing entry
dict['school'] = "DPS School"; # Add new entry


Deleting Dictionary Elements

Deleting happens in three ways. We can remove individual elements or clear the entire content of a dictionary or delete the entire dictionary.

del dict['Name']; # remove entry with key 'Name'
dict.clear(); # remove all entries in dict
del dict; # delete entire dictionary


Dictionary Functions & Methods

There are several functions and methods defined on dictionaries. Some of them are shown below.

str(dict) – returns a printable string representation of a dictionary.
len(dict) – gives the total length of the dictionary.
dict.clear() - removes all the elements of the dictionary.
dict.get(key) – returns the value of the passed key or None if the key is not in the dictionary.
dict.has_key(key) – returns true if key is in the dictionary or false otherwise.
dict.values() - returns a list of values in the dictionary. 

With the knowledge acquired from these different areas I will be headed towards coding the package for scraping the Quora user profile within this week itself.