Wednesday, April 18, 2012

Weekend project

Quick weekend project - a website to search for 'halal' status of products in local market. The data was scraped from JAKIM website. The primary motivation for doing this is to search for halal status from my mobile phone - small feature phone with browser (not android). The JAKIM website can't even being displayed on my phone. It used Django at the backend and the well known Twitter Bootstrap for the frontend page. This is my first use of Bootstrap, it simply work out of the box for mobile browser (mine is the old opera mini, 3.0 something I guess). I try to avoid Django for weekend project but the offer django-haystack has for quick search tool is too tempting so I decided to still use Django for this one. The search backend is Whoosh with the integration mostly done by haystack. Scraping JAKIM website not easy, the data are all in heavily nested html tables with no id or class to identify. I use python lxml lib to parse the html. When user search for keyword, it will look first in the Whoosh index and if none found, try to query JAKIM site directly and then redirect user to the same page with their query parameter. To make the result immediately available, I used the real time index feature of haystack that will automatically update the index once new item inserted into db. This site has one advantage over the JAKIM site. The search on JAKIM site was naively implemented that you can't even search for multiple keywords. For example searching for "shokubutsu original" yield no result from JAKIM while my site return exactly what I want.

Sunday, April 15, 2012

Django lxml encode error

Python script using lxml library that work fine on console suddenly throwing out error when importing it from django views module.
File "lxml.etree.pyx", line 123, in init lxml.etree (src/lxml/lxml.etree.c:160385) TypeError: encode() argument 1 must be string without null bytes, not unicode"
It unlikely problem with the encoding of the content I want to parse because it just importing the module and I'm not calling any function that do the parsing yet. Almost giving up in my search until I found this answer[1] on Stackoverflow. It turn out on console I'm using Python 2.6 while mod_wsgi, which run the django app is compiled against python 2.7.
[1]:http://stackoverflow.com/questions/9465248/runtimewarning-compiletime-version-2-6-of-module-lxml-etree-does-not-match-ru