I have been using a Perl based indexer for some time now for the search facility on UKlug but just recently I have noticed that its taking a bit too long to run. This is not the first time I have optimized the indexer but this time I decided to bite the bullet and write it in C.
I have also changed the way the search engine works to tf/idf term ranking. This has added some overhead to the indexing so it would just have been slower if I had left it.
Of course interfacing with Postgresql meant that I had to blow the cobwebs off my libpq skills. Having used libpq a while back I was reasonably familiar with it but it still took me a while to get back into it. It does not help that the docs are a bit spartan and the examples are little use. It was a case of fudge it and see what works.
The indexing takes place as follows
get job text
parse out terms
count terms in text
insert into reverse index data_id,term,term_count
The actual weighting calculation takes place at runtime.
Of course there are problems with what I am doing. The biggest of which is character encoding. At the moment I have not really had to worry too much about this because most of the jobs in the database have been either from the states of UK. So for all intents and purposes treating the text as ASCII was sufficient for my needs. I have just recently added both Dutch and German feeds to the database so its getting to the point where I can no longer ignore the encoding issue.
Another problem is of course the fact that a reverse index does not scale as well as other methods although at the moment it’s handling several million entries with comfort.