Tuning a perl for a Postgres based search engine

by paddyjackson

I have a database of just over 11000 jobs and I need to run and indexer against it for the search engine to work. Just recently this has been getting slower and slower due to other things going on with the server so I decided to have a look at it tonight. The following was what I done and what I found:
Preliminaries.
All regex’s are pre-compiled prior to comparison using something like:
$dict{qr/\b\Q$keyword\E\b/} = $keyword;
Table Name Rows
key_word_search ==: 51641
rss_jobs ==: 179 (last nine hours worth)
Total checks == 9243739 (approx)
Methods:
0. Normal index with no tuning applied. This is what has been running
for the last few months.
1. For each job entry we are indexing check first to see if the job_id
and the keyword is already in the index. If yes go to the next record.
2. Use perls “index” function to pre-process the result and only try a
full regex if the string appears somewhere in one of the 3 RSS entries.
Results.
I was not going to try and second guess the result but I had a feeling
that M2 would be quicker. What I was suprised at is just how much
quicker. I imagine each method would see an improvement if more RAM was
given to Postgres especially M1 and M0 but I doubt either of them would
catch M1.
Also, the trigger that inserts the job actually carries out quite a
few checks to ensure the entry does not already exist so M1 is being
duplicated somewhat anyway and I am not about to relax the database checks/integrity to satisfy performance. Performance can normally be
cured using some other method as can be seen here.
Outer == No. Total Operations applied.
Inner == No. Left after filtering by Method
Matched == No. we matched and will be entered into database.
The inner rule is the method I have put in to filter the results before
I try a full regex match. The original indexer had no filter.
Method 0:
Outer == 9239317 In == 9239317 MATCH == 3009
real 8m23.868s
user 8m9.510s
sys 0m0.720s
Method 1:
Outer == 9239317 In == 14546 MATCH == 3009
real 1m30.897s
user 1m25.840s
sys 0m0.520s
As you can see here using the perls inbuilt “index” function I have
managed to narrow the actual operations considerably. We are relying on
the speed of the index compared to an actual regex match to gain the
speed here. I imageine they have almost literally used C’s
char *strstr(const char *s1, const char *s2);
or something simlar.
Method 2:
Outer == 105084 In == 99293 MATCH == 23
real 2m9.680s
user 0m16.840s
sys 0m5.090s
We can see here that this method is a lot slower. I actually stopped
this one early and it had only completed just over 1% of the total
operations required and it took 2 minutes. This was going to be slow
due to the amount of IO required ie 9 million possible calls to the
database and then a binary lookup on and index of just over 800k is not
going to be that fast at the best of times.
As an excercise and to satisfy my own curiosity I tried to put M1 first
and use M2 after it to see what would happen and the following was the
result.
Outer == 9239317 In == 14540 MATCH == 3009
real 1m42.974s
user 1m22.980s
sys 0m1.430s
We can see from this that calling out to the database is adding
overhead to the process.
Conclusion:
When we are using heavy regex intensive operations it pays to
pre-process using perls inbuilt “index” rather than relying on the
speed of the regex itself.

Advertisements