Almost a Blog

Month: December, 2003

Should I add a comment feature 28 Dec 03

I suppose I should add a tool where people can leave comments but that might be asking for trouble and that is a little bit more involved than writing my own RSS generator. Its still fairly straight forward though if you have any experience with a database and some free time, something I seem to get in fits and starts. I have tool that allows people to comment on recruitment agencies but it is failry basic and compared to some of the Forums around today, it’s not even in the same league.

Some RSS 27 Dec 03

I have been playing with RSS for a few days and since most of the RSS that I have seen has been blogs I decided to RSS enable my plain old XHTML diary to a whizzy RSS compliant new fangled jobby. I have no other reason for doing this other than possible self promotion via my massively increased site traffic and “NOT”…..
I can hear people scream “use X or Y” do not write your own. What would be the fun in using someone else’s RSS generator. I had a look at some of the more noteworthy blogs and I noticed that there is an awful lot of commented out text in the source of the file. This seems to me to be a bit ignorant because I am paying for bandwidth and every bit counts ;-). I know thats a lame excuse but I could not help it nor could I think of a better one. To cut a long story short I used a very crude method to do it.
Using a couple of extra “span” tags I was able to come up with some compliant RSS from my blog. The joy of Perl.
The Script I used
The following script is quite rough around the ages but is gets the job done. If you have any questions about the Perl or why I just had to write my own feel free.
use strict;
use warnings;
use HTML::Parser;
use URI::URL;
use XML::RSS;
use LWP::Simple;
my $base = “/hjackson”;
my $base_url = “http://www.hjackson.org”;
my $PAGES = {
“$base_url/cgi-bin/blog/december.html” => ‘htdocs/blog/december.xml’,
“$base_url/cgi-bin/blog/november.html” => ‘htdocs/blog/november.xml’,
“$base_url/cgi-bin/blog/october.html” => ‘htdocs/blog/october.xml’,
“$base_url/cgi-bin/blog/september.html” => ‘htdocs/blog/september.xml’,
my $STATE = { ‘intext’ => 0,
‘intitle’ => 0,
‘inlink’ => 0,
‘inspan’ => 0, };
my $RSS = { ‘link’ => “”,
‘title’ => “”,
‘description’ => “”, };
sub start_tag {
my ($self, $tag_name, $attr) = @_;
if( lc($tag_name) eq ‘span’) {
if( lc($attr->{class}) eq ‘blogtitle’) {
#print “In Span $tag_name\n”;
$STATE->{intitle} = 1;
if( lc($attr->{class}) eq ‘blogtext’) {
#print “In Span $tag_name\n”;
$STATE->{intext} = 1;
if( lc($tag_name) eq ‘a’ and $STATE->{intitle} eq ‘2’ ) {
#print “href = $attr->{href}\n”;
$STATE->{‘inlink’} = 1;
$RSS->{‘link’} = $attr->{href};
sub text {
my ($self, $text) = @_;
if ($STATE->{intitle} eq 1) {
#print “Title = $text\n”;
$RSS->{title} = $text;
$STATE->{intitle} = 2;
if ($STATE->{intitle} eq 2 and $STATE->{inlink} eq 1) {
$RSS->{title} = $text;
$STATE->{inlink} = 2;
if ($STATE->{intext} eq 1) {
#print “$text\n”;
$RSS->{description} = $text;
$STATE->{intext} = 2;
if ( ($STATE->{intitle} eq ‘2’) and ($STATE->{intext} eq ‘2’) and ($STATE->{inlink} eq ‘2’ )) {
sub end_tag{
my ($self, $tag_name, $attr) = @_;
if( lc($tag_name) eq ‘span’) {
if($STATE->{intitle}) {
if($STATE->{intext}) {
my $rss;
sub create_rss{
‘title’ => “$RSS->{title}”,
‘link’ => “$RSS->{link}”,
description => “$RSS->{description}”,
$RSS->{‘title’} = “”;
$RSS->{‘link’} = “”;
$RSS->{‘description’} = “”;
$STATE->{intext} = 0;
$STATE->{intitle} = 0;
my ($html_page, $xml_page);
while ( ($html_page, $xml_page) = each %{ $PAGES } ) {
my $content = get($html_page);
#print “$html_page \n$content\n”;
$rss = new XML::RSS (version => ‘1.0’);
title => “Harry Jacksons Blog”,
‘link’ => “www.hjackson.org”,
description => “Just my Blog”,
dc => {
date => ‘2000-08-23T07:00+00:00’,
subject => “Harrys Blog”,
creator => ‘harry@hjackson.org’,
publisher => ‘harry@hjackson.org’,
rights => ‘Copyright 2003, Harry Jackson’,
language => ‘en-us’,
syn => {
updatePeriod => “hourly”,
updateFrequency => “1”,
updateBase => “1901-01-01T00:00+00:00”,
my @tags = (‘span’, ‘a’);
my $p = HTML::Parser->new(api_version => 3);
$p->report_tags( @tags );
$p->handler( start => \&start_tag, “self,tagname,attr”);
$p->handler( text => \&text , “self,text”);
$p->handler( end => \&end_tag , “self,tagname,attr”);
$p->parse($content) || die $!;
open ( FILE, “>$base/$xml_page”)
or die “Cannot open file $!\n”;
print FILE $rss->as_string;

RSS Job Database 23 Dec 03

I have been playing with RSS for a few days and have now got an RSS Job database. I spen ages trying to find RSS feeds for this and so far have not sound very many. The database can be found here. An example URL which can be used to search and create RSS feeds from the database is as follows:
This link creates an RSS Version 1.0 feed based on a search from the database. You can see from the URL that we are searching for the terms “perl” and “london”. For more information on how to use the database please see the help page

Another Google Find

I found a reference to my brother, Lee Jackson while sarching for information on our family name etc. There is not really much there just that he won a darts match. Its weird when you just happen to come across it.

Spdiering the Internet 07 Dec 03

I have started to document what I have been doing to construct the spiders. It is not really a tutorial, it”s more about what I did and how I did it. I doubt it is even close to how it should be done but I am enjoying doing it and I get to research some interesting areas of information retrieval an processing while doing it so what the hell.

Finishing the Robots 03 Dec 03

I have been quite busy lately dipping my toes in various waters hence the lack of entries just lately. I have actually finished with the robots for the short term and now moved onto the Search engine part of the project.
I am enjoying building the search engine because I get to work with C++ again, which is another language I enjoy using. I like it because I feel as if I am as close to the hardware as I am when using C but have various High Level Tools at hand when I need them. I picked C++ over C because it has the STL which I have used before. I imagine that most commercial search engines are using either C or C++ for