There appears to be some misunderstanding surrounding the usage of the robots.txt file.
The following is just a fraction of the stuff I have found while spidering websites.
The “noarchive” statement should be part of a meta tag it should not be in the robots.txt file. Its not part of the standard.
I believe that the following or something similar should be in the standard but it isn’t yet ie “Crawl-delay”.
It is implemented by a few crawlers but people insist on doing the following
The proper way should be as follows,
I know Yahoo’s crawler (Slurp) adheres to the Crawl-delay directive
but here we are endorsing a non-standard method, whether this is a good or bad thing is left up to the reader to decide. I think there needs to be a delay type option in the robots.txt file having been hammered once by msn’s bot.
Then we have the people who think that they need to authorize a spider to spider their website.
The reason for not having an Allow directive is simple. Hardly any of the internet would be indexed becasue only a fraction of the websites online actually uses the robots.txt file. By implementing an Allow directive it would mean that websites are closed for business to the spiders. For instance, take the following directive
is the spider then to assume that only that directory is available to the spider on the entire website, what can the spider assume about the above directive. To me it reads that only the “index_me” directory is to be indexed. What then is the point in the Disallow directive.
The Disallow directive was chosen because the internet is for all intents and purposes a public medium so we all opt in when we put our websites up then we opt out of the things we don’t want indexed.
my favorite though are the following. The honest mistakes