Thursday, March 01, 2007

Government OA sites that limit crawling

From Matt Knoll at RevolutionHealth (thanks to Gary Price):

We recently found a robots.txt file on an NLM site that blocks all spiders except Google.  Is the government allowed to do that? Does anyone know if this is common?

Last July, Susan Nevelow discovered that the National Science Foundation (NSF) blocked the Wayback Machine from copying its web pages, and had pretty much the same questions.  Bill Hooker at Open Reading Frame wrote to the NSF webmaster and got a direct answer:

NSF blocks all indexing of the site between 7AM and 7PM ET, our peak traffic hours, for the convenience of our users. However, there is no block on the site from 7PM to 7AM ET. This is standard policy for most high traffic sites. The owner of [the Wayback Machine] need only comply with our policy in order to index our pages.

Could there be a similar explanation at the NLM?