Thursday, September 20, 2007

Defining open access for robots

Peter Murray-Rust, The laws of robotics; request for drafting, A Scientist and the Web, September 20, 2007.  Excerpt:

I have been asked about what we need for robotic access to publishers’ sites. Several publishers are starting to allow robotic access to their Open material. (Of course the full BBB declarations logically require this, but in practice many publishers haven’t made the connection). So let’s assume a publisher who espouse Open Access and allows robotic access to their site. Is, say, CC licence enough?

There are no moral problems with CC, but the use of robots has additional technical problems, even when everyone agrees they want it to happen....

I can see roughly two types of robotic behaviour:

  1. systematic download for mining or indexing....It would be highly desirable to minimise repetitious indexing and an enthusiastic publisher could put their XML material in a proper repository framework with a RESTful API (rather than requiring HTML screen-scraping of PDF-hack-and-swear). In return there could be a list of acknowledged robots so that these could act as “proxies” or caches.
  2. Random access from links in abstracts or citations. This is likely to happen when the bot is in PMC/UKPMC, or crystaleye, and discovers an interesting abstract and goes to the full-text on a publishers site. The bot may have been created by an individual researcher for a single one-time purpose.

So I’d like to come up with (three?) laws of mining robotics. Here’s a first shot:

  • A publisher should display clear protocols for robots, with explanations of any restrictions and lists of any regular mining bots.
  • A data-miner should use software that is capable of honouring machine-understandable guidance from servers. The robots should be prepared to use secondary  sites.
  • Mining software should be Open Source and should honour a common set of public protocols.

But I would like suggestions from people who have been through this…