Sunday, April 06, 2008

No data- or text-mining at PMC

Peter Murray-Rust, Can I Data- and Text-mine Pubmed Central?, A Scientist and the Web, April 5, 2008.  Excerpt:

Until last week I had assumed that the NIH policy on access to publicly funded research grants full Open Access rights to anyone in the world....

Last week I learned at Dagstuhl that data- text-mining of Pubmed Central was blocked by the site itself - delgates had found that there is a maximum of two papers that can be downloaded before the IP address is blocked.

I’d very much like clarification (as I have found the NIH sites and elsewhere extremely difficult to navigate on a consistent basis). There is no explicit mention of the right to download material for data-mining and a lot of verbiage about “consistency with publishers’s policies” which is no help to scientists like me.

So - simply - when the flood of public depositions comes on stream after April 7 (obviously with some delay) can I text-mine them?

This is important....

So - simply - can I run my robots over the material deposited by mandate?

  1. Yes - without question or fear of reprisal.
  2. No - not at all.
  3. Well - um - err - it depends on each individual paper and each individual publisher and nobody can give a clear answer

The current answer appears to be 2 (I will be cut off mechanically). I suspect the real answer is 3....

If the NIH aren’t prepared to do this then the “victory” is only the first step in a long struggle for liberating data.

Also see his follow-up post, No-One May Data- Or Text-Mine Pubmed Central, April 6, 2008.  Excerpt:

I realised with considerable disappointment ( Can I data- and Text-mine Pubmed Central?) that I might not be able to text- and data-mine the material that the NIH has required to be deposited in Pubmed Central in its mandate. Now I have got confirmation by email from an authoritative source (who asks not to be named in case the information is not quite precise). But in general terms the answer is simple:


In short Pubmed Central is “free access” (no price barriers), not “open access” (no permission barriers). You may not download material from it (except to expose it to your own eyeballs), and certainly not redistribute it. You may not data-mine it.

I am aware of the struggle that was required to get George Bush to sign the mandate and it certainly wasn’t the time to break ranks. But now that the mandate is passed (and starts tomorrow) we must press ahead immediately to campaign for full access to the text....

So we have to argue to the NIH that bioscience is desperately impoverished by the unreasonable permission barriers that are now in place.... 


  • Peter MR is right.  PMC removes price barriers and leaves permission barriers in place.  Users may not exceed fair use, which is not enough for redistribution or most kinds of text- and data-mining.  For detail --and official confirmation-- see Question F2 in the NIH FAQ:

What is the difference between the NIH Public Access Policy and Open Access?

The Public Access Policy ensures that the public has access to the peer reviewed and published results of all NIH funded research through PubMed Central (PMC).  United States and/or foreign copyright laws protect most of the articles in PMC; PMC provides access to them at no cost, much like a library does, under the principles of Fair Use.

Generally, Open Access involves the use of a copyrighted document under a Creative Commons or similar license-type agreement that allows more liberal use (including redistribution) than the traditional principles of Fair Use.  Only a subset of the articles in PMC are available under such Open Access provisions.  See the PMC Copyright page for more information.

  • Removing price barriers from NIH-funded research was a major victory, and one we couldn't have achieved if we demanded the removal of permission barriers at the same time.  But Peter is right that researchers need more and that we have to keep working for further goals.  In time, I hope we can shorten the permissible 12 month embargo and remove permission barriers from the copies covered by the NIH policy.