Sunday, July 22, 2007

Which publishers allow harvesting of public-domain data?

Peter Murray-Rust, Request to Elsevier for robotic extraction of data from their journals, A Scientist and the Web, July 22, 2007.  Excerpt:

In previous posts I have written on the value of robotic extraction of data in scientific articles. By default Elsevier do not allow robotic extraction:

…You may not engage in systematic retrieval of Content from the Site to create or compile, directly or indirectly, a collection, compilation, database or directory without prior written permission from Elsevier.

The Site may contain robot exclusion headers, and you agree that you will not use any robots, spiders, crawlers or other automated downloading programs or devices to access, search, index, monitor or copy any Content

PMR: So I have written the following letter:

Subject: Permission to extract crystallographic data robotically from Elsevier publications

…I and colleagues have built a repository of crystallographic information published in scientific journals. This data is factual, and not copyrighted by the original authors. Major publishers such as the International Union of Crystallography and the Royal Society of Chemistry encourage (and often demand) the publication of such data as part of the scientific record and mount it on their sites as “supporting information” or “supplemental data”. It is of extremely high quality and over the last 30 years the crystallographic and chemical community have shown that it is an essential resource for data-driven science….

We have built robots which have analysed over 50,000 papers on publishers’ sites and extracted the crystallography. Note that the major publishers I have referred to do NOT require a subscription to access this information. We have agreed protocols whereby our robots run at times and frequencies that do not cause denial of service (DOS) - i.e. we try to be responsible.

Elsevier journals do not expose this as public supplemental information but I believe it is available to toll-access subscribers.  I would like permission to extract crystallographic data from any Elsevier journals using robotic techniques and to make the TRANSFORMED extracted data public under  a CC-BY licence (Creative Commons) or an OpenData license from the Open Knowledge Foundation. All data so extracted would be referenced through the DOI of the article thus allowing any user (human or robot) to give full citation and therefore credit to the authors and the journal….We need not store the actual documents….

I am guessing that Elsevier journals (e.g. Tetrahedron, Polyhedron, etc.) contain a total of ca 20,000 relevant papers - until we are able to examine them robotically I can’t be more precise. Obviously I cannot write for permission for each paper individually so I am asking for general permission to carry out robotic extraction of crystallographic data from all Elsevier journals to which I have access through my institution. And I would obviously agree to devising a robotic protocol that was friendly to your web server….

If you and colleagues wish to be convinced of the value and quality of this cyberscience please have a look at [CrystalEye] where you can see the aggregated material from the other publishers. Although we haven’t published the results formally yet, two graduate students have carried out thousands of days’ work of theoretical calculations on the data which we believe have led to new insights into crystal and molecular structure….

Note that this is a public request - I have explained the reasons on my blog in which this letter is contained. Since this is a matter of considerable current public interest I request permission to post your replies - if there is material that you wish to remain confidential please send a separate mail to me indicating confidentiality which I will honour.