Saturday, May 24, 2008

Harvesting chemical data from published articles

Peter Murray-Rust has blogged some notes on his talk at the Royal Society of Chemistry meeting, Open Access Publishing in the Chemical Sciences (London, May 22, 2008).  First he endorses Christoph Steinbeck's summary of his talk and then adds some additional notes:

The main thing we took away was the importance of factual data. No-one disputed that facts could not be copyrighted (though not all realised that copyright was only one of the methods used by publishers to control access and re-use - server-side beheading is completely effective). I asked the audience - > 30 composed of publishers, librarians, software companies, etc. - no actual chemists of course - whether anyone would object to our robots reading the literature and extracting the data from the papers whether as text, images of tables. Half the audience thought I should, the rest didn’t vote against.

So, publishers, I’m going to start mining data from your sites. I hope you welcome this as a way forward to a new exciting era of data-rich science publishing. I hope that if you don’t agree you’ll let me know. I wouldn’t like to start and then get the lawyers sent. So please comment - it’s very important. I shan’t attack anyone who sends a reply. And you can send it by confidential email if you like.

There are a million new compounds each year in the scholarly literature. Our robots can produce huge amounts of good information from it. In some cases we get over 90% recall and precision - it depends on the type. This must be good for science. So please, publishers, let us know we can do it and we’ll publicly thank you. And if you don’t like the idea, please let us know why....