Tuesday, April 24, 2007

Data-driven science requires open data

Peter Murray-Rust, Data-driven science - a scientist's view, a position paper for the NSF/JISC Repositories Workshop (Phoenix, April 17-19, 2007).  Excerpt:

...Our thesis is that the current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science. However the apathy of the academic, scientific and information communities coupled with the indifference or even active hostility and greed of many publishers renders literature-data-driven science still inaccessible....

I use the neologism "hypopublication" ("hypo-" = "below", "low", or "insufficient") to emphasize the inadequacy of current publication protocols and the lack of hyperlinking or aggregatability. For example about 2 million chemical compounds are published each year (about half in patents) with insufficient semantics, metadata or hyperstructure. Vast effort is required to create useful data from these, and the current commercial processes seriously disadvantage the whole of science. It should now be possible to publish a fairly complete scientific record of an experiment, yet the current publication process continues to emphasize the "article" at the expense of the data. The article summarises the experiment and gives the essential impact factor (market indicator for tenure and funding) - the data are often missing or so emasculated as to be useless. It is the film review without access to the film.

Many scientific disciplines require publication - in textual form - of sufficient data for the experiment to be evaluated (though frequently not enough to allow replication). Some communities laudably insist on machine-parsable data including much bioscience (genomes, protein sequences and structures) and crystallography. Over the years they have managed to coerce the publishers to require authors to provide this information. If all communities did this, for all major kinds of data, then literature-driven science would become a reality. Note, however, that some publishers (such as ACS and Wiley) see such data as their property. Although "facts cannot be copyrighted", these publishers continue to insist on this and one senior representative recently told me that this was so they could "sell the data". To try to counter this I am promoting the concept of Open Data - including a mailing list offered by SPARC. The STM publishers have agreed that factual data is not copyrightable, but there is generally indifference in the academic and information communities to the importance of insisting on this.

It is important to stress that "Open Access" - as currently practised - does not promote Open Data. The Budapest and other declarations make it clear that Open Access involves free, unrestricted access to all the data for whatever legal purpose. In practice, however, publishers ban robotic indexing of sites, cut off subscribers whom they opine are downloading too much content, and continue to copyright facts. The politicisation and complexity of the Open Access struggle means that Open Data currently has little community recognition and support. Yet Open Data is the single most important problem in data-driven science....