Open Access News

News from the open access movement

Sunday, November 18, 2007

What counts as open data?

Peter Murray-Rust, Open Data, A Scientist and the Web, November 17, 2007.  Excerpt:

There are several reasons why I’m currently thinking about Open Data (see Open Data at WP for some collected wisdom and links). We’re currently collecting more chemistry data that we intend to make Openly available (see CrystalEye knowledge base as an example). I’ve been asked to write an article for Serials review (Elsevier) on the subject and am putting my ideas in order. Chemspider announced Something New and Exciting Coming Soon… which contained an image with “Open Data” (no details). And Peter Suber announced New OA database on material properties, originally from the Chemistry Central blog which announced “The database is yet another of the free, on-line chemical services to have emerged in recent years. ” The use of “OA” was, I think, Peter’s.

I didn’t agree with Peter in his description of Material Properties as an “Open Access” database, and I’m worried that we shall see the same imprecision in the use of “Open Data”. So I wrote to Peter and am amplifying the arguments here. As a baseline Peter and I are both on the advisory board of the The Open Knowledge Foundation (initiated by Rufus Pollock) which has developed the Open Knowledge Definition. I think it’s important to take this as a starting point for this analysis, thought there are aspects of databases which make the system much more complex.

It’s good that the principle is simple to summarise:

In the simplest form the definition can be summed up in the statement that A piece of knowledge is open if you are free to use, reuse, and redistribute it. For details read the latest version of the full definition (with explanatory annotations).

I’m going to look at the most important clauses for science/chemistry....

There are significantly different types of Open Data in science....

It is critical to distinguish between “Free” and Open. “Free”, in this context, simply means that the provider has mounted the data (not necessarily the whole data) on a web page. There is often no licence, no copyright, no guarantee of availability, no commitment to archival, no explicit freedom of re-use. The materials database is in this category - and to be fair it didn’t call itself Open....

Open Access for scholarly publications implicitly guarantees certain aspects which are not guaranteed by default for Open Data:

  • The whole of the work is available. This is almost always trivial for articles (but as we have seen is a problem for some sorts of data).
  • There will be continued access to the work. This is based on (Gold) the permanence of Open Access publishers and the copying to inter/national repositories and (Green) the permanence of institutional repositories and in some cases inter/national repositories (self-archival on personal webpages does not guarantee permanent access). Repositories in general do not archive data.
  • The work can be re-used. This is clear if a licence is embedded in the work or provided by the repository. Note that many repositories do not make the licence position clear.
  • The work is in a convenient and modifiable form. Trivially readable for sighted humans. The rest is not always true.

Almost all these are major problems for Open Data.

So I very much hope that we can use Open Data in a strict form which adheres to the Open Knowledge Foundation guidelines. This is a good time to cement or challenge them. But it would be a serious problem if we allow “Freely accessible” to become synonymous with “Open Data”.


  • Just a quick note on my offline talk with PMR about Material Properties, which I called "OA" in a blog post.  Neither of us could find its licensing terms, so we couldn't tell just how open it was.  I needed (I still need, we all need) a generic term for such resources when we do know they are free of charge but don't know any details about their licensing terms.  For better or worse "OA" has become that generic term, even while it has a narrower, earlier, more technical and more proper sense through the BBB definition.  I readily and often acknowledge that I use the term "OA" both ways --widely and narrowly, as a generic term and as the technical term for the BBB level of openness.  I also readily and often acknowledge that this ambiguity causes problems --see for example the Poynder interview at pp. 30-31.  I can add that I resisted this dual sense as long as I could and only acquiesced when it became an undeniable fact of actual usage.  For perspective, I've also argued that this kind of semantic spread is not a special calamity for our technical term, but affects most technical terms in wide use and needn't prevent precise communication.
  • One tempting solution is to come up with a new generic term so that "OA" can be limited to its strict BBB sense.  That's desirable but difficult, since coining terms is not the same thing as assuring their use, let alone their intended use.  BTW, "free" would not make a better generic term, at least not yet, since it suggests to many people that a work is merely free of charge and does not also remove permission barriers.  A good generic term would cover all kinds of free online content, including those that are BBB OA.
  • I share PMR's hope that the term "open data" can stay fairly well tethered to its technical definition.  But the data world needs a generic term for the same reason that the publication world does.  If we had a good generic term for free online content, perhaps it could allow "open data" to remain univocal.