Open Access News

News from the open access movement


Tuesday, June 24, 2008

Open data, flawed data, and science

Peter Murray-Rust, Data-driven science and repositories: consideration of errors, petermr’s blog, June 24, 2008.
... As we have blogged earlier (CrystalEye - an example of a data repository) CrystalEye
was developed by Nick Day as part of his PhD work. The primary aim is to see if large amounts of data - larger than a human can inspect - can be reliably used for scientific work. Before describing this I shall briefly review “errors” and indicate the implications for data repositories.

... [W]e all know that the scientific literature contains “errors”. ... My discussion will be very superficial and is not intended to be a systematic or authoritative coverage; it’s more an indication to data-driven scientists and data repositarians of issues they should address.

“Errors” can include:
  • variance in the original experiments ...
  • systematic errors (bias) in the measurements ...
  • misunderstandering or misreporting of the physical quantity or measurement ...
  • omission of relevant independent variables. ...
  • omission of units of measurement ...
  • Transcription and typographical errors ...
  • Our inability to describe effects comprehensively ...
We therefore need to know which of these are important. If typographical errors are very low (e.g. less than 1% probability in a data set) we can concentrate on effects which occur more frequently (say 20% of the time). If there is a typo in every data set we may have to use statistical methods to detect them or even abandon the effort. If we estimate a quantity by two different methods and the variance between them is low, then this gives confidence in the precision of each (though says nothing about the accuracy). ...