Open Access News

News from the open access movement

Wednesday, August 06, 2008

Big science creates natural pressures for open data

Richard Poynder, In search of the Big Bang, Computer Weekly, August 6, 2008. Excerpt:

This month (August) the world's biggest particle accelerator, the Large Hadron Collider (LHC), will begin hurling subatomic particles called protons around a 27km circular tunnel running beneath the Swiss-French border, before crashing them into each other. By doing so, particle physicists hope to learn more about the physical universe. At the same time, they are reinventing the way they share their research with each other....

The next challenge will be managing the huge volumes of data generated....[A]bout 15 petabytes of data will be generated annually. If stored on CDs, this would create a 20km-high tower of discs.

Once collected, the data will be processed and used to perform complex theoretical simulations, a task requiring massive computing capacity. The problem, says [Rolf-Dieter Heuer, who becomes director of CERN in January], is that "no science centre, no research institution, and no particle physics lab in the world has enough computer power to do all the work".

CERN will distribute the data to a network of computing centres around the world using a dedicated computing grid. This will allow the workload to be shared....

But the biggest challenge will be how to store the data in a format that allows reuse....[B]ecause it is costing �4.75bn to collect the LHC data, it would be profligate not allow reuse.

"Ten or 20 years ago we might have been able to repeat an experiment," says Heuer. "They were simpler, cheaper and on a smaller scale. Today that is not the case. So if we need to re-evaluate the data we collect to test a new theory, or adjust it to a new development, we are going to have to be able reuse it. That means we are going to need to save it as open data....

Openness is not an issue for data alone, however. The research papers produced from the LHC experiments will also have to be open - which presents a different kind of challenge.

Today, when scientists publish their papers, they assign copyright to the publisher. Publishers arrange for the papers to be peer-reviewed, and then sell the final version back to the research community in the form of journal subscriptions.

But because of an explosion in research during recent decades, along with rampant journal price inflation, few research institutions can now afford all the journals they need. "Journal prices are rising very strongly," says Heuer. "So the reality today is that lots of researchers can no longer afford access to the papers they need." ...

As the LCH countdown began, the HEP community launched a number of OA initiatives. In 2006, for instance, CERN spearheaded a new project called SCOAP3, which aims to pay publishers to organise peer review on an outsourced basis, thus allowing published research to be made freely available....

A second initiative will see the creation of a free online HEP database called INSPIRE. This will be pre-filled with nearly 2 million bibliographic records and full-text "preprints" harvested from existing HEP databases such as arXiv, SPIRES and the CERN Document Server (CDS).

If SCOAP3 proves successful, the final full-text version of every HEP paper published will be deposited in INSPIRE, making it a central resource containing the entire corpus of particle physics research....

This suggests scholarly publishing is set to migrate from a journal-based to a database model, and one likely consequence will be the development of "overlay journals". Instead of submitting their papers to publishers, researchers will deposit their preprints into online repositories such as INSPIRE. Publishers will then select papers, subject them to peer review (for which they will levy a service charge), and "publish" them as Web-based journals - although, in reality, the journals will be little more than a series of links to repository-based papers.

"INSPIRE would be an ideal test-bed to experiment with overlay journals, because it will contain the entire corpus of the discipline," says Holtkamp....

What is key to current developments is the belief that scientific information must be openly available. Because science is a cumulative process, the greater the number of people who can access research, critique it, check it against the underlying data and then build on it, the sooner new solutions and theories will emerge. And as "Big Science" projects like the LHC become the norm, the need for openness will be even greater because the larger the project, the more complex the task, and the greater the need for collaboration - a concept neatly expressed in the context of Open Source software by Linus' Law: "Given enough eyeballs, all bugs are shallow."

Holtkamp adds, "I am pretty confident that Open Access will be the standard of the future for scientific papers, although it remains unclear when Open Data will become the norm."

Certainly, if the public is asked to fund further multi-billion-pound projects like the LHC, there will be growing pressure on scientists to maximise the value of the data they generate - and that will require greater openness.

Posted by Peter Suber at 8/06/2008 02:15:00 PM.