Open Access News

News from the open access movement

Sunday, June 22, 2008

Carol Minton Morris, The Petabyte Problem: Scrubbing, Curating and Publishing Big Data, HatCheck, June 16, 2008.

[Alex] Szalay presented the third and final keynote, �Scientific Publishing in the Era of Pedabyte Data,� at [the Joint Conference on Digital Libraries ( Pittsburgh, June 16-20, 2008)] on June 19, 2008.

He opened with a look at the evolution of science: 1,000 yrs ago science was empirical; during the last few hundred years science was theoretical using models and generalizations; a computational branch emerged in the last few decades, and; today science is about data exploration.

Scientific data doubles every year which has fundamentally changed the nature of scientific computing. Today scientific computing cuts across disciplines and has become unwieldy making it more difficult to extract knowledge. ...

Szalay has been personally involved in the expotential growth of astronomy data from the late 1990s to 2008 due to his role with the Sloan Digital Sky Survey (SDSS) that has been �mapping the universe� as part of the Virtual Observatory activities for the last ten years. SDSS is now complete, and is in the process of developing the final data release. The completed SDSS archive will contain over 100 terabytes and will be managed by Johns Hopkins University. Sky Survey user sessions show a constant and increasing use of the SDSS data.

Data versioning was SDSS�s biggest challenge, and he emphasized that there is a need to develop automation for more steps of the steps in curating data ...

Szalay believes that scientific discoveries are made at the edges and boundaries or large data sets�the places where you might not naturally be looking. The [greater the] number of connections that can be made among data sets the more likely that something new will be discovered along the edges suggesting data federation is significant.

Scientific projects that generate data are often short term�3-5 years. Data is only �uploaded� at the end of a project�the data will never catch up with the published discoveries. He advocates for projects becoming more active data curators and publishers further up stream in the investigative process. ...

To answer the question, �How can you publish data so that others might recreate your results in 100 yrs.,� he referred to Gray�s laws of Data Engineering: scientific computing revolves around data; scale-out the solution for analysis; take the analysis to the data; start with 20 queries, and; go from working to working.

One successful experiment in scaling out the solution for analysis came about because the Sloan Digital Sky Survey generated more data than scientists have time to study or classify, coupled with the fact that astronomy is attractive to the public. Astronomers asked citizens for help in classifying over a million galaxies by establishing the Galaxy Zoo.

This public science analysis solution has received enormous publicity and has allowed 100,000 citizens from all over the globe to contribute to discovery by helping to classify galaxies online while viewing beautiful images of unknown locations in the universe. For example, a German teacher found and called attention to an object that she had no experience in analyzing. Her observation turned out to be a significant discovery. The object that proved to be a Voowerp.

Szalay believes that the educational impact of this work is enormous. Data sharing and publishing would benefit from the establishment of specialized journals for data. He emphasized that scholarly communications are no longer characterized by a paper trail, but rather by an email trail along with resources collected by the Internet Archive, wiki pages, some science blogs, collaborative workbenches, and even instant messages.

Technology plus sociology plus economics must come together to continue to work on how to preserve our intellectual data resources. ...

Posted by Gavin Baker at 6/22/2008 03:16:00 PM.