Open Access News

News from the open access movement


Wednesday, January 16, 2008

"Building a web of data"

Aaron Swartz has launched TheInfo.org, a wiki-based collection of tools and tips for those who scrape, process, view, and host large data sets.  (Thanks to Jonathan Gray.)  From the site:

This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects....

Some of us have spent years scraping news sites. Others have spent them downloading government data. Others have spent them grabbing catalog records for books. And each time, in each community, we reinvent the same things over and over again: scripts for doing crawls and notifying us when things are wrong, parsers for converting the data to RDF and XML, visualizers for plotting it on graphs and charts.

It's time to start sharing our knowledge and our tools. But more than that, it's time for us to start building a bigger picture together. To write robust crawl harnesses that deal gracefully with errors and notify us when a regexp breaks. To start converting things into common formats and making links between data sets. To build visualizers that will plot numbers on graphs or points on maps, no matter what the source of the input.

We've all been helping to build a Web of data for years now. It's time we acknowledge that and start doing it together.