Open Access News

News from the open access movement


Thursday, May 18, 2006

Chris Sherman building on Kevin Kelly

Chris Sherman, Building the Universal Library, Search Engine Watch, May 18, 2006. Excerpt:
What will it take for Google or another search engine to truly assemble a library of all of the world's information? A thought-provoking essay by Wired magazine's "senior maverick" [Kevin Kelly] takes a fascinating look at the challenges....

[Kelly] says these [book digitization] projects are scanning about a million books a year. Although this sounds like an impressive pace, it amounts to just 5% of all books currently in print. Fortunately, much of the new information created by humans is now in digital format, so it can more easily be included in the Universal Library without the extensive physical effort of scanning books. And let's not forget the web. Although the search engines have become fairly proficient at creating comprehensive indexes of the surface web, they're still missing massive amounts of content located in databases or other dynamic sources (the Invisible web) --not to mention web pages that have disappeared. "The grand library naturally needs a copy of the billions of dead Web pages no longer online and the tens of millions of blog posts now gone--the ephemeral literature of our time." Including this "ephemeral literature" could prove to be a major challenge. Various studies have put the "half-life" of an average web page at just under two years, with the half-life of a typical web site being just over two years. The most complete publicly accessible archive of the web, the Internet Archive, contains just a fraction of all content that has been posted to the web --some 55 billion pages in all.

But I think it's a fair bet to say that Google and Yahoo haven't thrown away the pages they've crawled through the years. And there's a precedent for digital restoration on a massive scale: Google's painstaking effort to build an archive of the Usenet. Assembling archives stored on magnetic tape, CD-ROM and other sources, Google restored a comprehensive archive of Usenet, dating back to 1981, and made this available to users in December 2001. Although still not totally complete, the renamed Google Groups now likely contains more than 99 percent of all Usenet postings ever made. It's not unthinkable that Google and Yahoo, the longest surviving crawler-based engines, could collaborate to restore a comprehensive archive of the web. Surely there are data archives from search engines now long-gone that could also be mined to build out an archive....