Wednesday, July 16, 2008

Google slow to index OAI repository records

Kat Hagedorn and Joshua Santelli, Google Still Not Indexing Hidden Web URLs, D-Lib Magazine, July/August 2008.  Excerpt:

This report is a follow-up to the McCown et al. article in IEEE Internet Computing two years ago, in which the researchers investigated the percentage of URLs from OAI records in Google, Yahoo and MSN search indexes. We were interested in whether Google in particular had increased the number of OAI-based resources in its search index.

To this end, we used a slightly different methodology using the OAIster metadata corpus to see what percentage of the corpus was found in the Google search index only. OAIster harvests and aggregates OAI metadata with links to digital resources those without links to digital objects are removed during our transformation and indexing process....

Google's indexing does not seem to have retrieved more of the hidden web since the publication of the McCown, et al. article in 2006. We would venture to conclude that Google has not endeavoured to increase their support and access to OAI materials. Even taking into account the caveats, we would also conclude that aggregations of OAI records are as valuable for user research purposes as they were at least two years ago.

From our own experience, we know that providing the OAIster records in bulk to Google proved problematic for them, and eventually they requested only the OAIster URLs instead of the complete metadata. We are not, at this point, certain that Google is using these URLs (crawling them) for addition to their search index.

It is also interesting to note that Google has recently dropped support of OAI for website indexing. Given the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources, not less. (Granted, the OAI for Sitemaps feature may not have been an appropriate approach for Google.) ...

We to encourage other OAI aggregators to run their metadata against the Google index, to prove or disprove our conclusions. Our source code and raw data are available upon request.

Update (7/29/08). Also see Wouter Gerritsma's comments on this article.

Update (7/29/08). Also see the note on this article in Wired Campus, the blog of the Chronicle of Higher Education. See especially the comments from readers.