Open Access News

News from the open access movement

Friday, October 27, 2006

Comparing Google Custom searches with Vanilla Google searches for OA content

Andy Powell, Pushing an OpenDOAR, eFoundations, October 27, 2006.  These comments on the new OpenDOAR search engine probably apply just as well to the new ROAR search engine.  Excerpt:

The OpenDOAR directory of open access repositories has announced a new search service based on Google's Custom Search Engine facility.  Good stuff - though for me it raises several questions of policy and implementation....

I thought I'd do a little experiment, to try and compare results from the new OpenDOAR search service with results from a bog standard Google search....

What these results say to me is that, for known item searching at least, there is little evidence that Google is losing our research nuggets within large results sets.  What Google is doing is to push the nuggets to the top of the list.  In fact, in some cases at least, I suspect one could argue that the vanilla Google search is surrounding those nuggets with valuable non-repository resources that are missed in the OpenDOAR repository-only search engine.

For me, this exercise raises three interesting questions:

  1. Are repositories successfully exposing the full-text of articles (the PDF file or whatever) to Google rather than (or as well as) the abstract page?  If not, then they should be.  I think there is some evidence from these results that some repositories are only exposing the abstract page, not the full-text.  For a full-text search engine, this is less than optimal.  My suspicion is that the way that Google uses the OAI-PMH to steer its Web crawling is actually working against us here and that we either need to work with Google to improve the way this works, or bite the bullet and ask repository software developers to support Google sitemaps in order to improve the way that Google indexes our repositories.
  2. Are we consistent in the way we create hypertext links between research papers in repositories?  If not, then we should be.  In the context of Google searches, linking is important because each link to a paper increases its Google-juice, which helps to push that paper towards the top of Google's search results.  Researchers currently have the option of linking either direct to the full-text (or one of several full-texts) or to the abstract page.  This choice ultimately results in a lowering of the Google-juice assigned to both the paper and the abstract page - potentially pushing both further down the list of Google search results.  The situation is made worse by the use of OpenURLs, which do nothing for the Google-juice of the resource that they identify, in effect working against the way the Web works.  If we could agree on a consistent way of linking to materials in repositories, we would stand to improve the visibility of our high-quality research outputs in search engines like Google.
  3. What is the role of metadata in a full-text indexing world?  What the mini-experiment above and all my other experience says to me is that full-text indexing clearly works.  In terms of basic resource discovery, we're much better off exposing the full-text of research papers to search engines for indexing, than we are exposing metadata about those papers.  Is metadata therefore useless?  No.  We need metadata to support the delivery of other bibliographic services.  In particular we need metadata to capture those attributes that are useful for searching, ranking and linking but that can't reliably be derived from the resource itself.  I'm thinking here primarily of the status of the paper and of the relationships between the paper and other things - the relationships between papers and people and organisations, the relationships between different versions, between different translations, between different formats and between different copies of a paper.  These are the kinds of relationships that we have been trying to capture in our work on the DC Eprints Application Profile.  It is these relationships that are important in the metadata, much moreso than the traditional description and keywords kind of metadata.

Overall, what I conclude from this (once again) is that it is not the act of depositing a paper in a repository that is important for open access, but the act of surfacing the paper on the Web - the repository is just a means to en end in that respect.  More fundamentally, I conclude that the way we configure, run and use repositories has to fit in with the way the Web works - not work against it or around it!  First and foremost, our 'resource discovery' efforts should centre on exposing the full text of research papers in repositories to search engines like Google and on developing Web-friendly and consistent approaches to creating hypertext links between research papers.