Open Access News

News from the open access movement

Thursday, July 31, 2008

How to hide OA content from search engines

The experienced folks at SHERPA have compiled a list of Ways to snatch Defeat from the Jaws of Victory. (Thanks to Peter Millington.) Excerpt:

You may have set up your repository and filled it with interesting papers, but it is still possible to screw things up technically so that search engines and harvesters cannot index your material. Here are some common gotchas:

Require all visitors to have a username and password

Harvesters and crawlers will be locked out, and a lot of end users will give up and go away. It is reasonable to require a username and password for depositing items, but not for just searching and reading.

Do not have a 'Browse' interface with hyperlinks between pages

Search engine crawlers will never index past your first page. Button-style controls cannot normally be followed.

Set a 'robots.txt' file and/or use 'robots' meta tags in HTML headers that prevent search engine crawling

Google, Yahoo!, etc., may find your pages, but if you tell them not to index them or to follow the links, they won't.

Restrict access to embargoed and/or other (selected) full texts

Search engines and harvesters may index the metadata pages, but not the full texts of the relevant items.

Accept poor quality or restrictive PDF files

Some PDF-making software packages (usually free, cheap, or esoteric) generate poor quality PDF files that sometimes cannot be read properly by harvesting and indexing programs. However, you can still cause problems even with high-end software if you use it to restict the functionality of the PDF file - e.g. preventing copy-and-paste. It may not be possible to index such files.

Hide your OAI Base URL

If harvesters cannot find your OAI Base URL, they cannot harvest your data. Good places to give the OAI Base URL are on your repository's 'About' page or home page. Also, register it with OpenDOAR and ROAR.

Have awkward URLs

Many harvesters and firewalls will spit out or block:

Numeric URLs - e.g. http://130.226.203.32/

URLs that use 'https:' instead of 'http:'

URLs that include unusual port numbers e.g. :47231

Stick to 'http:' and alphabetical URLs. It should be possible to avoid using port numbers in URLs.

Posted by Peter Suber at 7/31/2008 12:05:00 PM.

The open access movement:
Putting peer-reviewed scientific and scholarly literature on the internet. Making it available free of charge and free of most copyright and licensing restrictions. Removing the barriers to serious research.

Why the OAN volume has been low since January 16, 2010

Why I curtailed my blogging on July 1, 2009

I recommend the OA tracking project (OATP) as the best way to stay on top of new OA developments. You can read the OATP feed on a blog-like web page or subscribe to it by RSS, email, or Twitter. You can also help build the feed by tagging new developments you encounter.