Open Access News

News from the open access movement


Thursday, July 31, 2008

How to hide OA content from search engines

The experienced folks at SHERPA have compiled a list of Ways to snatch Defeat from the Jaws of Victory.  (Thanks to Peter Millington.)  Excerpt:

You may have set up your repository and filled it with interesting papers, but it is still possible to screw things up technically so that search engines and harvesters cannot index your material. Here are some common gotchas:

  1. Require all visitors to have a username and password
    • Harvesters and crawlers will be locked out, and a lot of end users will give up and go away. It is reasonable to require a username and password for depositing items, but not for just searching and reading.
  2. Do not have a 'Browse' interface with hyperlinks between pages
    • Search engine crawlers will never index past your first page. Button-style controls cannot normally be followed.
  3. Set a 'robots.txt' file and/or use 'robots' meta tags in HTML headers that prevent search engine crawling
    • Google, Yahoo!, etc., may find your pages, but if you tell them not to index them or to follow the links, they won't.
  4. Restrict access to embargoed and/or other (selected) full texts
    • Search engines and harvesters may index the metadata pages, but not the full texts of the relevant items.
  5. Accept poor quality or restrictive PDF files
    • Some PDF-making software packages (usually free, cheap, or esoteric) generate poor quality PDF files that sometimes cannot be read properly by harvesting and indexing programs. However, you can still cause problems even with high-end software if you use it to restict the functionality of the PDF file - e.g. preventing copy-and-paste. It may not be possible to index such files.
  6. Hide your OAI Base URL
    • If harvesters cannot find your OAI Base URL, they cannot harvest your data. Good places to give the OAI Base URL are on your repository's 'About' page or home page. Also, register it with OpenDOAR and ROAR.
  7. Have awkward URLs
    • Many harvesters and firewalls will spit out or block:
      • Numeric URLs - e.g. http://130.226.203.32/
      • URLs that use 'https:' instead of 'http:'
      • URLs that include unusual port numbers e.g. :47231

      Stick to 'http:' and alphabetical URLs. It should be possible to avoid using port numbers in URLs.