Open Access News

News from the open access movement


Thursday, May 18, 2006

The Academic Invisible Web

Dirk Lewandowski and Philipp Mayr, Exploring the Academic Invisible Web, a preprint self-archived May 17, 2006.
Abstract: Purpose: To provide a critical review of Bergman’s 2001 study on the Deep Web. In addition, we bring a new concept into the discussion, the Academic Invisible Web (AIW). We define the Academic Invisible Web as consisting of all databases and collections relevant to academia but not searchable by the general-purpose internet search engines. Indexing this part of the Invisible Web is central to scientific search engines. We provide an overview of approaches followed thus far. Design/methodology/approach: Discussion of measures and calculations, estimation based on infor-metric laws. Literature review on approaches for uncovering information from the Invisible Web. Findings: Bergman’s size estimation of the Invisible Web is highly questionable. We demonstrate some major errors in the conceptual design of the Bergman paper. A new (raw) size estimation is given. Research limitations/implications: The precision of our estimation is limited due to small sample size and lack of reliable data. Practical implications: We can show that no single library alone will be able to index the Academic Invisible Web. We suggest collaboration to accomplish this task. Originality/value: Provides library managers and those interested in developing academic search engines with data on the size and attributes of the Academic Invisible Web.

From the body of the article:

Library collections and databases with millions of documents remain invisible to the eyes of users of general internet search en-gines. Furthermore, ongoing digitization projects are contributing to the continuous growth of the Invisible Web. Extant technical standards like Z39.50 or OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) are often not fully utilized, and consequently, valuable openly accessible collections, especially from libraries, remain invisible....

There are different models for enhancing access to the AIW, of which we can mention only a few. The four systems to be described [in this article] have a common focus on scholarly information, but the approaches and the content they provide are largely different. [1] Google Scholar and Scirus [2] are projects started by commercial companies. The core of their content is based on publishers’ repositories plus openly accessible materials. On the other hand, [3] Bielefeld Academic Search Engine (BASE) and [4] Vascoda are academic projects where libraries and information pro-viders open their collections, mainly academic reference databases, library cata-logues plus free extra documents (e.g. surface web content). All systems use or will use search engine technology enhanced with their own implementations (e.g. cita-tion indexing, specific filtering or semantic heterogeneity treatment)....

[T]he AIW is very large and...its size is comparable to the indices of the largest general-purpose Web search engines. Therefore, only a co-operative approach is possible. We conclude that existing search tools and approaches show potential to make the AIW visible. What we do not see is a real will for lasting collaboration among the players mentioned.

Comment. There's a lot here for friends of OA to think about. One lesson is that an OA article can still be invisible in the relevant sense (not indexed by all or most search engines) if it has no incoming links, if it's in a file format most search engines ignore, or if it's in a relational database for which access requires filling out an interactive form. Most OA content is visible in this sense, but not all of it is. We can do better, both by making existing OA content more visible and (of course) by making more content OA.

See my tips (co-written with Google) on how to facilitate Google-crawling of OA repositories and my tips on how to make visible OA content even more visible or discoverable.