Open Access News

News from the open access movement

Wednesday, November 02, 2005

Péter Jacsó reviews CiteSeer in the November issue of his Digital Reference Shelf. Excerpt:

Too often, interesting pilot projects fade away after the initial grant money runs out. Luckily, earlier this summer the National Science Foundation awarded a $1.2 million grant to the Penn State University School of Information Sciences and Technology (IST) and University of Kansas to enhance and improve the original CiteSeer project. The funding is much deserved in light of the direct utility and the inspirational value of CiteSeer....CiteSeer (which was likely the model for Google Scholar) started out in 1997 with this good name, then switched to ResearchIndex, then switched back to the original. It currently offers its services, including sophisticated citation searching options, based on nearly one million documents. The documents were collected and processed from the open-access Web. They are the self-archived papers, their preprint and/or reprint versions. CiteSeer stands out by offering the full text of (almost) all of the documents. The size of the database in and by itself is impressive, and the instant access to the source documents makes it immensely useful. This instant access concept certainly limited the scope of the database, but it is already huge and grew at an impressive rate during the past eight years. Beyond the instant access, there was another filter applied to collecting the computer science-related papers: Only papers in PDF and PostScript formats have been collected. This also reduced the scope of the collection, but certainly increased it�s quality. These two formats are the most common in computer science, so this is not as restrictive as it may sound. The inclusion of papers in HTML and Word formats could have increased the size of the collection, but it would have lowered its quality by picking up from the open Web far less-relevant papers posted by undergraduate students in introductory distance education computer science courses offered by one of the online universities....The items on the [search] result list are sorted by decreasing citedness order. Clicking on the title of the paper brings up a much enhanced bibliographic record. Beyond the traditional content of author, title, source name and other publication data, it offers many (a little too many) additional links to the full text of the document from a variety of locations and different file formats. It also offers informative excerpts from a variety of lists about the cited, citing and otherwise related papers and their citedness indicator before making the complete lists available. This is an awesomely information-rich, but very dense, page....Let me emphasize one quintessential advantage of CiteSeer: you receive access to the source documents (with some exceptions) with no fuss and no muss � even if your library doesn�t have a link resolver � because CiteSeer has a copy of the source document. This is partially true for Google Scholar, but to a far lesser extent....CiteSeer has ultra high-brow software, way beyond what end-users will see directly. Actually, what the end users see may not be as tender an interface as you see in most Web-wide search engines, and it has no help file (which is a sin). This may make it look user unfriendly. What it lacks in user friendliness it makes up in smartness, especially in selecting high-quality sources, and in normalizing/standardizing the terribly inconsistent, incomplete and inaccurate citations prevalent in every scholarly field....CiteSeer has perfected � within reasonable limits � the process of recognizing and consolidating matching records for incomplete and/or partially erroneous citations. It can also locate the references in the full text (not merely in the footnotes) for many of the documents, in about 60-65% of the cases in my test....ITS is one of the recipients of the grant. In light of past performance, that group is a guarantee that the fund for the project known as Next Generation CiteSeer will be well-used. Of course I regret even more that only a relatively small amount was awarded for this project. It showed a working example of the revolutionary new method of autonomous citation indexing which is done without human indexing, does not require the enormously expensive journal subscription and processing investments, and can be ported to other disciplines.

Posted by Peter Suber at 11/02/2005 08:54:00 AM.