Open Access News

News from the open access movement


Thursday, June 07, 2007

Federated searching of US government OA databases

Drew Robb, Exploring the deep web, GCN, June 4, 2007.  Excerpt:

For the past decade, the Energy Department’s Office of Scientific and Technical Information in Oak Ridge, Tenn., has been using the Internet to speed research processes.

“When we first started posting information on the Web in 1997, we relied on search engines provided by the database vendors,” said OSTI Director Walt Warnick. “It soon occurred to us that it would be helpful to provide our patrons with the ability to search across multiple databases at one time.”

That led the agency to install federated search software....In April 1999, OSTI launched the EnergyFiles site, providing access to over 500 DOE databases and sites. That was followed in 2002 by Science.gov, which allows a single query to pull data from 30 scientific research databases at 12 federal agencies. February 2007 saw the release of Science.gov 4.0 with greatly enhanced relevance ranking. OSTI is now working to expand the system to include government research sites worldwide....

Google may dominate the search market, but it has two major shortcomings. The first is that it barely accesses what is known as the deep Web....“In 2000/2001 we did some analysis and realized that the quantity of documents from these deep-Web databases was far bigger than what everyone was calling the Internet,” said Jerry Tardif, vice president at search firm Bright Planet.  Tardif estimated that the deep Web is several hundred times the size of the surface Web....Others give a lower figure....But whatever that size, if you are only using Google or Yahoo, you are missing most of what is out there.

“Google makes search look simple, but in fact, search is not simple, particularly when completeness is important,” said David Fuess, a computer scientist at Lawrence Livermore National Laboratory’s Nonproliferation, Homeland and International Security (NHI) directorate.

The other problem is information overload. Public search engines may be fine for locating a hotel in Singapore, but not for professional research.

Federated search engines address both of these problems....

 “Science.gov is mostly [research and development] findings,” Warnick said....[I]t gives searchers in-depth access to research papers from CENDI (originally the Commerce, Energy, NASA, Defense Information Managers Group), an interagency working group of senior scientific and technical information (STI) managers from a dozen agencies, including DTIC, the National Agricultural Library, the National Library of Medicine and the National Science Foundation. Together, CENDI members control more than 95 percent of the federal R&D budget, so accessing their databases provides a near-comprehensive overview of federally funded research. OSTI also hosts several other federated search sites including E-Print Network and Science Accelerator.

DTIC has its own federated search engine — STINET (Science and Technical Information Network) Federated Search — specializing in providing research information to the Defense Department community.

“Our customers wanted to come to a single site and search for scientific information from both the DTIC and our sister organizations in other federal agencies,” said Ricardo Thoroughgood, chief of the STINET Management Division. “Initially, it was an internal DOD resource, but we shut down that site and made it available to the public with all unclassified and unlimited information, so that data is readily available to the public through the STINET databases.” ...