Open Access News

News from the open access movement


Sunday, November 04, 2007

More on the British Library / Microsoft book-scanning project

Jim Ashling, Progress Report: The British Library and Microsoft Digitization Partnership, Information Today, November 4, 2007.  Excerpt:

Microsoft made it clear that it wasn’t going to let Google tackle mass book digitization exclusively when it announced a partnership with The British Library (BL) in November 2005.

The BL/Microsoft project is designed to digitize 25 million pages of 100,000 out-of-copyright titles from the BL collection related to 19th-century literature. Access will be provided via Microsoft’s Live Search Books site and the BL’s Web site. Live Search Books now includes many partners: The University of California Libraries, Cornell University Library, the University of Toronto Library, The New York Public Library, and the American Museum of Veterinary Medicine have all joined, as well as more than 50 publishers....

Kristian Jensen, head of British Early Printed Collections, reviewed the selection process. Unlike previous BL digitization projects where material had been selected on an item-by-item basis, the sheer size of this project made such selectivity impossible. Instead, the focus is on English-language material, collected by the BL during the 19th century. Jensen compared the process to mass microfilming. “Nonselectivity widens access,” he said....

The works of virtually unknown writers will be brought to the attention of scholars as easily as material by Charles Dickens....

The target is to scan 50,000 pages per day with a 2-year timetable for completion....

Scanning produces high-resolution images (300 dpi) that are then transferred to a suite of 12 computers for OCR (optical character recognition) conversion. The scanners, which run 24/7, are specially tuned to deal with the spelling variations and old-fashioned typefaces used in the 1800s. The process creates multiple versions including PDFs and OCR text for display in the online services, as well as an open XML file for long-term storage and potential conversion to any new formats that may become future standards. In all, the data will amount to 30 to 40 terabytes....

[W]orse still, British/EU legislation keeps material in copyright for life, plus 70 years.

Obviously, then, an issue exists here for a collection of 19th-century literature when some authors may have lived beyond the late 1930s. An estimated 40 percent of the titles are also orphan works. Those two issues mean that item-by-item copyright checking would be an unmanageable task. Estimates for the total time required to check on the copyright issues involved vary from a couple of decades to a couple of hundred years. The BL’s approach is to use two databases of authors to identify those who were still living in 1936 and to remove their work from the collection before scanning. That, coupled with a wide publicity to encourage any rightsholders to step forward, may solve the problem....