Open Access News

News from the open access movement


Saturday, May 05, 2007

More on quality problems with Google book scanning

Robert Townsend, Google Books: What’s Not to Like? American Historical Association Today, April 30. 2007.  (Thanks to Library Journal Academic Newswire.)  Excerpt:

The Google Books project promises to open up a vast amount of older literature, but a closer look at the material on the site raises real worries about how well it can fulfill that promise and what its real objectives might be.

Over the past three months I spent a fair amount of time on the site as part of a research project...and from a researcher’s point of view I have to say the results were deeply disconcerting. Yes, the site offers up a number of hard-to-find works from the early 20th century with instant access to the text. And yes, for some books it offers a useful keyword search function....But my experience suggests the project is falling far short of its central promise of exposing the literature of the world, and is instead piling mistake upon mistake with little evidence of basic quality control. The problems I encountered fit into three broad categories —the quality of the scans is decidedly mixed, the information about the books (the “metadata” in info-speak) is often erroneous, and the public domain is curiously restricted....

[I]n many instances you will be unable to inspect public domain items more closely, because the erroneous date places the information on the wrong side of the copyright line....

These problems are exacerbated by Google’s rather peculiar views on copyright. While taking an expansive view of copyright for recent works, it has taken a very narrow view about books that actually are in the public domain. As I have always understood it (and the U.S. Copyright Office confirms), “works by the U.S. government are not eligible for U.S. copyright protection.” But Google locks all government documents published after 1923 behind the same wall as any other copyrighted work....

What particularly troubles me is the likelihood that these problems will just be compounded over time. From my own modest experience here at the AHA, I know how hard it is to go back and correct mistakes online when the imperative is always to move forward, to add content and inevitably pile more mistakes on top of the ones already buried one or two layers down. With Google adding in more than 3,000 new books each day, the growth in the number of mistakes seems that much higher....