Open Access News

News from the open access movement

Monday, November 03, 2008

Google creates and searches OCR'd editions of scanned texts

Google is stepping up its use of OCR'd editions of image scans in its search index. From its October 30 announcement:

...Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves....

In the past, scanned documents were rarely included in [Google] search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR [Optical Character Recognition] on any scanned documents that we find stored in Adobe's PDF format....

This is a small but important step forward in our mission of making all the world's information accessible and useful.

While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read....

Here's the example from the first paragraph of the announcement:

PDF of an image scan of Paul Krugman's 1978 paper, The Theory of Interstellar Trade
Google's OCR'd, searchable HTML edition, with the search terms highlighted
How the paper appears in a Google search

Comment. Google has been OCR'ing its scanned books from the start (December 2004), in order to make them searchable. But it didn't release HTML editions until July 2007, presumably to prevent easy indexing by rival search engines. When it released the HTML editions, it said its purpose was to help visually-impaired users, whose reading software doesn't work on images. That was a good reason, but I never understood how it overcame Google's famous reluctance to share its work with rivals. As I wrote at the time:

Access for the visually impaired is important and long overdue. But the new plain-text layer also provides access for cutting and pasting, text-mining, and other forms of processing. Making these books accessible as texts, and not merely as images, is a breakthrough for all users.

I have a similar mix of appreciation and puzzlement today. But in addition to wondering why Google relaxed its grip on a competitive advantage, I'm also wondering whether this has any connection to the new settlement with book publishers. Today's announcement is not about book texts, but the HTML editions are based on technology Google developed for its book scanning program.

Posted by Peter Suber at 11/03/2008 03:44:00 PM.

The open access movement:
Putting peer-reviewed scientific and scholarly literature on the internet. Making it available free of charge and free of most copyright and licensing restrictions. Removing the barriers to serious research.

Why the OAN volume has been low since January 16, 2010

Why I curtailed my blogging on July 1, 2009

I recommend the OA tracking project (OATP) as the best way to stay on top of new OA developments. You can read the OATP feed on a blog-like web page or subscribe to it by RSS, email, or Twitter. You can also help build the feed by tagging new developments you encounter.