Open Access News

News from the open access movement


Monday, November 03, 2008

Google creates and searches OCR'd editions of scanned texts

Google is stepping up its use of OCR'd editions of image scans in its search index.  From its October 30 announcement:

...Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves....

In the past, scanned documents were rarely included in [Google] search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR [Optical Character Recognition] on any scanned documents that we find stored in Adobe's PDF format....

This is a small but important step forward in our mission of making all the world's information accessible and useful.

While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read....

Here's the example from the first paragraph of the announcement:

Comment.  Google has been OCR'ing its scanned books from the start (December 2004), in order to make them searchable.  But it didn't release HTML editions until July 2007, presumably to prevent easy indexing by rival search engines.  When it released the HTML editions, it said its purpose was to help visually-impaired users, whose reading software doesn't work on images.  That was a good reason, but I never understood how it overcame Google's famous reluctance to share its work with rivals.  As I wrote at the time:

Access for the visually impaired is important and long overdue.  But the new plain-text layer also provides access for cutting and pasting, text-mining, and other forms of processing.  Making these books accessible as texts, and not merely as images, is a breakthrough for all users.

I have a similar mix of appreciation and puzzlement today.  But in addition to wondering why Google relaxed its grip on a competitive advantage, I'm also wondering whether this has any connection to the new settlement with book publishers.  Today's announcement is not about book texts, but the HTML editions are based on technology Google developed for its book scanning program.