Open Access News

News from the open access movement

Saturday, October 13, 2007

Harnessing the crowd to make OA texts from public-domain print sources

Tim Armstrong, Crowdsourcing and Open Access, Info/Law, October 12, 2007.  Excerpt:

I gave a short talk earlier today to my colleagues about the open access movement in legal scholarship, about which the three of us here at Info/Law have blogged from time to time (check out our open access tag for more). I used the occasion to go public with my own minor contribution to improving access to primary legal source materials....

The House Report on the Copyright Act of 1976 is a key reference in the intellectual property domain, routinely cited by courts in copyright cases. It has been indispensable in resolving disputes as to legislative intent in the face of uncertain statutory text. But so far as I’ve been able to determine, it’s not freely available online...That’s unfortunate. As has often been noted, the copyright statute is intractably, even maddeningly, vague in places, and the legislative reports have been crucial tools in figuring out just what Congress was trying to do across a host of issues.

Taking advantage of our spiffy new copier, I scanned the entire House Report, working a few pages at a time over the course of a couple of weeks. That left me with a big folder full of TIFF files on my PC, which I scrubbed with the wonderful tool unpaper before converting to PDF. You can now download the completed PDF here, although be warned that it’s a very large file (155 MB): House Report No. 94-1476 (PDF).

Getting the scanned page images online, though, is only part of the battle. What I ultimately would like to see online is the text of the report, freely searchable, copyable, and indexable, rather than just the images. Because I don’t have the time or energy to convert the images to text myself, I’ve thrown the project open as an experiment in crowdsourcing. All my page scans are now available on Wikimedia Commons, and volunteers are slowly converting the raw OCR output to intelligible text on Wikisource. It’s a lengthy document, but given enough eyeballs, as they say. The Wikisource index to the scanned pages already appears on the first page of the Google search results for “House Report 94-1476.” Eventually, this process should produce a fairly well cleaned-up version of the source text.

Assuming this ultimately works (a big “if,” to be sure), what are some other public domain legal source texts that should get the crowdsourcing treatment? ...