Open Access News

News from the open access movement


Saturday, April 08, 2006

Text-mining non-OA texts

Alf Eaton, Open Text Mining Interface (OTMI), HubLog, April 7, 2006. Excerpt:

The Open Text Mining Interface (OTMI) is a proposed method for making available the text of journal articles for indexing and analysis, while preserving any subscription model that funds the journals. This approach, presented in a Web 2.0 session at the Bio-IT World conference earlier this week, uses an Atom XML version of each article, with OTMI namespaced extensions, to provide all the sentences of the article in alphabetical order. Some extra information such as word frequency is also presented, but this could presumably be derived from the sentence text anyway.  All the articles in the 2020 Computing issue of Nature have OTMI files linked using <link rel="OTMI" type="application/atom+xml" href=""/> - here’s an example file.

Comment. I have to commend the developers. Insofar as it's useful, however, OTMI will counteract what I've called the software strategy for OA: using very cool and useful tools optimized for OA files as incentives for authors and publishers to make their work OA. OTMI doesn't preserve information about what which sentences are adjacent or even proximate, foiling attempts to reconstruct a readable version of the text. While this is an essential virtue of OTMI for toll-access publishers, I suspect that it's a vice for hard-core text-mining. There have to text-mining applications for which OTMI files will be less useful than full-text originals with sentence-sequence and other contextual information intact. In any case, OTMI will reduce the number of text-mining apps that support the software strategy for OA.