Open Access News

News from the open access movement


Tuesday, April 08, 2008

Text-mining licensed non-OA literature

Heather Piwowar, Non-OA Full-text for text mining, Research Remix, April 7, 2008.  Excerpt:

Interesting discussion on Peter Murray-Rust’s blog about whether PubMed Central articles can be crawled and used for text mining. [PS: Blogged here on Sunday.]  The answer is no, not now, not unless they are open access (as opposed to traditional closed access but deposited in PMC).  Really unfortunate.  Incremental progress, we’ll get there....

I’ve been wondering about similar text mining questions. I think my needs are a bit different than those of PMR: ...I’m willing to limit myself to the articles that I have access to through my University’s subscriptions....I think once I have the papers I’m allowed to text mine them as fair use, since I have them under permission. So the question is what can I automatically download?

I learned I can’t spider PMC, but what about normal PubMed? Try as I might, I couldn’t find verbage on the PubMed website allowing/disallowing spidering through to full-text links on publisher websites (the links that are populated and visible when I’m logged in through the University’s connection). Is this allowed? Still seems like it might not be. And then you end up at the publisher sites anyway, with all of their differing rules. Unfort, the publisher’s rules are often hard to find, confusing, and vague (as often noted by PMR and others). Aaaaah.

So last month I asked our librarians….

...I’d also like to access non-OA text for which Pitt has subscriptions, but it sounds like I can’t do this by “crawling” PMC based on their rules....I’m wondering if I can do it by “crawling” the normal, full PubMed. Basically write a script to find the “HSLS” links on the article citation pages, follow them (usually into the publisher’s websites), and automatically save the html or pdf articles that are returned from a PubMed query....

I wouldn’t have thought this sort of automated downloading would be a problem… but the Restrictions on Systematic Downloading of articles in the PMC copyright notice referenced above makes me want to double-check....

Are you aware of any restrictions for crawling PubMed to automatically access and save content for which I do indeed have access through Pitt? ...

The librarian responded that automatically following PubMed links should be fine, and that there shouldn’t be problems from publisher sites because we have subscriptions and my text mining falls under fair use....