Friday, October 26, 2007

Clifford Lynch at ASIST 2007

Ken Varnum has blogged some notes on Clifford Lynch's talk at ASIST 2007 (Milwaukee, October 19-24, 2007).  Excerpt:

...We are crossing threshold where people are authoring not just for people but for machines. Not just for indexing purposes, but for understanding, at some level, of research. Data needs to be available in forms that can be synthesized. What does this mean? Lots of tagging and microformats for specific data types. Roles of publishers and authors in supplying this markup are unclear. How to attach structured data to article (and by whom?).

Overwhelming issues

1) Entire journal delivery system is not designed to allow text mining -- in fact, publishers stop this when they notice. Often contractually prohibited or limited. Some open access sites are text-mining friendly -- even zipping entire corpus and making it available. License and delivery mechanisms need updating.

2) Intellectual property issues vastly challenging. Definition (legally) or a derivative work is complex. Does an algorithm generate a derivative work? Legally not, probably. Output of a text summary tool may be a derivative work. Are your PubMed summaries derivative works? We're running up against a set of new challenges with very high stakes in copyright area.

Google is scanning everything, but in-copyright material is only provided as "snippets." Fundamental argument is that Google not doing economic damage by providing snippets. Google internally has a comprehensive database of literature which it can computer upon. We cannot know what they're doing with the results of computing on this database. This is a unique strategic asset. If they can develop text mining tools -- what can they do with it? It's a training set for a range of interesting purposes. Lexical analysis, AI systems... and more. We don't currently understand how to even talk about these questions....

Copyright remains a huge problem; most of the content that people will interact with was developed in living memory -- and therefore in copyright. How do we deal with that? ...