Tuesday, June 10, 2008

Sharing repository traffic data

Harvesting usage data? JISC Information Environment Team, June 9, 2008.  Excerpt:

I was talking with a researcher the other day who said that, despite his institution mandating deposit of research papers in his institutional repository, he didn't comply - preferring to deposit in an international subject repository. Naturally, I asked him 'why?'. He said that it was because he wanted each of his papers to be in one, and only one, place on the web, so that he could get accurate download statistics for it. Obviously, we?re aware in the JISC IE team of the various arguments on this topic, and we?ve funded a piece of work to look at the practical ways in which subject and institutional repositories might work together, which could address this issue among others. We've also funded various projects on repository statistics, such as ?Interoperable Repository Statistics? (which has developed a tool that repository managers can use to analyse and share statistics) and an ongoing small piece of work on harmonising article-level usage data formats. There is also MESUR and other projects in this space.

However, in the real world, it is likely that copies of some research papers are likely to be at various places on the web, and we wondered whether a tool could be built that used fuzzy matching to identify copies that were probably the same paper, some means of querying the servers on which they sat to get download data, and a reliable way of then aggregating that data into some acceptable statistics. Is that an important use case? Is feasible to build something that addresses it? ...

Update. Also see Gavin Baker's thoughts on solutions to this problem, in a post on the CC-Community list.