Open Access News

News from the open access movement


Monday, January 19, 2009

How well do repositories accept automated mass deposits?

Stuart Lewis, DSpace at a third of a million items, Stuart Lewis' blog, January 19, 2009.  Excerpt:

As part of the JISC-funded ROAD (Robot-generated Open Access Data) project we are load testing DSpace EPrints and Fedora to see how they cope with holding large numbers of items. For a bit of background, see an earlier blog post: ‘About to load test DEF repositories

The project programmer Antony Corfield has created a SWORD deposit tool for this purpose....

The tool was left running over Christmas depositing 9Mb (approx) SWORD packages, each containing the results of an experiment. It is now almost up to 1/3rd of a million deposits. Whilst we would have liked to keep running the experiment to take the deposits to maybe 1 million, it would take more time than we have....

[PS:  Here omitting the chart of the time it took to deposit each item.]

Some observations from this chart:

  • As expected, the more items that were in the repository, the longer an average deposit took to complete.
  • On average deposits into an empty repository took about one and a half seconds
  • On average deposits into a repository with three hundred thousand items took about seven seconds
  • If this linear looking relationship between number of deposits and speed of deposit were to continue at the same rate, an average deposit into a repository containing one million items would take about 19 to 20 seconds.
  • Extrapolate this to work out throughput per day, and that is about 10MB deposited every 20 seconds, 30MB per minute, or 43GB of data per day.
  • The ROAD project proposal suggested we wanted to deposit about 2Gb of data per day, which is therefore easily possible.
  • If we extrapolate this further, then DSpace could theoretically hold 4 to 5 million items, and still accept 2B of data per day deposited via SWORD....

It is useful to note the overhead that SWORD puts on the deposit process. Those of you who have run normal imports into repositories such as DSpace will know that they zip along quite fast, probably several times (if not more) faster than the deposit by SWORD....

(Over the next few months we’ll run similar tests with EPrints and Fedora, and depending on time can try other tests such as performing the same tests but on a server which is under user-load....)