Open Access News

News from the open access movement

Thursday, July 19, 2007

Open-source tool for automated metadata extraction

The National Library of New Zealand has upgraded and opened the source code for its Metadata Extraction Tool.  (Thanks to ResourceShelf.)  From yesterday’s announcement:

The National Library of New Zealand Te Puna Mātauranga o Aotearoa is pleased to announce the open-source release of version 3.2 of its Metadata Extraction Tool.

The Metadata Extraction Tool programmatically extracts preservation metadata from a range of file formats including PDF documents, image files, sound files, office documents, and many others. It automatically extracts preservation-related metadata from digital files, then outputs that metadata in XML. It can be used through a graphical user interface or command-line interface.

The software was created in 2003, and redeveloped this year. It is now available as open-source software under the terms of the Apache Public License.


  • Kudos to the NLNZ.  The more we improve the tools for automated metadata extraction, the more we remove ergonomic barriers to self-archiving.  And by opening the source code to this tool, the National Library has greatly bumped the odds that it will continue to improve.
  • There's another nice consequence of opening the source code. Someone could make it into a module or plug-in for one of the open-source archiving packages, like EPrints, DSpace, or Fedora.  When I self-archive, I’d love to have the archiving software take an automated whack at filling out the metadata fields and only bother me to check its work and supply any missing information.  Even if this tool could only do 50% of the job today, rather than 95%, that’s a big step toward metadata consistency and streamlined self-archiving.  And over time it will only get better.

Update.  Thanks to Dorothea Salo for this splash of cold water: 

New Zealand's gizmo doesn't extract descriptive/bibliographic metadata, which is the sort your comment was about.  It extracts what they're calling preservation metadata and I usually call technical metadata -- technical information about the file itself. So if you feed it an image, it will output file format, size, bit depth, resolution, and so forth.  Still quite cool (though I think PRONOM and DROID are a bit more useful), but not quite what you're hoping for.

Thanks, Dorothea.  Got it now.  But when an open-source tool for extracting descriptive/bibliographic metadata comes along, we’ll know what to do with it--