The Second British Library DataCite Workshop

alex ball

The Second British Library DataCite Workshop

Alex Ball reports on a one-day workshop on metadata supporting the citation of research data, held at the British Library, London, on 6 July 2012.

On Friday, 6 July 2012 I made my way to the British Library Conference Centre for the second in a series of DataCite workshops [1]. The theme was Describe, Disseminate, Discover: Metadata for Effective Data Citation. In welcoming us to the event, Lee-Ann Coleman, Head of Scientific, Technical and Medical Information at the British Library, said there had been some doubt as to whether anyone would turn up to an event about metadata, but as it happened there were 36 of us, drawn from across the UK and beyond.

Overview

I had the honour of starting off proceedings with an overview of data citation and discovery. I began by using the OAIS Information Model [2] as a way of explaining the different metadata requirements for the three tasks of citation, discovery and reuse. I then fleshed out what those requirements were by first comparing four different data citation styles, and then comparing a range of metadata schemes used in data discovery portals and data archives. I also touched on some of the issues that arise in data citation, such as how one might enumerate all the contributors to a dataset and how to apply identifiers to dynamic datasets.

DataCite Metadata

I was followed by Elizabeth Newbold of the British Library, who explained how the DataCite Metadata Schema had been developed, from its conception in August 2009, to the release of version 2.2 in July 2011 [3]. Elizabeth took us through the five mandatory and 12 optional elements of the schema, explaining why each had been included and certain others left out. There were some useful lessons for anyone thinking of developing their own schema, particularly on the costs and benefits of maintaining an authority list of element values. The talk concluded with some hints on what the future might hold for the schema: elements for recording spatio-temporal coverage, if the encoding scheme can be agreed, and even a corresponding Dublin Core DataCite Application Profile (DC2AP).

Case Studies: Repositories and Archives

Next up was David Boyd of the University of Bristol, who introduced us to data.bris, which is both a project and a research data storage facility at the University [4]. Researchers have to apply to use the facility, but once approved they can add data to it as if it were another network drive. The really nifty feature is that the facility automatically assembles metadata for the deposited data from the files themselves (using Apache Tika [5]) and various University systems, including the new Current Research Information System (CRIS). The system can then push this metadata to DataCite if researchers press the big red 'Publish' button, registering a DOI for the data in the process, and there are plans to have it push metadata to the CRIS as well.

After lunch, Michael Charno of the Archaeology Data Service (ADS) [6] demonstrated how its in-house content management system (CMS) is being used to manage metadata for both archaeological data sets and grey literature. The metadata are mostly elicited from depositors, though the ADS supplies technical and management metadata. Master metadata records are kept in the CMS and then pushed out to the various systems operated by the ADS (e.g. ArchSearch, Grey Literature Library, Heritage Gateway). Most of these systems have their own metadata schema, but because they are all heavily based on Dublin Core, they interoperate well. Recently a module was added to the CMS so that metadata records could be pushed out to DataCite as well; the module also handles minting DOIs and updating the corresponding URLs.

Case Studies: Discovery Portals

In 2007, the British Library (BL) commissioned a study of the research data landscape, a major recommendation of which was that the BL should consider providing a data discovery service [7]. Rachael Kotarski of the BL took up the story of what happened next. A basic set of metadata was devised that supported data discovery without going into disciplinary detail. Sample records were then loaded into the BL's Explore service [8] and trials were conducted to see if users found the metadata set optimal. Having established that the service would be useful and used, the BL proceeded to work out how to integrate dataset records into the main BL catalogue. This meant mapping the metadata set to MARC fields and, because the software did not have an entry template for databases, selecting the most appropriate from those that were avaialable. Some negotiation was required between the cataloguers and the dataset curators before everyone was happy. Now, the remaining issues to solve relate to the ongoing maintenance of the records; tactics such as updating URLs and metadata from DataCite records and repository landing pages are being considered.

The last case study of the day came from Steve Donegan of STFC, who took us through the development of the NERC Data Catalogue Service (DCS) [9]. This is a combined catalogue of the datasets held by all the NERC data centres. It is the third generation of the catalogue, the preceding iterations being the NERC DataGrid (NDG), with its Data Discovery Service and Data Portal, and the NERC Metadata Gateway. Since 2007 there has been a legal requirement for NERC to make metadata for its datasets available using the profile of ISO 19115 designed for the Infrastructure for Spatial Information in the European Community (INSPIRE) [10]. NERC began writing its own profile of ISO 19115 to help it comply with this, but then found the Marine Environmental Data and Information Network (MEDIN) was already using a workable profile, so NERC adopted that with some minor tweaks. Now that version 2 of the UK GEMINI profile is available [11], NERC is considering switching to that instead. Steve concluded by explaining the current and planned architecture of the DCS, in particular how it relates (and will relate) to the UK Location Portal, which implements the INSPIRE requirements for the UK.

Advanced Metadata Applications

The day's talks concluded with David Shotton of the University of Oxford introducing RDF and the advanced applications to which it may be put. He showed how he had reconstructed the DataCite metadata schema using elements from Dublin Core [12], FOAF [13], PRISM [14], the Semantic Publishing and Referencing Ontologies (SPAR) [15], and a custom ontology [16]. He then demonstrated a Web form for entering DataCite metadata, automatically generated from this RDF scheme [17]. He then discussed the practicalities of data citation before introducing the Open Citations Project [18]. The latter is an initiative to compile a database of biomedical literature citations; so far these have been harvested from PubMed Central, but the Project intends to pull them in from CrossRef and DataCite as well.

Discussion

To round things off, Caroline Wilkinson of the British Library chaired a discussion of issues raised by the talks. Of particular concern were how to handle sensitive and embargoed data, and how to satisfy the needs of both humans and machines in DOI landing pages. A couple of delegates said they now had a better appreciation of how DataCite's optional metadata would be used, and would therefore make a greater effort to provide it.

Conclusions

Confounding what one might expect from a workshop on metadata, the day turned out to be enjoyable, informative and highly relevant on a practical level to those who attended. I personally learned a lot about initiatives of which I had previously had a shaky grasp. There wasn't quite the level of questioning and debate you'd expect from an event like this, but this was more a credit to the speakers for their clarity rather than indicating any lack of engagement from the delegates. Indeed, I had the impression that the workshop will make a tangible difference to how several institutions handle their research data.

The slides from the presentations are available from the British Library Web site [19]. The third workshop in this series has yet to be announced at the time of writing, but is expected to take place in October 2012.

References

British Library DataCite workshops
http://www.bl.uk/aboutus/stratpolprog/digi/datasets/dataciteworkshops/
Reference Model for an Open Archival Information System (OAIS), Consultative Committee for Space Data Systems (CCSDS), Magenta Book CCSDS 650.0-M-2, June 2012
http://public.ccsds.org/publications/archive/650x0m2.pdf
DataCite Metadata Schema for the Publication and Citation of Research Data, Version 2.2, July 2011 http://dx.doi.org/10.5438/0005
data.bris Project http://data.blogs.ilrt.org/
Apache Tika http://tika.apache.org/
Archaeology Data Service http://archaeologydataservice.ac.uk/
British Library, Background to Datasets Activity
http://www.bl.uk/reshelp/experthelp/science/sciencetechnologymedicinecollections/researchdatasets/datasets.html
Explore the British Library http://explore.bl.uk/
NERC Data Catalogue Service http://data-search.nerc.ac.uk/
Infrastructure for Spatial Information in the European Community (INSPIRE) http://inspire.jrc.ec.europa.eu/
UK GEMINI (Geo-spatial Metadata Interoperability Initiative) http://www.agi.org.uk/uk-gemini/
Dublin Core Metadata Initiative http://dublincore.org/
Friend of a Friend (FOAF) Vocabulary http://xmlns.com/foaf/spec/
Publishing Requirements for Industry Standard Metadata (PRISM)
http://www.prismstandard.org/specifications/Prism1%5B1%5D.2.pdf
Semantic Publishing and Referencing Ontologies (SPAR) http://purl.org/spar
David Shotton and Silvio Peroni, The DataCite Ontology, Version 0.5, July 2012
http://purl.org/spar/datacite/
Tanya Gray and David Shotton, DataCite Input Form http://www.miidi.org:8080/datacite/
Open Citations database http://opencitations.net/
British Library, Previous Workshops
http://www.bl.uk/aboutus/stratpolprog/digi/datasets/workshoparchive/archive.html

Author Details

Alex Ball
Research Officer
UKOLN
University of Bath

Email: a.ball@ukoln.ac.uk
Web site: http://homes.ukoln.ac.uk/~ab318/

Alex Ball is a Research Officer working in the field of digital curation and research data management, and an Institutional Support Officer for the Digital Curation Centre. His interests include Engineering research data, Web technologies and preservation, scientific metadata, data citation and the intellectual property aspects of research data.