DataCite UK User Group Meeting
DataCite [1] is an international not-for-profit organisation dedicated to making research data a normal, citable part of the scientific record. It is made up of a membership of 15 major libraries and data centres, which, along with four associate members, represent 11 different countries across four continents. The approach taken by DataCite currently centres on assigning Digital Object Identifiers (DOIs) to datasets; it is a member of the International DOI Foundation and one of a handful of DOI registration agencies. As such, it is assembling a suite of services to allow repositories and data centres to assign DOIs to their holdings, associate metadata with those DOIs, and maintain and use those metadata.
This User Group meeting was a chance for UK users of DataCite services to share their experiences so far, to influence the future direction of the services and the underlying metadata scheme, and to try out the latest versions of the DataCite tools. It also attracted those, like myself, who were not users at that time, but nevertheless interested in the practicalities and potential of these services.
Short Talks
Following a welcome and introduction to the day from Max Wilkinson of the British Library (BL), the meeting began with a series of presentations from projects and centres which are already working with DOIs for data.
Mike Haft of the Freshwater Biological Association (FBA) outlined how his organisation is developing a repository to support easier access to its data holdings. The work so far, carried out by the FISHNet Project [2], has focused on creating a solid repository back-end upon which a light and flexible front-end may be built. The workflows built into the repository allow for DOIs to be assigned to datasets that have passed usability assessments. There are still issues to be addressed, however: how DOIs should be assigned to datasets that are frequently updated; how DOIs should be assigned to aggregate data products; how to co-ordinate DOI management for jointly owned datasets; and how to handle annotated datasets formed from many small samples.
Sarah Callaghan of the Centre for Environmental Data Archival (CEDA) gave an overview of how the British Atmospheric Data Centre (BADC) has decided to implement dataset DOIs [3]. It has established four eligibility criteria that its datasets have to satisfy before receiving a DOI: the dataset has to be stable (i.e. the data will not change), complete (i.e. no more data will be added), permanent (i.e. kept in perpetuity) and of good quality. There is also a fifth criterion — authors’ permission — which is a greater hurdle for older datasets than for new and future ones. Once the DOI is assigned, the associated dataset is fixed: no additions, deletions or modifications to the bits or the directory structure are allowed; any such changes would result in a new version that would need a new DOI. Anyone resolving one of these DOIs would be taken to the catalogue record for the dataset, which provides a link to the data themselves and may also link to records for other versions. BADC follows the Natural Environment Research Council (NERC) convention of using a 32-digit hexadecimal number known as a Globally Unique Identifier (GUID) for the DOI suffix (the part after the 10.xxx/).
Anusha Ranganathan, University of Oxford, explained that Oxford”s DataBank [4] is using DOIs as a complement to its internal identifier system. The local identifiers consist of the string ‘bodleian’ followed by an accession code to represent (the latest version of) a dataset. Particular versions are specified by appending a dot and the version number, while files within that version are specified by further appending a slash and the filename. The local identifier for a particular version may be “promoted” to a DOI if it has public catalogue page. Datasets with a DOI automatically receive a new one for a new version triggered by a data change, but if the version change is triggered by additional files or changed metadata, the new version only receives a DOI if the authors explicitly ask for it.
Gudmundur Thorisson, University of Leicester, introduced the Café RouGE (now Café Variome) clearinghouse for genetics data [5]. It had planned to assign DOIs to incoming data from diagnostic laboratories, using suffixes starting with the string ‘caferouge’ followed by a dot, the gene name, a hyphen and then a numeric identifier. On the day, the wisdom of using the service name in the DOI was questioned due to (well-founded) fears it might change over time. The greatest concerns of the Café RouGE team were how to relate DOIs to dynamic datasets, and how ORCID [6] might be used in DataCite metadata to identify dataset authors.
Michael Charno of the Archaeology Data Service (ADS) reported how his organisation has used the DataCite application programming interfaces (APIs) to assign DOIs to around 400 datasets [7]. A few of these datasets have been updated subsequent to receiving a DOI; as a matter of policy, obsolete versions are removed from live access, so their DOI landing pages change to a page that refers the reader to the latest version. In future, these redirection pages may also contain instructions for applying to ADS for a copy of the obsolete version. The ADS also holds around 9,000 survey reports that it hopes to furnish with DOIs; before it does so it needs to establish that the contractors (e.g. Oxford Archaeology, Wessex Archaeology) are happy for ADS to take on that responsibility. A random number is used as the DOI suffix.
Susan Noble of MIMAS pointed out the challenges of applying DOIs to the datasets redistributed by ESDS International [8]. These datasets are provided free of charge to the UK Further and Higher Education sectors, but must otherwise be accessed for a fee from the original publishers. This means there is a risk of a dataset receiving one DOI relevant to UK researchers and another relevant to everyone else; this highlights the need for a service providing multiple resolutions for the same DOI, similar to that provided by CrossRef. Many of the datasets are also frequently updated. The solution has been to mint DOIs for monthly snapshots of the data; all DOIs for the same dataset share the same landing page, and are visible on that page, but only the latest snapshot is readily available. The DOI suffix is generated from the author, title and snapshot date for the dataset.
Matthew Brumpton of the UK Data Archive (UKDA) indicated how his organisation is adding DOIs to its collection of published datasets [9]. Each of these datasets already has a study number identifier, so this is used as the basis for the DOI suffix. Each DOI resolves to a special metadata page that contains the DOI, the full change history of the study and a link to the current catalogue record. High-impact changes to the data or metadata result in a new version and DOI, but low-impact changes (such as typographic corrections) do not. The UKDA has written its own interface for managing DOIs and DataCite metadata, using the DataCite APIs to connect to the central service.
Srikanth Nagella of the Science and Technology Facilities Council (STFC) showed how DOIs are being used in ICAT, the data catalogue used by large facilities such as the Diamond Light Source and ISIS [10]. The DOI suffixes are made up of the facility code (e.g. isis, dls), a dot, a code letter denoting either an experiment (e), a dataset (d) or a file (f), another dot, and a randomly assigned number. The DOIs resolve to the relevant catalogue record in ICAT.
Breakout Groups
Following the talks, we divided into groups to discuss some of the issues that had been raised. As already mentioned, a case was argued for DataCite to provide a service for resolving the same DOI to multiple locations, for cases where the route has an impact on the accessibility of the data. Another suggestion was for DataCite to provide sample contracts that could be used between a data publisher and a data redistributor. This approach would allow the latter to forge ahead with minting DOIs until the former decided it wished to take on that responsibility itself.
Several versioning issues were highlighted for further consideration. Of pressing concern was how to store multiple versions efficiently. With frequently updated time-series data, is it better to create snapshots of the entire time series at regular intervals, or chunk the data into a succession of short time series? How should the links between the different versions be exposed? Granularity was another vexing issue: it could become a significant administrative burden to provide DOIs for subsets and supersets of data, but how can users cite specific parts of a dataset using a DOI for the whole thing?
Peter Li (University of Birmingham) from the SageCite Project [11], along with others who had built tools on top of the DataCite APIs and metadata scheme, remarked that it would have been useful to have a forum for sharing code and approaches. Thankfully, this was a relatively simple matter to resolve: Michael Charno and Tom Pollard (BL) have now set up Google Groups for both DataCite developers 12 and users 13.
The DataCite Metadata Schema
After lunch, Elizabeth Newbold (BL) took us through version 2.0 of the DataCite Metadata Scheme [14][15], explaining its genesis and the rationale behind its five mandatory and 12 optional properties. The scheme is administered by the Metadata Supervisor at the German National Library of Science and Technology (TIB) with assistance from the DataCite Metadata Working Group. The mandatory properties correspond to the minimum metadata needed to construct a basic bibliographic reference, while the optional properties support fuller citations and simple data discovery applications.
Having looked in detail at the metadata scheme, we took the opportunity to give our feedback on how it might be improved. The discussion mainly centred on the roles of various stakeholders in the data — creators, contributors, publishers, owners — and the options for identifying these stakeholders by formal identifiers instead of (or as well as) names. We also suggested that the use cases for the scheme should be clarified, so that more explicit guidance can be provided on how to use it.
Hands-on Exercise
The final session of the day, led by Ed Zukowski and Tom Pollard (BL), was an opportunity to practise using the DataCite tools to mint DOIs, toggle their activation status, and update their associated metadata. For this exercise, we teamed up into pairs, with each pair consisting (as far as possible) of an experienced user and a complete novice. Those of us with the code and right set-up to hand used the DataCite APIs to complete the exercise, while the others used the Web interface. In my pair, we used the APIs to perform the tasks and the Web interface to verify our progress. Both were easy and intuitive to use, and I found the code examples for the APIs most helpful.
Conclusion
I found this workshop both enjoyable and motivating, with a good mix of theory and practice. The representatives from DataCite were clearly keen not only to communicate their work to date but also to develop it in response to the needs of their users and the wider community. There are still plenty of issues to solve and further systems to build before DataCite achieves its stated aims, but this is a good basis from which to proceed.
Acknowledgements
Many thanks to Tom Pollard, Gudmundur Thorisson and Peter Li for helpful comments.
References
- The DataCite Web site http://datacite.org/
- The FISHNet Web site http://www.fishnetonline.org/
- The British Atmospheric Data Centre Web site http://badc.nerc.ac.uk/
- The DataBank Web site http://databank.ouls.ox.ac.uk/
- The Café Variome Web site http://www.cafevariome.org/
- The Open Researcher and Contributer ID Initiative Web site http://www.orcid.org/
- The Archaeology Data Service Archives catalogue http://archaeologydataservice.ac.uk/archives/
- The ESDS International Web site http://www.esds.ac.uk/international/
- The UK Data Archive Web site http://www.data-archive.ac.uk/
- The ICAT Project Web site http://www.icatproject.org/
- The SageCite Project Web site http://www.ukoln.ac.uk/projects/sagecite/
- The DataCite developers discussion group Web page http://groups.google.com/group/datacite-developers
- The DataCite users discussion group Web page http://groups.google.com/group/datacite-users
- Starr, J., Gastl, A., “isCitedBy: A Metadata Scheme for DataCite”, D-Lib Magazine, 17(1⁄2), January/February 2011 http://dx.doi.org/10.1045/january2011-starr
- The latest version of the DataCite Metadata Scheme is available from the DataCite Web site http://schema.datacite.org/
Author Details
Alex Ball
Research Officer
UKOLN
University of Bath
Email: a.ball@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/
Alex Ball is a research officer with UKOLN, University of Bath, working in the area of data curation and digital preservation. His research interests include scientific metadata, Open Data and the curation of engineering information and Web resources. He is a member of staff at the UK’s Digital Curation Centre; his previous projects include ERIM and the KIM Grand Challenge Project, and he contributed to the development of the Data Asset Framework.