Making Datasets Visible and Accessible: DataCite's First Summer Meeting
Over 7-8 June 2010 DataCite held its First Summer Meeting in Hannover, Germany. More than 100 information specialists, researchers, and publishers came together to focus on making datasets visible and accessible [1]. Uwe Rosemann, German Technical Library (TIB), welcomed delegates and handed over to the current President of DataCite, Adam Farquhar, British Library. Adam gave an overview of DataCite, an international association which aims to support researchers by enabling them to locate, identify, and cite research datasets with confidence. Adam described DataCite as a ‘CrossRef for data’ and called for delegates to work together to explore roles and responsibilities for publishers, data centres, and libraries.
Session 1: Metadata for Datasets: More Than Pure Citation Information?
The session kicked off with Toby Green, OECD Publishing, describing datasets as the ‘lost sheep’ in scholarly publishing. Metadata, Toby said, were the sheepdogs that help authors to cite, publishers to link, discovery systems to find, and librarians to catalogue. Toby demonstrated how OECD Publishing were using metadata to present datasets and highlighted some key citation challenges, which included dynamic datasets, different renditions of datasets, and what he referred to as the ‘Russian Doll Problem’, where datasets are progressively merged.
Jan Brase, DataCite and the German Technical Library, explained how the German Technical Library had used metadata to create a discovery service linking to external resources of data, such as data centres. Making data citable, Jan said, increases visibility of datasets, and supports not only easy reuse and verification, but also metrics for assessing the impact of data sharers. Jan noted that DataCite had agreed a core set of metadata elements (Creator, Title, Publisher, Publication Date, Discipline, and DOI) and was now focusing on optional elements and vocabularies.
Following Jan, Wolfgang Zenk-Moltgen, GESIS, Leibniz Institute for the Social Sciences, gave an overview of the Data Documentation Initiative (DDI), which aims to create an international standard for describing social science data [2]. The latest version of the documentation, DDI 3.1, has a complex lifecycle model that allows for metadata collection at all stages of a study, from conception to publication. The documentation, said Wolfgang, could support data collection, distribution, discovery, analysis, and repurposing.
Session 2: Peer-review Systems and the Publication of Datasets: Ensuring Quality
Starting positively, Hans Pfeiffenberger, Earth System Science Data Journal, declared that ‘we who care about publishing data have won’. His comment referred to the vision statement of a European group of funders, who had called for ‘open access to the output of publicly funded research and permanent access to primary quality assured research data’ [3]. Hans suggested that publishing data through peer-reviewed journals could help to ensure quality. Earth System Science Data Journal, he explained, asked reviewers to consider the originality, significance, and quality of data as part of the peer review process. Not only did this help to ensure quality, he said, but it helped to reward researchers for data publication.
After Hans, Matthias Razum, Fachinformationszentrum (FIZ) Karlsruhe, Leibniz Institute for Information Infrastructure, introduced eSciDoc, an e-research environment that supports collaboration among science communities [4]. eSciDoc aims to support the whole research process from acquisition to archiving. By integrating with existing research systems, eSciDoc allows the context of research to be collected as metadata, helping to ensure quality.
The ability of researchers to produce ever increasing volumes of data has led to a ‘structural crisis of scientific publishing’, said Michael Diepenbroek,World Data System. Michael said that simply increasing access to data was not enough, because knowing the quality of data was also important. Issuing DOIs to datasets raises the possibility of cross-referencing DOIs between publications and constructing novel metrics. This is an attractive incentive for both those who produce data and those who publish data.
Michael Lautenschlager, German Computing Centre for Climate and Earth System Research (DKRZ), discussed the development of a Web-based workflow for publication of climate data. The workflow focused on four stages of data publication: permission to publish, scientific quality assurance, technical quality assurance, and final publication. The project team was developing quality assurance toolboxes and browser interfaces to assist scientists and publication agents with each of the workflow stages.
The Global Biodiversity Information Facility (GBIF) is an international initiative to facilitate free and open access to biodiversity data worldwide. Vishwas Chavan, GBIF, said that the GBIF had identified five components in the data-sharing challenge: technical and infrastructure; policy and political; socio-cultural; economic; and legal. Changing researcher behaviour, suggested Vishwas, represented a greater challenge than developing the technical infrastructure, and until the value of data was recognised, he said, it would always be considered the by-product of an article. His answer to this was a data usage index (DUI) allowing the impact of data publication to be measured [5].
In the final talk of the day, Andreas Hense, Bonn-Rhine-Sieg University, looked to Australia, which he said was a world leader in data publication. Andreas said the key to achieving quality data publication was motivating researchers and facilitating the publication process. Motivation could be achieved, he said, through creating renowned data publications and increasing the visibility of data. Andreas described the ARCS Data Fabric, which provides reliable storage for early phase research, assists with metadata capture, and simplifies publication [6].
Session 3: Trustworthiness of Data Centres: A Technological, Structural and Legal Discussion
Tuesday 8 June opened with a talk by Henk Harmsen, Data Archiving and Networked Services (DANS), who gave an overview of the DANS Data Seal of Approval [7]. The Seal of Approval is a minimum set of 16 requirements that DANS considers necessary for a responsible data centre, including three for data producers, three for data consumers, and ten for data repositories. Approval consists of a self-assessment, which must be made publicly available via the repository’s Web site, followed by a review by the Data Seal of Approval Board. Henk noted that self-assessment was simple to implement, taking no longer than a day. By September 2010, he added, a tool would be available to streamline the self-assessment process.
Jens Klump, GFZ German Research Centre for Geosciences, spoke next, presenting the Nestor Catalogue of Criteria for Trusted Digital Repositories [8]. The criteria, which have been developed to determine dependability of data centres, take into account service adequacy, measurability, documentation, and transparency. There could be no ‘one-size fits all’, Jens said, so the guidelines had been deliberately designed in an abstract way to allow application across a broad range of organisations.
The final presentation in the session by Stefan Winkler-Nees, German Research Foundation (DFG), gave a funder’s perspective. Stefan suggested that up to 80-90% of research data produced from projects funded by DFG, the largest funding body in Germany, ‘never see the light or are actually lost’. DFG’s vision, he said, was that ‘research data should be freely accessible…easy to get … and professionally curated on a long-term basis’. Achieving this vision, Stefan explained, would have numerous benefits, such as allowing new interpretations of data, preserving unique and non-reproducible data, and promoting quality scientific practice. Stefan estimated that as much as €10 million could be saved annually if scientists in Flanders saved 1% of their working hours by using a professional data management service. The major challenge, said Stefan, was creating incentives for scientists to share their data.
Session 4: Best Practice and Examples: What Can Be Done and Is Done Worldwide
Merce Crosas, DataVerse, Harvard University, opened the session with an overview of DataVerse, an open source application ‘to publish, share, reference, extract and analyse research data’ [9]. The idea of DataVerse, said Merce, was to solve the problems of data sharing with technology, reducing the burden for researchers, research institutions, and data publishers. By installing a DataVerse Network, an institution is able to create multiple DataVerses for its research groups, providing a data publication framework that supports author recognition, persistent citation, data discovery, and legal protection.
After Merce, Adrian Burton, Australian National Data Service (ANDS), gave an overview of ANDS, a federally funded initiative to support the reuse of research data throughout Australia. Trends driving ANDS, he said, included the rise in data science, increasing openness of public data, and the rising prominence of Freedom of Information (FOI) laws. Adrian highlighted the Government 2.0 Taskforce, which had been established by the Australian government to look at ways of using Web technologies to communicate with the public [10]. The recommendations of the taskforce were widely accepted by the government and had paved the way for public data to be used in interesting ways. He gave the example of a ‘Know Where You Live’ mashup, created using Australian Bureau of Statistics data [11]. Open access to the Bureau data, he said, had supported the creation of this beautiful and compelling interface.
Adrian said that despite trends towards data sharing, there were still few incentives for researchers to share their data. ANDS is therefore working on creating services to create incentives, including ‘Identify my Data’ which helps researchers by persistently identifying data, and ‘Register my Data’ which helps researchers to describe data. Adrian said that while ANDS provides a discovery portal for searching for data, he was not convinced of its impact, because of the already widespread provision of similar portals. Instead, he said, ANDS took a syndication approach, taking descriptions of inaccessible collections and publishing them onto readable Web sites, allowing search engines to index the data.
Next up, Gert König-Langlo, Alfred Wegener Institute for Polar and Marine Research, acquainted delegates with the Earth’s radiation budget, which considers incoming and outgoing radiation as a key driver of the world climate system. Gert gave an overview of the Baseline Surface Radiation Network, which has a well established system for data collection and publication. Radiation measurements are taken via 47 data stations around the world. Each station has a data scientist who is responsible for submitting data to the World Radiation Center on a monthly basis. Following quality checks, datasets are assigned DOI names and archived in Pangaea.
Lee Dirks, Microsoft Research, gave an overview of projects at Microsoft Research, including collaborations with organisations such as DataCite, Open Planets Foundation, Creative Commons, and DataONE. Lee highlighted several developments, including a Creative Commons add-in for Word, allowing licences to be embedded within documents, and a data curation add-in for Excel created in collaboration with DataONE. Responding to a question from the audience, Lee said that Microsoft Research had no immediate intention to develop a data archive, but he would not rule this out.
Following Lee’s presentation, William (Bill) Michener, DataONE and University of New Mexico, offered a choice: ‘better data management or planetary destruction’! Bill highlighted the environmental threats facing the planet and said that better data publishing, sharing, and archiving could help to tackle these threats. The Data Observation Network for Earth (DataONE), he explained, aimed to provide open access to Earth data in order to help people to understand and conserve the environment. Bill went on to say that citizen science could provide exemplars for how we deal with data. He gave the example of eBird, a citizen science network allowing recreational bird-spotters to record bird sightings. Data from the network had been instrumental in guiding cleanup efforts of the Gulf of Mexico oilspill [12] and was incredibly successful in getting people to archive and reuse data. DOIs had not been used here, he explained, which was something to be considered when trying to change the cultural paradigm.
Hannes Grobe, Pangaea and Alfred Wegener Institute, gave an overview of Pangaea, an open access repository of geoscientific and environmental data which has been working in partnership with a range of publishers since 2009. Referring to a Nature editorial entitled ‘Empty Archives’, Hannes said that by far the biggest challenge for data centres was acquiring data, and he stressed the importance of metadata, noting that ‘Nothing exists in any useful sense until it is identified’ [13]. Hannes demonstrated the power of Geographical metadata by finding a dataset based in Hannover [14], and nicely rounded off the session with an example of one of the ‘interesting things you may find behind a DOI’: a singing iceberg [15].
Session 5: Visualisation of Datasets: More Than Meets the Eye
Jürgen Bernhard, Interactive Graphics Systems Group, University of Technology Darmstadt, focused on the use of visual methods for searching data. So, for example, a researcher could sketch a profile on a graph and identify data with similar trends. Jürgen’s project was in its early stages, only six months into a PhD project, and was focused on developing a prototype search tool and researching possible use cases.
Great visualisations tell stories said Toby Green, OECD Publishing, in his second talk of the meeting. As an example Toby showed Charles Minard’s elegant representation of the fate of Napoleon’s soldiers on their march to Moscow [16]. He went on to introduce some of OECD’s visualisation tools, using OECD Regional Statistics to demonstrate how dynamic visualisations could draw attention to events that might have otherwise gone unnoticed. The key to a successful visualisation, Toby noted, was tailoring it to the audience. He highlighted some visualisation sites that he liked, including a chart showing evolution of privacy settings on Facebook and a map showing the flow of taxis in the city of New York. Toby finished his talk by drawing attention to some popular data-related sites, including Swivel [17] and Many Eyes [18], which support data sharing and visualisation, and the Guardian DataBlog [19].
Brian McMahon, International Union of Crystallography, last to speak in the session, discussed visualisation of chemical data. Brian categorised crystallographic data into three groups: raw data (for example photographic film), primary data (analyses of raw data), and derived data (created from primary data). Publication of derived data was already widely practised, Brian said, and there was a movement towards publication of raw and primary data to support data mining and validation. Brian showed how crystal structures were commonly displayed within articles through embedded Java applets, and demonstrated an enhanced toolkit allowing users to create custom visualisations of published data.
Session 6: Datasets and Scholarly Communication: A Perfect Combination?
Beginning the final session, Susanne Haak, Thieme Publishing, introduced a collaborative project in which Thieme was working alongside the German Technical Library (TIB) to publish primary data in chemistry journals. The partnership, which was one of too few between publishers and libraries, said Susanne, had created a simple workflow allowing articles and data to be published independently but linked via DOI names. The workflow was still in development with open questions in areas such as peer review, legal ownership, and copyright.
Continuing in a similar vein, IJ Aalbersberg, Elsevier, explained how articles in ScienceDirect were being linked to underlying data. Aalbersberg suggested that attaching data to articles as supplementary material belonged in ‘the past’. Publishing data as supplements, he said, meant the data was distributed in multiple locations, frozen in time, and limited in size. According to the recent Parse.Insight study, said Aalbersberg, researchers preferred data to be stored in repositories independent of the publisher [20]. This was already happening, he noted, in partnerships between publishers and data centres such as Pangaea. In the future, Aalbersberg said, closer interoperability between data centres and publishers could provide a richer user experience by creating single-page environments giving access to both the article and the data. Beginning in the following week, Aalbersberg said, ScienceDirect would be taking a step towards this future by launching a new ‘extended view’ for chemistry articles, allowing supplementary data to be rendered directly within an article.
Delivering the final talk of the meeting, Eefke Smit, International Association of Scientific, Technical, and Medical Publishers, gave two key messages. Firstly, she said, it was time for stakeholders to converge and form partnerships. Secondly she called for data and article publishing to be considered in the context of each other to support integration of services. Eefke noted that the PARSE.Insight report had found that scientists have a strong wish to see better ways of sharing and discovering data, and said that publishers had ‘willingness in abundance’ to make this happen [20].
Time for Convergence
The benefits of making datasets visible and accessible were universally accepted and the meeting provided strong examples of how the research community is moving towards this aim. The overwhelming message was that the greatest challenge would be to encourage a research culture that was motivated to share, manage and make data persistent. Attaining this, it seems, requires better tools for data management and incentives for researchers to make their data visible and accessible, such as standards for data citation. The mood of delegates was positive and there was a keen willingness amongst stakeholders to work together. If this convergence is achieved, the goal of visible and accessible data could quickly become a reality.
References
- DataCite Web site http://www.datacite.org/
- Data Documentation Initiative Web site http://www.ddialliance.org/
- The EUROHORCs and ESF Vision on a Globally Competitive ERA and their Road Map for Actions to Help Build It, EUROHORCs-ESF Science Policy Briefing 33
http://www.esf.org/publications/science-policy-briefings.html - eSciDoc Web site http://www.escidoc.org/
- Chavan V and Ingwersen P, BMC Bioinformatics 2009, 10(Suppl 14): S2. doi:10.1186⁄1471-2105-10-S14-S2
http://www.biomedcentral.com/1471-2105/10/S14/S2 - ARCS Data Fabric Web site http://www.arcs.org.au/index.php/arcs-data-fabric
- DANS Data Seal of Approval Web site http://www.datasealofapproval.org/
- Nestor Catalogue of Criteria for Trusted Digital Repositories Web site
http://www.dcc.ac.uk/resources/tools-and-applications/nestor - The DataVerse Network Project Web site http://thedata.org/
- Government 2.0 Taskforce Web site http://gov2.net.au/
- Know Where You Live Web site http://www.hackdays.com/knowwhereyoulive/
- eBird Gulf Coast Oil Spill Bird Tracker Web site
http://ebird.org/content/ebird/news/new-gulf-coast-oil-spill-bird-tracker - Nelson B, “Data Sharing: Empty Archives”, Nature 461, 160-163 (2009). doi:10.1038/461160a
http://www.nature.com/news/2009/090909/full/461160a.html - Hannover dataset http://doi.pangaea.de/10.1594/PANGAEA.206394
- The Singing Iceberg http://dx.doi.org/10.1126/science.1117145
- Napolean March Visualisation http://en.wikipedia.org/wiki/File:Minard.png
- Swivel Web site http://www.swivel.com/
- Many Eyes Web site http://manyeyes.alphaworks.ibm.com/manyeyes/
- The Guardian DataBlog http://www.guardian.co.uk/news/datablog
- PARSE.Insight Report http://www.parse-insight.eu/downloads/PARSE-Insight_D3-6_InsightReport.pdf