A Dublin Core Application Profile for Scholarly Works
In May 2006, the Joint Information Systems Committee (JISC) [1] approached UKOLN [2] and the Eduserv Foundation [3] to collaborate on the development of a metadata specification for describing eprints (alternatively referred to as scholarly works, research papers or scholarly research texts) [4]. A Dublin Core (DC) [5] application profile was chosen as the basis of the specification given the widespread use of DC in existing repositories, the flexibility and extensibility of the DCMI Abstract Model [6] and its compatibility with the Semantic Web [7]. The main driver for this work was the establishment of a three-year project to aggregate content from repositories and offer cross-searching and other added-value services [8]. Drawing on the conclusions of the ePrints-UK Project [9] and the findings of the ongoing PerX Project [10], JISC was quick to identify that the quality and consistency of metadata would be a critical success factor for this project.
The work was carried out over a three-month period from May to July 2006, before moving in early August into what we have termed our community acceptance period. A working group of invited experts was assembled to contribute to the development of the application profile both in person and through an active email discussion list and the project deliverables were developed in the open, collaborative arena of the UKOLN Repositories Research Team wiki [11]. The core deliverables were a functional requirements specification, the application model, application profile and usage guidelines, eprints XML schema and 'dumb-down' guidelines. The ensuing article offers a whistle-stop tour of the development process that led to the production of these deliverables and the application profile as a whole.
Identifying Metadata Requirements for Describing Scholarly Works
Identifying the functionality that we need to support is an important first step if the profile is going to be fit for its primary purpose. Current practice for repositories is to expose simple DC records over OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) [12] as mandated by that protocol. However, it is widely agreed that simple DC has limitations that pose problems for repository developers and aggregator services. Issues relating to normalised names, use of controlled subject vocabularies or other authority lists, dates and identifiers are common and many were identified in the course of our functional requirements gathering.
From the work specification supplied by JISC, we defined our primary use case in developing an application profile for scholarly works as: supporting the Intute repository search project to aggregate richer, more consistent, metadata from repositories. Through liaison with that project, a review of existing standards and previous project findings, plus consultation with our working group, we established a set of scenarios from which we derived an extensive list of functional requirements [13]. Principal amongst these were the following:
- Provision of richer, more consistent metadata.
- Facilitate search, browse or filter by a range of elements, including journal, conference or publication title, peer-review status and resource type.
- Enable identification of the latest, or most appropriate, version and facilitate navigation between different versions.
- Support added-value services, particularly those based on the use of OpenURL ContextObjects [14].
- Implement an unambiguous method of identifying the full text(s).
- Enable identification of the research funder and project code.
- Facilitate identification of open access materials.
Note that by 'open access' we mean "free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself." [15]
The Application Model
In order to build up a DC application profile for scholarly publications we first need to develop an application model. This model shows the entities that we want to describe in DC and the key relationships between those entities. It is critical to undertake this modelling step in the development of any application profile. Without it, users of the application profile may become confused about which entity is being described by any given metadata property.
As a simple example, imagine developing an application profile to describe a personal audio CD collection. One might choose to model the following set of entities: the collection and its owner, each CD, the recording artist and the record label. Each of these entities could then be described separately, using a specific set of properties for each entity. We refer to such a model as the application model.
The application model for scholarly publications presented here [16] is based on the Functional Requirements for Bibliographic Records (FRBR) [17. FRBR is an entity-relationship model developed by the library community for the entities that bibliographic records are intended to describe. FRBR models the bibliographic world using four key entities: work, expression, manifestation and item. This article does not attempt to summarise the FRBR model in any detail. Readers that are not familiar with it are encouraged to consult the FRBR documentation [17].
In the context of this model an eprint is defined to be a scientific or scholarlyresearch text (as defined by the Budapest Open Access Initiative [18]), for example a peer-reviewed journal article, a preprint, a working paper, a thesis, a book chapter, a report, etc.
The Model
Although FRBR is used as the basis of the model, some of the entity and relationship labels used in FRBR have been modified for this model, in order to make them more intuitive to those dealing with eprints and to align them with the terminology used in DC:
DC | FRBR |
ScholarlyWork | Work |
Copy | Item |
Agent | Corporate Body |
isExpressedAs relationship | 'is realized through' |
isManifestedAs relationship | 'is embodied in' |
isAvailableAs relationship | 'is exemplified by' |
isCreatedBy relationship | 'is created by' |
isPublishedBy relationship | 'publisher' attribute of a Manifestation |
A ScholarlyWork is a distinct intellectual or artistic scholarly creation.
The isExpressedAs, isManifestedAs and isAvailableAs relationships can be thought of as 'vertical' relations between the ScholarlyWork and its Expressions, between an Expression and its Manifestation and between a Manifestation and its Copies. There are also 'horizontal' relationships between different Expressions of the same ScholarlyWork (e.g. the 'has a translation' relationship in FRBR), different Manifestations of the same Expression (e.g. the 'has an alternative' relationship in FRBR) and so on. These 'horizontal' relationships have not been included in this model. Software applications may be able to infer some of these 'horizontal' relationships by navigating up and down the 'vertical' relationships.
In natural language, what the above model says is:
A ScholarlyWork may be expressed as one or more Expressions. Each Expression may be manifested as one or more Manifestations. Each Manifestation may be made available as one or more Copies. Each ScholarlyWork may have one or more creators, funders and supervisors. Each Expression may be have one or more editors. Each Manifestation may have one or more publishers.
The most common forms of Expression of an eprint are the various 'revisions' that it goes through (draft, pre-print, ..., final published version, etc.) and its different translations. Therefore, the most important Expression to Expression relationships required are isVersionOf/hasVersion and isTranslationOf/hasTranslation.
Attributes
A critical part of developing the application model is to identify the key attributes that will be used to describe each entity in the model. Initially, this can be done in a fairly generic way, noting for example that we want to capture the 'title' of the ScholarlyWork but not worrying about whether we are going to use DC Title or some other kind of title. The key attributes for each of the entities in our application profile are listed below.
Attributes of a ScholarlyWork
- title
- subject
- abstract
- grant number
- has adaptation
- identifier (URI)
Attributes of an Expression
- title
- description
- date available
- status
- version number or string
- language
- genre / type
- copyright holder
- has version
- has translation
- bibliographic citation
- references
- identifier (URI)
Attributes of a Manifestation
- format
- date modified
- identifier (URI)
Attributes of a Copy
- date available
- access rights
- licence
- is part of
- identifier/locator (URI)
Attributes of an Agent
- name
- family name
- given name
- type of agent
- workplace homepage
- mailbox
- homepage
- identifier (URI)
A Note on Implementing This Model Using DC Metadata
Many of the above relationships and attributes can be implemented fairly easily using metadata terms already defined by the Dublin Core Metadata Initiative (DCMI) [5]. DC metadata is sometimes only considered capable of describing flat, single-entity, constructs - a Web page, a document, an image, etc. However, the DCMI Abstract Model [6] introduces the notion of a description set, a group of related descriptions, which allows it to be used to capture metadata about more complex sets of entities, using application models like the one described here.
DCMI is currently developing a revised set of encoding guidelines for XML and RDF/XML, which will allow these more complex, multi-description, description set constructs to be encoded and shared between software applications.
The Application Profile and Vocabularies
The application profile provides a way of describing the attributes and relationships of each of the five entities as part of a description set. The profile also identifies mandatory elements, provides usage guidelines and offers illustrative examples. Note that for this application profile, we have made very few elements mandatory. Indeed, all that a minimal description set must include is either:
- a single ScholarlyWork description with at least one dc:title statement and one dc:type statement indicating that this is a ScholarlyWork entity,
or:
- a single ScholarlyWork description with one dc:type statement indicating that this is a ScholarlyWork and one eprints:isExpressedAs statement linking to a single Expression description with at least one dc:title statement and one dc:type statement indicating that this is an Expression.
All other aspects of the application profile are optional.
It is not the intention of this article to offer a full analysis of the different metadata properties and readers should refer to the documentation for further information [4]. Briefly, the profile makes use of properties from a number of schemes: the DC Metadata Element Set (simple DC) [19], DC Metadata Terms (includes qualified DC terms) [20] and the MARC relator codes [21] all provide terms. Properties from the Friends of a Friend (FOAF) Scheme [22] introduce some semantic web flavour and only five new properties have been created from scratch: grant number, affiliated institution, status, version and copyright holder.
Where existing dc:relation qualifiers have been used, the relationships being documented have been clearly defined alongside five new properties:
- has adaptation
- has translation
- is expressed as
- is manifested as
- is available as
To aid fulfilment of several of the functional requirements further, four vocabularies have been defined for:
- access rights (Open, Restricted or Closed)
- entity type (ScholarlyWork, Expression, Manifestation, Copy or Agent)
- status (Peer Reviewed or Non Peer Reviewed)
- resource type
Figure 2 shows the resource type vocabulary as an extension of the value 'Text' in the DCMI Type scheme [23].
Eprints DC XML: An XML Format for the Eprints Application Profile
At the time of writing (January 2007), the DCMI does not define an XML format to support the serialisation of DC description sets as described by the DCMI Abstract Model (DCAM). The existing DCMI recommendation Guidelines for implementing Dublin Core in XML [24] pre-dated the development of the DCAM and is based on two simpler 'abstract models' for DC metadata which are described in that specification itself. The DCMI Architecture Community [25] is currently considering a working draft [26] that describes a new XML format which is based on the DCAM, with the intention of producing a new DCMI recommendation, probably in early/mid-2007.
Since the Eprints application profile makes use of the full range of features of the DCAM, the serialisation of description sets based on that application profile requires a format which supports those features, so the working group has defined an XML format known as Eprints DC XML [27]. The format is based very closely on the latest drafts being considered by the DCMI Architecture Community, although it does make use of different XML Namespace Names from those used in the DCMI drafts.
Figure three shows an example instance of the Eprints DC XML format:
A W3C XML Schema and a RELAX NG [28] Schema for Eprints DC XML are available.
The Eprints Application Profile and 'dumb-down'
One of the requirements of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) specification [12] is that, for each "item" in a repository, the repository must support the dissemination of metadata records in the "oai_dc" "metadata format". "oai_dc" is a format defined by the OAI-PMH specification to serialise "Simple DC" description sets. Simple DC is an application profile in which:
- the description set comprises a single description
- each statement within that description references one of the 15 properties of the Dublin Core Metadata Element Set [19]
- each of those 15 properties may be referenced in multiple statements
- each statement has a single value string
- each value string may have an associated language tag
- there is no use of resource URIs, vocabulary encoding scheme URIs, syntax encoding scheme URIs, value URIs or rich representations
The Simple DC profile is used by many systems as a 'lowest common denominator' for basic interoperability, and the process of transforming description sets based on some richer application profile into description sets based on the Simple DC profile is sometimes referred to as 'dumb-down', reflecting the fact that such a transformation involves a loss of information content. The working group provided a mapping from the Eprints application profile to the Simple DC application profile [29]. Because a description set based on the Eprints application profile typically contains multiple descriptions, each of a single resource, the mapping generates multiple Simple DC description sets from a single Eprints application profile description set.
In the proposed mapping, the resulting Simple DC description sets describe the eprint only at the ScholarlyWork and Copy levels. The Simple DC description set for the ScholarlyWork complies with the guidelines specified by the ePrints-UK Project [30]. This is not the only possible approach to mapping the Eprints application profile to Simple DC. For example, it would also be possible to map to a group of Simple DC description sets, one for each entity in the model or to a single Simple DC description set only about the ScholarlyWork. However, the working group felt that the chosen mapping offered the most useful set of simple DC descriptions with minimal loss of information.
Conclusion: Towards Community Acceptance
This application profile represents a relatively innovative approach to metadata, taking as it does the FRBR model and applying it to scholarly works. By making use of the benefits afforded by the DCMI Abstract Model, the profile is able to group descriptions of multiple entities into a single description set. Overall this approach is guided by the functional requirements identified above and the primary use case of richer, more functional, metadata. It also makes it easier to rationalise 'traditional' citations between 'expressions' and 'modern' hypertext links between 'copies', as well as supporting navigation between different versions and the identification of appropriate, and, we hope, open access, full-text copies. In practice, this seemingly complex model may be manifest in relatively simple metadata and/or end-user interfaces. Furthermore, it is likely that many repositories already capture the metadata properties identified in the profile, but are prevented from usefully exposing this metadata to other services by the limitations imposed by simple DC.
Yet the application profile alone cannot bring about interoperability or provide Intute and other aggregators with the metadata necessary to offer rich functionality. For this we need community uptake by repositories and repository software developers, agreement on common approaches and most of all, Eprints DC XML metadata being generated and exposed. There are growing signs that our community acceptance and dissemination activities to date are generating momentum, with support built into the newly released GNU Eprints version 3, alongside statements of support from DSpace and Fedora [31] developers, interest from European colleagues and lively discussions at the recent DC - 2006 and Open Scholarship 2006 conferences in Mexico and Glasgow.
References
- The Joint Information Systems Committee (JISC) http://www.jisc.ac.uk/
- UKOLN, University of Bath http://www.ukoln.ac.uk/
- The Eduserv Foundation http://www.eduserv.org.uk/foundation/
- Eprints Application Profile http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile
- The Dublin Core Metadata Initiative (DCMI) http://dublincore.org/
- Powell, Andy, Nilsson, Mikael, Naeve, Ambjörn and Johnston, Pete, DCMI Abstract Model. DCMI Recommendation, May 2005 http://dublincore.org/documents/abstract-model/
- W3C Semantic Web Activity http://www.w3.org/2001/sw/
- Intute Repository Search Project http://www.intute.ac.uk/projects.html
- Eprints UK Project http://eprints-uk.rdn.ac.uk/project/
- PerX Project http://www.icbl.hw.ac.uk/perx/
- Repositories Research Team wiki http://www.ukoln.ac.uk/repositories/digirep/
- Lagoze, Carl, Van de Sompel, Herbert, Nelson, Michael and Warner, Simeon. The Open Archives Initiative Protocol for Metadata Harvesting. Protocol Version 2.0 of 2002-06-14. http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
- Eprints Application Profile Functional requirements specification http://www.ukoln.ac.uk/repositories/digirep/index/Functional_Requirements
- ANSI/NISO Z39.88 -2004: The OpenURL Framework for Context-Sensitive Services http://www.niso.org/standards/standard_detail.cfm?std_id=783
- Budapest Open Access Initiative: Frequently Asked Questions http://www.earlham.edu/~peters/fos/boaifaq.htm#openaccess
- Eprints Application Model http://www.ukoln.ac.uk/repositories/digirep/index/Model
- IFLA, Functional Requirements for Bibliographic Records, 1998 http://www.ifla.org/VII/s13/frbr/frbr.pdf
- Budapest Open Access Initiative http://www.soros.org/openaccess/
- Dublin Core Metadata Element Set, Version 1.1. DCMI Recommendation. April 2006. http://dublincore.org/documents/dces/
- DCMI Usage Board, DCMI Metadata Terms, DCMI Recommendation, December 2006 http://dublincore.org/documents/dcmi-terms/
- Library of Congress Network Development and MARC Standards Office, MARC Code Lists for Relators, Sources, Description Conventions, January 2007 http://www.loc.gov/marc/relators/
- Brickley, Dan and Miller, Libby, FOAF Vocabulary Specification, January 2006 http://xmlns.com/foaf/0.1/
- DCMI Usage Board, DCMI Type Vocabulary, DCMI Recommendation, August 2006 http://dublincore.org/documents/dcmi-type-vocabulary/
- Powell, Andy and Johnston, Pete. Guidelines for implementing Dublin Core in XML. DCMI Recommendation. April 2003. http://dublincore.org/documents/dc-xml-guidelines/
- DCMI Architecture Community. http://dublincore.org/groups/architecture/
- Johnston, Pete and Powell, Andy. Expressing Dublin Core metadata using XML. DCMI Working Draft. May 2006. http://dublincore.org/documents/2006/05/29/dc-xml/
- Johnston, Pete. Eprints DC XML. November 2006. http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_DC_XML
- OASIS RELAX NG Technical Committee http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=relax-ng
- Powell, Andy. Mapping the Eprints Application Profile to Simple DC. August 2006. http://www.ukoln.ac.uk/repositories/digirep/index/Mapping_the_Eprints_Application_Profile_to_Simple_DC
- Powell, Andy, Day, Michael and Cliff, Peter. Using simple Dublin Core to describe eprints. Version 1.2. http://eprints-uk.rdn.ac.uk/project/docs/simpledc-guidelines/
- Eprints Application Profile Community Acceptance Plan http://www.ukoln.ac.uk/repositories/digirep/index/Community_Acceptance_Plan