Unique Identifiers in a Digital World

andy powell

Unique Identifiers in a Digital World

Andy Powell reports on a seminar organised jointly by Book Industry Communication and the UKOLN on the use of unique identifiers in electronic publishing.

On the afternoon of Friday the 14 March more than 50 people involved in electronic publishing met for a seminar reviewing recent developments in the unique identification of digital objects. Delegates included representatives of publishers, libraries and other organisations. The seminar was organised jointly by Book Industry Communication (BIC) and the UK Office for Library and Information Networking (UKOLN) with support from the eLib programme. A brief report follows:

Introduction - Why we need identifiers

Brian Green (BIC) and Mark Bide (Mark Bide and Associates) introduced the seminar with an overview of why the publishing industry needs identifiers [1]. Unique identifiers for digital objects are an essential part of the technology that allows:

electronic trading including rights transactions;
copyright management;
electronic tables of contents;
production tracking and other in house administration;
bibliographic control and resource discovery.

Several issues were highlighted:

What level of ‘granularity’ is required? Traditionally publishers have worked at the book or journal level, using the International Standard Book Number (ISBN) and International Standard Serials Number (ISSN) as identifiers. However, the unit of publication is getting smaller. Recent schemes allow for the identification of individual articles within publications. Increasingly we need to identify much smaller fragments of complete works, for example parts of text, images, video clips, pieces of software, etc.
Identifiers are either ‘dumb’ or ‘intelligent’. A dumb identifier has no inherent meaning and can only be resolved by looking it up in a database. Intelligent identifiers contain some meaning. Consider the ISBN. This is a relatively intelligent identifier because its various parts have some meaning. The first part identifies the country, language or geographic region in which the book was published for example. However, as book rights are sold from one publisher to another the intelligence of the ISBN decreases and any particular ISBN can only be resolved by querying it against a central database. As the unit of publication gets smaller, the number of identifiers required grows and it becomes increasingly difficult to maintain intelligent identification schemes. The trend is likely to be towards ‘dumb’ identifiers.
It is important to distinguish between identification and location. The Uniform Resource Locator (URL) that we are all familiar with on the Web is a locator rather than an identifier. If an object moves, its associated URL changes and people using the old URL are likely to get a failure indicating that it is no longer available. A true identifier must remain the same whatever the current location of the object. The IETF URN Working Group are in the process of defining Uniform Resource Names (URNs) [2] which are persistent identifiers for information resources.
How persistent should an identifier be? In the Internet world it is generally accepted that identifiers for digital objects need to last for a long time - significantly longer than the objects they identify. Indeed, they probably need to outlast current Internet technology and computer systems.

There is another complication at the moment in that we are in a transitional period of publishing. Publishers must continue to deal with traditional paper publications, while also being involved with new electronic only publications and with parallel publications.

The music industry is facing similar problems. In response the International Confederation of Authors and Composers’ Societies (CISAC) [3] has been developing the Common Information System (CIS). This system includes identifiers for various manifestations of content and for creators and publishers. A recent development is the International Standard Work Code (ISWC) which identifies the musical composition itself, rather than the recorded or printed expression of the work. It has been suggested that the ISWC might be extended to cover literature and the visual arts as well. Creators and publishers are identified by the Compositeur, Auteur, Editeur (CAE) number, which will be extended and renamed the Interested Party (IP) number.

The Digital Object Identifier

Carol Risher (Association of American Publishers (AAP) and Albert Simmonds (RR Bowker) gave an overview of the Digital Object Identifier (DOI) [4]. Their presentation included a video based largely on the first public demonstration of the DOI given in February that showed documents and other files being retrieved on the Web using DOIs rather than URLs. Development of the DOI is being performed by RR Bowker and the Corporation for National Research Initiatives (CNRI) on behalf of the AAP.

A DOI contains two parts. The first part, known as the ‘Publisher ID’, indicates the numbering agency and publisher and is assigned by the DOI Agency. The second part, known as the ‘Item ID’, is assigned by the publisher and can be made up of any alpha- numeric sequence of characters. The use of an existing standard scheme in the Item ID, a SICI or PII for example, is encouraged though some publishers may choose to use a proprietary scheme. A DOI can be assigned to any digital object at a level of granularity that is appropriate to the publisher. Typically this might mean that a separate DOI is assigned to each component (text, image, sound, video) of a multimedia document.

The DOI system has two parts - the ‘DOI agency’ and the ‘DOI computers’. The DOI agency assigns Publisher IDs, issues guidelines for DOI usage and works with the relevant standards bodies to maintain the integrity of the system as a whole. The DOI computers form a distributed system that resolve any DOI to its associated URL. The system is based on the CNRI handle system [5]. Any user who knows the DOI of a digital object can query the DOI Directory directly by typing it into a Web based search form. Typically however, DOIs are likely to be embedded in Web pages, hidden behind clickable buttons. Queries to the DOI Directory are resolved and the client passed direct to the publisher’s system.

The current state of the DOI system is as follows:

the DOI system is real and can be used now;
publisher procedures are still being formulated;
DOIs tend to be long but in general will not be seen;
the DOI system is free to readers;
publishers will have to pay to register a Publisher ID with the DOI agency;
a European agency is planned.

Once assigned, a DOI remains unchanged. If the ownership of an object changes, the new owner registers the change with the DOI agency. If the object pointed to by the DOI moves (that is, the URL changes), the DOI entry for that object can be updated.

It is anticipated that the charges associated with registering with the DOI agency will be small enough that DOIs will be used in non-commercial areas of the Internet as well as by commercial publishers. The DOI agency will assign Publisher IDs to individuals and other organisations in addition to traditional publishers.

The DOI is non-proprietary and will be introduced to ISO in May. Development of the DOI system will continue over the summer culminating in a full demonstration at the Frankfurt Book Fair in October 1997.

The SICI and the BICI

Sandy Paul (SISAC/BISAC) gave an overview of the Serial Item and Contribution Identifier (SICI) [6], a scheme for identifying serials and parts of serials. The scheme has been in use since the late 1980’s and is now widely used, mainly at the issue level, by a broad range of publishers in EDI message transactions and by libraries and subscription agents.

The original version of the SICI allowed an identifier to be assigned to each issue of a serial (the Serial Item Identifier) and to each contribution (article) within a serial (the Serial Contribution Identifier). Recently the SICI has been updated to identify fragments other than articles (for example a table of contents, an abstract or an index) and to identify particular physical formats. The SICI contains the ISSN of the serial.

A final draft of the Book Item and Component Identifier (BICI) [7] is now available. This is essentially a book version of the SICI, using the ISBN in place of the ISSN. The BICI can be used to identify a part, a chapter or a section within a chapter, or any other text component, such as an introduction, foreword or index. It can also identify an entry in a directory, encyclopaedia or similar work that is not structured into chapters.

The PII

Norman Paskin (Elsevier Science) gave an overview of the Publisher Item Identifier (PII) [8] which was developed in 1995 by the Scientific and Technical Information (STI) group of publishers. The requirements for the PII were:

format independence;
capability for future extension;
one document per identifier, one identifier per document;
easy to generate;
generated by the publisher;
minimal restrictions on applicability;
compatible with other standards.

The PII is made up of 17 characters and contains the ISBN or ISSN in order to guarantee uniqueness. It is a ‘dumb’ identifier that has the capacity of 10000 items per journal per year. Future versions of the PII will have extensions to cover document components and versions. Development of any new version of the PII will take account of developments in other areas, for example the DOI system and URNs.

Some interesting figures were given for the numbers of identifiers required for the STI area of publishing. Estimating 1 million articles per year, identifying all the versions of all the components of those articles may require somewhere in the region of 10¹⁴ identifiers! [9]

Group Sessions

The seminar closed with three group sessions covering:

Copyright management applications
Using DOIs in the information supply chain
DOI syntax and system.

These were followed by group reports and a plenary discussion. Some interesting issues were raised.

Should the DOI Agency be closely aligned to a country’s ISBN agency?
Should the publisher part of the Publisher ID be based on the Interested Party number from the Common Information System (see above)?
Can DOIs can be assigned to off-line (for example CD-ROM based) digital objects? Yes.
Does the DOI have any relevance to traditional print-only publications? No.
What happens to ‘dead’ DOIs?
How does the DOI system cope with digital objects that are mirrored across several sites? The DOI resolves to the URL of a page that contains a list of pointers to the mirrored resources.

It was generally agreed that the group sessions could have gone on for far longer than the 45 minutes allocated and that follow-up meetings in specific areas may be required.

This was an interesting seminar and thanks are due to Brian Green (BIC) and Rosemary Russell (UKOLN) for organising a very successful event.

References

Unique Identifiers: a brief introduction, Brian Green and Mark Bide, ISBN 1-873671-18-0
http://www.bic.org.uk/bic/uniquid
IETF URN Working Group,
http://www.bunyip.com/research/ietf/urn-ietf/
International Confederation of Authors and Composers’ Societies (CISAC),
http://www.cisac.org/
Digital Object Identifiers,
http://www.doi.org/
CNRI Handle System,
http://www.handle.net/
SICI standard,
http://sunsite.Berkeley.EDU/SICI/
A Standard Identifier for Book Items and Contributions - draft (Report prepared for BIC and the British National Bibliography Research Fund), David Martin - available after 21 April 1997
http://www.bic.org.uk/bic/bici.html
The PII as a means of Document identification,
http://www.elsevier.nl/inca/homepage/about/pii/
Information Identifiers, Norman Paskin, Learned Publishing (vol 10 issue 2, pp 135 -156)

Author Details

Andy Powell,
Technical Development and Research Officer,
Email: A.Powell@ukoln.ac.uk
Web page: http://www.ukoln.ac.uk/~lisap/
Tel: +44 1225 323933
Address: UKOLN, University of Bath, Bath, BA2 7AY