Abstract Modelling of Digital Identifiers
Discussion of digital identifiers, and persistent identifiers in particular, has often been confused by differences in underlying assumptions and approaches. To bring more clarity to such discussions, the PILIN Project has devised an abstract model of identifiers and identifier services, which is presented here in summary. Given such an abstract model, it is possible to compare different identifier schemes, despite variations in terminology; and policies and strategies can be formulated for persistence without committing to particular systems. The abstract model is formal and layered; in this article, we give an overview of the distinctions made in the model. This presentation is not exhaustive, but it presents some of the key concepts represented, and some of the insights that result.
The main goal of the Persistent Identifier Linking Infrastructure (PILIN) project [1] has been to scope the infrastructure necessary for a national persistent identifier service. There are a variety of approaches and technologies already on offer for persistent digital identification of objects. But true identity persistence cannot be bound to particular technologies, domain policies, or information models: any formulation of a persistent identifier strategy needs to outlast current technologies, if the identifiers are to remain persistent in the long term.
For that reason, PILIN has modelled the digital identifier space in the abstract. It has arrived at an ontology [2] and a service model [3] for digital identifiers, and for how they are used and managed, building on previous work in the identifier field 4, as well as semiotic theory [9]. The ontology, as an abstract model, addresses the question ‘what is (and isn’t) an identifier?’ and ‘what does an identifier management system do?’. This more abstract view also brings clarity to the ongoing conversation of whether URIs can be (and should be) universal persistent identifiers.
Identifier Model
For the identifier model to be abstract, it cannot commit to a particular information model. The notion of an identifier depends crucially on the understanding that an identifier only identifies one distinct thing. But different domains will have different understandings of what things are distinct from each other, and what can legitimately count as a single thing. (This includes aggregations of objects, and different versions or snapshots of objects.) In order for the abstract identifier model to be applicable to all those domains, it cannot impose its own definitions of what things are distinct: it must rely on the distinctions specific to the domain.
This means that information modelling is a critical prerequisite to introducing identifiers to a domain, as we discuss elsewhere [10]: identifier users should be able to tell whether any changes in a thing’s content, presentation, or location mean it is no longer identified by the same identifier (i.e. whether the identifier is restricted to a particular version, format, or copy).
The abstract identifier model also cannot commit to any particular protocols or service models. In fact, the abstract identifier model should not even presume the Internet as a medium. A sufficiently abstract model of identifiers should apply just as much to URLs as it does to ISBNs, or names of sheep; the model should not be inherently digital, in order to avoid restricting our understanding of identifiers to the current state of digital technologies. This means that our model of identifiers comes close to the understanding in semiotics of signs, as our definitions below make clear.
There are two important distinctions between digital identifiers and other signs which we needed to capture. First, identifiers are managed through some system, in order to guarantee the stability of certain properties of the identifier. This is different to other signs, whose meaning is constantly renegotiated in a community. Those identifier properties requiring guarantees include the accountability and persistence of various facets of the identifier—most crucially, what is being identified. For digital identifiers, the identifier management system involves registries, accessed through defined services. An HTTP server, a PURL [11] registry, and an XRI registry are all instances of identifier management systems.
Second, digital identifiers are straightforwardly actionable: actions can be made to happen in connection with the identifier. Those actions involve interacting with computers, rather than other people: the computer consistently does what the system specifies is to be done with the identifier, and has no latitude for subjective interpretation. This is in contrast with human language, which can involve complex processes of interpretation, and where there can be considerable disconnect between what a speaker intends and how a listener reacts. Because the interactions involved are much simpler, the model can concentrate on two actions which are core to digital identifiers, but which are only part of the picture in human communication: working out what is being identified (resolution), and accessing a representation of what is identified (retrieval).
So to model managing and acting on digital identifiers, we need a concept of things that can be identified, names for things, and the relations between them. (Semiotics already gives us such concepts.) We also need a model of the systems through which identifiers are managed and acted on; what those systems do, and who requests them to do so; and what aspects of identifiers the systems manage.
Our identifier model (as an ontology) thus encompasses:
- Entities - including actors and identifier systems;
- Relations between entities;
- Qualities, as desirable properties of entities. Actions are typically undertaken in order to make qualities apply to entities.
- Actions, as the processes carried out on entities (and corresponding to services in implementations);
An individual identifier system can be modelled using concepts from the ontology, with an identifier system model.
In the remainder of this article, we go through the various concepts introduced in the model under these classes. We present the concept definitions under each section, before discussing issues that arise out of them. Resolution and Retrieval are crucial actions for identifiers, whose definition involves distinct issues; they are discussed separately from other Actions. We briefly discuss the standing of HTTP URIs in the model at the end.
Entities
The following concept definitions apply to entities:
- Whatever exists and can be referred to in the model is a thing. In formal terms, “thing” is the root of this ontology.
- Users and Administrators of systems are modelled as parties: people and groups that participate in activities. Authorities are parties who are responsible for other entities: they include system administrators, and people responsible for setting policies.
- Policies are sets of rules about entities, some of which can be enforced through systems.
- A name is an association of a label (a symbol) with a context that the label is in. Contexts define how the label is to be made sense of. There is only one instance of a given label in any one context. Contexts can impose policies on their labels.
- Typical policies for contexts are: a label format policy (what labels are allowed in the context); an access policy (what parties are authorised to carry out particular actions on the label); and an association policy (how the label is used in an identifier).
- Identifiers are an association of a name with a single thing. The name is said to identify the thing.
- The range of things that may be identified is defined through an information model. The range of things that will be identified is defined through an association policy.
- Contexts are identified by identifiers of their own.
- Labels may be mapped to other labels through an encoding scheme.
- Actions are triggered by parties, typically through a system; actions may produce results, which may be entities or changes in quality. Actions are subject to authorisation.
In the following examples, we notate names as (context, label), and identifiers as (name, thing) = ((context, label), thing).
Example:
Parties:
- The University of Hard Knocks
- Joe Bloggs
- the IT Department
Authorities:
- The University of Hard Knocks, responsible for the local PURL identifier system.
Label:
- 8323
- cat
- document1.pdf
Context:
- University of Hard Knocks Purl Server Context (defined by the university’s PURL system)
- URL (defined by DNS and the Internet as a concrete network)
- Employees of BHP
- French Literature
Name:
- (University of Hard Knocks Purl Server Context, report/312)
- (Australian Mobile Phone Numbers, 0482321234)
- (Employees of BHP, Joe Bloggs
Policies: Label Format:
- “All labels must be four digits long, and start with 8”
Policies: Access:
- “Information on the label 8323 can only be looked up by staff members”
Policies: Association:
- “All labels in this context are only used to identify SIM cards” (i.e. this is a context for mobile phone numbers)
Identifiers:
- ((Australian Mobile Phone Numbers, 0482321234), “my phone”)
- ((University of Hard Knocks Purl Server Context, report/312), “the report on global warming from last week”)
Identifiers: Contexts:
- (http://purl.org/uni-hardknocks/, “University of Hard Knocks Purl Server Context”)
These definitions support the following insights:
Any association of a name with a thing - by anyone - establishes an identifier. A name is not an identifier unless it identifies something. (E.g. an unassigned phone number is a name, but not an identifier.)
An identifier is not restricted to an association of an ASCII string or a stream of ones and zeros with a thing; a spoken word or a picture also count as identifiers. (Identifiers are in fact defined as linguistic signs.)
The context of the name differentiates instances of a label from each other, and determines which particular instance is being associated with a thing. This allows the same label to mean different things in different contexts [12].
An identifier management system delimits its own context for the identifiers it manages; so the same label, managed by two different identifier management systems, forms two different identifiers.
Isolated facets of identifier management systems, such as protocol and encoding scheme, may also be considered part of the identifier context - meaning that a change in either brings about a different identifier. But this is a matter of identifier management policy: a particular identifier system model can also decide that its identifiers remain the same regardless of protocol or encoding.
Policies are specific to contexts. In some instances, particular policies in fact set the context of an identifier (see below).
Relations: Equivalence
The following are definitions of relations for identifiers:
- Two identifiers are equivalent if they identify the same thing at a given point in time. Any claim of equivalence is only meaningful with reference to a specified time.
- Two identifiers are synonymous if some authority claims that they are equivalent. (i.e. we trust the authority that they are equivalent, rather than confirm it for ourselves.)
- An identifier is an alias of another identifier (called a target) if the identifiers are synonymous, and the alias is managed to be dependent on the target for its equivalence.
- A preferred identifier is one out of a set of synonyms that an authority privileges, and which the authority is responsible for keeping persistent.
- Two names are the same only if they contain the same label in the same context.
Example:
- Equivalent Identifiers:
- ((Employees of BHP Staff Numbers, 8336), “my cousin Fred”)
- ((Human Names, Fred Q. Bloggs), “my cousin Fred”)
- Synonymous Identifiers:
- “Taiwan Province of China” and “Taiwan” are synonyms according to the authority of the People’s Republic of China (but not according to the authority of the Republic of China)
- Preferred Identifier:
- ISBN 0195306090 (as opposed to “Shirk, S. 2007. China: Fragile Superpower” or NLA:an40693053), according to the International ISBN Agency
- Not Same Name:
- (Names of soccer players, Pelé) is not the same as (Names of asteroids, Pelé)
- (Handle Server 102.100.272, Pelé) is not the same as (PURL Server purl.nla.gov, Pelé) - even if both refer to the same thing
Equivalence between two identifiers may happen to be true at a given point in time; it does not mean that the two identifiers will always mean the same thing, or should always be treated as interchangeable. Judging whether two things are the same or not presupposes an information model for the things being compared.
Synonyms presuppose an authority which weighs in on the equivalence of two identifiers; the authority can also weigh in on which identifier should be preferred in given contexts. This is still only one authority’s claim, and other authorities can make different judgements; but the claim matters for any systems for which that authority is also responsible, as the authority is assumed to enforce its claims throughout its domain. By introducing responsibility, synonymy is a stronger claim than equivalence.
Aliases require that the authority does not just assert equivalence, but actively manages the equivalence itself: it is responsible for making sure the two identifiers stay equivalent while they are being managed, and do not drift apart in what they refer to.
The identifier model constrains when two instances of labels count as the same name. If their contexts differ, they may currently be equivalent when used as identifiers; but nothing guarantees that they will stay equivalent, because the association policies of the two contexts are independent—that is, the two names are managed separately. The model deals with this case by saying that the labels may be the same, but the names are different (since their contexts are different), so they belong to different identifiers. In the example above, the Handle Server and the PURL server may currently agree to use the same label to point to the same thing; but nothing prevents one of the authorities reassigning the identifier later on, and the other keeping it as is.
Other existing information models may have a larger or smaller repertoire of relations for identifiers. This set of relations may map to existing information models in different ways, but is intended to make explicit the role of authorities, which is often left implicit.
Relations: Other
The following additional definitions apply to relations:
- Entities have one or more representations, which can be used to communicate the entities to an audience. Labels are represented through encoding schemes appropriate to the medium of communication; e.g. URL-encoding is necessary for labels occurring inside URLs.
- Representations of names (context + label) can combine a context identifier with a label in the one representation. Two representations can look different, because of different encodings, but still represent the same name.
- The representation of an identifier is the same as the representation of the identifier name.
- An identifier management system manages an entity if it is used to record and update representations of the entity and its attributes, which parties can then consult through the system. A system manages an entity to enable an authority to be responsible for that entity: the authority initiates management actions on the entity, to maintain the entity’s desirable qualities.
- A context (enclosing context) contains another context (subcontext) if all labels in the enclosing context are also in the subcontext, and all policies enforced by the subcontext are also enforced by the enclosing context.
- Some entities are concrete, meaning that they are managed by a specific identifier system. Entities that are not concrete are abstract. In particular, an identifier management system defines a single concrete context specific to it. Abstract contexts are defined instead by their purpose and owner.
- Concrete and abstract contexts define concrete and abstract names, which in turn define concrete and abstract identifiers.
- Concrete contexts realise abstract contexts, if the concrete context’s identifiers correspond to the abstract context’s identifiers, and the two contexts’ policies are consistent. Correspondingly, a concrete identifier can realise an abstract identifier. Realisation is necessary because abstract identifiers cannot be managed by a specific system, by definition.
- Two concrete identifiers are homologues if they are equivalent and realise the same abstract identifier with the same label. Homologues are a simple way of realising a single abstract identifier in multiple systems.
Example:
Representation: identifier
- ((University of Hard Knocks Purl Server Context, report/312), “the report on global warming from last week”) : represented as: http://purl.org/uni-hardknocks/report/312;
Representation: name
- (Handle Server 102.100.272, XYZZY) : represented as: hdl:102.100.272/XYZZY;
Representation: label
- Pelé : represented as:
- Pel%E9 (URL encoding scheme)
- .–…-.. ..-.. (Morse Code encoding scheme)
- Pele (ASCII)
Same Name, different representation:
- (Names of soccer players, Pel%E9) = (Names of soccer players, Pelé): There are several soccer players named Pelé, but the name is the same regardless of what it refers to
Context: Concrete:
- University of Hard Knocks Purl Server Context (defined by the university’s PURL system)
- URL (defined by DNS and the Internet as a concrete network)
- Australian Mobile Phone Numbers (defined by the mobile telephony system)
Context: Abstract:
- University of Hard Knocks Library
- Employees of BHP
- French Literature
Homologues:
- ((Handle Server 102.100.272, Pel%E9), “Pelé”), ((PURL Server purl.nla.gov, Pel%E9), “Pelé”)
- - or as representations, hdl:102.100/272/Pel%E9, http://purl.nla.gov/ Pel%E9 . Both identifiers realise the same abstract identifier ((Names of soccer players, Pel%E9), “Pelé”), are equivalent, and have the same label.
Changing the encoding scheme of a label does not change the label itself; so different encodings of an identifier are not considered distinct identifiers—so long as we know what the encoding scheme is. So the IRI http://en.wikipedia.org/wiki/Pelé and the URI http://en.wikipedia.org/wiki/Pel%e9 are not considered to be distinct identifiers, but different encodings of the same identifier. Allowing different representations of labels lets us treat labels as Platonic ideals, which can be realised in several ways, for example: a spoken URL, a handwritten URL, and a URL transmitted in an HTTP request are the same identifier. The alternative of treating each as a distinct identifier is untenable.
Contexts are seldom made explicit in digital identifiers. Contexts have identifiers of their own, but they are seldom included when citing an identifier. However scheme prefixes in URIs identify contexts at least partly: http://www.example.com is a distinct identifier from ftp://www.example.com, because the named protocol provides a distinct system context for the label proper, www.example.com. Identifiers and identifier contexts are not defined by the services provided for them, or the protocols enabling those services: the contexts exist independently of them. For example the RFC 3986 [5] definition of HTTP URI specifies that an HTTP URI is not constrained to be processed through HTTP [13].
This seems to contradict our preceding claim - that http://www.example.com and ftp://www.example.com are distinct identifiers. Our claim does hold up though: the two identifiers share the same DNS domain but are managed separately. http://www.example.com/a.pdf can be a different document from ftp://www.example.com/a.pdf (because the respective server roots are different). That immediately makes them distinct identifiers. That said, http://www.example.com/a.pdf could be accessed through the FTP protocol instead of HTTP, without becoming identical to ftp://www.example.com/a.pdf. In general, a digital identifier can be acted on through several services and several protocols, but remain the same digital object, managed in the same identifier management system. The critical distinction is the management system for the identifier, not the service protocol for accessing it.
It can be useful to point out that the nominated context for an identifier is a subcontext of another, whose policies have already been specified. Because the larger context determines policies for the identifier which the subcontext follows, the larger context is of more interest to users: its policies are more generally applicable. And if the larger context’s policies have already been specified, there will be much less policy to specify for the subcontext. The nesting of contexts is also useful to point out if the two contexts will end up managed through the same identifier management system.
For instance, we could argue that in the PURL
http://purl.foo.com/net/jdoe/bar, the label is bar, and the context name is http://purl.foo.com/net/jdoe/ . We could instead argue that the label is net/jdoe/bar, and the context name is http://purl.foo.com/ . Both segmentations are legitimate; but http://purl.foo.com/ defines a larger context for identifiers than does http://purl.foo.com/net/jdoe/ (all PURLs vs. all PURLs in the net/jdoe subdomain); and any constraints set by the enclosing context (e.g. “is resolved by purl.foo.com”) also apply to the net/jdoe subdomain. So we take the enclosing context as the starting point for understanding the PURL, rather than the subdomain.
Identifiers can exist in the abstract, as mental constructs; but they can only be managed and acted on in the physical world, through identifier management systems, as digital objects. Systems allow interaction with identifiers digitally, which enables actions on the identifiers; but they also allow the identifier administrators to take responsibility for the identifiers, and to maintain the identifiers as digital objects. An abstract identifier and a concrete identifier are not the same thing: the identifier management system can place no constraints on the abstract identifier. Instead, the identifier model has the concrete identifier realise a corresponding abstract identifier.
This means that one abstract identifier can be realised by more than one concrete identifier. This happens when two different identifiers, in two different identifier schemes, are managed to be equivalent—e.g. both a Handle and an ARK, or URLs in two different domains. This equivalence makes sense only if the two identifiers are understood to be fulfilling the same underlying purpose (and not merely contingently). So the identifier model accounts for the statement that http://www.example.com/pdf/a.pdf is migrated to http://cms.example.com/repository/a.pdf (an ostensive change in identifier), by claiming that both concrete identifiers realise the abstract identifier (“example.com’s PDF repository”, “a.pdf”). The concrete identifiers are defined by particular servers and systems; the abstract identifier is defined by the management and intention common to both.
If the label in the two concrete identifiers is the same, then the same label, in different contexts, is used to identify the same thing. The contexts are still different, so the two are not guaranteed to remain synonymous. Because the two concrete identifiers can nonetheless be confused as being the same, the identifier model gives them a distinct name (homologues). Retaining the same label across contexts is useful in managing multiple contexts.
The identifier model’s approach distinguishing abstract from concrete identifiers is not common: usually if two concrete identifiers are distinct, no attempt is made to model any underlying identity between them. In some approaches, this differentiation even extends to differences of encoding, or of names of contexts. (For example, the URIs doi:10.1000⁄182 and info:doi:10.1000⁄182, are distinct strings, and some systems will process them as distinct identifiers for that reason - even though they are in fact different encodings of the same identifier.) However the notion of an abstract identifier allows us to capture the intent behind associating a label with a thing, which ultimately resides in an authority rather than a specific system - let alone a particular encoding or representation. This allows identifiers to be considered not merely equivalent, but synonymous (deliberately and reliably equivalent), because the same authority intends them to mean the same thing. Because identifiers are signs used meaningfully, this intent is important to capture.
Qualities
The following are definitions of qualities of entities:
- A thing is unique if there exists only one of the thing within a given scope. (A scope is a subset of the universe of all things.)
- An identifier is universal if it is the unique identifier identifying a thing within a given context.
- An entity or quality is persistent if it is managed and maintained for a defined period. (This does not have to be forever: the model emphasises uninterrupted maintenance of the entity, rather than chronological duration.)
- An entity or quality is accountable to a party, if that party can access well-maintained information on its previous and current responsible authorities. Accountability is realised through accountability data, such as authority metadata.
- An entity is trusted by a party if that party is confident that their use of the entity meets certain expectations. Accountability helps establish trust.
- Meeting expectations about system performance (e.g. uptime, load handling) makes the entity reliable.
- An entity is nameable if it may be treated as a name (i.e. it has a representation).
- Other qualities are defined with respect to the actions realising them. These include registered, actionable, resolvable, reserved, published, citable, verified, verifiable.
It is critical to this model that identifiers identify things uniquely; what that means is determined by the information model used for the things identified. An identifier can identify an aggregation (which is a single thing, but has multiple components); it can also identify an abstraction, which may encompass multiple concrete things (e.g. different versions of a digital object can be identified by the same identifier, because the identifier does not identify a single version.)
Uniqueness is only meaningful relative to a scope. For example, “Perth” is not unique in the scope of city names on Earth (let alone in the scope of the universe); but it is unique in the scope of city names in Western Australia. The scope of uniqueness of a name motivates the definition of a context for the name. So defining the context of the name “Perth” as “city names in Western Australia” means “Perth” can still be used as an identifier unambiguously, in the given context.
Universality is useful for discovery: if only one identifier exists for an object in a registry, then searching for all instances of that identifier in the registry will discover all references to the object. If the object is known to have multiple identifiers, on the other hand, then discovery requires a separate search for each identifier.
If the context for an identifier is “all known naming systems” (the global context), universality is not a realistic expectation. There cannot be only one possible identifier in the world for a given thing, so long as any authority can set up its own identifier management system. However, various alternate strategies emulate universality - particularly preferred identifiers: an authority can advocate that one identifier should be preferred over its synonyms in its specific sphere of influence. This allows the search space for discovery to be constrained. Establishing preferred identifiers is the motivation for normalising names in catalogues and databases.
If the context for an identifier is “a single identifier management system”, by contrast, universality is often realised: a particular identifier management system will often have only one identifier for a given thing.
Both “persistent” and “accountable” are second-order qualities when applied to identifiers. Identifiers are not persistent or accountable in themselves, but persistent or accountable with regard to other qualities, such as resolvability, citability, registration, association, and so on.
Persistence is not defined through a timeframe alone. It is defined through an assertion that the given quality of the identifier will be maintained throughout a nominated timeframe. Because persistence is an assertion, it needs to gain users’ trust through demonstrating that appropriate management is taking place.
Making an identifier persistent is a matter of policy and not technology. The ability to redirect actions on an identifier to another management system (as done under DNS, Handle, and “Cool” HTTP URIs [14]) makes it easier to implement identifier persistence policy; but it does not automatically make the identifiers persistent.
Qualities need to be associated with digital objects, if they will be acted on in the digital realm. Accountability, for instance, is realised through accountability data; persistence, as shown below, is realised through maintaining association data.
Actions
The following are definitions of actions applied to entities:
- Actions on identifiers typically are realised through an identifier management system.
- If an identifier does undergo an action through an identifier management system, it is said to be actionable. (This means that any identifier managed through an identifier management system should be actionable.)
- Some actions change the state of an identifier in the identifier management system. These can be referred to as read/write actions, and include the following.
- To Create a thing is to bring it into being
- To Register a thing is to start maintaining and managing it in a system. Entities can be created without being registered, but well-behaved identifier systems require registration.
- To Update a thing is to alter characteristics of the thing as it is maintained in a system. It presupposes that the thing is registered.
- To Deregister a thing is to delete it from a system; it is the opposite of Register, rather than Create.
- To Destroy a thing is make it cease to exist; it is the opposite of Create.
- To Reserve a thing is to assign it a temporary or ‘in use’ status; it is used to mark an identifier object in a system as not yet fully populated.
- To Identify a thing is to associate it with a name. This action creates an identifier.
- To Publish a thing is to enable access to it through a given non-curatorial action, from outside the curation boundary (see next section for details)
- Other actions do not change the status of the identifier on the identifier management system. These can be referred to as read actions, and include:
- To Cite a thing is to communicate a representation of a thing to an audience. Identifiers can be cited, as can identifier actions (e.g. service calls). Depending on what the representation is embedded in, the thing is citable in different ways, e.g. Web-Citable, or Print-Citable.
- To Query an entity is to obtain selected information about the entity from a system.
- To Resolve an identifier is to get information which distinguishes the thing identified from all other things.
- To Retrieve through an identifier is to access a representation of the thing identified.
- To Verify an entity is to confirm that a value for an entity, managed in a system, is what it should be. What the value should be is decided with regard to particular qualities, such as accountable and resolvable.
- Entities are verifiable with regard to a quality, if the Verify action is possible for that entity and quality; entities are verified if the Verify action has actually taken place.
As digital objects, identifiers can be registered and deregistered. That is distinct from creating and destroying identifiers: if someone has made the connection between a name and a thing in their head, they have created an identifier, and only erasing their memory will destroy that association. Though concrete identifiers exist only by virtue of their management systems, identifiers can be recorded outside those management systems. (This is important for archival purposes: we can still use deregistered identifiers to identify things in a historical sense.)
Actions on concrete digital identifiers are realised through services on identifier management systems.
The usual target of verification is the resolvability of an identifier: verification confirms that the identifier resolves to something, and moreover that it resolves to the correct thing.
Actionability on an identifier requires the use of an identifier management system. Citing an identifier, for instance, does not depend on the existence of an identifier management system; so citing an identifier (e.g. writing the identifier name down on a piece of paper) does not make the identifier actionable. When an identifier is referred to as Actionable, what is usually meant is that the identifier is Resolvable. The distinction between Resolve and Retrieve is discussed further below.
Publishing
The following concept definitions relate to the action of publishing entities:
- Read/Write actions are curatorial actions, which take place in order to manage entities - that is, to realise or maintain desirable qualities of the entities.
- A party authorised to perform curatorial actions on an entity through a system is an administrator on that system.
- A party authorised to perform Read actions but not Curatorial actions on an entity through a system is an end-user on that system.
- Systems have a curation boundary [15], defined by who is authorised to perform curatorial actions through the system - i.e. who the system administrators are. Curatorial actions occur within the curation boundary of a system: only administrators can undertake such actions.
- An entity crosses the curation boundary, when end users are granted access to the entity: this is what is commonly understood by publishing an entity.
- Granting an end-user access to an entity means allowing them to perform actions on the entity. Those actions, by definition, are Read actions.
The notion of a curation boundary helps us distinguish between administrators and end-users — even if the community of administrators is distributed and sizeable. Making an identifier accessible to an administrator remotely does not count as publishing it, any more than is making it available locally for editing. An identifier is only published when a user who cannot update the identifier is newly given the ability to act on the identifier in some other way (typically as we will see, to resolve it).
Publishing an identifier depends on who is allowed to act on it, as well as on how they can act on it. An identifier may be resolved by administrators, while it is being prepared for release. But it is only considered published once end-users are also allowed to act on it.
Querying and verifying identifiers are actions typically undertaken in order to curate the identifier, though they are not write operations, and might be accessible outside the curation boundary.
This definition of publishing centres on authorising Read actions through a system. An alternate definition of publishing depends on who has knowledge of the entity published: if an end-user becomes aware of an identifier, and can, for example, cite it, we could speak of the identifier being published. But the definition adopted here requires the end-users to perform actions through the identifier management system: if they can write the identifier down, but they cannot yet resolve it, this definition does not consider it as published yet.
Resolution and Retrieval
The following concept definitions relate to the actions of resolving and retrieving on identifiers:
- To Resolve an identifier is to get information which distinguishes the thing identified from all other things (association data). This information tells us what the thing identified is: it “identifies” the thing.
- Resolving an identifier is a way of dereferencing it: navigating from the identifier to the thing identified.
- Information on how to access the thing (the locator of the thing) is one type of resolution data: the locator distinguishes the thing to be accessed from all other things, which have different locators (or no locators at all). But resolving an identifier does not involve actually providing access to the thing.
- An identifier is Resolvable if it can be resolved, and Web-Resolvable if it can be resolved to information usable directly on the Web (e.g. resolvable to a locator, i.e. a URL).
- Retrieval on an identifier provides access to a representation of the thing, via a locator. Access is typically the responsibility of an external system. The representation is domain-specific, and can take various forms.
- An identifier can be resolved without being retrieved, if the thing it identifies is not an online resource; e.g. XML namespaces, vocabulary terms.
- Resolve is the main non-curatorial action used with identifiers; when an identifier is published, that is normally understood to mean making it available for resolution by end-users. For instance the identifier info:hdl:102.100.272/XYZ may be resolved to the locator http://www.example.com/a.pdf through a Handle Resolver. But accessing the latter URL, and downloading the PDF at that location, is a distinct action of retrieval, enabled by the www.example.com HTTP server rather than the 102.100.272 Handle server.
- An identifier can have multiple instances of association data, all of them providing access to the same thing according to the system’s information model. Such identifiers can undergo multiple resolution, in which all instances of association data are returned to the requester. Multiple resolution typically feeds into an appropriate copy selection process, which determines which association data is the best to use for the request [16].
- For instance, a document is stored in multiple repositories. The identifier for that document has association data that includes multiple URL locators, one for each copy of the document. This is allowed by the information model: the thing identified is the abstract document, and not just a particular instance of the document on a server. So all the URLs are distinctive to the underlying abstract document (they are not associated with any other abstract document), even though they are also distinct from each other (as different concrete copies of the document). Any of the locators can be returned to the requester as resolutions of the identifier, because they are information distinguishing the abstract document from all other abstract documents (under a particular information model dealing with abstract documents). All of the locators can be returned to the requester, as a multiple resolution of the identifier.
Association data captures the association in an identifier management system of an identifier’s name and the thing identified. Maintaining this data is the primary responsibility of an identifier management system. However, an identifier record, as a digital object, may also contain other information.
Resolving an identifier is different from querying it. Querying a Handle identifier is done by viewing the entire Handle digital record - including not only any URLs registered (as association data), but also timestamps, permissions, and other metadata. Resolving a Handle identifier, on the other hand, typically involves mapping the Handle to one of the registered URLs.
Resolution and retrieval are often conflated. Resolution distinguishes what the identifier identifies from what it does not; it does not necessarily involve accessing what is identified. In contemporary digital identifier systems, some sort of resolution to a locator is a prerequisite for retrieval. However metadata describing a resource are an acceptable way of resolving an identifier - so long as that metadata uniquely discriminates the thing identified from all other candidates. In the HTTP protocol, resolution and retrieval can be distinguished as HEAD vs. GET.
Multiple resolution is intrinsic to the functioning of appropriate copy protocols like OpenURL [17], which assume that a single abstract resource can have multiple concrete instances, each with its own locator. Multiple resolution is also commonplace in the operation of large-scale, mirrored Web sites. Usually the selection of one of the multiple instances is a process hidden from the user.
It bears repeating that digital identifiers do not apply exclusively to the digital realm. Not all things identified by digital identifiers are online digital objects, so they cannot all meaningfully be retrieved (e.g. a vocabulary item or an organization - although the description of the organisation may well be a digital object, such as a Web page). Not all identifiers are associated with services to resolve or retrieve the identifiers digitally (e.g. a name roster in Excel); in fact, digital identifiers need not provide retrieval as an option at all. However, there is a strong expectation that online identifiers should at least be resolvable: a user should be able to determine, through some service, what is being identified.
- A request for a service call on an identifier is logically distinct from the identifier itself. For example, the URI http://www.example.com as an identifier is logically distinct from an HTTP GET request on http://www.example.com (although the distinction has been blurred in the history of HTTP). This distinction is important because:
- More than one service may be associated with the same identifier.
- The identifier should not be bound or restricted to the specified service.
- Best practice for persistent identifiers is to manage the association between the name and thing, independent of whatever service is used to retrieve the thing. The way the thing is retrieved may not persist; but the association between name and thing should persist.
For example, a request to retrieve a resource by its URI identifier (HTTP GET on the URI) can be distinct from a request for the most appropriate copy of the resource, or metadata concerning the resource (HTTP GET on the URI embedded in an OpenURL request), or an archived version of the resource (HTTP GET on the URI embedded in a Wayback Machine [18] request). Under the Semantic Web, HTTP URIs identifying abstractions may not be intended for derefencing at all - even if they hyperlink to descriptions of the thing identified (see e.g. XML namespaces, or the Semantic Web use of HTTP Status Code 303 See Other [19]).
For persistent identification of digital resources, identifier management systems should maintain association data independently of the locator used to retrieve the resource - e.g. as a prose description identifying the resource. Even if the network location of the resource is compromised or no longer maintained, administrators should be able to recover what was supposed to be identified.
Universality of HTTP
The HTTP protocol is currently close to universal for interacting with resources on the Internet; this has proven of great benefit in expanding the reach of the Internet and guaranteeing its integrity. Any digital identifier scheme used online realistically needs to provide at least some services through the HTTP protocol. This amounts to exposing those identifiers as HTTP URIs - as is already commonplace, e.g. with Handles, XRIs and ARKs, through resolution and retrieval services.
It is also clear from the foregoing, and from the current definition of HTTP URIs [5], that HTTP URIs qualify as identifiers (and are no longer bound to be locators, as URLs). Provided they are appropriately managed, nothing prevents them being used as persistent identifiers. There is of course a long history of HTTP URLs not being managed appropriately; but persistence has always been a policy matter. There is no technical barrier to HTTP URIs being persistent, as indeed Tim Berners-Lee pointed out in 1998 [14].
That said, we take issue with the following common assumptions, that do not follow [20]:
- A universal service protocol (such as HTTP) is the same thing as a universal identifier scheme.
- HTTP URIs are the preferred identifier for all authorities (although they may well be preferred for HTTP-oriented authorities);
- HTTP URIs are the preferred identifiers in contexts where HTTP services are not relevant (e.g. internal document management);
- HTTP will always be a universal protocol, and persistent identifier providers should assume it will be;
- HTTP URIs will capture all functionality, data, or services presented by other identifier schemes;
- Identifiers in other schemes should be maintained only to the extent of exposing them under HTTP.
- All identifiers, even when mapped to an HTTP URI, must be meaningfully dereferencable through a Web browser.
Different identifier schemes address different business requirements, by presenting users with different services and policies. The HTTP protocol has a deliberately restricted repertoire of services, consistent with a resource-oriented rather than a service-oriented view of architecture; and it does not natively support a rich environment for managing identifiers, such as we believe is necessary to support identifier persistence properly [10]. Other identifier schemes, more explicitly oriented towards persistence, provide users with different levels of support and management.
It is important for the Web that all digital identifiers behave as HTTP URIs for dereferencing - resolution and/or retrieval. This has made the modern Web architecture possible. But this does not mean all digital identifiers have to be HTTP URIs, and in particular managed as HTTP URIs, in order to achieve interoperability with other identifiers. HTTP as a service protocol for identifiers does not address all purposes equally well, and there is a place in the Web for other identifier schemes to continue in use, so long as they are exposed through HTTP.
Conclusion
Under the PILIN Project, we have sketched a model for identifiers and identifier services. This model has allowed us to compare different identifier schemes, and identifiers used in different domains, without losing sight of their underlying commonalities. One of the major problems in debates on persistent identifiers has been the different understanding of terminology between proponents of different identifier schemes: these have led to misunderstanding, or inordinate focus on incidental details. The ontology allows us to analyse identifier systems in terms of their base functionality and the requirements they fulfil, rather than being distracted by implementation specifics. To give an example, debate over how identifiers are actionable is simplified by a comparison of how identifier systems dereference identifiers, and by the recognition that all contemporary digital identifier systems provide retrieval, but only some provide resolution.
This more abstract layer of comparison brings clarity to the identifier debates; it enabled us to articulate identifier policy guidelines in a much more focussed manner. Identifier systems can then be mapped back to the business requirements they satisfy more accurately.
The model as presented here is not novel: it represents a convergence of views in various identifier communities, even though communication between those communities has often been difficult. The basic notions underlying the model are drawn from semiotics, and are much older. However, making such a model explicit helps establish which differences between identifier schemes are essential, and which are incidental. It does so especially by foregrounding the requirements users have for identifiers, as desirable identifier qualities.
We hope that our model can help others in the identifier community likewise approach the recurring debates over identifier systems with more clarity and less risk of confusion - and in that way, can focus discussion on issues which truly make a difference to identifier managers and users.
Acknowledgements
This article reports on work done under the PILIN project and the PILIN ANDS Transition Project. PILIN was funded by the Australian Commonwealth Department of Education, Science and Training (DEST) under the Systemic Infrastructure Initiative (SII) as part of the Commonwealth Government’s Backing Australia’s Ability - An Innovation Action Plan for the Future (BAA). The PILIN ANDS Transition Project was funded by the Australian Government as part of the National Collaborative Research Infrastructure Strategy (NCRIS), as part of the transition to the Australian National Data Service (ANDS).
The authors wish to acknowledge the support and feedback of the rest of the PILIN team. We also thank Dan Rehak for his feedback.
References
- PILIN Project http://www.linkaffiliates.net.au/pilin2/
- Nicholas, Nick. 2008. PILIN Ontology for Identifiers and Identifier Services
http://resolver.net.au/hdl/102.100.272/G9JR4TLQH
The present article is an expansion of the PILIN Ontology Summary
http://resolver.net.au/hdl/102.100.272/T9G74WJQH - Nicholas, Nick & Ward, Nigel. 2008. PILIN Service Usage Model. Version 1.1.
http://resolver.net.au/hdl/102.100.272/0LHBLDTRH - An overview of current identifier schemes, and of issues in identifier resolution and persistence, is given in Tonkin, Emma. 2008. “Persistent Identifiers: Considering the Options”. Ariadne, Issue 56
http://www.ariadne.ac.uk/issue56/tonkin/ - Berners-Lee, Tim, Fielding, Roy & Masinter, Larry. 2005. Uniform Resource Identifier (URI): Generic Syntax. IETF RFC 3986
http://www.ietf.org/rfc/rfc3986.txt - The Digital Object Identifier System http://doi.org/
- OASIS Extensible Resource Identifier (XRI) TC
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xri - Kunze, John & Rogers, R.P. Channing. 2008. The ARK Identifier Scheme. IETF Internet Draft
http://tools.ietf.org/html/draft-kunze-ark-15 - For an introduction to semiotics, see e.g. Chandler, Daniel. 2001. Semiotics: The Basics. London: Routledge.
- Nicholas, Nick, Ward, Nigel & Blinco, Kerry. 2009. A policy checklist for enabling persistence of identifiers. D-Lib 15. 1⁄2. doi:10.1045/january2009-nicholas
http://www.dlib.org/dlib/january09/nicholas/01nicholas.html - PURL (Persistent Uniform Resource Locator) http://purl.oclc.org/
- The identifier model’s notion of context is deliberately vague, and is built in to the name: once the name is tied to a specific context, it can be associated with a thing without ambiguity. Semiotics proper does not keep its associations static: it allows context to play a role in how the name is associated with the thing (cf. Charles Peirce’s “interpretant” with the identifier model’s association policy; on the interpretant, see e.g. http://plato.stanford.edu/entries/peirce/#prag). For digital identifiers, however, this simpler model is adequate.
- §1.2.2 (pp. 8-9): “Although many URI schemes are named after protocols, this does not imply that use of these URIs will result in access to the resource via the named protocol. […] Even when a URI is used to retrieve a representation of a resource, that access might be through gateways, proxies, caches, and name resolution services that are independent of the protocol associated with the scheme name.”
- Berners-Lee, Tim. 1998. Cool URIs don’t change http://www.w3.org/Provider/Style/URI
- On the ‘curation boundary’, see Treloar, Andrew, Groenewegen, David & Harboe-Lee, Cathrine. 2007. The Data Curation Continuum: Managing Data Objects in Institutional Repositories. D-Lib 13: 9⁄10.
http://www.dlib.org/dlib/september07/treloar/09treloar.html - On appropriate copy, see e.g. Rehak, Daniel R. 2005. The Appropriate Version Problem: Separating Learning Designs and Course Structures from Learning Object Versions, Variants and Copies. CORDRA: Content Object Repository Discovery and Registration/Resolution Architecture
http://hdl.handle.net/2000.01/D6E3BF9462684182AC293D64D3DDE192 - Ex Libris Group, OpenURL http://www.exlibrisgroup.com/category/sfxopenurl
- Internet Archive Frequently Asked Questions http://www.archive.org/about/faqs.php#The_Wayback_Machine
- www-tag@w3.org list , post from Roy T. Fielding 18 June 2005
http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html
see also Sauermann, Leo & Cyganiak, Richard. 2008. Cool URIs for the Semantic Web.
http://www.w3.org/TR/cooluris/ - Further discussion in Nicholas, Nick. 2008. Using URIs as Persistent Identifiers
http://resolver.net.au/hdl/102.100.272/DMGVQKNQH