Versioning in Repositories: Implementing Best Practice
The VIF Project
The Version Identification Framework (VIF) [1] Project ran between July 2007 and May 2008 and was funded by the Joint Information Systems Committee, (JISC) under the Repositories and Preservation Programme [2] in order to help develop versioning best practice in repositories.
The project was run by partners, the London School of Economics and Political Science (LSE) [3], the Science and Technology Facilities Council (STFC) [4], the University of Leeds [5] and Erasmus University Rotterdam [6]. It has produced a detailed Web-based framework [7], which provides information and guidance about versioning in repositories. The article, ‘Version Identification: A Growing Problem’ [8], published in Ariadne Issue 54 explored the issues associated with versions in institutional repositories and outlined the current research and work carried out to date. This successor article highlights some of the best practice developed within the VIF Project, which is also available in more detail in the framework itself. It also accompanies the event report in Ariadne Issue 55, ‘Which One’s Which? Understanding Versioning in Repositories’ [9] which reported on VIF’s final workshop in April 2008.
Why Pay Attention to Versioning?
Versioning is so inherent to the research process that it can be easy to give the matter little thought; it is one of those facts of research life that, at a cursory glance, would not appear to be able to contend with heavier issues such as resourcing or licensing.
As such, until recently, there has not been a huge amount of time dedicated to versioning issues, but we do know it is a recognised problem. In the survey that VIF carried out in autumn of 2007 [10], only 5% of academics and 6.5% of information professionals surveyed found it easy to identify versions of digital objects within institutional repositories. Across multiple repositories the figures were only 1.8% of academics and 1.1% of information professionals. Moreover, a third of information professionals who work with repositories stated that they either have no system currently in place or ‘don’t know’ how they deal with versioning at present.
However, the institutional repository movement has grown rapidly and as it is now developed well past infancy, the type of problems encountered has changed as repositories have become more established. The bigger issues are still pressing and constantly under debate, but increasingly the devil shows itself in the detail. The more that content is entered into repositories, the more academics are concerned about issues such as citation, and the increased potential of version relationships and even of simple version identification to cause confusion. Not tackling the issue can compound the problems.
On further investigation, the subject of versioning proves to be a magnificent example of how dedicating some time thinking about the finer points and applying the best practice proves to be well worth the time and effort. Working through the detail allows a repository manager to challenge and improve all aspects from strategy and policies to configuration and communication with the repository’s users. Advanced repositories can use a review of versioning issues as an opportunity to take stock and clear up, ready for the next stage of growth. Meanwhile fledgling repositories can take full advantage of the lessons learnt by the earlier repositories and bypass potential problems, issues and wasted resources.
Most importantly, attention paid to versioning really puts the emphasis back on the role of the end-user and helps to build trust in a repository. When researchers search a repository and access a digital object, they need to have confidence that they have found the right item for their purposes. People working on a repository or even an author may think that they know what the best version to store is, but research is very rarely a linear or objective process, and all researchers have their own independent needs.
For example, one researcher may need to find the published version in order to give an accurate citation, whilst another would like to view an earlier version of the same work in order to assess the development in the thought of the author or content creator. Another person might need to know exactly what information was provided at a particular date to provide evidence, for example in a court case. If the only version information provided on this object is ‘final version’, how can any of these researchers know that they have found the relevant, or appropriate item? How can the end-users trust the information or indeed the repository itself?
The Confusion Caused by Versioning
The first point of confusion for VIF to tackle was what we mean when we talk about versions. Research projects are more often than not highly dynamic and complex processes, spawning vast numbers of separate entities, all of which may relate to each other in some way, but may or may not be understood as ‘versions’ of the same thing.
Part of the misunderstanding is caused by a lack of differentiation between a version relationship between two or more objects and a single object’s version status. Sometimes researchers just need to know what is in front of them. For example, the same questions can be asked of the same object:
- I want to cite the published version, is this one it?
- Is this a draft version which has subsequently been added to?
- Is this the presentation which was delivered in London last year or Cardiff this year after the author developed his thesis?
If the researcher wants to gain a more complex understanding of the relationships an object has with other objects, the questions might be more wide ranging:
- In what order were these draft versions created?
- Is this working paper related to the article of the same name?
- Are conference papers and posters stored in the same place as the article which refers to them?
- There are two articles with the same name which appear identical; are they sequential versions of each other, or just copies in different formats?
To help clarify understanding, VIF has provided the following definitions:
‘A ‘version’ is a digital object (in whatever format) that exists in time and place and has a context that can be described by the relationship it has to one or more other objects.
‘A ‘version relationship’ is an understanding or expression of how two or more objects relate to each other.’ [11]
These definitions encapsulate the notion that all objects associated with a piece of work (perhaps a concept or a research project) have a relationship with each other. These relationships vary but may be considered by some to be versions of each other. The VIF Project has avoided specifically defining what is and what is not a version and left it to the repository manager and the end-user to decide what they would call versions. VIF has instead encouraged repository managers to focus on ensuring that all important information is made available to users for them to understand both the version status and any version relationship.
The other cause of confusion is the lack of common agreement about terminology despite a few recent attempts to define common vocabularies. The NISO/ALPSP Working Group on Journal Articles [12] and the VERSIONS Project [13] focussed upon defining the position of a journal article in the publication process. The RIVER Project [14] looked more broadly at any sort of digital object, and its lifespan in general terms, not just publication, and offers expressions to describe the relationships between digital objects.
However, across the wider community, terms are still frequently used interchangeably and the boundaries between them are often subjective. For example, one person’s ‘revision’ might be minor changes, but represent another person’s significant changes or landmark versions (for example, peer-reviewed, published, etc.). When do formatting or stylistic changes (for example, typesetting or font) or a change of file format (creating a digital variant) become significant?
Best Practice
The Strategy
A clear, documented understanding of the purpose of the repository will help navigate some of the thornier decisions. A repository can be used as: a permanent archive; a research database; a tool for statistical analysis and reporting (for example, helping to support the Research Assessment Exercise submission); or even as a workbench on which academics develop their work.
These purposes are not mutually exclusive but acknowledgement of what is important for your institution and what type of object the repository will contain will help clarify what versioning issues may come up. The repository may for a variety of reasons want to limit the number of drafts of the same work, provide restricted access to some types of material, or decide to remove duplicated or old material; but there should be a framework in place to guide these procedures. Without strategy and subsequent policy setting, decision making will be at best inconsistent, and at worst incoherent, leading to confusing explanations and instructions to both contributors and users.
For example, an archive may feel that adjusting an object by adding a coversheet or watermark may alter the original integrity of the work; but a workbench repository may find these tools extremely helpful. A scientific repository of images of cell cultures that is a permanent archive will focus more on cataloguing the individual objects; but one that is used for researchers to examine and cross reference data systematically may pay more attention to highlighting the relationships between images.
A university repository which currently contains journal articles in order to collect an institution’s intellectual output and which is considering including e-theses, might have the aim of changing the academic administrative process and move away from a print focus to electronic submission. If the repository is a collection of the university’s academic output, should only theses completed at the university be included? If so, how should the repository present this decision to an academic who completed her thesis elsewhere but has become interested in the repository because she is attracted by the idea of depositing all of her own work in one place?
It is also worth considering the future goals and uses of the repository. A repository which only deals with one kind of object for one purpose, like e-theses, will find consistent application of versioning tools much simpler than those containing multiple formats related in different ways. However, what if it is later expanded to include supporting research material in different formats, or follow-up journal articles? Will the versioning tools implemented be able to be easily adapted? Multimedia objects are in some ways harder to deal with than text documents because there are more variations in the sort of versions associated with them, such as formatting and compression; and frequently there are fewer obvious places to present text-based versioning information.
Once clear about the purpose of the repository, the likely content and the needs of the end-users, the appropriate software should be selected. One software system might be more appropriate if the repository is going to be used as a workbench and requires a high degree of customisation because of the repository’s specialism. However the same software might be too challenging for a start-up repository that is intended to store a number of text-based documents but which has little IT technical support.
Priorities for functionality should be considered and exploited. Is automatic ingest functionality essential to save manual input time? Is keeping every version important or will the ability to delete be vital? Is flagging (for example, the last available version) something that would be useful?
All software available has a continuing programme of upgrading and it is worth keeping up to date with improved versions. There are certainly some major advances anticipated from the main software providers in respect of version control due in 2008.
Once strategy is set, it is critical to document the thought processes involved and set transparent policies. Version policies do not have to be stand-alone, but can be integrated into existing policy documents. The OpenDOAR [15] policies tool provides a number of templates for repositories to adapt and has taken up VIF’s recommendations for including versioning policies.
Advocacy
The best and most accurate version information comes directly from content creators at the time of deposit. Identifying version status and relationships should be seen as an essential part of the ingest process. The framework provides a few simple tips for content creators to include within an existing advocacy programme that will enhance the quality of version information. These are:
- Keep a record of which versions are made publicly available and where.
- Use a numbering system that denotes major revisions.
- Make explicit the author, title, date last changed and version status on all versions of work. This can be done:
- descriptively within the object, for example, on a title page, title slide, first frame of film and so on.
- by using a clear, updated and relevant filename for every different version.
- by filling in available ‘Properties’ details or ‘ID tags’ (full guidance below).
Repository staff should suggest that content creators look at the Toolkit or the Content Creation section of the Framework [16], which contains more information about how to include version information within specific object types. The VERSIONS Project produced a freely available toolkit [17] available to download or in a printed booklet to hand out to academics, which is an extremely practical and useful information resource containing a lot of best practice and guidance for repository managers, authors and researchers.
The Essential Versioning Information
Some pieces of information about an object, such as the author and the title are obvious, but more information is necessary to make a qualified assessment of what version is at hand. Although only one piece of information may be needed to identify a particular object, the more information that the repository can make transparent, the less possibility for confusion.
VIF has identified five important pieces of information that can be used to identify a version quickly and easily (if applied in a clear and transparent manner) by either embedding them into an object or storing them in metadata:
1. Defined dates:
A date is probably the most obvious and potentially effective way of identifying a version and it may be the only information required. However, despite the fact that a date without explanation can easily cause more confusion and ambiguity, it seems that undefined dates are very common. VIF’s strongly recommends that any date used is clear, defined and that the most relevant is then applied consistently within the repository. If only one date is used, it should be Date Modified (by the author, not the repository) wherever possible and should be accompanied by a description of who made the changes and why. As the version identification it should relate to the object at hand, not to the repository, nor to an understanding of the workflow.
2. Identifiers
An identifier is a ‘name’ which uniquely identifies an object (simple or complex). One object might have a number of different identifiers, such as a filename, author, email address, handle, DOI, URI, Repository Number or ISBN. As many identifiers as are known should be displayed in both metadata and the object itself, although persistent identifiers should particularly be made available for citation purposes.
3. Version Numbering
Version numbering is very common amongst authors, as it works very well for individuals. A variety of systems are used, although consistency across authors is not usual and multiple authors working together can cause problems by branching off onto different lines. Whilst systems can be very easy and intuitive to follow, it can be difficult to impose a consistent version numbering system since a major change for one person may be regarded as a minor one by someone else. In addition, the version number is only really meaningful if other versions are also known. However, repository managers may consider this version information useful to include if provided by the creator at time of deposit.
4. Version Labels / Taxonomies
Taxonomies are extremely useful in positioning an object in a workflow, which is why people like the idea of them so much. The suggested VERSIONS names for example, that have enjoyed some take-up success for journal articles are ‘Draft’, ‘Submitted Version’, ‘Accepted Version’, ‘Published Version’ and ‘Updated Version’.
However, there are a number of challenges in implementing a robust and consistent system, particularly when dealing with more than one kind of object as workflows differ enormously. VIF’s advice is that if terminologies are used, they must be used consistently and the terms must be clear and explained to the end-user.
5. Text Description
A written description will always be the best and clearest method of communicating the object’s version status and relationships to other objects. Complexities can be fully explained, and if it is the content creator who has written the description, then the level of trust in the information is very high. However, statements like ‘last version’ should be avoided as they can become very rapidly out of date. However, there are not many opportunities to add this level of detail, particularly in non-text- based objects and it is time-consuming to do.
How to Use Version Information
The best way to capture information about an object within a repository is within metadata fields which can be amended by repository staff. Richer metadata used for describing repository content will enable better export and harvesting leading to more accurate searching in repository search services and faster version identification. The framework provides information about how to use different metadata schemas and share their metadata with others by mapping fields consistently.
However, it is also critical to make it directly available to an end-user within an object because there are ways of accessing objects which bypass associated metadata. If the information is embedded, it cannot be lost. Unlike metadata, which is not always available. For example: because the object was accessed directly through an Internet search; or because it has been saved locally, breaking the connection to the repository; or because the standard format provided by a cross-repository search service has not replicated the repository’s version information.
VIF has therefore strongly recommended that repositories systematically implement at least one of the following solutions to embed versioning information into objects:
1. ID Tags and Properties Fields
This opportunity already exists in a number of object types, but content creators need to be encouraged to complete the fields or the repository staff must make the decision to add information at deposit.
2. Cover Sheets
Implementation of consistent cover sheets across all, or a group of objects in a repository to provide standardised information not only provides detailed version information, but provides an opportunity to link with other information such as copyright details. However, without an automated system, this can prove to be a time-consuming and resource-intensive process.
3. File-naming Conventions
A consistent system of file-naming can provide uniformity in an obvious place, although the amount of information that can be provided is manifestly limited.
4. Watermarks
The use of a watermark is an effective but unobtrusive way of planting certain bits of metadata into the object. It can be an automated process and can be used on a number of different types of digital object.
Linking Objects in the Repository
Implementing a consistent understanding of how to link objects is a difficult task that raises many questions with no obvious answers. Are all outputs from one research project or wider body of work versions of each other, and if not, where is the line drawn? How to define what constitutes a ‘project’ or what the extent of the ‘body of work’ is? Are objects with identical content but different file format versions of each other? When organising the repository, when, why and how should records be linked and when should one record contain more than one object?
The conceptual model of how records link in the software is important to understand, and although it is not easy to change by a repository manager, it should be considered by a repository manager when selecting the software system.
VIF has recommended that IFLA’s Functional Requirements of Bibliographic Records 18 conceptual model is a useful way of understanding version relationships. The FRBR model presents objects in a structural tree, moving from conceptual levels of the ‘work’ through different expressions and manifestations down to distinct physical or electronic copies.
VIF has supported the development and promotion of several application profiles currently available or in development (see the Sharing Standards section of the Framework [19] for details about the individual profiles). Funded by JISC, these application profiles are a metadata specification for describing a digital object type and they have mostly been created using the Dublin Core metadata scheme as a basis for creating a FRBRised structure in which to store metadata.
The use of an application profile helps to aggregate a more consistent, richer metadata from repositories when using a cross-repository search tool and helps with version identification because they provide metadata information down to the item level.
Whether the FRBR model is taken up or not, VIF recommends that repository systems are able at least to link objects that have a version relationship, and to provide a flexible architecture that allows objects to be stored and displayed in a way that reflects their version status in a workflow. The framework highlights the following examples:
- All objects have their own record, and no records are linked together or make reference to other records.
- All objects have their own record, but if there is a relationship between two objects this is recorded by the repository and displayed to the user. This relationship may not be described, but one or both repository records refer to the other.
- Versions created as part of the development of a piece of work can be stored in a linear fashion, with one version being the latest, and previous versions linked to within that record.
- Versions of different format, such as a Word document and a PDF of the same thing, are stored in one record.
For more complex relationships, it might be useful to consider having the records stored separately, each with its own metadata and attributes, but linked and using categories as ‘supplemental material’:
- A research paper that draws on a small set of survey results that are saved as a spreadsheet. Both are submitted for inclusion in the repository. Should the data file have its own record? Preferably not. It should be able to be stored in a record which represents the body of work of which both pieces are components.
Conclusion
Implementing versioning best practice within a repository does not have to be difficult or too time-consuming, but it does require thought and planning, which will pay dividends over time.
There is a wide range of repositories, all with different purposes, dealing with a range of complex versioning issues and most definitely with a wide spectrum of resourcing available. The framework has been designed not to be prescriptive, but rather to provide a whole range of guidance and advice from which all repository mangers and software developers can pick and choose. Even a person running a repository on a part-time basis should find some useful information to help improve version identification within the repository.
Although there are some clear recommendations given within the framework, much of the guidance is balanced, for example, presenting both pros and cons of the different solutions so that a repository manager can judge what is most appropriate for the repository in question. Repository managers should review the guidance and implement what they are able to within their repositories.
Whatever the approach taken, there are some key principles to follow to enhance end-users’ trust and confidence in repositories and in knowing that the digital objects they access are appropriate for their purposes.
- All information provided in the repository should be as transparent as possible.
- Any solution adopted should be implemented consistently.
- Decisions made should be made explicit; policies should be recorded and user education and communication should be included in procedures.
Bearing these guiding principles in mind will help improve a repository, provide a user-led approach and enhance trust in the repository’s content.
References
- The Version Identification Framework http://www.lse.ac.uk/library/vif/
- JISC Repositories and Preservation Programme: http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres.aspx
- London School of Economics and Political Science (LSE) http://www.lse.ac.uk/library/
- Science and Technology Facilities Council (STFC) http://www.scitech.ac.uk/
- University of Leeds http://www.leeds.ac.uk/
- Erasmus University Rotterdam http://www.eur.nl/english/
- The Version Identification Framework (VIF) http://www.lse.ac.uk/library/vif/
- Puplett, D. “Version Identification: A Growing Problem”, January 2008, Ariadne, Issue 54 http://www.ariadne.ac.uk/issue54/puplett/
- Molley, S. “Which One’s Which? Understanding Versioning in Repositories”, April 2008, Ariadne, Issue 55, http://www.ariadne.ac.uk/issue55/vif-wrkshp-rpt/
- Cave, P. “Work package 2: Requirements exercise - report of a survey of Academics and Information Professionals” 2007, p.3 http://www.lse.ac.uk/library/vif/documents.html
- VIF Importance of Version Identification page http://www.lse.ac.uk/library/vif/Problem/importance.html#what
- Recommendations of the NISO/ALPSP Working Group on Versions of Journal Articles, 2006, http://www.niso.org/committees/Journal_versioning/Recommendations_TechnicalWG.pdf
- VERSIONS Project http://www.lse.ac.uk/library/versions/
- Rumsey, S. et al, “Scoping Study on Repository Version Identification (RIVER) Final Report” , Rightscom, 2006 http://www.jisc.ac.uk/uploaded_documents/RIVER%20Final%20Report.pdf
- OpenDOAR: http://www.opendoar.org/tools/en/policies.php
- Content Creators section of the framework http://www.lse.ac.uk/library/vif/Framework/ContentCreation/index.html
- VERSIONS Toolkit http://www.lse.ac.uk/library/versions/VERSIONS_Toolkit_v1_final.pdf
- IFLA FRBR Final Report http://www.ifla.org/VII/s13/frbr/frbr.pdf
- Sharing Standards section of the Framework http://www.lse.ac.uk/library/vif/Framework/SoftwareDevelopment/standards.html
Acknowledgements
The VIF Project team included Dave Puplett, LSE, Catherine Jones, STFC and Paul Cave, University of Leeds. Each of the team members significantly contributed to the thought and development of the project, the framework and therefore also to this article.