Web Magazine for Information Professionals

VIF: Version Identification Workshop

Sarah Molloy reports on a half-day workshop on the use of the Version Identification Framework, held in Hatton Garden, London on 22 April 2008.

The Version Identification Framework Project (VIF) [1] is a project partly funded by the Joint Information Systems Committee (JISC) and is in partnership with the Science & Technology Facilities Council, the University of Leeds and Erasmus University, Rotterdam. The project was undertaken in order to investigate the growing issues surrounding the identification of revised or related materials being deposited in repositories, with the aim of providing a framework for consistent identification. Two surveys undertaken by the Project in late 2007 identified a number of problems related to version identification and the impact this might have on user perceptions of reliability; it was these issues that the VIF hoped to address. The framework is intended as best practice guidance for the repository community, software developers and content creators. The VIF workshop [2] held on Tuesday 22 April 2008 was an opportunity to introduce the VIF and the results of the project surveys, whilst also providing a forum for discussion and indeed encouraging discussion of the issues and the proposed solutions. It was also hoped that the workshop would encourage acceptance of the Framework as a standard for version identification.

Session 1

Introduction to Versioning and the VIF Project

Frances Shipsey, VIF Project Director based at LSE, opened the workshop on behalf of the VIF Project Partners, welcoming participants and handing over to Jenny Brace, VIF Project Manager based at LSE, to begin the first session.

Jenny gave a brief overview of the VIF Project. She then went on to elaborate on the audience for which guidance is designed before returning to look at the definition of a version, and how such a definition might include both linear, developmental versions as well as related versions, such as a conference paper in relation to a published article.

Jenny also spent some time looking at examples of potential versions, seeking opinions from the audience as to what constitutes a version. This highlighted different understandings of what people call versions; and how it should not be assumed that all repository users will perceive a version in the same way. For example; minor changes to an object to correct errors might be considered versions, whilst significant changes affecting the overall status of the object might be called revisions.

She then went on to investigate the difference between a version object, perhaps the latest iteration of a report being prepared for publication, and a version relationship which might identify two distinct but related objects, perhaps a published journal article and the dataset used in the supporting research. How would they be defined within the VIF and what distinction, if any, could or should be made?

Paul Cave, Project Officer, VIF at University of Leeds, opened his presentation by explaining the research phase of the Project, detailing how evidence had been gathered via two surveys, one designed for Information Professionals, and the other for academics, and then outlining the key outcomes. Most significant was the response from academics regarding their own ability to identify versions of objects in an institutional repository (IR) with a statistic that only 5% found this to be easy. But, when asked whether they were satisfied with the way in which they identified their own versions, 65% of academics stated that they were broadly happy with their own arrangements.

There was a strong indication from the survey results that academics, depending on discipline, were keen for the version held in an IR to be the final version, and not earlier drafts or iterations.

Dave Puplett, VIF Project and Communications Officer at LSE, then discussed the stages undertaken to identify possible solutions to the identified issues. Firstly, whilst admitting that the current majority of objects held in institutional repositories are documents, Dave highlighted the need to be broad when developing a version framework since a growing number of repositories accept alternative types of media, for example; audio-visual clips and images.

Proposed versioning solutions identified the lack of a single tool that would resolve all the main issues surrounding version identification. However, a combined approach was identified and strongly supported by respondents from both sets of survey results. Dave highlighted that many respondents, in line with current beliefs about what is already being deposited, placed higher importance on being able to identify the final version of a digital object than any subsequent versioning that might take place. It was also necessary to consider that there is a certain level of detachment from the issues surrounding versioning on the part of academics. Whatever framework was to be put in place needed to be simple and constructive in order at least to maintain the status quo rather than alienate supportive academics by making unacceptable demands. In the next part of Dave’s presentation, he began to look at the VIF in detail, pointing to 5 key pieces of information that are required either in isolation or in combination in order to identify a specific version. These are: defined dates, identifiers, version numbering, version labels or taxonomy, and text description.

He also looked at the benefits of embedding the version information in the object itself to ensure that this is not lost if the hosting repository metadata is by-passed. Suggested solutions presented by the VIF are that a combination of tools be used in order to identify the object and its version. Such tools included the use of coversheets, already in use in some institutional repositories and manually added, or perhaps watermarking of the object which can be done automatically. Simple solutions like having a uniform system for file naming and introducing ID tags and properties were also suggested.

Dave provided a brief overview of the recommendations for each repository user type in concluding this first session, providing some points of reference for what repository managers, content creators and software developers should consider doing in order to effect a change of policy more broadly in accepting the VIF. He concluded by inviting people to access the framework itself and to read the recommendations in more detail.

Session 2

Breakout Sessions

Two breakout sessions had been prepared; session 1 looking at metadata and the implications on interoperability and data mining, session 2 focussed on strategy and advocacy, investigating the importance of preparing both repository managers and academics or content creators for the need for versioning and clear, precise identification. I chose to attend the second of these sessions, wanting to understand the practical aspects of advocating the use of the VIF, and what strategic planning might be required in advocating its use to any steering committee or group charged with managing and implementing an institutional repository.

Dave Puplett opened this session with a look at the specific recommendations laid out in the VIF. Recommendations for repository managers entailed working on clarifying for what the repository was intended, since this could affect which level of version identification was adopted. He also emphasised the need to investigate whether the current installation of any repository software was capable of meeting the needs of version identification, and if not, what additional functionality might reasonably be expected or requested. Dave commented that, in some instances, it may not be possible to provide updates to current software and that this therefore had major implications for implementation at the local institutional level. Repository managers should also take into consideration the types or formats of any digital objects that they may include in their repositories, ensuring that a strategy for what versioning identification may be required was in place, even if they do not currently accept deposit of them.

In terms of advocacy, Dave highlighted that it was important to enable and support content creators to adopt good versioning practice. In developing a relationship with the depositor, repository managers would then be able to refer ambiguously identified versions back to them for clarification. There was some discussion regarding the necessity and desirability of checking each deposit, particularly in a self-deposit environment, and the implications for workload on repository staff in both mediated and self-deposition was a cause for concern.

Lastly, Dave concentrated on the role of advocacy in relation to content creators, noting that there may be considerable difficulties to overcome if authors, for example, needed to be convinced of the benefits of depositing in the first place. Dave pointed delegates to the content creators section of the Framework Web site, which offers good practice guidelines and links to version software management tools.

Peter Millington, Technical Officer, OpenDOAR (Directory of Open Access Repositories) [3] spoke next. OpenDOAR is one of the SHERPA [4] services based at Nottingham University. In its mandate, OpenDOAR is tasked with not only providing access to open access repository content via its own interface, but also advocacy and support in best practice guidance for repository management. Peter’s presentation opened with an overview of the current OpenDOAR policy tool that enables institutions to define a policy on sharing and re-use of data for their repository. As a result of the Version Identification Framework and the previous VERSIONS Project, additional options for including version identification in policies were being implemented. Peter demonstrated them, which included deposited items status (for example; draft, submitted, published) and a Version Control Policy option that allows institutions to identify under what circumstances changes to versions, or additional versions, might be deposited.

The final part of the breakout session was a presentation by Peter van Huisstede, RePub, Erasmus University, Rotterdam (EUR) on the implementation of Subversion Control Software into the IR at Rotterdam. Subversion is normally used to control versioning of software code as it is being written, but EUR have built it into the back end of their repository to enable version control of the metadata records for all objects in the repository. In this way, each element of the metadata is a separate entity within the subversion repository, running concurrent to the main repository. The benefit of keeping these two separate is that managing and changing version information with regard to single elements is more straightforward. A single change can be made in the Subversion repository that will be reflected globally within the main repository. For example; if the name of a department within a University changes, then the change can be made within the Subversion repository and, because that element of the data in the institutional repository is identified only by a unique number, this change will be reflected in each instance of that element in the IR. In a repository with over 7000 records, this is a powerful tool.

Session 3

Software

The final session of the workshop was a set of presentations looking at the potential for development in repository software to enable versioning to be implemented. Catherine Jones, VIF Project Officer from the Science and Technology Facilities Council (STFC), presented some for software development that would facilitate the introduction of version identification to institutional repositories, and the adoption of the VIF at institutional level. These recommendations focussed on the desirability of software versioning enablement or its introduction if not currently available and also particularly highlighted the need for a facility to identify duplicate objects, both within the institutional repository and across external repositories or co-operatives. Support for Application Profiles within the software metadata was essential if harvesting and cross-searching was to take place, meaning that a standardised structure for metadata must be accepted.

Tim Brody, University of Southampton, representing EPrints, gave a short presentation on the potential for versioning enhancements to the EPrints software. In the first model, separate EPrints records representing each digital object could be linked together by a version statement (perhaps through a link in each record), whilst the second model used a single EPrint record in which to display all linked objects.

Richard Green, University of Hull, representing Fedora, started by outlining the way in which Fedora software deals with objects. He explained that Fedora is not delivered to the customer with an interface, and that objects are dealt with via their attributes, using metadata and content datastreams. Within the content datastream, versions could be both deposited and identified.

Richard made two recommendations for Fedora users; that a single object would be most suited to developments (perhaps revisions) of the same thing, whilst different forms of content (perhaps the author’s final draft and the published version) might be better recorded as separate objects with a relationship linking them.

Jim Rutherford, HP Labs, representing DSpace, began by explaining that the current edition of DSpace is not able to include versioning information or manage versions and related objects in this context. It is able to track changes via logging, and details the provenance of metadata. These can be used to build up a history of an object but not to manage simultaneous versions.

Whilst the three approaches to versioning issues were different, there was consensus in the belief that it is not the role of the software to enable versioning, but rather that its role is to display the relationship information that helps end-users to identify the version that they are viewing. The identification of versions is the responsibility of content creators and repository managers in following good practice by identifying versions through consistent methods.

The final part of the workshop was given over to a panel discussion chaired by Frances Shipsey, who was accompanied by members of the VIF project team and the software representatives.

It was stated that in the most recent installation of EPrints, by default, the most recently added version of an item was identified as the latest version, which may not always be the case if repository managers were to begin retrospective deposits of materials. Tim Brody was asked what could be done to resolve this potential difficulty. He stated that it depended on the way in which objects were related within the database, and that it should be possible to resolve the matter.

Balviar Notay asked whether the Scholarly Works Application Profile (SWAP) [5] could be embedded into EPrints. Tim Brody replied that customers needed to understand better how it might work. There was a project under way at University of Warwick [6] to try to embed SWAP.

Conclusion

The workshop was drawn to a close by Jenny Brace, who concluded the workshop with a few final thoughts for delegates to consider. Version identification was an accepted problem in the repositories community. However, whilst there was no single solution, there were some simple procedures that could be followed: she felt that ensuring the identification system was transparent, consistent and open, regardless of the access route to material, was essential. Moreover, deciding in advance what the repository was intended to achieve was of paramount importance.

On a personal level, I attended this workshop to gain a clearer understanding of why versioning was important, and what impact it could have both at the institutional level and across shared repository systems. The workshop not only presented the evidence, but also offered a forum for discussion; on the principles of version identification, on its importance to the repositories community, of its importance in defining the provenance of objects. It was also an opportunity to discuss with colleagues what procedures and developments they might expect to take back to their own institutions for discussion and implementation.

My institution is in the early stages of building its repository. For our repository team, the main focus is on how to get the project moving, what our strategy might be for highlighting the benefits of an institutional repository to our research community, and how exposing research outputs to a wider audience can be of significant benefit for the team of researchers, and the institution alike. Given that we are at such an early stage of development, this is an excellent time to be considering version identification. For us, good practice in version identification should and can be built in to our own repository policies and procedures from the beginning.

References

  1. LSE Version Identification Framework Web Site http://www.lse.ac.uk/library/vif/About/index.html
  2. LSE Version Identification Framework Workshop Web Site http://www.lse.ac.uk/library/vif/Project/workshop.html
  3. OpenDOAR Web Site http://www.opendoar.org
  4. SHERPA Web Site http://www.sherpa.ac.uk
  5. Scholarly Works Application Profile, JISC http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres/swap.aspx
  6. Warwick Research Archive Project http://www2.warwick.ac.uk/services/library/main/research/instrep/erepositories/

Author Details

Sarah Molloy
Repository Administrator
Queen Mary University of London

Email: s.h.molloy@qmul.ac.uk
Web site: http://www.library.qmul.ac.uk/e-resources/oa.htm

Return to top