Repository Software Comparison: Building Digital Library Infrastructure at LSE

ed fay

Repository Software Comparison: Building Digital Library Infrastructure at LSE

Ed Fay presents a comparison of repository software that was carried out at LSE in support of digital library infrastructure development.

Digital collections at LSE (London School of Economics and Political Science)[1] are significant and growing, as are the requirements of their users. LSE Library collects materials relevant to research and teaching in the social sciences, crossing the boundaries between personal and organisational archives, rare and unique printed collections and institutional research outputs. Digital preservation is an increasing concern alongside our commitment to continue to develop innovative digital services for researchers and students. The Digital Library Management and Infrastructure Development Programme is a cross-library initiative to build our capacity to collect, manage, preserve and provide access to our digital collections. The initial phase of the programme has investigated our collections, their users and best practice in the wider community and produced functional specifications for testing against the current best-of-breed open source repository software. Following this comparison we made a recommendation for the implementation of a repository system to operate at the core of our digital library. This article gives a summary of that comparison.

Methodology

Recent studies such as those by the Repository Support Project (2009) [2], the National Library of Medicine (2009) [3] and Purdue University (forthcoming) [4] have carried out comparisons of repository software, but while these were useful in stimulating our thinking, none provided a comprehensive answer to our question: which is the best software to meet our needs?

In previous studies, two different methodologies seem to have emerged:

Like-for-like comparisons of repository features (comparing for example supported metadata schemas or underlying databases);
Comparisons of repository features against local functional requirements.

While like-for-like comparisons can provide the most reusable data, they also tend to flatten the most interesting differences between the platforms, making it hard to identify their relative strengths and weaknesses in different contexts. For us, context is everything and it provided the basis for producing our own set of functional criteria for testing, placing us more towards the latter category of methodologies.

The comparison of repository software was the culmination of a 9-month phase of our infrastructure development programme that progressed through a series of inter-related workpackages:

Digital collections audit (including a risk assessment [5])
Initial user requirements (these are evolving in parallel with other projects looking at specific areas of digital collections, such as born-digital archives)
Investigation of best practice in the wider community
(Draft) metadata specification
Functional specification
Repository software comparison

Parallel work is underway to look at digital collections policies, including born-digital deposit agreements and digital preservation policies. These will feed additional requirements into the implementation phase of the programme, but by early 2010 we had gathered sufficient information to understand our core functional criteria and to carry out a high-level comparison of alternative repositories.

Digital Collections

Digital collections for which the Library has preservation responsibility include the outputs from digitisation projects, born-digital archives and research outputs. Existing systems and processes are in place for their management, at varying degrees of maturity, but over time our scale of operation is anticipated to exceed our capacity for management in this heterogeneous environment.

The table below gives some details about our digital collections.

Material Type	Existing Systems	Functional Requirements
Digitisation projects Images Multi-page text Maps (including geo-data)	Presentation through institutional CMS, custom Web sites and 3rd-party provision.	Presentation through rich multi-media Web and mobile interfaces Discovery, search and browse Preservation storage of master image files
Born-digital archives Any file formats and structures	Storage on original physical media and server (RAID, tape backup). Description in archives management system.	Preservation storage of born-digital archive deposits Integration with existing collection management systems Future integration with library discovery systems and multimedia presentation interfaces
Research outputs Mostly PDF, office formats Increasingly video, audio	Comprehensive collection management in mature EPrints repository.	‘Traditional’ IR functionality including cataloguing, ingest, discovery, presentation and harvesting to third parties

Table 1: Details of LSE Library digital collections

In the future, the preservation remit of the Library is likely to expand further to include additional collections and formats, some of which are nearer on the horizon than others, such as research data, learning materials (currently managed in the institutional Virtual Learning Environment (VLE)) and Web archiving (of the institution’s Web presence and of Web-published material by archive depositors).

Digital Library Technical Aims

Our digital library development is designed to produce unified infrastructure to support the management, preservation of and access to all of our digital collections. This will involve building shared functionality where possible (generally at ‘low’ system levels such as storage, identification, bit-security and format management) and more targeted functionality where necessary (generally at ‘high’ system levels such as access interfaces for discovery and presentation). All collections will benefit from shared approaches to digital object management (including economies of scale for the institution as collections grow over time) while retaining enough flexibility to innovate with specific collections or formats.

We realised that a modular approach to developing this functionality was preferable to a monolithic approach, and this was confirmed by our review of best practice in institutions with digital collections of similar size and diversity. A modular architecture allows independence of functional components, a separation of digital objects from particular software installations and an iterative approach to developing capacity in different functional areas according to evolving demands and requirements, such as projects involving content creation.

A modular architecture also facilitates interfaces with existing library systems such as catalogues holding descriptive metadata and access points providing federated discovery across collections. This also reduces our chances of producing another ‘silo’ of isolated content. Within this ecosystem we identified specific areas to prioritise for functional development. They turned out to be digital object storage, management and preservation in the first instance, followed by ingest and management functionality for curators and access functionality for end-users. Having identified these high-level requirements, we thought about our architecture, outlined our priorities, and decided we could capture initial functional requirements in preparation for our comparison of repository software.

Repository Comparison

In formulating our criteria we made use of previous internal work carried out during the implementation of our institutional repository, which was based on a checklist provided by the Open Society Institute (2004) [6]. This document is out of date compared with recent advances in repository software; its main use was the historical insight it provided into how our requirements have evolved. Other sources we consulted included ‘York Digital Library Project Requirements Specification’ (2008) [7] and ‘Wellcome Library requirements for a digital object repository’ (2008) [8].

We arrived at 24 functional criteria, grouped into 7 functional areas which correspond approximately to OAIS functional classes [9]. On the basis of these criteria, we devised a series of tests to carry out against out-of-the-box installations of the latest versions of each software. We installed the latest production release of each repository in a virtual machine with identical specifications, to allow ease of testing and ease of reset to default configurations. The repository versions we tested were DSpace 1.6, EPrints 3.2.0 and Fedora 3.3.

Our approach to scoring a repository on a functional criterion was to give a colour rating (red, yellow, green) and explanatory text giving our reasons for the scoring. This approach to classification was intended to give an indication of how well each repository met our needs:

Green: the repository would satisfy our requirements in that area out-of-the-box, perhaps with minimal configuration of the repository software;
Yellow: the repository could be made to satisfy our requirements in that area, but it would take technical development to extend the repository software;
Red: the repository does not satisfy our requirements or the repository software could not practically be extended to satisfy our requirements.

We deliberately avoided a numerical comparison of repository features and a selection based on a cumulative total to allow us to compare relative strengths and weaknesses in different functional areas. It is important to note that if a repository did not provide specific functionality out-of-the-box, but was designed to support that functionality with the addition of third-party tools or applications, it would score green rather than yellow or red. We realised early on that our purpose was to identify whether the repository could function as a component in our digital library infrastructure, rather than provide a comprehensive solution in all functional areas.

Findings

Our key finding is that repository softwares are not equal. Competing solutions have relative strengths and weaknesses in different functional areas, which is most likely because they were designed to solve different problems. This might make a particular repository a useful tool and the natural choice in a certain context, but less useful and a poor choice in another. It is important to remember that our specific context will heavily skew these findings according to our functional requirements and a perceived weakness when measured against our criteria may not in fact be a weakness in another context.

DSpace and EPrints represent a different approach to digital collections management compared with Fedora. They are both monolithic repositories that package solutions for multiple functional areas into one piece of software. They provide functionality not only for digital object storage and management, but also for ingest workflows and access front-ends. Both repositories are designed primarily as open access publication databases, which means they do not provide some required functionality for born-digital archives and digitised material:

data model to support complex object types;
repository-independent and flexible persistent identifier schema;
ingest workflows with integrated object packaging tools;
integrated bitstream preservation tools (although this is changing, particularly with new versions of EPrints);
repository-independent storage;
flexible and configurable access control;
extensible front-ends for rich presentation interfaces;

while some functionality that is provided is not suitable or necessary for our purposes:

ingest workflows with manual metadata capture;
insecure indexing engines for repository content;
inflexible front-ends for simple object structures and file formats.

Selection of either of these repositories would require significant development work to extend core functionality. We did not consider this to be a viable option, as it would:

increase the risk of destabilising other, dependent functionality;
increase the degree to which we are tied to the platform;
increase the difficulty of performing upgrades to latest version releases.

Fedora, meanwhile, does not bundle functionality into a monolithic core, and instead provides a flexible architecture that is designed to be customised to local requirements. In some cases this requires the use of additional software to provide a complete repository solution, which we deem to be beneficial as appropriate components can be selected independently according to requirements in specific functional areas. This approach is advantageous as all the above risks associated with monolithic solutions are mitigated or avoided:

core functionality is not modified, resulting in a stable repository core;
independence of components in discrete functional areas means we are not tied to any one piece of software;
upgrades of independent components can be performed as necessary.

Although this approach requires greater set-up and configuration in the short term, in the long term the costs are anticipated to be lower, and additional benefits should accrue from the ability to adapt more easily to evolving requirements for preservation and innovative access methods.

The full investigation has not been reproduced here, but the table below gives a listing of the functional areas and testing criteria, and a summary of our findings in each area.

Functional Area	Testing Criteria	Summary of Findings
Data Model	object structures collections structures external aggregations	DSpace and EPrints have restrictive data models (although DSpace is currently more flexible than EPrints) which reduces their ability to hold complex, structured objects. Both would require modification or significant customisation to support the range of digital objects in scope. Fedora is designed explicitly to be a flexible and extensible digital object store and would support the diversity of digital objects natively, but would require configuration.
Ingest, Data Management and Administration	Ingest command-line interface Web interface machine interface batch import custom workflows metadata capture Data Management bulk updates audit history Administration CRUD (Create, Read, Update, Delete) objects	All repositories provide sufficient functionality to support flexible human- and machine-ingest for a range of materials with different workflows, although configuration would be required for all. All repositories provide sufficient functionality to support digital object management and administration operations on content stored within the repository. Fedora’s flexibility gives it a slight advantage for customisation of ingest workflows.
Descriptive Information (Metadata)	persistent identifiers human-readable, hierarchical identifiers support for standards OAI compatibility	DSpace and EPrints suffer from a lack of support for identifier schemas (although EPrints is significantly worse) which tends to reduce options for making collection and object identifiers independent from the repository software and human-readable interpretations of collection hierarchies. Fedora provides sufficient flexibility in its identifier schemas and architecture to support independent, persistent and human-readable identifiers. All repositories support extensible metadata schemas for the implementation of descriptive and technical metadata (including format-specific and preservation metadata) and OAI compatibility.
Storage	file system structures integrity checking	DSpace and EPrints suffer from a lack of attention to preservation storage, both requiring the maintenance of the repository software and environment to retrieve digital objects in the event of a hardware or software failure. Fedora supports the storage of digital objects independent of the repository, which is preferable for preservation purposes.
Access	batch export Web front-end API/URI schemas indexing engine access control approach authorisation authentication	DSpace and EPrints have limitations in their access functionality. Both provide out-of-the-box front-ends which are designed for open access publications, and which would require modification to support digitised collections and born-digital archives. Their machine interfaces to support the development of independent Web applications are also limited. Fedora comes with no front-ends out-of-the-box, which would require the use of third-party front-ends (of which several exist for Fedora) or custom in-house development. Fedora provides multiple machine interfaces to support the development of independent Web applications and the exposure of content to the Web of linked data. While all repositories provide sufficient functionality to control access to content at granular levels, Fedora’s use of XACML (Extensible Access Control Markup Language) makes its access control more easily configurable. All repositories come with support or plugins for authentication using LDAP (Lightweight Directory Access Protocol), Shibboleth and CAS (Central Authentication Service).

Table 2: Repository software comparison by functional area, testing criteria and summary of findings

Recommendation

The infrastructure development team has recommended to our senior management that we implement Fedora at the core of our digital library. Our senior managers have accepted this recommendation and mandated the implementation. In line with our decision to develop a modular architecture, we will be iterating development in our prioritised functional areas over time, according to our increasing understanding of our requirements and growing collections. Initial development will implement a Fedora repository configured for in-scope content types, along with the necessary hardware environment and fully tested redundancy and back-up regimes. Subsequent development will then build functionality for ingest, management and access. This implementation forms part of the wider body of work of the digital library development, which includes significant effort devoted to user engagement, advocacy within the institution and internal communication amongst collection and technical specialists and the wider library staff and community.

Resource Implications

Our best practice review discovered that scales of resourcing for digital library teams vary according to the scale of operation of the host institution, but that certain roles and responsibilities are always clearly defined. This includes strategic planners, collection specialists, digital curators, systems administrators, technical developers and management. These roles may be combined into a single person, or distributed across a team. From a systems development point of view, we are anticipating ‘development humps’ where demand for technical skills will be increased, along with an overall increase in the longer term in the demand for technical skills to carry out system and collection maintenance.

We observed that DSpace and EPrints can be implemented with little additional burden on systems administrators and little to no technical development for a limited set of use cases — primarily open-access publications databases. However, once the use cases and collection types multiply, technical demands increase. By comparison, we observed that Fedora requires significant local set-up in order to achieve any useful functionality. This can include simple configuration, but it is also likely to involve the implementation or development of additional pieces of software as modules to provide specific functionality that does not come as standard.

In our context we observed a tipping point, where the necessary effort to extend an existing system (as in our case with EPrints) outstrips the effort to implement a new system that supports requirements more natively. In our case, this point has been reached, as our collections have multiplied and diversified to the point where it makes more sense in the long term to expend not insignificant effort in the short term, in order to build a core system that will serve as a platform for our digital library that is extensible and scalable.

During our investigations it was observed that maintaining digital library infrastructure is comparable in terms of scale and complexity to maintaining a traditional library management system, with additional peaks of development activity required to build functionality (as might be achieved by a supplier request in the case of an LMS). While arrangements for resourcing development work will and should vary according to the evolution of requirements, alongside projects and project funding, in the long term it is anticipated that digital collections infrastructure maintenance will require the assignment of permanent technical resource. This is in the nature of systems management as much as it is the result of the specific and peculiar requirements of digital collections management. LSE is not seeking to research and develop new technology; we are seeking to implement the best-of-breed solutions produced by that research in pioneering and national institutions. The technology is available, but implementation and maintenance are activities that must be resourced at the local level according to local requirements in support of local collections.

Conclusion

Repository softwares are not equal. Competing solutions have relative strengths and weaknesses in different functional areas, which is most likely because they were designed to solve different problems. This might make a particular repository a useful tool and the natural choice in a certain context, but less useful and a poor choice in another. The situation at LSE is an increasing volume and heterogeneity of digital collections, requiring both the extension of current capacity and the flexibility to adapt to evolving content types and requirements. This, coupled with existing systems carrying out aspects of collection management, pushed us naturally towards a solution of a certain kind.

Specific aspects of our context further supported our selection of repository software:

The heterogeneous nature of our collections and the pattern of emerging requirements naturally disposes us to a modular architecture, where functionality in different collection and technical areas is independent;
Although digital preservation is of increasingly urgent concern, we have systems already in place handling aspects of collection management, so we are not starting from scratch and attempting to develop a solution in all functional areas in one go;
Despite competing demands for resources, increased in intensity by the financial situation in the sector, we do have access to members of staff to carry out technical development and maintenance.

These are important considerations beyond a purely technical analysis, and they were factors in our selection—differences in this wider context could have affected our final choice.

Over the coming months, LSE Library will be working towards the implementation of its digital library. The progress of our infrastructure development will be described in a future issue.

Acknowledgments

Members of the digital library infrastructure development team: Mike Hallas, Mike McFarlane, Simon McLeish, Peter Spring, Neil Stewart, Nick White, Nicola Wright.

References

LSE Library (also known as the British Library of Political & Economic Science)
http://www.library.lse.ac.uk/
Repository Software Survey, Repository Support Project, March 2009
http://www.rsp.ac.uk/software/Repository-Software-Survey-2009-03.pdf
Evaluation of Digital Repository Software at the National Library of Medicine, Jennifer L. Marill and Edward C. Luczak, May/June 2009, D-Lib Magazine, Volume 15 Number ⁵⁄₆
http://www.dlib.org/dlib/may09/marill/05marill.html
A Comparative Analysis of Institutional Repository Software, 25 February 2010
http://blogs.lib.purdue.edu/rep/
Digital Collections Risk Assessment at LSE: Using DRAMBORA
http://blogs.ecs.soton.ac.uk/keepit/2010/07/19/digital-collections-risk-assessment-at-lse-using-drambora/
A Guide to Institutional Repository Software, , Open Society Institute, August 2004, 3rd Edition
http://www.soros.org/openaccess/pdf/OSI_Guide_to_IR_Software_v3.pdf
Requirements Specification, University of York Digital Library Project (SAFIR), Version 1.1, March 2008
https://vle.york.ac.uk/bbcswebdav/xid-89716_3
Wellcome Library requirements for a digital object repository, 2008
http://library.wellcome.ac.uk/assets/wtx055600.doc
Reference Model for an Open Archival Information System (OAIS), Consultative Committee for Space Data Systems (CCSDS), CCSDS 650.0-B-1, Blue Book, January 2002
http://public.ccsds.org/publications/archive/650x0b1.pdf

Author Details

Ed Fay
Collection Digitisation Manager
Library
The London School of Economics and Political Science

Email: E.Fay@lse.ac.uk
Web site: http://www.library.lse.ac.uk/
Twitter: http://www.twitter.com/digitalfay
Blog: http://lselibrarydigidev.blogspot.com/

Return to top