E-Archiving: An Overview of Some Repository Management Software Tools
In recent years initiatives to create software packages for electronic repository management have mushroomed all over the world. Some institutions engage in these activities in order to preserve content that might otherwise be lost, others in order to provide greater access to material that might otherwise be too obscure to be widely used such as grey literature. The open access movement has also been an important factor in this development. Digital initiatives such as pre-print, post-print, and document servers are being created to come up with new ways of publishing. With journal prices, especially in the science, technical and medical (STM) sector, still out of control, more and more authors and universities want to take an active part in the publishing and preservation process themselves.
In picking a tool, a library has to consider a number of questions:
- What material should be stored in the repository?
- Is long-term preservation an issue?
- Which software should be chosen?
- What is the cost of setting the system up? and
- How much know-how is required?
This article will discuss LOCKSS [1], EPrints [2] and DSpace [3] which are some of the most widely known repository management tools, in terms of who uses them, their cost, underlying technology, the required know-how, and functionalities.
LOCKSS
Libraries usually do not purchase the content of an electronic journal but a licence that allows access to the content for a certain period of time. If the subscription is not renewed the content is usually no longer available. Before the advent of electronic journals, libraries subscribed to their own print copies since there was no easy and fast way to access journals somewhere else. Nowadays libraries no longer need to obtain every journal they require in print since they can provide access via databases and e-journal subscriptions. Subscribing to a print journal means that the library owns the journal for as long as it chooses to maintain the journal by archiving it in some way. Thus a side effect of owning print copies is that somewhere in the U.S. or elsewhere there are a number of libraries preserving copies of a journal by binding and/or microfilming issues and making them available through interlibrary loan.
It is this system of preservation that Project LOCKSS (Lots of Copies Keep Stuff Safe) developed at Stanford University is recreating in cyberspace. With LOCKSS, content of electronic journals that was available while the library subscribed to it can be archived and will still be available even after a subscription expires. This works for subscriptions to individual e-journals, titles purchased through consortia, and open access titles. Due to the nature of LOCKSS, a system that slowly collects new content, it is suitable for archiving stable content that does not change frequently or erratically. Therefore, the primary aim of the LOCKSS system is to preserve access to electronic journals since journal content is only added at regular intervals. Key in this project is that an original copy of the journal is preserved instead of a separately created back-up copy to ensure the reliability of the content. It is estimated that approximately six redundant copies of a title are required to safeguard a title's long-term preservation [4].
Participation in LOCKSS is open to any library. Nearly 100 institutions from around the world are currently participating in the project, most of them in the United States and in Europe. Among the publishing platforms that are making content available for archiving are Project Muse, Blackwell Publishers, Emerald Group Publishing, Nature Publishing Group, and Kluwer Academic Publishers. Additionally, a number of periodicals that are freely available over the Web are being archived as well.
LOCKSS archives publications that appear on a regular schedule and that are delivered through http and have a URL. Publications like Web sites that change frequently are not suited for archiving with LOCKSS. If a journal contains advertisements that change, the ads will not be preserved. Currently, it is being investigated if LOCKSS can be used to archive government documents published on the Web. In another initiative, LOCKSS is used to archive Web sites that no longer change.
The advantage of preserving content with LOCKSS is that it can be done cheaply and without having to invest much time. Libraries that participate in the LOCKSS Project need a LOCKSS virtual machine which can be an inexpensive generic computer. The computer needs to be able to connect to the Internet, although a dial-up connection is not sufficient. Minimum requirements for this machine are a CPU of at least 600MHz, at least 128MB RAM, and one or two disk drives that can store at least 60GB. Everything that is needed to create the virtual machine is provided through the LOCKSS software. LOCKSS boots from a CD which also contains the operating system OpenBSD. The required software such as the operating system is an open source product [5]. Configuration information is made available on a separate floppy disk. Detailed step by step downloading and installation information can be found on the LOCKSS site [6]. In order to be able to troubleshoot problems that may occur, the person who installs and configures LOCKSS should have technical skills and experience in configuring software. Once LOCKSS is set up, it pretty much runs on its own and needs little monitoring from a systems administrator. For technical support, institutions can join the LOCKSS Alliance. The Alliance helps participants to facilitate some of the work such as obtaining permissions from publishers.
LOCKSS collects journal content by continuously crawling publisher sites and preserves the content by caching it. A number of formats are accepted (HTML, jpg, gif, waf, pdf). LOCKSS preserves only the metadata input from publishers rather than local data input from libraries. Libraries have the option to create metadata in the administration module for each title that is archived. When requested, the cache distributes content by acting as a Web proxy. The system then either retrieves the copy from the publisher's site or if it is no longer available there from its cache. Crawling publisher sites requires that institutions first obtain permission to do so from the publisher. This permission is granted through the licence agreement. A model licence language for the LOCKSS permission is available on the LOCKSS page [7]. Publishers will then add to their Web site a page that lists available volumes for a journal. The page also indicates that LOCKSS has permission to collect the content.
Since individual journals have their own idiosyncrasies, plug-ins are required to help LOCKSS manage them. The plug-in gives LOCKSS information like where to find a journal, its publishing frequency, and how often to crawl. For a publishing platform like HighWire only one plug-in is necessary. The LOCKSS plug-in generation tool allows the administrator to create and test plug-ins without having to do any programming.
An essential aspect of electronic archiving is to ascertain that the material is available, that it is reliable, and that it does not contain any errors. With LOCKSS the process of checking content for faults and backing it up is completely automated. A LOCKSS computer continually compares the content it has for a certain journal in its cache with the content of the same journal in the cache of other LOCKSS computers. The system conducts polls with those LOCKSS peers. This process is accomplished with the LCAP (Library Cache Auditing Protocol) peer-to-peer polling system. If discrepancies are detected between two copies, the problem is fixed by downloading a fresh copy either from the publisher or from another LOCKSS computer without human intervention. In this system LOCKSS can repair any damage, even if disaster strikes and all content is lost from a cache. The polling system ensures that content is preserved reliably, that errors are eliminated, and that missing data is fixed. If journal content on a publisher's Web site is not available, LOCKSS serves the content out of its cache. This process is invisible to the user.
A good preservation system is a safe system. Frequent virus attacks and other intrusions make security an especially pressing issue when it comes to archiving content on the Web. The LOCKSS polling system can detect when a peer is being attacked. Human intervention is then required to prevent damage. LOCKSS' goal is to make it as costly and time-consuming as possible for somebody to attack the system. Even if the system is attacked and some peers are eliminated, the decentralised architecture of LOCKSS is a security measure in itself in so far as there is no single point of failure due to the physically distributed nature of the caches.
LOCKSS is not concerned with the preservation medium itself that is used for archiving. Should the hardware become obsolete, the entire cached content will have to be moved onto a new storage medium. However, in order to find answers to the still burning question of how to deal with issues concerning the long-term accessibility of material even when the technology changes, LOCKSS is now addressing the question of format migration. Changes in technology, for example in file formats, may make electronic resources unreadable. The LOCKSS creators have now started to develop a system that makes it possible to render content collected in one format into another format. This works through the 'migration on access' method which means that content is preserved in its format until a reader accesses the content, at which point it is converted into a current format. In January 2005 the LOCKSS team published an article in which it was announced that, as a proof-of-concept, the team had successfully managed to migrate the GIF format into a PNG format [8]. Currently, LOCKSS is developing format converters that will facilitate format migration on a larger scale. In order to develop this process, LOCKSS is planning to integrate format and bibliographic metadata extraction.
EPrints
A growing number of authors are bypassing the traditional publishers of scholarly communication and are turning towards self-publishing instead. Some provide their works on preprint servers such as arXiv.org [9], the CERN Document Server [10], or Cogprints [11]. Others make their work freely available in post-print servers to allow publishers to make use of articles for some time first before the copyright goes back to the author. A number of institutions have also become active in preserving and providing access to grey literature produced on their campuses such as the GrayLIT Network [12], ETH E-Collection [13] or SPIRES [14]. Some repositories are hybrids that store more than one type of material.
EPrints is a tool that is used to manage the archiving of research in the form of books, posters, or conference papers. Its purpose is not to provide a long-term archiving solution that ensures that material will be readable and accessible through technology changes, but instead to give institutions a means to collect, store and provide Web access to material. Currently, there are over 140 repositories worldwide that run the EPrints software. For example, at the University of Queensland in Australia, EPrints is used as 'a deposit collection of papers that showcases the research output of UQ academic staff and postgraduate students across a range of subjects and disciplines, both before and after peer-reviewed publication.' [15] The University of Pittsburgh maintains a PhilSci Archive for preprints in the philosophy of science [16].
EPrints is a free open source package that was developed at the University of Southampton in the UK [17]. It is OAI (Open Archives Initiative)-compliant which makes it accessible to cross-archive searching. Once an archive is registered with OAI, 'it will automatically be included in a global program of metadata harvesting and other added-value services run by academic and scientific institutions across the globe.' [18]
The most current version is EPrints 2.3.11. The initial installation and configuration of EPrints can be time consuming. If the administrator sticks with the default settings, installation is quick and relatively easy. EPrints requires no in-depth technical skills on the part of the administrator; however, he or she has to have some skills in the areas of Apache, mySQL, Perl, and XML. The administrator installs the software on a server, runs scripts, and performs some maintenance.
To set up EPrints, a computer that can run a Linux, Solaris or MacOSX operating system is required. Apache Web server, mySQL database, and the EPrints software itself are also necessary (all of which are open source products). For technical support, administrators can consult the EPrints support Web site or subscribe to the EPrints technical mailing list [19].
EPrints comes with a user interface that can be customised. The interface includes a navigation toolbar that contains links to Home, About, Browse, Search, Register, User Area, and Help pages. Authors who want to submit material have to register first and are then able to log on in the User Area to upload material. Authors have to indicate what kind of article they are uploading (book chapter, thesis, etc) and they have to enter the metadata. Any metadata schema can be used with EPrints. It is up to the administrator to decide what types of materials will be stored. Based on those types the administrator then decides which metadata elements should be held for submitted items of a certain type. Only 'title' and 'author' are mandatory data. In addition to that a variety of information about the item can be stored such as whether the article has been published or not, abstract, keywords, and subjects. Once the item has been uploaded, the author will be issued a deposit verification. Uploaded material is first held in the so-called 'buffer' unless the administrator has disabled the buffer (in which case it is deposited into the archive right away). The purpose of the buffer is to allow the submitted material to be reviewed before it is finally deposited.
Users of the archive have the option to browse by subject, author, year, EPrint type or latest addition. They also have the option to search fields such as title, abstract or full text. Available fields depend on which fields the administrator implemented. An example of how the user interface works can be seen in the Cogprints archive [11]. In this archive citations on the results list contain the author name, publication date, title, publisher, and page numbers. If a citation is accessed, the user can link to the full text or read an abstract first. Subject headings and keywords are also displayed. At the Queensland University of Technology in Australia, archive visitors and contributors can also view access statistics [20].
DSpace
The DSpace open source software [21] has been developed by the Massachusetts Institute of Technology Libraries and Hewlett-Packard. The current version of DSpace is 1.2.1.
According to the DSpace Web site [22], the software allows institutions to
- capture and describe digital works using a custom workflow process
- distribute an institution's digital works over the Web, so users can search and retrieve items in the collection
- preserve digital works over the long term
DSpace is used by more than 100 organisations [23]. For example, the Sissa Digital Library is an example of an Italian DSpace-based repositories [24]. It contains preprints, technical reports, working papers, and conference papers. At the Universiteit Gent in Belgium, DSpace is used as an image archive that contains materials such as photographs, prints, drawings, and maps [25]. MIT itself has a large DSpace repository on its Web site for materials such as preprints, technical reports, working papers, and images [26].
DSpace is more flexible than EPrints in so far as it is intended to archive a large variety of types of content such as articles, datasets, images, audio files, video files, computer programs, and reformatted digital library collections. DSpace also takes a first step towards archiving Web sites. It is capable of storing self-contained, non-dynamic HTML documents. DSpace is also OAI- and OpenURL-compliant.
It is suitable for large and complex organisations that anticipate material submissions from many different departments (so-called communities) since DSpace's architecture mimics the structure of the organisation that uses DSpace. This supports the implementation of workflows that can be customised for specific departments or other institutional entities.
DSpace runs on a UNIX-type operating system like LINUX or Solaris. It also requires other open source tools such as the Apache Web server, Tomcat a Java servlet engine, a Java compiler, and PostgreSQL, a relational database management system. As far as hardware is concerned, DSpace needs an appropriate server (for example an HP rx2600 or SunFire 280R) and enough memory and disk storage. Running DSpace requires an experienced systems administrator. He or she has to install and configure the system. A Java programmer will have to perform some customising.
DSpace comes with user interfaces for the public, submitters, and administrators. The interface used by the public allows for browsing and searching. The look of the Web user interface can be customised. Users can browse the content by community, title, author, or date, depending on what options the administrator provides for. In addition to a basic search, an advanced search option for field searching can also be set up. DSpace also supports the display of links to new collections and recent submissions on the user interface. Access to items can be restricted to authorised users only. A new initiative that DSpace launched earlier in 2004 is a collaboration with Google to enable searching across DSpace repositories.
Before authors can submit material they have to register. When they are ready to upload items they do so through the MY DSpace page. Users also have to input metadata which is based on the Dublin Core Metadata Schema. A second set of data contains preservation metadata and a third set contains structural metadata for an item. The data elements that are input by the person submitting the item are: author, title, date of issue, series name and report number, identifiers, language, subject keywords, abstract, and sponsors. Only three data elements are required: title, language, and submission date. Additional data may be automatically produced by DSpace or input by the administrator.
DSpace's authorisation system gives certain user groups specific rights. For example administrators can specify who is allowed to submit material, who is allowed to review submitted material, who is allowed to modify items, and who is allowed to administer communities and collections. Before the material is actually stored, the institution can decide to put it through a review process. The workflow in DSpace allows for multiple levels of reviewing. Reviewers can return items that are deemed inappropriate, Approvers check the submissions for errors for example in the metadata, and Metadata Editors have the authority to make changes to the metadata.
DSpace's capabilities go beyond storing items by making provisions for changes in file formats. DSpace guarantees that the file does not change over time even if the physical media around it change. It captures the specific format in which an item is submitted: 'In DSpace, a bitstream format is a unique and consistent way to refer to a particular file format.' [27] The DSpace administrator maintains a bitstream format registry. If an item is submitted in a format that is not in the registry, the administrator has to decide if that format should be entered into the registry. There are three types of formats the administrator can select from: supported (the institution will be able to support bitstreams of this format in the long term), known (the institution will preserve the bitstream and make an effort to move it into the 'supported' category), unsupported (the institution will preserve the bitstream).
Systems administrators can refer to the DSpace Web site where they can find installation instructions, a discussion forum and mailing lists. Institutions can also participate in the DSpace Federation [28] where administrators and designers share information.
Conclusion
E-archiving is still in its infancy but nonetheless there are tools for libraries big and small to get an archiving project off the ground. Any archiving project requires time, planning, and technical know-how. It is up to the library to match the right tool to its needs and resources. Participating in the LOCKSS project is feasible for libraries that do not have any content of their own to archive but that want to participate in the effort of preserving scientific works for the long term. The type of data that can be preserved with LOCKSS is very limited since only material that is published at regular intervals is suitable to be archived with LOCKSS. However, efforts are underway to explore if LOCKSS can be used for materials other than journals. As far as administration goes, LOCKSS is easier and cheaper to administrate than EPrints and Dspace. Moreover, LOCKSS has opened up a promising way to find a solution to the problem of preserving content in the long run through format migration.
Institutions that want to go beyond archiving journal literature can use EPrints or DSpace. They are suitable for institutions that want to provide access to material that is produced on their campuses in addition to preserving journal literature. More technical skills are necessary to set them up, but especially with Dspace, just about any kind of material can be archived. EPrints is a viable option for archiving material on a specific subject matter, while DSpace is especially suitable for large institutions that expect to archive materials on a large scale from a variety of departments, labs and other communities on their campus.
References
- LOCKSS Web site http://lockss.stanford.edu/
- EPrints Web site http://www.eprints.org/
- DSpace Web site http://www.dspace.org/
- LOCKSS. (2004, April 1). Collections Work. Retrieved November 3, 2004, from http://lockss.stanford.edu/librarians/building.htm
- It can be downloaded from: http://sourceforge.net/projects/lockss/
- LOCKSS Web site http://www.lockss.org/publicdocs/install.html
- LOCKSS licence language http://lockss.stanford.edu/librarians/licenses.htm
- Rosental, D. S. H., Lipkis, T., Robertson, T. S., & Morabito, S. (2005, January). Transparent format migration of preserved web content. D-Lib Magazine, 11.1. Retrieved March 8, 2005 from http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html
- arXiv.org e-Print archive http://arxiv.org/
- CERN Document Server http://cdsweb.cern.ch/?c=Preprints
- Cogprints electronic archive http://cogprints.ecs.soton.ac.uk/
- GrayLIT Network http://graylit.osti.gov/
- ETH E-Collection http://e-collection.ethbib.ethz.ch/index_e.html
- SPIRES High-Energy Physics Literature Database http://www.slac.stanford.edu/spires/hep/
- ePrints@UQ http://eprint.uq.edu.au/
- PhilSci Archive http://philsci-archive.pitt.edu/
- It can be downloaded from http://software.eprints.org/
- University of Southampton. (2004). GNU EPrints 2 - EPrints Handbook. Retrieved March 22, 2005, from http://software.eprints.org/handbook/managing-background.php
- EPrints Mailing List http://software.eprints.org/docs/php/contact.php
- QUT ePrints http://eprints.qut.edu.au/
- DSace can downloaded from http://sourceforge.net/projects/dspace/
- MIT Libraries, & Hewlett-Packard Company. (2003). DSpace Federation. Retrieved March 22, 2005 from http://www.dspace.org/
- Denny, H. (2004, April). DSpace users compare notes. Retrieved March 22, 2005, from Massachusetts Institute of Technology Web site: http://web.mit.edu/newsoffice/2004/dspace-0414.html
- SISSA Digital Repository https://digitallibrary.sissa.it/index.jsp
- Pictorial Archive Ugent Library https://archive.ugent.be/handle/1854/219
- DSpace at MIT https://dspace.mit.edu/index.jsp
- Bass, M. J., Stuve, D., Tansley, R., Branschofsky, M., Breton, P., et al. (2002, March). DSpace - a sustainable solution for institutional digital asset services - spanning the information asset value chain: ingest, manage, preserve, disseminate. Retrieved March 22, 2005, from DSpace Web site: http://dspace.org/technology/functionality.pdf
- DSpace Federation http://dspace.org/federation/index.html