Collecting Born Digital Archives at the Wellcome Library
Society trusts libraries and archives to ensure that the report we read or the information we rely on for research will still be available when next we need it. The digital world presents new challenges of acquisition and life cycle management for libraries, archives and readers. This article looks at the first steps taken by the Wellcome Library to include born digital material [1] into its collections.
Plans for the Future
The Wellcome Library acknowledges that digital material will form part of its collections in the future. As more members of our donor/creator community produce their records in digital form, we must actively seek out this material if our collections are to grow and remain relevant. The Library enjoys significant support for its plans to collect digital material both from internal management and from archival staff. This support is proving crucial in tiding the Library over a period of investigation, trial, evaluation and development.
Yet going digital involves more than simply accepting digital material on CD-ROM or floppy disk from our existing donor community. We face a number of new challenges. We need to decide on preferred preservation formats that suit our purpose, we need to identify tools to perform technical processes, and we have yet to determine what technical metadata we will need.
We are already clear on a number of key principles [2]. We intend to hold digital material in a managed environment such as a repository and we will collect descriptive and technical metadata about the material we hold. What we collect will be in line with our Library collection development policy. Most importantly we will make the collection and ongoing management of digital material an economically sustainable activity by building it into the everyday business of the Library. Digital material will be integrated into existing collections, not treated as anything ‘different’. The work that we have done to date has been to test and further refine these key principles.
Social Change, Not Just Technology
Digital material sets its own challenges but will be handled within the broad intellectual framework provided by existing archival practice: respect for provenance and the integrity of the document, selection, structure, context and appraisal remain key to successful long-term life cycle management of digital material. Existing ways of working will need modification, and the sociology of archive work, the relationship between donor and archivist, will adapt.
Digital material is evanescent and prone to obsolescence: archivists cannot rely on material waiting for decades in a storeroom before they encounter it. To ensure that this material is captured and its integrity safeguarded, they will have to work with it quickly and establish precisely what we have, that it is what we wanted and expected, and that formats are accessible. In fact, they will be obliged, where possible, to intervene much earlier in the life cycle of the material. Archivists may even advise on its creation, offering opinions on the forms of record best suited to long-term survival. How organisations’ own Electronic Document Records & Management Systems (EDRMS) may affect the process is not yet known, but experience with paper records suggests that there will be a huge range of records management models, from strictly controlled to virtually anarchic; and that even if there is a formal records management system we should be prepared for a similar range. Even now, where an organisation that has deposited material at the Library has no records manager of its own, archivists may, as part of the maintenance of a good working relationship, provide records management advice. So it is a possibility that the Library may end up functioning as a de facto EDRMS for some organisations. The costs will have to be monitored and balanced against the benefits of obtaining the material; but generally all parties profit from this sort of arrangement, since the records received should improve in quality.
The need for much earlier access to material will have to be built into projections of future donation cycles and workloads. The Library is actively engaging with its donor/creator community to identify ways in which this ideal can be realised. Clear policy documents help our donors understand what we are trying to achieve and gives them the opportunity to contribute to the process. Working more closely with organisations or individuals may also have an effect upon access arrangements: material is likely to arrive in semi-current form and to be subject to access restrictions. There is likely to be a move towards regular, scheduled accessions, perhaps with the same cycle governing new material arriving and older material coming out of access restrictions.
As now, a donor may carry out selection or weeding in advance of transfer, liaising with the archivist. The experience of the Public Record Office of Northern Ireland (PRONI) [3] in preparing for ingest of electronic records suggests that appraisal is likely to take place at series level or higher, rather than file by file; this will probably also be the case when appraisal takes place after material has arrived in the archive.
Social change will also be felt within the Library and it is important to manage this impact. Workflows must be tailored for use by non-specialists, since digital records will occur increasingly as part of all collections and will become part of the daily work of the Library, not a specialist extra. As plans develop so increasing numbers of staff are included in the process, increasing levels of engagement.
Work to Date
The Library is implementing a three-fold approach to digital materials. We have implemented a developmental Fedora [4] digital object repository. We are developing the policy framework within which we will work with material and we are modelling this using workflows, considering every step of the records life cycle from creation to eventual dissemination. That has allowed us to see what factors may affect the process and where difficulties may arise.
Using the Fedora Digital Object Repository
Using Fedora has introduced the Library to the basic concepts and processes of the acquisition and storage of born digital material. We have shown that we can install a simple out-of-the-box system and use it to ingest and retrieve material. We have made very few modifications to Fedora and have not yet worked on access to the material we have archived. This has allowed us to focus on developing our internal processes in a way that does not depend on any one technology.
We have attempted to work with the types of material we expect to be collecting. We believe these will be mostly textual, word-processed or email type objects along with more complex text objects such as spreadsheets. These are also formats that Fedora can display using native disseminators. We have been working with Simple [5] and Simple/Complex [6] objects (emails, both text and HTML; Microsoft Word documents; and JPEG image files). Objects have been ingested into Fedora in their native format and not normalised or migrated to a preferred archival format. This has allowed us to focus on repository-based activity but has highlighted the need to decide on the formats we may prefer to acquire and hold.
Development of New Document Tools
The Library’s three-year Strategic Plan includes the aim of engaging with digital material. Three new documents form the foundation of our plans for the future; these are:
- Preservation Plan, covering all material held in our permanent collections both physical and digital.
- A revised donor/loan agreement that covers the acquisition of digital material.
- A manifest template that allows us to record all necessary technical information about digital donations at the time of donation/acquisition.
Documents such as the Preservation Plan already existed, but covered only physical materials or talked loosely about ‘material in all formats’. The new plan contains two sections, one addressing the specific needs of physical, the other the needs of digital material. This has allowed specific standards to be applied to the management of each type of material whilst allowing for the differences in preservation approaches between physical and digital. In this way a single coherent plan for the management of materials in our permanent collections is created that outlines our commitment to the preservation of all material.
The revised donor/loan agreement also makes clear distinctions between physical and digital donations to the Library based on the different issues around the management of digital material. For digital donations the new agreement seeks additional clarification regarding copyright and who else may have rights in the material. It asks if material has been donated only to our library or to other institutions, and it makes clear the possibility that material may be re-formatted and copied as part of preservation actions in the future. The bases for the new agreement were the forms developed by the Paradigm Project [7] at the University of Oxford and those developed by the East of England Digital Preservation Pilot Project [8]. Having readily available models to draw on saved the Library considerable time and effort.
The manifest that describes the technical properties of material being donated is also based on models from these two projects. Again, the use of existing models allowed us to think more closely about what information we needed in our own Library, what we would do with it and what processes would follow from it rather than focusing on designing forms.
A set of high-level workflows have begun to map the processes and responsibilities that will be involved in our handling digital material, showing how these fit together. Existing workflows for paper materials were examined in detail; it was seen that these require supplementing to take account of digital issues, but not complete replacement. Issues such as who owns the material, who can see it, who can copy it and on what basis, are common to paper and digital material alike. However, describing processes and activities on paper has shown where physical and digital materials have different management needs. It has also begun to highlight staff on whom new workloads may fall, identifying training needs.
The Challenge of Metadata
The creation, collection and management of technical metadata for digital material is a new challenge. It is uncertain what type of metadata - and what level of detail - will be essential to long-term lifecycle management and dissemination of the material (though standards such as PREMIS [9] provide a framework). Determining that may take time and further experience with digital material.
Using Fedora provided a starting model for technical metadata . Fedora’s use of METS [10] offered a relatively simple look at the process of producing and storing technical metadata. Manual work with METS records highlighted the principle that technical metadata should be automatically derived as far as possible. It also showed the need for additional work to incorporate technical metadata standards like PREMIS which are not included as the basic Fedora metadata set.
It is clear that for catalogue metadata, the General International Standard for Archival Description (ISAD(G)) - as used for our paper holdings - provides an appropriate ethos for the description of digital collections, supported by adequate levels of technical metadata. Cataloguers’ experience in applying ISAD(G) in hierarchical catalogues to achieve informative but slimline records for paper records should be applicable to digital material. The Library’s cataloguing manual for archivists asks them to balance providing as much information as the reader needs with ‘as little as you can get away with’ and this balancing act will continue to direct our practice. Within the archives department a useful groundwork of working to standards and mapping between them already exists, thanks to interoperability work between Library catalogues: ISAD(G) data can easily be mapped to Dublin Core and this, if desired, will provide a route whereby ISAD(G) data and METS records can work together.
New Business Collaborations
The Library is an associate member of the Digital Curation Centre (DCC) [11] and the Digital Preservation Coalition (DPC) [12] both of whom provide models and tools that can be applied within the specific context of our Library. Participation in events hosted by both organisations has allowed the Library to build expertise it would otherwise have struggled to gain alone. Active participation in DPC and DCC events has also allowed Library staff to express plans to peers who can then provide feedback and comment, as well as put staff in touch with individuals with expertise and skills they are willing to share.
Library staff have also made personal contact with individuals involved in related digital projects. The Paradigm Project has been a key partner for the Library. The Library has drawn heavily on its experience and especially its workbook: models, practices and policies developed by the Project have formed the basis of our own. This has represented a key saving in time and effort in that models and documentation can be re-worked for local use and lessons already learned quickly applied.
The Library has also developed new types of relationships with internal partners such as our IT department. We need access to new services and hardware and we rely heavily on IT staff to provide guidance and advice on issues such as capacity planning or system design. Their support and technical expertise has been essential to establishing our first Fedora repository.
Are We Succeeding?
At this early stage it is difficult to identify key success factors, but the most significant step to date has been to acknowledge that the Library needs to start work now if we are to engage with digital material in the future. As key stakeholders, archival staff have shown great enthusiasm for working with digital material and have been leading the development of workflows, policies and practices that will deal with the practicalities of digital collection development. As a result we can demonstrate progress towards our goal; progress that is supported internally and built on the best practice of our peers.
Conclusion: Engaging with Digital Material Means Business as Usual
Interestingly many of the skills we require have already been found to exist within the Library. Sound archival practice is proving to be very robust when applied to the management of digital material. Basing what we do on archival practice has allowed us to advance quickly to the point where the principles of working with this new medium can be readily understood and applied.
As a new business we have found that documentation is indispensible: we have created policies and documented workflow in order to establish and then test our plans and intentions. The use of workflows has allowed us to model responsibilities, tasks and processes and to fine-tune them based on consensus and shared input. If going digital is to be business as usual then clear workflows must be set up and documented, and this, plus staff training, will ensure that in the long run the management of this material is not confined to a small cadre of specialists. Documentation has also acted as a means of recording these early decisions so that they are not lost and has demonstrated that we are indeed progressing towards the reality of collection digital materials.
We have found that collaboration with like partners has proved to be an efficient way of gaining experience as well as of providing peer input. Collaboration has shown us what is common or best practice where no formal standards yet exist. At a less formal level, collaboration has allowed staff to exchange ideas with colleagues facing similar challenges. Collaboration and consensus will continue to be crucial. Communication and the creation of clear documentation are ensuring that the skills needed to work with digital materials are being distributed throughout the Library. If going digital is to be business as usual then new skills cannot be locked in silos: they must be distributed amongst all staff.
We are making systematic progress beginning by modelling the basic things such as the development of policy and processes. Going digital clearly will require a great deal more than accepting digital material from our existing donor community on CD-ROM. We will need to engage more closely with our donor/creators and at an earlier stage in the creation of material if we are to be responsible for its long-term management. This means sharing with them our plans, explaining the ways in which we plan to work with them and their digital material and actively seeking their feedback.
The process will not be a trivial one. There is a steep learning curve that should not be underestimated. Acquiring and managing born digital collections requires new infrastructure, new skills, commitment of already stretched resources and a determination on the part of management to see the process through both now and into the future. However, building on existing archival good practice provides a firm foundation for progress.
Editor’s note: Readers may also be interested in the follow-up article “Further Experiences in Collecting Born Digital Archives at the Wellcome Library” in Issue 53.
References
- Hereafter simply referred to as digital material.
- We are not yet planning for the dissemination of digital material, which may in some cases be subject to access restrictions imposed by legislation (for example, the Data Protection Act) or by the donor. Dissemination will require careful thought and developing this will be a separate process
- ‘Developing a pre-ingest strategy for digital records” ZoĆ« Smyth, PRONI: presentation made at Digital Curation Centre / Liverpool University Centre for Archive Studies joint workshop on ingestion of digital records, Liverpool, 30 November 2006
- Fedora Digital Object Repository http://www.fedora.info
- A single email JPEG image or MS Word document
- MS Word file with an embedded JPEG image
- The Paradigm Project http://www.paradigm.ac.uk
- Report of the East of England Digital Preservation Regional Pilot Project, MLA East of England and East of England Regional Archive Council. 2006 Available from http://www.data-archive.ac.uk/news/publications/darp2006.pdf Accessed 1 February 2007.
- PREMIS (PREservation Metadata: Implementation Strategies) maintenance activity http://www.loc.gov/standards/premis/
- METS (Metadata Encoding Transmission Standard) http://www.loc.gov/mets/
- Digital Curation Centre http://www.dcc.ac.uk/
- Digital Preservation Coalition http://www.dpconline.org/