Joint Workshop on Future-proofing Institutional Websites
This DCC [1] and Wellcome Library [2] workshop sought to provide insight into ways that content creators and curators can ensure ongoing access to reliable Web sites over time. The issue is not merely one of archiving; it is also about designing and managing a Web site so that it is suitable for long-term preservation with minimum intervention by curators to ensure the content remains reliable and understandable through time.
Practical Approaches to Future Proofing Web Sites
John Kunze of the California Digital Library (CDL) [3] chaired the first session, kicking off with an introduction to practical approaches from his perspective at the CDL. In a presentation that covered a wide range of activities associated with Web site creation, management and preservation, the key was that short-term Web site management activities can be considered indicative of the activities required to preserve Web sites in the long-term. The key to this is the three R’s -Reduce, Replicate and Redirect. The Internet Engineering Task Force (IETF) plain text online archive of RFC’s (Requests for Comment), which has persisted for almost as long as the Internet itself, was cited as a specific example of Reduction. Replication requires the preservation of objects in multiple formats. Redirection is necessary to avoid dead links. The task is ultimately one of ‘future-improving’ Web sites rather than future-proofing, and downgrading expectations to minimise the worst potential loss - applying ‘not-bad’ practices - rather than seeking the best solution for each set of data.
Richard Price, Head of Modern British Collections at the British Library [4], followed on the subject of ‘Formulating a collection development policy for Web sites as collections items’. Such a policy is necessary to attune the issue to a wider organisational remit, and leads directly to selection criteria and guidelines that allow the Web sites to be collected and preserved as part of an organisation’s wider preservation strategy. This, in effect, future-proofs Web sites that do not future-proof themselves.
Rachel Andrew of the Web Standards Project [5] offered a complementary presentation on using standards to future-proof Web sites at the point of creation, rather than by taking action at a later date. She quoted Eric Meyer, a web designer and high-profile standards expert, saying
‘Web standards are intended to be a common base… a foundation for the world wide web so that browsers and other software understand the same basic vocabulary’.
Web sites thus designed using standards for modern-day consistent accessibility by a range of software are therefore more easily sustained and likely to endure with minimum interference to the format over the long term.
Andy Powell of the Eduserv Foundation [6] introduced the topic of Persistent Identifiers (PI’s). Following on from John’s perception that ‘what’s good for the short term is good for the long term’, Andy raised the issue that ‘the only good long-term identifier is a good short-term identifier’. He encouraged people to think in terms of 15 - 20 years, rather than forever, as it is more reasonable to anticipate changes over that period. On the topic of different types of persistent identifiers, such as PURL’s, HTTP URI’s, ARK and DOI’s, Andy was keen to point out that there is no point having multiple identities for the same resource and made several recommendations on how to make HTTP URI’s, which he recommended as the publicly visible persistent identifier, more persistent than several of those currently designated for identifying Web resources.
Julien Masanès of the International Internet Preservation Consortium (IIPC) [7] closed the session with a view on important metadata for Web material, based on the IIPC’s experiences harvesting the Web. Metadata that documents technical dependencies and the tools used to harvest Web sites are more useful and necessary than purely descriptive metadata. Furthermore, the sampling process by which Web sites are selected for archiving must be documented so that future users can understand the context of the available collection when isolated from its original, networked environment.
Tools and Current Archiving Activity
Dave Thompson of the Wellcome Library chaired this session on Tools and Current Activities. A member of the programme committee, he also spoke on the UK Web Archiving Consortium (UKWAC) [8] that utilises a customised version of the PANDAS software 9 to collect and archive Web sites flagged by the UKWAC as relevant to consortium members. Noting that much of the problem in future-proofing Web sites arises from the absence of records management practices, Dave made the point that archiving alone does not necessarily result in future-proofed Web sites. Implementing records management principles and practices into current Web site management is the key. However, archiving Web sites is necessary to maintain access to them in the long term. The UKWAC has a role to play in this by archiving the Web sites of organisations which, for whatever reason, do not archive their own sites. UKWAC has been granted permission by over 175 Web site owners to collect and archive their sites. Dave provided several pointers for organisations wishing to future-proof their Web sites: minimise the use of proprietary file formats, use standards (preferably open ones), follow existing Web guidelines, and talk to clients to make sure they understand that Web site design can affect the portability of their Web site into the future.
For organisations wishing to future-proof and archive their own Web sites, Hans Goutier of the Dutch Ministry of Transport, Public Works and Water Management [10] introduced the Ministry’s pilot efforts to archive old versions of their and affiliated agencies’ Web sites. Good records management is recognised is fundamental to successful Web site preservation and begins with the initial Web site design. Records management retention decisions in this context are led by Dutch legislation, which requires preservation of certain records but not the Web sites themselves. The Ministry can therefore make decisions on exactly what it will try to preserve and this does not necessarily include the entire site. Based on its experiences to date, the Ministry has issued guidelines for agencies to assist them in designing Web sites with longevity in mind. Similarly, Web sites that are designed to cater for broad accessibility are far easier to maintain into the future than Web sites which are not. Hans concluded with several recommendations, including clear delineation of tasks and responsibilities, maintenance of an inventory of Web sites (for organisations with more than one), identification of record-keeping responsibilities for Web site content, and implementation of quality control and a modular design.
Julien Masanès spoke again, this time on IIPC-developed tools to enable Web archiving. Several of his slides were inaccessible to the audience, but despite this, Julièn gave an enthusiastic presentation on the tools required for harvesting and collecting different types of Web archives, such as local file systems, Web-served archives, and non-Web archives. The IIPC Toolset is aimed towards middle- and large-scale archiving for the Web Served Archives model, and the architecture of tools for Web archives incorporates index, ingest, search, access, and storage. The tools are open source and so may be obtained and used by a wider audience.
UKOLN’s Brian Kelly [11] followed with a presentation on lessons learnt from experiences with project Web sites. In contrast to many of the other speakers, for whom technical issues were a focus, Brian spoke about the organisational and human issues that could often lead to Web site failure or loss, such as failure to re-register domain names and failure to prepare and implement an adequate exit strategy for the end of a project. This was illustrated by the case of the first Web HTML validation service developed and available from webtechs.com [12]: many other sites linked to the webtechs site through their ‘Valid HTML’ icon, but someone forgot to pay the bill for the domain name and it was promptly taken over by a porn site. This human failure led to the perceived loss of a very valuable site and service and accidental promotion of an undesirable one. Again, good practice standards were cited as the key to maintaining accessibility in the short term and provided a favourable environment for preservation in the long term.
Matthew Walker of the National Library of Australia introduced PANDAS 3, a software tool from the National Library of Australia (NLA) [13] for managing the process of gathering, archiving, and publishing Web site resources. Scheduled for release in March 2006, PANDAS 3 is an evolution of PANDAS 2 (released in August 2002). PANDAS 3 is more robust, operationally faster, with improved error handling, incorporated gather- and processing-related functionality, a new user interface focussed on core workflows, and can run in a standard servlet container/java application server (e.g. Tomcat) instead of an Apple WebObjects application server. Important for implementers who wish to adapt the functionality, PANDAS 3 will be available as Open Source software.
International Activity and Legislation
The final session was chaired by Seamus Ross, Director of the Humanities Advanced Technology and Information Institute (HATII) [14] at the University of Glasgow and Associate Director of the DCC.
Adrian Brown, Head of Digital Preservation at The National Archives (TNA) [15] gave a thought-provoking presentation on the UK Government Web Archive. Web sites are selected on the basis that they are records containing evidence of interactions, specific business processes, or the development of e-government. Collection frequency varies for different Web sites, using either harvesting or direct transfer. Authenticity is a key issue for TNA, unlike many of the other organisations featured in the workshop, and is derived from the significant properties of the Web site as displayed on-screen; of course, the significant properties must therefore be defined before attempts can be made to preserve them. Preservation strategies may transform the source of the records, by normalisation, migration on demand or migration at obsolescence, or they may transform the means of access to the records through a form of emulation or the use of virtual computers. Preservation management, necessary whichever type of preservation strategy is implemented, must feature three key aspects: passive preservation; active preservation; and managing multiple manifestations. Legal issues and challenges arising from the presence of dynamic content must also be considered. Finally, Adrian identified some of the standards for government Web sites and summed up ‘archive-friendly’ sites as those which provide archival access with persistent, stable URL’s, documentation, metadata, and do not use constraining technology.
Matthew Walker gave the final presentation of the workshop and introduced PANDORA, the NLA’s Archive of Web resources. After briefly discussing the history of PANDORA and its relationship with PANDAS, the presentation focussed on the context surrounding the PANDORA archive and the workflow it used. The NLA has no legal right to archive Web resources and must obtain permission from every Web site owner for permission to archive (as must the UKWAC). Theirs is a selective approach with manual Quality Assurance processes, and it is scalable to the available resources. The lack of resources means that, as with many current other Web harvesting programmes, PANDORA [9] does not include the deep Web and the full linking structure of the Web is not retained. Subsequently, resources may be perceived as missing in the eyes of future researchers. There are nine stages to the workflow: nominating/identifying; selection; gathering; processing; archiving; publishing; cataloguing; permissions; and restrictions. The NLA uses tools such as PANDAS [9] and Xinq [16] , but other tools such as HTTrack [17], pageVault [18], and Heritix [19] are also useful. Matthew finished with some final recommendations on Web archiving: do something and do it now; build on what you already have; think about what you have done and revise as necessary.
Conclusions
DCC Director Chris Rusbridge joined Seamus Ross to bring the workshop to a close. They presented a mind map [20] Chris had constructed that identified the main issues surrounding future-proofing of Web sites raised during the course of the workshop. The advice to Web masters had been spread throughout several presentations and gathering it together in a single document was very useful for delegates. The main issues were:
- think about the records perspective;
- reduce, replicate and redirect;
- protect your domain;
- be archive-friendly;
- carry out ‘not-bad practice’;
- experiment, and;
- identify unhelpful practice.
The broad conclusions of the workshop placed successful Web site management and archiving firmly in the context of good records management. As one delegate put it, the topic should be approached from the perspective of good practice for making information available and accessible now, rather than forcing curators to adopt practices for preservation at a later date. Good practices for consistent and persistent Web site accessibility now make it easier to preserve and provide access to a reliably archived Web site at a later date.
References
- Digital Curation Centre Web site http://www.dcc.ac.uk
- Wellcome Library Web site http://library.wellcome.ac.uk/
- California Digital Library Web site http://www.cdlib.org/
- British Library Web site http://www.bl.uk/
- Web Standards Project Web site http://www.webstandards.org/
- Eduserv Foundation Web site http://www.eduserv.org.uk/foundation/
- International Internet Preservation Consortium (IIPC) Web site http://netpreserve.org/
- UK Web Archiving Consortium (UKWAC) Web site http://www.webarchive.org.uk/
- PANDAS and PANDORA Web site http://pandora.nla.gov.au/pandas.html
- Dutch Ministry of Transport, Public Works and Water Management Web site http://www.verkeerenwaterstaat.nl/
- UKOLN Web site http://www.ukoln.ac.uk
- Versions of this site are still available from the Internet Archive Web site http://www.archive.org/
- National Library of Australia Web site http://www.nla.gov.au/
- HATII Web site http://www.hatii.arts.gla.ac.uk/
- The National Archives Web site http://www.nationalarchives.gov.uk/
- Xinq Web site http://www.nla.gov.au/xinq/
- HTTrack Web site http://www.httrack.com/
- pageVault Web site http://www.projectcomputing.com/products/pageVault/
- Heritix Web site http://crawler.archive.org/
- This mindmap is available from the DCC Web site at http://www.dcc.ac.uk/training/fpw-2006/ along with copies of the presentations.