Digital Preservation Coalition Forum on Web Archiving

manjula patel; maureen pennock

Digital Preservation Coalition Forum on Web Archiving

Maureen Pennock and Manjula Patel report on the Digital Preservation Coalition's second Web Archiving Forum which took place at the British Library in London on 12 June 2006.

The Digital Preservation Coalition (DPC) [1] ran its first Web archiving forum in 2002, when archiving the Web was still a relatively unexplored concept for most organisations. This second Web-archiving forum sought to review and update on national and international activities since then and provided delegates with an excellent opportunity to exchange experiences and identify emerging areas of research and future developments in Web archiving activities.

Session 1: Technical Aspects

The first session focused on technical aspects of archiving Web content. Philip Beresford, Web archiving project manager at the British Library [2], spoke first about the British Library’s involvement with the UK Web Archiving Consortium (UKWAC) [3] and the first two years of UKWAC activity. After introducing the aims and objectives of UKWAC, which are focused on the development of collaborative approaches to Web archiving in the UK on a shared infrastructure, Philip was able to identify some of the problems and challenges they have faced in the practical development of the UKWAC archive so far. Some of these are technical and relate to the harvesting software; others derive from more practical issues such as obtaining written permission from site owners for UKWAC to archive the site and the resource-intensive nature of the process. Emerging tools, notably the Web-curator tool from the International Internet Preservation Consortium (IIPC) [4] and a new version of the harvesting software should help address some of these problems and enable UKWAC to develop its role and approach to collaborative Web-archiving.

Adrian Brown, Head of Digital Preservation at the UK National Archives (TNA) [5], followed with a presentation on collecting and preserving Web content. As the nature of the Web changes, it is becoming increasingly difficult to preserve: Web sites are no longer discrete entities that can be collected, but are experiences that arise from interaction between the Web-server and the client (i.e. Web-browser). These experiences can be defined as either content-driven, or Web-driven. Archiving on the client side, also known as transactional archiving, thus captures a particular experience while archiving on the server side allows multiple future experiences by capturing the back-end databases. Server-side archiving potentially allows the most authentic rendition of a Web site but is resource-intensive and may require support for multiple technologies. After discussing these options in more detail, Adrian went on to cover some of the specific challenges in preserving collected Web sites. TNA are looking at three key areas of preservation:

Preserving the bits (passive preservation)
Preserving the record (active preservation)
Managing multiple manifestations

Adrian identified a number of emerging tools that will assist in meeting the preservation challenge. Both Web-specific tools, such as LOCKSS [6] and the IIPC toolset, and generic tools, such as the JSTOR/Harvard Object Validation Environment (JHOVE) [7] and the National Archives’ online registry of technical information, PRONOM [8], will have a role to play.

Session 2: Collection Development and Legal Aspects

After a break for coffee, John Tuck, Head of British Collections at the British Library, opened the session on collection development and legal aspects. He introduced the legal deposit framework in the UK and the Legal Deposit Libraries Act (2003) [9], then focused on the British Library’s Collection Development Policy [10] with regards to Web archiving. Given the huge scale and dynamic nature of the Web, it is considered neither feasible nor affordable to attempt to capture the entire UK domain. Activities are therefore concentrated on taking a complete annual snapshot of the UK Web presence, and archiving more frequently a limited number of sites judged to be of research value both now and for the future. Barriers to progress in achieving these activities include the permissions process identified earlier by Philip Beresford, and the speed of Web archiving tools and technologies in relation to the speed of Web site development.

The last presentation of the morning was given by Andrew Charlesworth, Director of the Centre for IT and Law (CITL) [11] at the University of Bristol, who revisited some of the legal issues identified in the DPC’s first forum in Web-archiving back in 2002. He noted that although legislation had changed little in the past three years, people were certainly more aware of the potential value of information posted to the Internet and the detrimental effect Web archives may have upon the value of that material. Digital works and rights-holders are becoming increasingly more proactive in pursuit of the protection of intellectual property online, hence, seeking permission to archive is certainly the safest way to approach harvesting. Andrew spoke in detail about some of the specific legal issues and risk factors in archiving Web sites:

Copyright: content owners are much more aware of copyright, but there have not yet been many instances of copyright cases on the UK Web archiving scene.
Defamation: the UK remains a favourable environment in which to bring a defamation case, largely as a result of the no-win-no-fee basis on which defamation claims can be launched. Cases can easily be brought in from other countries as the content is available in the UK, regardless of the country of origin.
Privacy and data protection: data protection in the UK is considered less risky now than it was three years ago, as it is now more widely understood.
Content liability laws: The question of who is liable for content collected and made available in a Web archive has not been a great issue over the past three years.

The key legal change in the past three years has probably been the move towards legal deposit, although that is itself a slow process.

Session 3: Other Projects and Future Developments

The afternoon session aimed to cover a select number of other projects and future developments, especially those taking place at international or European levels.

Julien Masanés, director of the European Web Archive [12] and co-ordinator of the International Internet Preservation Consortium (IIPC), kicked off with an exploration of the issues surrounding preservation of the Internet. The nature of the Web, as a grid of stable and discrete objects that are linked and both evolve and increase in number over time, means that an integrated and linked approach to preserving it is required. The structure of the Web can facilitate this, as it is linked in a way that makes it easy for crawlers or harvesters to go from site to related site, and sites are mostly compatible with one another. Functional collaboration is the most effective and efficient way to approach the challenge, distributing the tasks or resources between partners and sharing technical development. Julien also introduced the European Web Archive, a non-profit foundation operating as a technical peering operation with the Internet Archive and a technology partner for cultural institutions wishing to establish Web collections.

Paul Bevan, previously the strategic manager for UKWAC at the National Library of Wales (NLW) [13] and now NLW Digital Asset Management System (DAMS) implementation manager, followed with a presentation on the NLW’s role in archiving Web sites relating to the 2005 general election. This was a UKWAC project shared out between the National Libraries of Wales and Scotland, and the British Library. The primary aim of the project was to develop a general archive of the 2005 general election, while simultaneously offering partners an opportunity to explore and identify some of the issues of such a collaborative project and test the capacity of the UKWAC system to manage events-based harvesting. Each partner had a slightly different selection strategy, with different harvesting frequencies and a different approach to permission seeking. The project highlighted how permission seeking could have a detrimental impact on the efficiency of the gathering process and how this affects the collection overall in an event-driven gathering initiative.

The final presentation was given by Catherine Lupovici of the International Internet Preservation Consortium and focused on standards and tools for domain-scale Web archiving. She discussed some of the tools and formats being developed by the IIPC as part of their toolset, covering the whole archiving chain from acquisition through to access. Introducing the IIPC content management approach, Catherine contended that the only true way to record the Web was to approach it from the domain-level perspective; this large-scale approach means that automatic tools and processes are essential and will allow future users to apply smart mining tools on collections. The next phase of the IIPC will extend the consortium, building on the initial toolset to develop more sophisticated tools for acquisition and access and examine preservation issues in more detail.

Conclusions

The forum concluded with a panel session chaired by Robert Kiley of the Wellcome Trust [14] and Richard Boulderstone of the British Library. Speakers and delegates held a lively discussion on the issues touched upon during the day. The preservation of Web archives was raised as a special concern, particularly with regards to obsolete tags, plug-ins and browsers, and the wide variety of source materials used to derive or contain Web-accessible content. Audio-visual material, documents and publications, spreadsheets and slide shows, and a whole host of other file types are commonly posted on the Web and collected in Web archiving activity; these will also require preservation effort to ensure the contents remain accessible over time. Web archiving should not therefore be perceived as an isolated activity. Furthermore, the scale of the challenge requires development of automated tools for different aspects of the archiving process, such as quality control and preservation. Most current tools focus only on capture.

Collaboration between different initiatives, such as the IIPC, UKWAC, and the European Web Archive, will be key to addressing and achieving these issues.

References

Digital Preservation Coalition (DPC) http://www.dpconline.org/
British Library http://www.bl.uk/
UK Web Archiving Consortium (UKWAC) http://www.webarchive.org.uk/
International Internet Preservation Consortium (IIPC) http://netpreserve.org/
UK National Archives http://www.nationalarchives.gov.uk/
LOCKSS (Lots Of Copies Keeps Stuff Safe) http://www.lockss.org/
JSTOR/Harvard Object Validation Environment (JHOVE) http://hul.harvard.edu/jhove/
PRONOM http://www.nationalarchives.gov.uk/pronom/
Legal Deposit Libraries Act (2003) http://www.opsi.gov.uk/acts/acts2003/20030028.htm
British Library Collection Development Policy http://www.bl.uk/about/policies/collections.html
Centre for IT and Law (CITL) at the University of Bristol http://www.law.bris.ac.uk/research/centreitlaw.html
European Web Archive http://europarchive.org/
National Library of Wales (NLW) http://www.llgc.org.uk/
Wellcome Trust http://www.wellcome.ac.uk/

Author Details

Maureen Pennock
Research Officer
Digital Curation Centre
UKOLN
University of Bath

Email: m.pennock@ukoln.ac.uk
Web sites: http://www.dcc.ac.uk/, http://www.ukoln.ac.uk/

Manjula Patel
Research Officer
Digital Curation Centre
UKOLN
University of Bath

Email: m.patel@ukoln.ac.uk
Web sites: http://www.dcc.ac.uk/, http://www.ukoln.ac.uk/

Return to top