From Link Rot to Web Sanctuary: Creating the Digital Educational Resource Archive (DERA)

bernard m scaife

From Link Rot to Web Sanctuary: Creating the Digital Educational Resource Archive (DERA)

Bernard M Scaife describes how an innovative use of the EPrints repository software is helping to preserve official documents from the Web.

When I started as Technical Services Librarian at the Institute of Education (IOE) in September 2009, one of the first tasks I was given was to do something about all the broken links in the catalogue. Link rot [1] is the bane of the Systems Librarian’s life and I was well aware that you had to run fast to stand still. It is characterised by the decay of a static URL which has, for example, been placed in a library catalogue to reference a relevant Web site resource which has subsequently been moved to a different server or else removed altogether.

I ran a report and identified that we had about 16,000 links to external resources in bibliographic records within our SirsiDynix Symphony library system. Of those, the in-built link checker was reporting about 1,200 as non-functional (7.5%). However, we knew that the situation was actually worse than that. The link checker was doing its job to the best of its ability but was unable to report on sites which were using a genuine HTML redirect page to alert the user that the document could not be found, because that page was, in itself, a valid document.

Having listed the 1200 known broken links by URL and catalogue record identifier, we decided to attack them wholesale and try to fix them manually as a first step. This exercise turned out to be very fruitful as it identified some interesting trends. First, there were blocks of URLs relating to a single organisation’s Web site which had obviously broken all at once. In some cases, swapping the base part of the URL with the new one rectified it. However, more commonly we were finding that, perhaps as the result of a migration to or between content management systems, the original file names had been changed, or worse still, removed, making their amendment far more time-consuming; or in the worst instances, impossible.

We were also able to take a snapshot view which told us how many of the broken links referred to documents that were no longer available [2]. Searching via Google would sometimes bring up an alternative candidate (if the file happened to have been mounted elsewhere on the web), but in general, it was evident that about 10% of the links referred to documents which no longer existed.

In many of the cases, these were documents which were from what we would classify as Official Publications [3] for the purpose of our specialist education collection. Official Publications are those from Government departments (e.g. Department for Education); organisations involved in the inspection of education (eg Ofsted) and a range of other quasi-governmental bodies. The documents are largely but not exclusively Word documents or pdf format but we anticipate that the formats will expand to include multimedia in the future. The works include the results of research and monitoring exercises which inform the educational policy changes. The take-up of persistent identifiers in the field of Official Publications appeared to be low which was further compounding the problem.

In the past, our method of last resort has been to retrieve a printed version of the resource which was being generated as part of our cataloguing workflow. That would be placed in store and only accessible by users on request.

How Could We Do This Differently?

From this point, it became clear that we needed a better solution to this problem. We had been using an in-house research repository running the EPrints software [4] since about 2009. EPrints was originally developed by the University of Southampton to build open access repositories which are compliant with the Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH). It is widely used in the Higher Education community.

Our instance is hosted by the University of London Computing Centre (ULCC). The arrangement works well, providing us with an administrative interface which allows us to tailor our instance to local requirements, backed up by a robust infrastructure, support and technical expertise from ULCC itself. It occurred to us that this software could enable us to eradicate our link rot problem, whilst building in a core level of digital preservation and increasing the discoverability of these documents. We were convinced that a citation which linked to a record in a Web archive was far more likely to survive than one which did not [5]. We needed a quick solution to the problem of link rot. Whilst we were aware that other systems such as DSpace and Fedora were contenders, we were keen to work with ULCC and build on local EPrints expertise which we had developed at the IOE.

A number of issues needed to be clarified. First, the copyright implications. Although all these documents were in the public domain, linking to the original location versus harvesting it into a local repository generates some concerns which we wanted to eradicate from the beginning by negotiating with the organisations holding the original documents.

IPR Issues

Simultaneously, the election of the Coalition Government and its planned cuts [6] meant that we were embarking on this project at a very auspicious moment in which the risk of content loss from government departments was rising. We initially considered using a suitable version of Creative Commons licence, but discovered that most organisations’ publications were covered by a PSI (Public Sector Information) click-through licence which was in many respects similar. We had no wish to exploit the content commercially and therefore decided to contact organisations directly which did not offer their content for use under PSI. We felt that this would be likely to quell any fears with the bodies we approached and decided we would try to position ourselves as offering ‘free preservation’, using the Institute’s reputation as a reason for us being able to all-but -guarantee longevity. In most cases, our approach was welcomed with open arms.

Later in the year, this task was made even easier when the copyright situation was clarified by the introduction of the Open Government Licence. [7]

Configuring the Metadata

The next stage was to decide which metadata we wanted to capture.

We were aware of the National Archives Electronic Records Online Pilot [8] Project which has some similarities to our own. For this reason we were doubly sure that the metadata associated with the documents needed to be of high quality in keeping with the Library’s mission and give added simplicity of discoverability and comprehensive coverage to the end-user. It was at about this time that we realised that we were actually in the process of creating a new virtual collection of official publications in the arena of education. Potential names started to be bandied about for the resource and the Digital Education Resource Archive (DERA) [9] was born.

Rather than creating metadata fields from scratch, we used our existing EPrints repository [10] as a starting point, as many of the fields there remained appropriate in this context. For example, the ability to choose different licence types. We also tried to build-in flexibility in the schema, for example by leaving the ability to restrict access to full text should future donors decide they wished to proceed on the basis of that access model instead. The new fields which we introduced were:

Type: We decided to call this “Document from Web”
Collection: to allow granularity on:
Subjects: There was some debate about the appropriateness of using our in-house subject schema London Education Thesaurus (LET) [11], a more generic one based on Library of Congress Subject Headings ( LCSH) or none at all. In the end, building on research from the University of Southampton [12] which stated that subject schemas had little impact in the realm of repositories, we decided not to use one at all. We are still considering in future developments whether we may try to mine terms from the full text using LET and add it to the discovery layer which the IOE is considering implementing.
Organisation: This was felt to be critical in order to be able to filter on the provenance of organisations, some of which would soon no longer exist. For example, if it is decided to close an organisation, a notice is often put on its Web site. Without intervention, there would be no guarantee that the electronic assets would survive beyond the closure date. We become aware of these things via our network of educational contacts and are now able to offer a solution. Typically, we would ask the organisation to supply the files on a Pen Drive with a spreadsheet of related metadata to allow us to upload it in bulk. In doing so, it would all be linked to the organisation’s name as it existed at the time of closure. A worked example of how this actually happened in practice is given later on.

Getting Used to EPrints

One challenge presented by EPrints is getting to grips, for example, with adding fields and changing the layout of screens. It was not so much the process itself that was difficult, but rather the lack of documentation describing how to. Not having admin access at server level meant I had to rely more on ULCC colleagues than I might otherwise have wished.

Our pre-repository workflow was that potential documents for cataloguing were recorded by a Collection Development Librarian in a spreadsheet who visited each relevant website at certain intervals. These were then added to the library catalogue by Technical Services as time permitted. Of course, this meant in some cases that the document no longer existed at the time of cataloguing. At the very least, we were aware that the link to the resource we were cataloguing might have little longevity so a printed copy of the document was made at the same time, attached to the catalogue record and sent to store. This was as far as our preservation could go.

When we moved to using EPrints, we wanted to be able to retrieve the document at the time it was found (before link rot had time to set in) and, if necessary, add the extra metadata later. EPrints is very flexible in this respect and we will shortly add a new brief format workflow which allows this to happen quickly and for preservation to begin at the point of discovery.

Retrospective Data Loading

At the end of 2010, configuration was largely complete and we had added several hundred documents to the system. We were able to soft-launch it whilst we concentrated on building up the repository to include a critical mass of content. Taking a spreadsheet of 900 references which had not been catalogued in the interim period, this was transformed into EPrints XML format and imported in bulk, the original URL being placed in a hidden notes field to give some extra provenance. Editors then started working through to link the documents and enhance the metadata to meet our standards.

Coincidentally, we were approached by staff at the British Educational Communications and Technology Agency (BECTA) in the Autumn of 2010, one of the casualties of the government cuts. They were looking for a safe long-term home for their electronic document archive. We were able to satisfy them that DERA was the appropriate place for them to be placed and they supplied us with a spreadsheet containing the metadata and a bunch of files on pen-drive. Again, the metadata was loaded via XML and we attached the 400 documents supplied. By February, the repository was a viable and useful resource and we launched it officially on 28 February 2011.

Lessons Learnt and Future Plans

We now find ourselves in the position of having opened a sluice gate into which a backlog of content will need to be poured and refined. This extra workload would possibly have benefitted from some earlier project planning, but it is a much better place to find ourselves in than one in which there was no interest at all.

Another point is that we have had to accept all common file formats at present. It would, for preservations purposes, be preferable to convert and ingest in PDF/A format. However our view was that the small overhead of batch migrating to that format at a later stage means it would be better to spend time upfront now on metadata rather than file conversion.

Conclusion

We now find ourselves in the position of having opened a sluice gate into which a backlog of content will need to be poured and refined. This extra workload would possibly have benefitted from some earlier project planning, but it is a much better place to find ourselves in than one in which there was no interest at all.

Another point is that we have had to accept all common file formats at present. In practice, the majority are pdf, some MS Word and a few Excel files. It would, for preservation purposes, be preferable to convert and ingest in PDF/A format, at least for the textual formats. However our view was that the small overhead of batch migrating to that format at a later stage means it would be better to spend time upfront now on metadata rather than file conversion. We felt that this was a pragmatic response which meant that we would be working within the spirit of digital preservation best practice.

We are also aware that data-based formats such as Excel cannot be meaningfully integrated into a full-text search and that these objects would benefit from better representations in which the data themselves can be interrogated. We do not think that EPrints is the right vehicle for this, and will be investigating how we might use models or services such as those used by the UK Data Archive [13] for this type of data.

The main things we have learnt from the project are that:

Placing files in a repository gives digital preservation to key documents in the subject field and eradicates the link rot problem.
Adding high-quality metadata enhances the resource and allows it to hold its head high and become an integral part of a library’s collection.
A specialist library can play an important role in preserving domain-specific government content as part of its long-term strategy and ensure high-quality resources remain available.
Provided you are prepared to get to grips with its complexity, the EPrints software is well suited to the task and provides good interoperability with other legacy systems for importing metadata
The added value of being able to search the full text provides a potentially very rich resource for data mining whether by current or future researchers of educational history.

Future plans are to build up content levels to a critical mass. We will also be listening very carefully to what users say about the service. We intend to keep an eye on developments in this rapidly changing field to ensure that the service remains relevant and continues to fill the gap in content provision which we have identified. We also need to ensure that we can properly integrate this resource with our other more traditional library content, which makes the ability to cross-search our resources a greater imperative than ever.

References

Ailsa Parker, Link Rot: How the Inaccessibility of Electronic Citations Affects the Quality of New Zealand Scholarly Literature. Library Journal Articles, 2007 - http://www.coda.ac.nz/cgi/viewcontent.cgi?article=1000&context=whitireia_library_jo
National Library of Australia Staff Papers, 2009 -, The National Library of Australia’s Digital Preservation Agenda
http://www.nla.gov.au/openpublish/index.php/nlasp/article/viewArticle/1319
Official Publications Collection http://www.ioe.ac.uk/services/545.html
Open Access and Institutional Repositories with EPrints http://www.eprints.org/
Richard M. Davis, Moving Targets: Web Preservation and Reference Management. January 2010, Ariadne, Issue 62 http://www.ariadne.ac.uk/issue62/davis/
George Osborne outlines detail of £6.2bn spending cuts. BBC News Monday, 24 May 2010 http://news.bbc.co.uk/1/hi/8699522.stm
Open Government Licence - The National Archives http://www.nationalarchives.gov.uk/doc/open-government-licence/
Electronic Records Online - The National Archives http://www.nationalarchives.gov.uk/ero/
Digital Education Resource Archive (DERA) http://dera.ioe.ac.uk/
IOE Eprints http://eprints.ioe.ac.uk/
Foskett, D. J., The London Education Classification : a thesaurus-classification of British educational terms. London 1974
Leslie Carr, Use of Navigational Tools in a Repository [email from JISC-REPOSITORIES Archives]
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=JISC-REPOSITORIES;66dc4da9.0603
UK Data Archive http://www.data-archive.ac.uk/

Author Details

Bernard M. Scaife

Technical Services Librarian
Institute of Education
University of London

Email: b.scaife@ioe.ac.uk
Web site: http://www.ioe.ac.uk/staff/31238.html

Bernard Scaife is Technical Services Librarian at the Institute of Education. His background is in library management systems and he is responsible for the cataloguing and acquisitions functions including the development of systems across the Library and Archives. Bernard is also involved in digitisation activities, migration of legacy metadata on the Web and in lowering the barriers to accessing high-quality education resources.

Return to top