Web Mirrors: Building the UK Mirror Service
On 1st August 1999 the UK Mirror Service [1] replaced HENSA as the JISC [2] funded mirror service for the UK academic community. The new service is run by the same teams at Kent and Lancaster that supported the HENSAs, but it is not merely a revamp of the HENSA sites; there are some fundamental changes.
This article takes a look at the implementation of the service and our plans for the future.
A Distributed Service
We decided at the outset to locate the service at both Kent and Lancaster, so that the service could survive the loss of either one of the sites. Our experience was that individual sites do occasionally fall off JANET for various reasons (network outages, power cuts, router upgrades etc). These events are infrequent, but happen often enough that there is a real gain in availability of service from the two-site approach.
The service is also distributed over a group of machines at each site. Incoming Web and FTP requests are handled by a array of front end hosts. These forward the requests to back ends which contain the mirrored data. The front ends cache data from the back ends, but operate independently of each other - there is no attempt to avoid duplication of material between the caching front ends. While this costs some disk space, it does mean that the system scales well to handle the peak demands that occur when new releases of well-known packages appear.
The key advantage of this arrangement is that it is scalable. As the load increases we can simply add more front ends. The front ends are cheap Intel boxes running Linux [3] and Apache [4]. As the front ends contain only cached data they do not need to be backed up, and setting them up is a trivial process.
Mirroring
We use a locally developed system called Cftp [5] to handle mirroring of FTP and Web sites. This had been in use at HENSA Unix since 1997, and has grown robust over the years to handle the vagaries of FTP servers that crash, provide listings in odd formats and violate the FTP RFC [6] in various interesting ways.
We needed to change Cftp to support mirroring to both the Kent and Lancaster sites. Our first attempt at this was a master/slave model: Cftp at the master site completed all its updates, then propagated the changes to the slave site. But this meant that updates were lost if the two sites lost network connectivity, and also that the slave site's updates were lost if there were problems at the master site.
Our second (and current) solution was to make each site carry out mirroring independently, but for each site to look 'sideways' at the other. The mirroring process works like this at each site:
- We build an index file describing the current state of the origin site (the site we are mirroring).
- We use the index file to compare our local copy with the current state of the origin site.
- If we need to update a file, we first look at our peer site to see if it already has an up to date copy. If so, we fetch it from the peer rather than the origin site.
To avoid both sites working in lock step, one traverses the tree forwards and the other backwards. To handle more than two cooperating sites we would need to use a random order or some other scheme.
This scheme has worked well in practice. Although the two sites are handling mirroring independently, there is very little duplicated copying of resources.
Mirror Status Information
A problem with most mirror sites is that you have no idea how up to date they are. Most sites attempt an update at least daily, but the mirroring process is vulnerable to various failure modes at the origin sites, the mirror site or the network. Typically the user has no way of knowing whether a mirror site is up to date, and we felt that this was a significant factor deterring people from using mirrors.
We therefore decided to label each mirrored directory with last mirrored information. This includes the time the index file was generated and when the directory itself was compared with the origin site. Cftp maintains this information in a status file stored in each directory. We felt it was important to timestamp each directory individually rather than just keeping a 'last mirrored' timestamp for each origin site, because mirroring can fail part way through an update.
The mirroring process also maintains other information in the status file, including when each file in the directory was last copied, error messages for attempted updates that failed and total size information for directories. Currently this information is only used to show last updated and total size information via the web interface, but we plan to extend this soon. For example, we will make the FTP server show directory timestamps that indicate the last modified time of the most recently changed file anywhere below the directory.
We also plan to use the status files to indicate which files have not been copied. Mirror sites are traditionally more or less selective about what they copy, but again users have no way of knowing what has been left out. Because of this the UK Mirror Service has up to now been fairly conservative, typically mirroring everything except files that are duplicated elsewhere in the archive. Explicit labelling of what has been skipped will let us omit 'dead' items - old releases of software for example - that are not often accessed. The Web interface could display these items as links to the origin server copies (we could also do this under the FTP interface).
Configuration
Under HENSA we maintained various configuration files which controlled the mirroring process and the Web and FTP servers. Adding a new mirror involved updates to all of these files. The UK Mirror Service uses a different approach: all the information about a mirror is contained in a single per-mirror XML [7] file. Among other things this file contains:
- A description of the site
- Contact information
- Where the site should appear in the Web and FTP trees
- How the site should be mirrored
These files are used to generate the configuration files for the Web and FTP servers, which are then automatically propagated to the front and back end servers. The use of XML means that we can easily extend the mirror description files with additional information (for example, if we wished to use an alternative method of mirroring).
Searching
Searching at HENSA was implemented with in-house tools, and had different implementations and user interfaces at the Unix and Micros sites. For the UK Mirror Service we are using Zebra [8], which is a Z39.50 [9] based indexing system. Locally developed scripts handle the extraction of index information (e.g. descriptive text from Linux RPM [10] files) and the construction of a single search index for the whole site.
We view effective search support as critical to the usability of the UK Mirror Service. We have concentrated up to now on building the foundations for a powerful search system, but the user interface is at present far from ideal (frankly due to the pressure of time). In particular many users have complained of the lack of a platform-based search facility. Thus work on the searching user interface is a high priority at present.
The first step is to add basic platform-specific search support. After that we plan to extend the extraction of descriptive text from the various package formats; for example README files and the like from software distributions. Also planned is intelligent handling of multiple versions of packages - usually the current version of a package is the only one of interest.
Future work
We have a number of projects planned for the short and medium term. Right now we are working on improvements to the searching user interface as described in the previous section. We are also looking at extending our coverage, particularly of material outside the traditional area of software packages.
We are currently looking at ways to making the Mirror Service work with the National Janet Web Caching Service [11]. There is obvious scope for the UK Mirror Service to satisfy requests for FTP URLs. We are in the early stages of collaboration with the National Web Cache to develop support for this.
The archive is currently accessible via the Web or by FTP. We would like to add NFS [12] and Samba [13] access, but this will need some software development to support effectively. Because the archive is distributed over several machines we cannot use standard software to support NFS access, but we should be able to modify freely available NFS server software to do the job.
We would like to offer added value for popular items, such as prebuilt binary versions of some source packages, and perhaps UK- or mirror- specific help for some very popular items (e.g. 'How to install Linux from the UK Mirror Service'). We also plan substantial enhancements to our tutorial information, including additional help on things like file formats and software installation.
A slightly longer term project is to set up a user forum to allow people to review resources, as well as provide helpful information. This would require careful nurturing and steering if it is to succeed, but we think it would be extremely useful.
As well as these externally visible improvements, there is much internal work to do. For example, we would like to implement automatic failover between the front end Web servers, so that if one machine fails another immediately takes over its IP address. We would also like to make the front end hosts automatically handle failure of a back end server by switching to the alternative host.
Another mundane but important area of work is in improved processing of access logs. As can be imagined we generate large amounts of log files, and buried in there is valuable information about what is popular, who is using the service, etc.
In conclusion, please try out the service and let us know what you think. We particularly welcome suggestions for resources or facilities you would like to see added to the site. Our aim is to provide what our users need, and to do that we need feedback from you.
References
- The UK Mirror Service
http://www.mirror.ac.uk/ - The Joint Information Systems Committee
http://www.jisc.ac.uk/ - The Linux operating system
http://www.linux.org/ - The Apache web server
http://www.apache.org/ - Cftp: A Caching FTP Server
Computer Networks and ISDN Systems 30 (1998) 2211-2222 - RFC 959: File Transfer Protocol
ftp://ftp.isi.edu/in-notes/rfc959.txt - XML: Extensible Markup Language
http://www.w3.org/XML/ - The Zebra Server
http://www.indexdata.dk/zebra/ - Z39.50
http://www.cni.org/projects/z3950/z3950dir.html - The RPM software packaging tool
http://www.rpm.org/ - The National JANET Web Cache
http://wwwcache.ja.net/ - The NFS Distributed File Service
http://www.sun.com/software/white-papers/wp-nfs/ - The Samba file and print service
http://www.samba.org/
Author details
Mark Russell |