OMNI Corner

sue welsh

OMNI Corner

In her regular appearance in Ariadne, Sue Welsh, introduces a new experiment in network indexing underway at OMNI.

MNI is by no means the first or only of the Electronic Libraries Programme Access to Network Resources projects to experiment with using the well known and popular Harvest software to create descriptions of networked resources automatically. EEVL (the Edinburgh Engineering Virtual Library, for example, recently announced the availability of their own Harvest experiment [1]) and non-eLib gateway projects had in any case, beaten us to it some time ago.

In the light of this activity, setting up a Harvest for OMNI is neither a difficult or revolutionary thing to do, and in fact such a thing has existed in the development area of the OMNI server for some time. The best way to deploy the Harvester though, was not immediately clear, especially taking into account the prominence given by OMNI throughout the Project's existence to the issue of the quality of Internet resources. This article describes how these concerns were resolved, and describes the new OMNI Harvester. Finally, if you are a researcher or practitioner looking after a list of links in your own subject area which you'd like to share, we'd very much like to hear from you, read on!

Harvest for Beginners

Harvest creates full text indexes to web documents. It can be used to index single files or entire servers, or be allowed to wander far and wide, using hyperlinks as its means of knowing which route to take. The two key variables which can be controlled are:

the number of hops that the Harvest can make (or how many documents to index before stopping)
the starting point or starting points - a list of URL's

It is also possible to restrain the Harvest software by preventing it from leaving one particular server.

Commonly, Harvest has been used to make a automated, full text version of gateways containing records created by hand. In this scenario, the URL's contained in the gateway records are used as starting points for Harvest. After indexing the document identified by a URL, the Harvest is usually allowed to follow URL's contained in the document, and continue indexing.

As all the starting points are relevant, it is likely that all the URL's in the corresponding documents will point to other relevant resources. Of course, the further Harvest is allowed to wander, the more irrelevant documents will become to the subject gateway. But, it is assumed that most documents which are only a few hops from the originals selected by the subject gateway will be useful.

It's the Quality, Stupid!

The same assumptions cannot, however, be made about the quality of the items found by this method. It is not possible to say if the items linked to the documents used as starting points are of the same quality, or if the author intended to recommend them to his readers, or hold them up as an example of what not to do! It is even less possible to make predictions about the accuracy and authority of resources that are five or six hops down the line. Restraining the Harvest by keeping to one server may help, but cannot be considered foolproof.

OMNI has always made evaluation and selection cornerstones of its approach to creation of our gateway to biomedical resources. If we created this sort of Harvest database, it would be necessary to make quite clear to our users that they were using something quite different; nothing in the Harvest database could be said to have been reviewed, evaluated or selected. After much discussion, there was no consensus on whether this was a useful thing to do, was there another option?

The OMNI Harvester - Built on Expert Advice

Towards the end of 1996, a different approach was suggested. OMNI is regularly contacted by subject experts who are compiling lists of Internet resources in their speciality, and usually in the form of a request to place links to their lists in the OMNI databases. The best way of dealing with these lists has been a subject for concern. Do they belong in the main OMNI database or should they be listed separately? How can we avoid OMNI users searching OMNI and finding only other lists, lists within lists, when they are searching for "Content"? However, one thing these lists undeniably represent is selection, selection by subject experts. There must be a way to harness this sort of activity and make it available for OMNI users.

The OMNI Harvester is an experiment based around these two strands of activity. It is a full text database of resources taken from listings compiled by subject experts (in virology, nutrition, pediatrics, orthopaedics, neuroscience) or projects such as PharmWeb [2] (for pharmacology) and the UK Human Genome Mapping Project [3] website (for genetics/molecular biology). Seven subject areas are covered so far, and the scope of the Harvester will be extended. Because we constrain the Harvest software so that it does travel away from the resources contained in the lists, the key element of selection is retained.

In the future we hope to integrate the Harvester and the main OMNI databases so that both can be searched simultaneously.

For more information.....

Visit the OMNI Harvester and associated documents [4] and tell us what you think.

Even better, if you are a subject expert maintaining a list of resources, contact us and we'll include your resources next time the Harvester is updated. All the listings involved are used with permission, and are prominently linked from the Harvester pages.

References

[1] EEVL, the Edinburgh Engineering Virtual Library,
http://eevl.ac.uk/

[2] PharmWeb: Phamacy Information on the Internet,
http://www.pharmweb.net/

[3] UK MRC HGMP Resource Centre,
http://www.hgmp.mrc.ac.uk/

[4] The OMNI Harvester,
from http://omni.ac.uk/general-info/harvest.html

Author Details

Sue Welsh is the project officer for the OMNI eLib project.
Email: swelsh@nimr.mrc.ac.uk