Meta Detectors

lorcan dempsey

Meta Detectors

Lorcan Dempsey talks about metadata and the development of resource discovery services in the UK.

How do you find out what is of interest on the network? The answer is with difficulty. What should libraries and the eLib subject services be doing about this? The answer is not clear. Let's postpone the question for a while, and look at the rapidly shifting service and technical environment in which they are operating.

For many people the first ports of call are the major robot- based 'vacuum-cleaner' services such as Lycos and Alta Vista which provide access to web pages worldwide, or classified listings such as Yahoo. Within UK higher education, there is a variety of services: NISS, BUBL, the new eLib subject-based services (ADAM, EEVL, IHR-Info, OMNI, SOSIG, et al), various listings and sensitive maps, and multiple more or less useful specialist resources. In an interesting recent development there is some experiment with Harvest to create an index of academic sites at Hensa at the University of Kent. (The absence of a service such as this or some crawler-based approach which concentrated on UK services has been a notable omission. Another is a UK mirror, if it could be negotiated, of a significant global service such as Lycos or Alta Vista. As noted elsewhere in this issue, there is some speculation that Alta Vista may be about to launch a European mirror.)

At the same time there are many local initiatives such as BARD at Oxford or the LSE pages which are of much wider than local appeal.

Within the subset that is supported by higher education funds there is some overlap and even competition. BUBL has a growing classified collection of thousands of links and aspires to create MARC records for these links. NISS operates an information gateway and populates it with records in a variety of ways. BUBL and NISS aim to cover all subject areas. The subject-based services are creating subject specific gateways. Some are now available; others will come on stream, but major subject areas are not covered and projects are not constrained to take the same technical or service approach. The Arts and Humanities Data Service will have various 'gateways' or 'catalogues' as part of its services in particular subject domains. OCLC's NetFirst will be available in the UK through a central licensing deal: this will provide access to a database of 50,000 network resource descriptions, and will be hosted by NISS. One might argue that such variety is a welcome sign of vitality and that multiple experiments are desirable in an environment in which technical, service and organisational models need to be explored. This is certainly true at the project level: preferred future directions have to emerge and the need for more research and coordinated development effort is clear. At the funded service level, I think there is a danger that this plurality leads to dissipation of expertise, effort, and funding, as well as to confusion and wasted time among users. This is not to argue that there should not be diversity, but to wonder how central funding should be levered most effectively. (There is a line between service and project, even if it is sometimes difficult to know where to draw it.)

So, we have a range of projects and services with different scope, coverage and ambition. What these services do is provide access to metadata. Metadata, a clumsy term, is data which helps in the identification, location or description of a resource. Catalogue records are one form of metadata. We can identify some emerging developments in the creation and use of metadata on the Internet.

The first, from which others flow, is that the discipline or control exercised over the production of collections of resources will improve as the web becomes a more mature publishing environment. There will be managed repositories of information objects. Such repositories may be managed by information producing organisations themselves, universities for example, by traditional and 'new' publishers, or by other organisations (the Arts and Humanities Data Service, for example, or industrial and other research organisations, archives, image libraries, and so on). (This is not to suggest that the existing permissive publishing environment will not continue to exist in parallel.) One concern of a managed repository will be its contents are consistently described and that these descriptions are promulgated in such a way that potential users are alerted to the resources they describe.

Secondly, discovery systems will be based on fuller descriptions. One aspect of this is that there will be more description of resources by those responsible for the resources themselves. This includes author-created descriptions, but will also be a central function of the repository managers identified above. The META and LINK tags in web page heads will be used to provide inline metadata which will be harvested by programs. Value may be added to this data at various stages along whatever use chain it traverses: by a local repository manager, by subject-based services, by crawler based indexing services, by various other intermediary services. Librarians and others who now invest in current awareness and SDI (selective dissemination of information) services will be potential users and manipulators of such data. These developments will be facilitated by more off-the-shelf products; one of significant importance is likely to be Netscape's Catalog Server, drawing on Harvest technologies. A second aspect is that libraries, commercial services like OCLC's NetFirst, the UK subject-based services and others will in many cases originate descriptions or put in place frameworks for the origination of descriptions. It will be interesting to see what sort of market conditions prevail for the creation and sharing of such records. A number of factors, including the perceived value of a resource, will determine the relative balance between author- produced, added value and third-party original descriptions in different scenarios.

Thirdly, descriptions will have to accommodate more data and complexity than is currently the case. Web-crawlers tend to describe individual web pages; approaches based on manual description (e.g. BUBL, NISS or OMNI) have initially tended to focus on servers, and not describe particular information objects on those servers. Neither approach is complete as users are interested in resources at various levels of granularity and aggregation which may not be satisfied by either of these simplified approaches. How to express the variety of relationships between resources and their characteristics is not clear: intellectual relationships and formal relationships. For example, what is the relationship between a web document and the JAVA or OLE objects it contains. A particularly important relationship, here expressed in traditional bibliographic terms, is that between a 'work' (an abstract intellectual object) and its 'manifestations' (its concrete physical instances): at a simple level, a work may exist as postscript, RTF and HTML files or it may exist in several places. (URNs - Uniform Resource Names, whenever they are operationalised, will have a role here.) At the same time, basic 'description' is not enough. To support a mature information environment a variety of types of metadata need to exist: data about terms and conditions; administrative metadata (data about the metadata itself: how current it is, who created it, links to fuller descriptions, and so on); data about file types, size and other technical characteristics. In some cases, a link to some other source may be included; this might be especially suitable for variable data such as terms and availability. This data will allow humans and clients to make sensible decisions about selected resources based on a range of attributes. (A client may pull a resource from a nearby server rather than a transatlantic one to conserve bandwidth, for example.) Again, the level of created structure (however it is designed) and the level of intellectual input deemed necessary will depend on the perceived value of the resources.

Next, programs will collect and manipulate data in a variety of ways, passing it off to other applications in the process. Harvest is an interesting example of a set of tools which allow data to be extracted in customised ways from resources and passed between nodes to support the construction of subject- or community-specific resource discovery services. Data may be collected and served up in various environments, converted on the fly to appropriate formats: it may be collected by robot, added to a local 'catalogue', or pulled into a subject-based service. The metadata we have been talking about refers to network information resources. There may also be metadata about people, about courses, about research departments, and about other objects. Programs might periodically look for resources that match a particular user profile, might search for people with a particular research interest, and so on. At the same time greater use will be made of existing data. For example, NetLab at the University of Lund has experimented with the automatic classification of resources based on matches between metadata and UDC schedules and with generating sets of related or central resources based on following links from a seed set. They have looked at the generation of a set of engineering resources by automatically following links from several central known resources in this area and ranking found resources by the number of times they are linked to.

Finally, these developments will take place in a rapidly changing distributed environment in which directory protocols (Whois++, ...), search and retrieve protocols (Z39.50, ..), Harvest, and a variety of other approaches will be deployed. These will be hidden from the user by the web, itself likely to be transformed by the integration of Java and distributed object technologies.

How do the eLib subject projects fit into this environment? In different degrees, they emphasise selection of good quality resources and fullish descriptions. Some have additional other emphases. They are in early stages of development and are establishing an identity, a view of user requirements and subject-specific indexing and descriptive approaches, and links into the subject bases that support them. Like other eLib projects, they have been told that they should consider how to become self-financing. They may look at alternative sources of funding, relationships with abstracting and indexing or other services which wish to move into this area, and various forms of charging and of record sharing. However, it seems to me that there is a strong case for continued central funding for one particular aspect of their work. Individual universities, and the higher education and research system as a whole, will be making more information about their courses, more of their research results, and more training and course materials available on the network in more or less controlled environments. In an increasingly competetive environment, the JISC should have an interest in the organised, effective disclosure of the results of UK research and teaching. The subject services can play an obvious role here, but can also encourage the development of well-managed information repositories within their domains, and provide a focus for collection of metadata. In whatever role they play, it is important that they put in place mechanisms to support the creation and submission of data from information repositories themselves, that they exploit emerging 'harvesting' technologies to enhance their services, and that they explore collaborative possibilities. They may add value to this collected data in various ways, but without such an approach, except in the narrowest of subject areas, their one, two or three 'cataloguers' will be overwhelmed by the volume, variety and volatility of the resources to be described.

And individual libraries? They have the immediate practical problem of describing network resources they own or license as well as a question as to how to present access to emerging resource discovery services. In the medium term, they have a role, perhaps shared with others on campus, in helping manage local repositories of data, creating customised discovery services, and integrating access to network and print resources in ways that save the time of the user. They can become the active channels through which their institutions effectively promote their own work as well as discover that of others.

Library interests touch on those of the subject services and others who will be bound together in the same metadata use chain. They will all benefit from the protocol and standards framework which allows easy flow of data and seamless access to diverse services, and a service and organisational framework which identifies roles and responsibilities. However, this framework is not in place. Without a coordinated approach, valuable work will be done, but we are likely to end up with a patchwork of different incompatible services which waste time and effort. With some coordination, and some concentration of vision and development effort, the collection of services could be greater than the sum of its parts. I would argue that a development agency aiming to facilitate concerted action at technical, service and organisational level, would repay investment. Its function would not be to be prescriptive, but to set a realistic aspirational level and to outline how we get from here to there. It would provide a focus for concerted action. We are embarking on a construction phase in which it should be possible to create something which avoids the fragmentation of current bibliographic services. It would be a pity if this current opportunity were not taken, and high on the agenda of some future eLib programme, there needed to be the creation of a framework for the description of higher education resources or the purchase of such a service from some other source.

(I would like to thank Peter Stone and Frank Norman for helpful comments on an earlier draft of this piece. The views expresssed are the author's own.)