Sharing History of Science and Medicine Gateway Metadata Using OAI-PMH
The MedHist gateway [1] was launched in August 2002, providing access to a searchable and browsable catalogue of high quality, evaluated history of medicine Internet resources. MedHist has been funded and developed by the Wellcome Library for the History and Understanding of Medicine [2], but is hosted by the BIOME health and life sciences hub [3], and as such is part of the Resource Discovery Network (RDN). MedHist was developed principally to fill the gaps left in the coverage of the history of medicine by existing resource discovery services within and outside the RDN. Both the Humbul Humanities Hub [4] and OMNI [5 gateway within BIOME provided some coverage of the subject, although this was not exhaustive. Outside the RDN, resource discovery services for the history of medicine were either defunct or concentrated on far narrower or broader subject areas [6] .
The fact the history of medicine is such an interdisciplinary subject caused problems for the Wellcome Library in deciding where to locate MedHist. Keen to keep the service within the RDN, it was decided to make MedHist a part of BIOME whose federated structure of health and medicine related gateways under a single hub suited the creation of an independent gateway with an affiliation to an existing service.
However, the interdisciplinary nature of the subject area suggested that it would also be important to make available MedHist's resource description records to other services with over-lapping subject interests, such as Humbul and the SOSIG [7] gateway, and to import any relevant metadata from other gateways, such as Humbul's History and Philosophy of Science records. Therefore, early on in the development of MedHist, methods of making its metadata available to other services, and importing external metadata, were investigated. The solution that was decided upon was the use of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [8].
Gateway metadata
MedHist, in line with other RDN gateways collects and makes available descriptive metadata about Internet resources, catalogued in accordance with the Dublin Core Element Set [9]. In addition to obvious access points such as title and URL, resource descriptions include an evaluative paragraph outlining the purpose and main features of the resource, and keywords are assigned from the National Library of Medicine's MeSH (Medical Subject Headings) thesaurus [10]. Additionally, where a resource is dedicated to an individual, personal name headings from the Library of Congress Name Authority File (LCNAF) are also added [11]. In addition to this descriptive metadata, administrative metadata such as site creator and owner are also collected, but not displayed on the MedHist Website.
Approaches to resource sharing
MedHist records automatically are available via a number of different access points: via the MedHist Website, the BIOME Website which searches the catalogues of all its constituent gateways and the RDN ResourceFinder database, a "union catalogue" of all the RDN's service providers' catalogues [12].
However, sharing metadata directly between different RDN services required the use of a separate process. Several options were considered before deciding on the use of OAI-PMH to expose and harvest metadata:
- Live cross-searching
A possible solution was the use of live cross-searching from one gateway of another's resource catalogue using Z39.50. Leaving aside some of the disadvantages of Z39.50 in terms of server response times [13], relying on cross-searching would have required some amendments to MedHist's and Humbul's user interfaces, to, for example, include a tick box requiring the user to indicate they wished to search another gateway's catalogue. Additionally, relying on cross-searching would also cause problems for users wishing to browse. Using Z39.50 does not easily allow for the incorporation of external records into the home service's browse structure.
- Custom-built solutions
Another approach considered was the use of custom-built solutions such as the My Humbul include.[14] This particular service allows for the selection of a number of records from the Humbul hub which can be reproduced on another Website, using a Javascript include. This service and the use of custom-built solutions were rejected for a number of reasons. Firstly, the My Humbul service would simply allow for the display of a selection of records from Humbul on the MedHist Website. The records themselves would still reside on the Humbul database and would not be searchable via MedHist or truly integrated into MedHist's subject browsing structure. Reliance on custom-built solutions was not felt to be effective due to the fact they would not necessarily be re-usable and would be specific to each gateway which entered into a resource sharing arrangement
- RSS feeds
Similar to the above solution would be the use of RSS (RDF Site Summary) [15] feeds to display third party metadata. Some gateways make their new acquisitions lists available in this format, using a service such as UKOLN's RSSxpress [16]. As with the use of custom built solutions, records displayed in this way are neither searchable or truly browsable, and there may be limited control over their display, even with the use of css stylesheets.
- OAI-PMH
The use of OAI-PMH to expose and harvest metadata records between MedHist and other gateways seemed the most effective solution. Not only was it an emerging standard for the exchange of metadata, it was already in use across the RDN to create the central ResourceFinder database: BIOME and other gateways already made available entire sets of their metadata records in OAI repositories which are harvested by the RDN [17]. BIOME and Humbul also had in place the necessary harvesting tools and technologies to successfully import third party OAI repository data and index the information onto their own databases
Using OAI-PMH to share metadata records
At present, MedHist records are exported to Humbul and Humbul's History and Philosophy of Science (HPS) records are imported into MedHist on a weekly basis. Whilst the exchange of records has been largely successful, some ongoing issues, some examined below, have currently prevented MedHist records being displayed on Humbul. However, Humbul HPS records are fully accessible via MedHist.
.
Importing Humbul HPS records into MedHist
Humbul OAI records are currently exported to MedHist on a weekly basis:
- A script on the Humbul server automatically emails the BIOME technical staff a file of Humbul OAI records (in tar format)
- The records are indexed onto the MedHist database
- BIOME alert MedHist staff to new Humbul HPS records
Once imported, MedHist staff:
- Decide on which records are most suitable for MedHist's subject coverage and make them "live" on MedHist
- Add local metadata such as MeSH and LCNAF subject headings.
- Suppress any resource descriptions from Humbul that "duplicate" existing MedHist descriptions.
Once live, Humbul HPS records are available on the MedHist Website:
- For searching via MedHist's search engine
- For browsing through MedHist's subject browse. Assigned keywords allow Humbul records to be fully integrated into MedHist's browse structure.
- Humbul HPS records display in the same format as MedHist records with an added rights statement that indicates the record is from Humbul's History and Philosophy of Science collection.
Record display for the Aldous Huxley: the author and his times Website
...and the same record displaying in MedHist
Some issues
To date the process has highlighted some issues which are currently being addressed by both Humbul and MedHist.
Staff time overheads
To fully integrate Humbul records into MedHist, MedHist staff must spend time adding subject headings to imported records, and deleting any Humbul records which "duplicate" existing MedHist entries. In terms of the number of records imported from Humbul this is not too time consuming, although it raises questions of the sustainability of the process if it were extended to include records from other gateways. One approach currently being considered is the automated addition of suitable subject headings by the data provider, i.e. for Humbul to add agreed MeSH keywords to each of the records exported. This would allow records to be incorporated into MedHist's browse structure, although it is likely they would have to be very generic headings (e.g. "Science" and "Philosophy"). Whether it is possible, or even desirable, for third party metadata to be in an automatic "live" state after import is another area which needs to be examined more closely.
Re-presenting metadata
MedHist and Humbul have slightly different conventions for the display of metadata records. Humbul favour both a short and full record display, the latter displaying information such as site author and publisher in addition to title, URL and description etc. MedHist has only one record display which features hyperlinked title, description and keywords. BIOME have tended only to use author / publisher information only for internal purposes, whereas Humbul consider this part of the full record display, in the same way it would be displayed for books within a library OPAC. Certainly BIOME have expressed some concern about this administrative data being displayed and the implications it may have for data protection. This is an area which may need to be considered across the RDN as a whole.
Rights
In a similar vein, some thought has had to be given about the expression of rights statements when displaying third party metadata. At present MedHist displays a single rights statement which indicates the record is from Humbul's History and Philosophy of Science collection, but does not reflect the rights statements published by Humbul which acknowledge the work of individual cataloguers within its distributed cataloguing system. The most likely way that this will be addressed is to have a hyperlinked rights statement that will lead users to the full record within Humbul, where they will also be able to see the resource creator and publisher information.
Collection development
Over-reliance on third party metadata could potentially encourage gateways not to catalogue within certain areas of their collections which may be covered by other gateways. Whilst this may cut down duplication of effort and be seen as beneficial, it may be some key resources will not have been described by a particular gateway for its key audience. MedHist continues to catalogue all history of medicine resources, including ones that may already be within Humbul, and views other Humbul records as ways of supplying records which may be slightly more peripheral, although still of interest to the subject area. In addition to "adding value" to the service, these more tangential records help to provide contextual information for those interested in, for instance, developments within scientific thinking during a particular period of time.
OAI and Dublin Core
At present, OAI records only support the use of unqualified Dublin Core. This means that records imported and exported cannot express the full richness of the metadata collected. For instance, MedHist keywords are exported without any indication they are from the MeSH or LCNAF thesauri. Similarly, Humbul's author / creator distinctions (e.g. Web designer, author, compiler) are lost during the export process (although it must be noted that these would not be able to be used by MedHist in any case).
Different records for different services?
At present, BIOME must currently export two sets of OAI records: one for the RDN which feature a "cut-down" version of its gateways' metadata (featuring basic descriptive metadata and metadata about the metadata record itself), and one for Humbul which must additionally feature author and publisher information. This raises questions about standards within the RDN and the extra work for technical staff which "bilateral" agreements between gateways can create. This is an issue that will probably have to be considered across the RDN as a whole.
Conclusion
It has become clear that OAI is an effective way of sharing metadata between gateway services, but that it is not a panacea for all interoperability ills. The process between MedHist and Humbul has not been as straightforward as originally envisaged. It has been a "learning process" which has raised almost as many questions as it has solved. However, it is clear that the issues that have been raised are ones which may have to be addressed by anyone using OAI-PMH or metadata aggregation services and are hurdles to be overcome. Overall OAI-PMH has shown itself to be an efficient and effective way of metadata exchange, and also demonstrated how data may be re-used and re-formatted outside its original context. We certainly look forward to resolving some of the outstanding problems and pushing the resource sharing agenda forward with other related gateways, including SOSIG.
References
- MedHist is at: http://medhist.ac.uk
- The Wellcome Library home page is at: http://library.wellcome.ac.uk
- http://biome.ac.uk/
- http://www.humbul.ac.uk
- http://omni.ac.uk
- The Medical History on the Internet site: http://www.anes.uab.edu/medhist.htm closed as of December 2000. For a service concentrating on a narrower topic see the Karolinska Institute History of Biomedicine page: http://www.mic.ki.se/History.html; for a service covering a broader area, the ECHO history of science, technology and medicine pages: http://echo.gmu.edu/center
- http://sosig.ac.uk
- The Open Archives Initiative Website is at: http://www.openarchives.org
- The list of current Dublin Core elements can be accessed at: http://dublincore.org/usage/terms/dc/current-elements/
- The MeSH 2002 home page is at: http://www.nlm.nih.gov/mesh
- The Library of Congress Authorities page is at: http://authorities.loc.gov
- The RDN ResourceFinder catalogue is at: http://rdn.ac.uk
- See for instance Pete Stubely, 'Clumps as catalogues: virtual success or failure?', Ariadne Issue 22, http://www.ariadne.ac.uk/issue22/distributed/distukcat2.htm
- The My Humbul page is at: http://www.humbul.ac.uk/user/myhome.php
- More information on RSS can be found at: http://groups.yahoo.com/group/rss-dev/files/specification.html
- The RSSxpress page is at: http://rssxpress.ukoln.ac.uk
- See Pete Cliff, 'Building ResourceFinder', Ariadne Issue 30, http://www.ariadne.ac.uk/issue30/rdn-oai/
Author Details
David Little MedHist Project Officer Wellcome Library for the History and Understanding of Medicine Wellcome Trust Email: d.little@wellcome.ac.uk |